model roundup

Qwen 3.6

50 items · started 2026-06-05 · closed 2026-06-16

Qwen 3.6 93B with MTP on 2×RTX 3090 NVLink=187 tokens/SEC,LLM lost bleat-a-thon (github.com via hn)

+1 12d copilot qwen mcp

* AI CODE CREATION GitHub Copilot Write better code with AI GitHub Copilot app Direct agents from issue to merge MCP Registry New Integrate external tools DEVELOPER WORKFLOWS Actions Automate any workflow Codespaces Instant dev environment…
RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8 (imil.net via hn)

+2 13d qwen

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8 A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.
How to Setup a Local Coding Agent on macOS (ikyle.me via hn)

+4110 13d gemma llama

How to Setup a Local Coding Agent on macOS Running Gemma 4 26B-A4B and Qwen3.6 35B-A3B locally with llama.cpp, MTP speculative decoding, multimodal support, and PI as a coding agent. I'd had my internet fail a few times recently leaving me…
advice for dual-gpu asymmetric (www.reddit.com via reddit)

2w gemma qwen llama

Hello everyone, i had a 3080ti 12gb and added a 3080 20gb, so it has a bit less speed but more memory than my main card. I could finally get some speed with the usual suspects (i am testing gemma 4 31b/26b-a4b and qwen 3.6 27b/35b-a3b), BU…
Running Claude Code Offline on an M3 Pro with Qwen3.6 (har-ki.github.io via hn)

+71 2w claude-code

06 — Air-Gapped Claude Code¶ The setup, the fixes that make it work, and the hardware that sets the pace Claude Code connects to a model running locally on the laptop. You provide a Kubernetes incident for investigation.
Reasoning, but without actually *drafting* replies? (www.reddit.com via reddit)

2w gemma

I've been experimenting a bit today with letting models reason for creative tasks, rationale being that it might help with keeping track of details and prompt adherence. And predictably, the wall I'm running into is that they all want to d…
Is Qwen 3.6 27B IQ4XS better than Gemma 4 31B QAT as a Hermes agent? (www.reddit.com via reddit)

2w gemma qwen

If Gemma 4 is better, does anyone have a link for the latest fixed template? Using LMstudio.
MTP hyperparameter search (www.reddit.com via reddit)

2w qwen llama

TLDR; I only got a 6% improvement on tokens/sec over naïve parameters. I was messing around and ran a hyperparameter search with optuna over the MTP and speculative decoding options of llama-server for Qwen3.6 27b on strix halo.
Executing a plan under context constraints (www.reddit.com via reddit)

2w qwen llama

I'm running Qwen 3.6 35B-A3B via Pi harness on a 32gb unified RAM setup (Framework 13). llama.cpp, 64k context window.
Harnesses seem to have an issue. (www.reddit.com via reddit)

2w qwen llama opus

There's a post i saw about Claude Fable where a user asked the model the car wash question and it sent me down a rabbit hole. I spun up qwen on llama.cpp and in the llama.cpp chat interface I asked the model and it got it right consistentl…
Need help improving speed of inference (www.reddit.com via reddit)

2w qwen llama

Hello i'm running the qwen 3.6 27b in ud q5k xl, and with all the optimizations it barely fits in my 3090 vram with a 120k context, i'm sure it does not spill when context is full but i would like to improve the token generation speed. I w…
Deploy a Qwen 3.6 Agentic RAG — Step-by-Step Walkthrough (medium.com via reddit)

2w rag qwen agentic

Deploy an Agentic RAG powered by Alibaba’s latest Qwen 3.6, running fully on your machine.
Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss? (www.reddit.com via reddit)

2w llama

Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100, and I'm currently getting around 55 tokens/sec. I'm trying to find out whether there are any configuration changes that could increase throughput…
Qwen 3.6 27B AutoRound GGUF, need your feedback (www.reddit.com via reddit)

2w qwen

I have always been a fan of the AutoRound quants of this model, for some reason, it thinks less (sort of like Qwopus models) and comes up with solutions quicker than Unsloth quants for instance. https://huggingface.co/sphaela/Qwen3.6-27B-A…
How useful is qwopus compared to qwen3.6 27b (www.reddit.com via reddit)

2w agentic

I see a lot of conflict comments on this sub and elsewhere on how useful is qwopus compared to for example unsloth quants of qwen3.6 27b. Some say it’s worse some say it’s much better.
The 'storage tax' on cloud GPUs for short LLM runs is brutal. What's your workflow? (www.reddit.com via reddit)

2w cline llama agentic

I’m trying to test Qwen3.6-27B for agentic coding through Cline / llama.cpp, but my local box struggles once the context gets longer. (my poor 3080 just can't keep up).
What's up on CPU inference these days? (www.reddit.com via reddit)

2w moe llama

What are the best models, quants and llama.cpp versions/forks for CPU inference these days? I have AVX2 but no AVX512 - Intel core ultra 7 165H; 64G RAM This seems to ask for massive MoE (a lot of RAM, not a lot of bandwidth/compute).
I'm brand new to running LLMs and the sheer number of tools is overwhelming (www.reddit.com via reddit)

2w ollama gemma qwen

Hey everyone. I'm brand new to running LLMs in general, even more new to running them locally, and the sheer number of tools available is absolutely overwhelming.
Qwen 3.6 35b A3B Speed Help (www.reddit.com via reddit)

2w moe qwen
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit)

2w moe gemma qwen+1

I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
Gemma 4 31B's competence surprised me (www.reddit.com via reddit)

2w gemma qwen

I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.
Qwen 3.6 for coding with 5090 - Your settings recommendations? (www.reddit.com via reddit)

2w qwen

Hi, totally new to using LLMs for coding purposes, I am on Ubuntu and currently using LM Studio with Qwen 3.6 27B Q4 on a 5090. Finding it slow and context runs out fast.
Pursuit of performance Llama.cpp to MLX (www.reddit.com via reddit)

2w llama

Right now, I am running llama.cpp on a M2 ultra 64gig. Having great fun with unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL - Running opencode and finding it amazing to have such great tools running locally.
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. (github.com via reddit)

2w

Forgive the claude summary, in the readme, but the base works. I'm still working on the hip kernal and having it combine with MTP.
Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance (www.reddit.com via reddit)

2w

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling, and u/complexminded pointe…
5070 Ti + 5060 Ti on vLLM hangs on GDN with Qwen3.6 (www.reddit.com via reddit)

2w vllm
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better] (www.reddit.com via reddit)

2w gemma qwen

These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…
[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup (www.reddit.com via reddit)

2w

Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative…
Weird to get near linear scaling by adding another GPU? (www.reddit.com via reddit)

2w

Single steam benchmarks (club-3090) model: qwen3.6-27b-autoround-int4 BEFORE: 1x3090 *Their default script recipe for single 3090'*s (4-bit quant and 4-bit kv cache, mtp=2) NARRATIVE decode_TPS: mean = 53 std = 0.6 CODE decode_TPS: mean =…
Local LLMs are not as amazing as some people will lead you to believe (www.reddit.comhttps)

2w qwen

Local LLMs are great, in fact lots of simpler things like a fastapi web server they can do quite well. The moment you move outside of that - things get a bit worse.
club-3090 adds experimental FP8 support for Qwen3.6-27B! (www.reddit.com via reddit)

2w vllm qwen

It’s finally here! Something many of us running dual RTX 3090 rigs have been anticipating.
why I have just installed OpenLumara, my first Agentic Framework. Using only local models, served by LMStudio (www.reddit.comhttps)

2w qwen agentic

Where I came across it: https://www.reddit.com/r/LocalLLaMA/comments/1txxgpq/openlumara_a_different_kind_of_ai_agent_written/ DISCLAIMER: A good posting would be: This is what I wanted to do with Lumara. Here is what worked, here is what d…
Local agents on a MacBook Pro M5 finally feel practical to me (www.reddit.com via reddit)

2w agentic

Realtime check X for new people to follow I have been pretty pessimistic about local models for agentic workflows for a while. Not because they were useless, but because in practice they often felt just a bit too slow, too fragile, or too…
Why aren't languages/frameworks offering retrained models for their project? (news.ycombinator.com)

+1 2w qwen

The cost of training is coming down. We have incredible open source models (especially smaller ones) like qwen 3.6-27b.
Need some guidance toying with local models (www.reddit.com via reddit)

2w cline

Hi, so I have a pretty low-end laptop regarding running LLMs locally (NVIDIA GeForce RTX 3050 with 4GB VRAM, AMD Ryzen 7 5800H and 16GB DDR4) and while I'm not looking for anything to realistically work with, I'd be interested in how could…
Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ (www.reddit.com via reddit)

2w qwen llama

Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks. BeeLlama.cpp (my llama.cpp fork) was used as inference engine…
Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant) (local-llm.utop.workers.dev via hn)

+2 2w qwen llama

Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support This AI generated blog article is a kind…
Just received RTX 6000 Pro, have 5090- how would you use? (www.reddit.com via reddit)

2w qwen gemini

Just received an RTX 6000 PRO, and I have an 5090 Astral. I am considering running a Qwen 3.6 27B on the 5090 and maybe two or three more on the 6000 to play roles such as lead SWE and coder and researcher.
I can't wait for all the x250 sample distills of Mythos and GPT-5.6 (www.reddit.com via reddit)

2w gpt-5 gemma mythos+1

Just kidding. Are there any distills that actually improve a model's quality?
Model recommendations for cybersecurity (www.reddit.com via reddit)

2w qwen

My goal: I want to use an LLM to learn more about software/firmware reverse engineering and binary analysis. Eventually I would like to learn how to build agents to augment parts of this process.
Best Coding Harness for Qwen3.6 35B? (www.reddit.com via reddit)

2w copilot

I've been happily using GitHub Copilot for 7-8 months, primarily in Visual Studio and VS Code, mostly with the built-in flagship models and have felt like the output is worth the cost. Lately I've been playing with a lot of different local…
Z.ai, we need Air! GLM GGUF wen? (www.reddit.com via reddit)

2w glm gemma qwen+1

First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding.
Experimentation with Qwen 3.6 and Gemma 4 - Guidance needed (www.reddit.com via reddit)

2w moe gemma qwen+1

I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful.
I’m upset… (www.reddit.com via reddit)

2w codex chatgpt openai

So long story short - openai 20$ subscription is much better than my local AI stack… r7900xtx+32GB RAM (Qwen3.6-35B_Q4+OpenWebUI+SerXNG+Playwright+opencode). I wasn’t expecting much but it’s literally impossible to replace chatGPT level of…
Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization (www.reddit.com via reddit)

2w qwen llama

I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s: bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ch…
Local vs Frontier on low-level systems engineering (www.reddit.com via reddit)

2w qwen opus

Hey r/LocalLLaMA, Before anyone jumps on me, this is absolutely not a post about how great Qwen is 😄 Even though I use Qwen 3.6 35B-A3B daily, I’ve found a massive gap between Opus and every other model, local or frontier (including GPT 5)…
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss (www.reddit.com via reddit)

2w gemma agentic

I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought.
Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result (www.reddit.com via reddit)

2w moe llama

TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps).
qwen3.6 35B has much worse vision capability than gemma4? (www.reddit.com via reddit)

2w qwen

How different are the image recognition capabilities between gemma4 and qwen3.6? I give the model the task to extract calendar events from a photo of an calendar that is croped to the calendar.
Maybe KV cache offload to RAM isn't bad (www.reddit.com via reddit)

2w llama

So, llama.cpp has the -nkvo (--no-kv-offload) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance.

← all threads