model roundup

Qwen 3.6

562 items · started 2026-04-16 · closed 2026-05-30

Qwen3.6 35B - TXT vs Markdown vs HTML vs HTML+CSS (www.reddit.com)

+154 4w claude-code

Theres been talk of late about using HTML rather than markdown in Claude Code. I was curious how this worked with a local model so loaded up Qwen3.6 35B A3B at Q8 and F16 KV cache.
VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? (www.reddit.com)

+939 4w vllm llama

EDIT - IGNORE. I MADE A MISTAKE.
I'm seeing low draft acceptance when using Qwen3.x MTP, what am I doing wrong? (www.reddit.com)

+314 4w llama

I'm using llama.cpp, and I've tried Bartowski's and my own quants. When using Qwen3.5-122B or Qwen3.6-27B, I'm seeing really low draft acceptance in chats with interleaved code snippets (chatting with the LLM about programming / a code pro…
Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model (www.reddit.com)

+1221 4w llama

I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler).
Krasis update: Qwen3.6-35B-A3B (Q4) at reading speed, 1x 8GB 3070 Mobile laptop (32GB RAM) (www.reddit.com)

+1813 4w

Context Krasis is an LLM runtime for running models that don't fit into VRAM. Krasis streams the model through VRAM from system RAM efficiently and handles prefill and decode as separate architectures and optimised usecases.
Qwen/Qwen-Image-Bench · Hugging Face (huggingface.co via reddit)

+6013 4w qwen

Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality crit…
Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. (www.reddit.com)

+819 4w vllm qwen llama

Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090?
Local LLMs on Refurb M4 Max vs new M5 Max (www.reddit.com)

+216 4w gemma

Hoping the community can guide me on this one. I'm on the fence about the following purchase: Refurbished 16-inch MacBook Pro Apple M4 Max Chip with 16‑Core CPU and 40‑Core GPU, 64gb ram for $3,479.00 vs The new 16-inch MacBook Pro Apple M…
Anyone tried a setup like this? Is it a bad idea? 😅 (www.reddit.com)

+12 4w qwen

I’m considering building a local machine for AI inference using a Dell Precision T5820 and 2 Intel Arc A770’s. From this I could get 32GB DDR4 RAM, 1TB SSD and 32GB VRAM, all for like $1000.
Need some advice on AI workflow (www.reddit.com)

+210 4w llama chatgpt mcp

Hi all, I'm somewhat new to the scene (been lurking for maybe 4-5 months now), but i think I have all the basics figured out. My setup: 9800x3d with 64GB of RAM, 6900xt with 16GB VRAM.
Qwen3.6 huge quality gain from Q4 to Q6 for coding agent (www.reddit.com)

+1312 4w ollama deepseek llama

So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap.
Noob here, curious about roughly how advanced of a video game a model like Qwen3.6 27b could create, if kept fully offline, and got unlimited attempts/revisions (maybe ~1 month project time limit). Like, could it make something equivalent to Pokemon Red? Doom? Doom II? What if using GLM 5.1? (www.reddit.com)

+125 4w glm

So, I got interested in local LLMs a few months ago, but, I don't have a background in coding, and I don't know how to code, and I am not good with computers or anything. So far I mainly just was having fun with comparing different local L…
KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche (www.reddit.com)

+8763 4w qwen llama

Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects…
Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference? (www.reddit.com)

+11 4w vllm llama agentic

I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).
LMStudio with MTP support - which model? (www.reddit.com)

+13 4w qwen

Looks like LMStudio released support for Multi-Token-Prediction (MTP) and the release notes say to use a MTP-compatible model. What model is everyone using with MTP support?
Is Granite-4.1-30b Overshadowed by Qwen3.6 & Gemma4 models? (www.reddit.com)

+107 4w retrieval-augmented rag

I don't see any threads on this model. Is it because it's dense and/or without-reasoning?
Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m? (www.reddit.com)

+1919 4w qwen agentic

I dont have good experience running q4_k_m, the difference to q6 is "a few errors an hour" to " a few errors every couple of days". Edit: How it fails?
How Qwen3.6-35B-A3B fails differently as a sub agent compared to solo (www.reddit.com)

+21 4w moe

Been running Qwen3.6-35B-A3B as a sub agent on a single 4090 for a few weeks. The failure modes are different from solo use and I haven't seen this written up anywhere.
2 RTX A6000 at 96GB VRAM with nvlink. Best local coding model/what you would daily drive? (www.reddit.com)

8 4w moe qwen

Really been testing qwen 3.6 27b and 35 a3b so far with 27b at q8 and 35 a3b at q4 (byteshape quant is insane). But i feel im not utilizing it the best, esp for long context messy coding of large repos.
Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal? (www.reddit.com)

+18 4w qwen llama

Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774) My run command: ``` llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface…
$400 Qwen 3.6-27B Setup - Dual RTX 3060 - 30-50 t/s (www.reddit.com)

+611 4w qwen

I picked up a 7900 XTX earlier which runs qwen3.6-27b fine, but not to my like. Its compute performance is quite unstable for me.
Looking for Suggestions — Single 5090 & 64gb DDR5 (www.reddit.com)

+211 4w vllm qwen llama+1

Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if…
Poor performance on RX 9070 XT (www.reddit.com)

+15 4w llama

I was thinking about upgrading from an MI50 to an AMD AI PRO9700, and I happen to have an RX 9070 XT on my gaming pc, so I tested the performance on it to have an idea of what to expect. So, install rocm, build llama.cpp, download Qwen3.6-…
Llamacpp server : How do the -np and -c flags interact? (www.reddit.com)

+53 4w moe qwen llama

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.
qwen 3.6 27B AR-> Diffusion - local training on 5090 (www.reddit.com)

+31 4w qwen

based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable an…
Sharing INT4-W4A16 version of Jackrong/Qwopus3.6-27B-v2 for VLLM/SGLang users (www.reddit.com)

+11 4w vllm

link: https://huggingface.co/JC1DA/Qwopus3.6-27B-v2-INT4-W4A16-Autoround Super surprised how good Jackrong's model is... It's taking so much time to evaluate the all the base qwen3.6-27B, Jackrong's version and other's quantized models but…
Anyone use QwQ-32B? It's over a year old? Has Qwen 3.6 27b basically replaced it? (www.reddit.com)

+316 4w gemma qwen

I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B?
Is there any case of a less quantised smaller model outperforming a more quantised larger model? (www.reddit.com)

+611 4w gemma qwen

As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing.
Is Qwen3.6 current king for local agentic use? (www.reddit.com)

+1122 4w glm moe agentic

I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionally and I couldn't even get GLM 4.7 Flash…
llama.cpp oom issue (www.reddit.com)

+113 4w llama

I'm having an issue with llama.cpp going OOM (system ram, not vram) after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it st…
Qwen 3.6 benchmarks on 2x RTX PRO 6000 (www.reddit.com)

+56 4w vllm qwen

Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend.
Could someone please help explain these results? (www.reddit.com)

+22 4w moe llama

I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled!
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) (www.reddit.com)

+111 4w moe qwen llama

A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into…
Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s) (www.reddit.com)

+110 4w qwen

Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit?
qwen3.6-35b-a3b-mtp running on GTX 1060 6GB (www.reddit.com)

+63 4w moe

I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model ru…
Need Help Choosing a Harness for Qwen 3.6 27B (www.reddit.com)

+33 4w qwen llama

I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I read posts all the time, but I have a spec…
GPU VRAM only for small models with llama.cpp: is it possible? (www.reddit.com)

+311 4w qwen llama

I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large co…
Qwen3.6-35B-A3B vs Gemma4-26B-A4B (www.reddit.com)

+1221 4w qwen llama

Just wondering how are people's experience with both these models! I've had some nice results with Qwen but Gemma4 runs so much faster here.
Qwen Plays ̶p̶̶o̶̶k̶̶e̶̶m̶̶o̶̶n̶ ? / QWEN PLAYS DCSS! - qwen3.6-35b-a3b@q4_k_xl plays open source roguelike adventure DCSS (and does a decent job) (www.reddit.com)

+41 4w qwen codex

Hi, (TLDR.): Qwen in its MTP version has tool call bugs and outputs everything into tool/thinking blocks - mangeling the output - canceling the +speed with repeated wrong tool calls! DCSS works well with non MTP qwen even on smaller qwants.
Is there any reason for an uncensored model if you have no interest in roleplaying? (www.reddit.com)

+28 4w rag

My rag I've been building is much in response to having a LLM that I feel more confident in knowing where the knowledge base is coming from especially after the Open AI deal with the Pentagon. So, when I saw "uncensored" heretic models, I…
It's OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD (www.reddit.com)

4 4w llama

please forgive the mildly clickbait title. hard to fit everything in it I've seen a lot of discussion here about KV-cache quantization, especially with the recent llama.cpp improvements, leading to some debate on the tradeoffs between KV q…
Claude code in terminal models / combine with local llm? (www.reddit.com)

+11 4w ollama qwen opus+1

Hi, I’m pretty sure I have seen people typing /model and seeing all available models. I have to type models from memory.
Local model doing accounting tasks (www.reddit.com)

+72 4w qwen

So I've been using qwen 3.6 27b for monthly closes, bank recs, payable and receivables. Built a simple sql lite database it manages.
If you're missing Jeeves, you might want to check out my weekend project. (www.reddit.com)

5 4w

Just wanted to share my amusing weekend project. https://www.askjeebus.com 100% vibe coded.
Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else (www.reddit.com)

+1 4w llama

I'm running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it's just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working…
Any reason to run dense over MOE for RAGs? (www.reddit.com)

+512 4w moe rag

I tend to use Claude for a lot of research and I also increasingly worry about things like misinformation or things in the model I can't audit. So, I'm building my own all in one RAG with big datasets like all of Wiki, research papers, all…
Removing Vision from model (www.reddit.com)

+99 4w qwen agentic

I removed mmproj file from models to remove vision and save my vram. But just curious, is this really don't affect its text ability?
I added native MTP to exo for Qwen3.6 MLX models; here are the exactness and speed results (www.reddit.com)

2 4w

I opened my first contribution to exo: native multi-token prediction support for Qwen3.6-style MLX checkpoints. I hope it is useful.
Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image (www.reddit.com)

4w llama

Sharing this because I didn't believe the first run. Setup: laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s), Linux.
Optimizing speed & quality on Qwen3.6 27b (www.reddit.com)

+34 4w agentic

Does the inference speed below seem optimal for the hardware, or could there be further room for improvement ? I’ve been trying to use Qwen3.6 27b for agentic harnesses like Pi/Hermes.
- minor speed bump for MTP with Qwen3.6-27B-MTP Q6_K_XL (www.reddit.com)
DGX Spark agentic usage numbers (www.reddit.com)

+16 4w openclaw agentic

What I need it to do: Be able to support openclaw-type agent which is used by multiple people. What I tried: So I read in the internet about the atlas thing.
club-rdna16: practical 16GB AMD/Radeon local LLM testing repo (www.reddit.com)

+3 4w llama

Following on from club-5060ti, I’ve been doing some testing with my desktop AMD GPU and wanted to make a similar repo for 16GB Radeon cards. Repo: https://github.com/5p00kyy/club-rdna16 Pages/results: https://5p00kyy.github.io/club-rdna16/…
Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM (www.reddit.com)

+1733 4w llama

Hello everyone! I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB.
Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps (www.reddit.com)

+83 4w moe

..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M.
Blackwell and PDL performance increase (www.reddit.com)

+95 4w qwen llama

Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.) In short, PDL enables more efficient execution of kernels…
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model) (www.reddit.com)

+54 4w moe llama

I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends." It does depend. So let me spl…
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline. (www.reddit.com)

+5038 4w gemma qwen

BeeLlama v0.2.0 is here! Not quite a pegasus, but close enough.
Experts first llama.cpp (www.reddit.com)

+69 4w moe llama

This is for all with 12GB VRAM. Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers.
Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM (www.reddit.com)

+277 4w qwen llama

Hi everyone, I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp.
Some tests with qwen3.6 27b + 35b a3b about MTP vs ngram-mod (www.reddit.com)

14 4w glm

I will try to keep this short ;) I used GLM 5.1 to vibecode a vague prompt on my vibecoded react web app and have GLM 5.1 rank the plans made with each other and the one it made itself. Test strategy: - use starter prompt as always - add v…
Qwen 3.6. struggling with German (www.reddit.com)

11 4w gemma qwen

Hi everyone, I’m looking for advice on local AI setups. My goal is to have a local AI generate text documentation from my one-hour therapy sessions.
Comparison of Qwen 3.6 and Gemma4 (MoE and Dense models, Q4_K_M), generating a moderately complex MySQL query, only one produced acceptable results (www.reddit.com)

14 5w moe qwen

I tried Qwen3.6 35B A3B MoE, Qwen3.6 27B Dense, Gemma4 26B A4B MoE, Gemma4 31B Dense. In all cases I was using Q4_K_M and thinking mode enabled.
Qwen3.6 35Ba3 has changed my workflows and even how I use my computer (www.reddit.com)

+197 5w qwen codex chatgpt

My workflow has changed basically to ask Codex to do certain tasks and then document how to do them (including errors it found on its way) into a skill. I feed that skill to pi, and suddenly my qwen3.6 gets that hard stuff done: - devops o…
For the users who have add bad luck with QWEN 3.6 27B, and Gemma 4 31B. "Actually..wait..actually". Endless reasoning. Horrible output. I found a solution. rtx pro 6000. (www.reddit.com)

26 5w vllm gemma qwen

Edit: does this happen every time a newbie tries to post here. Getting roasted despite having valid results?
Show HN: Modernizing my old PhD work in an evening with little Qwen3.6 MoE (github.com via hn)

+2 5w moe

pge-jax JAX implementation of the Prioritized Grammar Enumeration (PGE) algorithm for symbolic regression. Overview pge-jax is a complete symbolic regression system that automatically discovers mathematical formulas from data.
Ask HN: Is the next big thing locally running coding agents? (news.ycombinator.com)

+18 5w qwen anthropic claude-code

There's extreme price escalation on part of Anthropic, with token spend now approaching levels that have made many-an-enterprise scratch their heads. At the same time, judging by opensource advances (E.g.
I'm running an agentic system with kobold.cpp as my backend. Am I losing performance? (www.reddit.com)

+11 5w moe llama agentic+1

Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp (www.reddit.com)

+6724 5w llama

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP.
Continue config for Qwen 3.6 and llamacpp (www.reddit.com)

+1 5w continue-dev qwen llama

If anyone is using the Continue.dev extension in VSCode, what config settings are you using for Continue and the llama-server? Mine keeps hanging after bad tool calls.
One Night Werewolf played by LLMs (www.reddit.com)

+13 5w gemma qwen

The other day I posted about playing one night werewolf on my custom made UI via tool calls. Since then I’ve played a few games and improved the prompts.
Qwen3.6 27B and llama.cpp appreciation post (www.reddit.com)

+2011 5w llama

To preface, here's my config: llama-server \ --host 0.0.0.0 \ --port 1235 \ --models-preset %h/Software/models.ini \ --models-max 1 \ --sleep-idle-seconds 3600 \ --timeout 3600 \ --parallel 1 \ --device ROCm0,ROCm1 [*] flash-attn = on jinj…
Same task in github-copilot, pi, claude-code, and opencode with Qwen3.6 27B (www.reddit.com)

+2117 5w copilot agentic

I wanted to know how much of a coding agent's performance came from the model and how much came from the harness, so I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. ALl the images above a…
How can you stop your model from looping (www.reddit.com)

+59 5w copilot qwen

So i thought this is a small model issue but when i added a new gpu and i am able to run low mid model like Qwen 3.6 35b q4 or q5 this issue still exists now its not as much as small model but it does break when linking the model to copilo…
Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark? (www.reddit.com)

+4 5w vllm qwen openai

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g…
Volatile prefill speed after each reboot - llama.cpp (www.reddit.com)

+11 5w moe llama

After every machine restart I get a different prefill speed, it can be only 300t/s or 1500t/s. It's like a lottery at each restart.
Kinda New to all this, couple of questions about how to set pcs and what models (www.reddit.com)

+25 5w qwen

Ill address all the questions here not spam the sub what would be a better set up, 1 pc with 2 3090s and a 5080, but that 3090s will have to run at x4 pci-e slots OR 1 pc with 5080, another pc with the 2 3090s and on x16 split into 2x8 mai…
Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs (www.reddit.com)

+6817 5w mmlu qwen

Hey r/LocalLLaMA, We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP. Blog / Download NTP Models / Download MTP Models TL;DR For NTP, “pick the largest quan…
A streamlined Hugging Face model search utility coded by Qwen 3.6-27B (www.reddit.com)

+3 5w qwen

Hi all. As some may have been aware, Hugging Face's model search had issues recently.
How accurate can “whichllm” be? (www.reddit.com)

+44 5w

Hello people I think the question is clear but I wanted to add some context: I work on internal tools in my job and some of the tools are for us developers (most tools are for marketing and factory production). I am currently working on a…
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help (www.reddit.com)

+3736 5w moe llama

MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round.
Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell (www.reddit.com)

+11 5w llama

Hi, i run llama.cpp inside LXC on a Proxmox server. The hardware is a recent AMD Epyc with two 6000 Blackwell MaxQ.
IGPU 780 Unsloth Q2_K_XL Qwen 3.6 27b 8t/s with MTP LM Studio (www.reddit.com)

5w qwen

Man Loving MTP. And Unsloth.
unsloth/Qwen3.6-35B-A3B-GGUF has worked very well on my 24GB 3090 Ti for coding. Any recommendations for other models? Also, my perspective as an experienced coder just trying this stuff out now (www.reddit.com)

+11 5w qwen

I've tried Gemma4 and a few other variations of Qwen, but they're either not as robust with their output, or they take too long or too much VRAM and force the context limit down from 131K to 20K or even 4K, or they're slow AND low-context…
Agents creating their own language : reality or not ? Compliance issue. (www.reddit.com)

2 5w agentic

Hi ! I've read a while ago that some AI's tend to agree on their own language to talk one to another over time.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM (www.reddit.com)

+2320 5w vllm qwen

Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using Bee…
The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b (www.reddit.com)

+178 5w glm qwen chatgpt+2

One way I like to test new models, is by one-shoting (with a good prompt) a single webpage clone of the classic arcade game pacman. I usually do 3 attempts and keep the best one.
Find bugs in YOUR code using OpenCode, Llama.cpp and Qwen3.6 (wtarreau.blogspot.com via hn)

+1 5w llama

Background For quite some time I had been submitting tasks to LLMs via llama-cli (natively) or llama-server (API), both from the excellent llama.cpp project. On CPU-only llama-cli starts fast and can restart from a checkpoint which has alr…
Qwen3-Coder-Next-UD-Q4_K_XL vs. Qwen3.6-27B-MTP-UD-Q4_K_XL on Strix Halo (www.reddit.com)

+215 5w agentic

I wanted to switch from Qwen3-Coder-Next-UD-Q4_K_XL to Qwen3.6-27B-MTP-UD-Q4_K_XL for local agentic coding. The Qwen3.6-27B is perceived to be "smarter" than Qwen3-Coder-Next, and I wanted to "upgrade" my local AI coders.
Qwen3.6 35B MTP, t/s varies on different scenario (www.reddit.com)

+11 5w moe llama

Tried Qwen3.6 35B Q5_K_M MTP, HW: 9700x, 64GB 5600 RAM, 5060 TI 16GB. --n-cpu-moe 30 ^ -ngl 99 ^ -c 131072 ^ --no-mmap ^ --flash-attn on ^ --cache-type-v q8_0 ^ --cache-type-k q8_0 ^ --threads 8 ^ --parallel 1 ^ -rea off ^ --reasoning-budg…
TurboQuant on 16 GB VRAM (www.reddit.com)

6 5w llama

I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute.
Weird performance depending on quant (www.reddit.com)

+26 5w llama

Hi, I'm using llama.cpp with qwen3.6 35B A3B on two different machines. I noticed that on both machines tokens per second is better while using Q4_K_S and Q4_K_M quants than lower Q3_K_M quants.
RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed (www.reddit.com)

+1 5w llama

What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases.
Why might MTP be net negative for tool heavy agentic flows? (www.reddit.com)

+31 5w agentic

The Qwen3.6-27B MTP benchmarks that have been circulating put factual tasks at 62-70% acceptance vs code at 79-89%. Tool calls probably sit in that factual range or lower, structured output, constrained format, less predictable than pure c…
club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes (www.reddit.com)

+35 5w vllm llama

I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes.
Distilled Model's Vision Problem (www.reddit.com)

+1 5w openclaw qwen

Have been using Qwen 3.6 Claude distilled version, 27b at Q4 for openclaw, Hermes and other local harnesses. But recently noticed that the Claude distilled version that I use lost its vision abilities.
MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro (www.reddit.com)

+136 5w qwen

https://preview.redd.it/8gpkg8zxmy1h1.png?width=1672&format=png&auto=webp&s=a95db16a39cdc49c0ff155117b734d413a49c2d3 https://youtu.be/MI0Pm1d6YF4 MTP can accelerate LLM inference 2x, especially for coding agents. This video covers what MTP…
Lemonade v10.5.1: an MTP + ROCm 7.13 quick start for Strix Halo (www.reddit.com)

+1 5w

Update to Lemonade v10.5.1, then: ``` Get the model lemonade pull Qwen3.6-27B-MTP-GGUF Get ROCm 7.13 lemonade backends install llamacpp:rocm Load the model (MTP args auto-applied) lemonade load Qwen3.6-27B-MTP-GGUF --llamacpp rocm --ctx-si…
llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig (www.reddit.com)

+1620 5w moe llama

PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.
- Benchmarking llama.cpp's new MTP support on Strix Halo (calebcoffie.com)
MLX engine comparison… and oMLX is the top choice. (www.reddit.com)

+54 5w

Just stumbled on this blog. A very interesting read if you are picking inference engine.
Tesla P40 running qwen 3.6 (www.reddit.com)

+2 5w qwen llama

Does anyone know why qwen 3.6 MTP spec decoding won't work with Tesla P40 when the K cache is quantized? I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40.
Qwen 3.6-27B giving me attitude! (www.reddit.com)

11 5w qwen

I'm laughing here. I'm messing about with Qwen3.6-27B in order to gauge just how capable it is with local vibe-coding.
Not getting any faster with MTP on Macbook Pro M1 Max 32gb (www.reddit.com)

4 5w llama

Using latest llama.cpp with mtp and these settings, I only get 10 tps, should I be getting more? [unsloth/Qwen3.6-27B-MTP-Q4_K_M] jinja = true model = /Users/[username]/llms/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf cache-type-k…
If you use continue.dev and Qwen 3.6 (dense / MoE) - I could use your help (www.reddit.com)

+65 5w continue-dev moe qwen+1

Someone suggested I give Continue (Vscode extension) a try. I've been using Roo / Zoo now and liking it but it is pretty tough on context and I was told continue has more control over it.
Quantizing MTP KV Cache = free lunch? (www.reddit.com)

+4630 5w llama

With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized: -cache-type-k-draft q8_0 -cache…
Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) (www.reddit.com)

+10548 5w vllm qwen llama

TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp…
Luce DFlash + PFlash on 7900XTX: Qwen3.6-27B at 2.24x decode and 3.05x prefill vs llama.cpp HIP (www.reddit.com)

+1 5w humaneval llama

Tested a bit on my XTX, a bit share hope helpful, thanks to Lucebox! Lucebox DFlash + PFlash PR #119 Reproduction Report (RX 7900 XTX) Hardware Environment Component Spec GPU AMD Radeon RX 7900 XTX (Navi 31, gfx1100) VRAM 24 GiB GDDR6 (~93…
Built an agent that builds agents — pure Python, Qwen3.6 35b a3b Q8_0 MTP (github.com via reddit)

+1 5w agentic

Hi, i built this agentic ai, Closed-loop system that ships standalone Python agents. What's different: - Interviews you until it understands the request before building anything - Two testing stages: prompt validation via LLM invoke, then…
Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled (www.reddit.com)

+51 5w qwen llama

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled My setup is heterogenous, I originally acquired my server (Lenovo ThinkStation P3 Tower Gen 2) to run OpenShift/K8s clusters (because I work on that), and…
- Qwen 27b MTP Config, Llama.cpp Single 3090 (www.reddit.com)
Cutoff dates of open source models (www.reddit.com)

1 5w qwen mcp

I was trying Qwen 3.6-27b and Gemma4 in a siomple web chat. Asked them both a qn like 'recommend the best llm for a 5060ti' and was suprised when they both replied 'user is asking about a card that doesn't exist'.
Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 (www.reddit.com)

+2 5w qwen llama agentic

I'll be UPDATING this as it seems I was benchmarking and testing Just before the UPDATE LOL TL;DR If you're running rigid agent frameworks locally with mtp on consumer hardware: drop your draft window to 3, lock parallel slots to 1, and co…
Follow-up: adding Ollama support to my open-source cursor-aware AI app - looking for beta testers with vision-capable local models (www.reddit.com)

+812 5w ollama cursor

EDIT 2: Trick-Assignment-828 pointed me at the actual rule update from the mods - Rule 3 Low Effort was expanded to cover LLM-assisted posts without disclosure. Disclosing now: Disclosure: I'm a non-native English speaker (German).
While waiting for Fara-1.5 for my coding harness (www.reddit.com)

+31 5w vllm llama

Hi all, Not sure many people are aware so wanted to give a word about Fara-1.5 release. => this release will likely be the big sister of Fara-7B and built on top of Qwen3.5 Actual Fara-7B performs not bad at all but actually requires a pro…
Build Own Docker Image with llama.cpp and MTP (www.reddit.com)

+14 5w llama

Hi All! Saw some folks waiting for the Docker images with llama.cpp and MTP when it released.
MTP experiences on 7900xtx? (www.reddit.com)

+99 5w moe llama

Hi! I have been using Qwen3.6 35B A3B happily the past few weeks, and I wanted to try out Qwn3.6 27B with the new fancy MTP speculative draft!
ik_llama: Qwen3.6 27B and 35B on very low VRAM (www.reddit.com)

+23 5w llama opus agentic

Thank you to the people at ik_llama and llama.cpp. It's amazing how far you've all pushed mtp and other tech so that I can run 27B and 35B Qwen3.6 models on an old gaming laptop with a RTX2060 mobile at 6GB VRAM and 32GB RAM.
Nnoticing qwen-27b@q2 better than qwen-35b@q8? (www.reddit.com)

18 5w qwen llama

The Latest qwen3.6 models. Is this odd?
I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why (www.reddit.com)

+23 5w qwen llama

In my real-world usage (opencode) and in my synthetic benchmarks, Coder-Next (Q5) demolishes the whole Qwen3.6 family including the 27B Dense model (All Q8). Everybody else is hailing that 27B is superior and is an amazing model, but I hav…
Seeing the activity pop up big time in this sub due to various open models. Most of them require at least 16gb vram. What can I do with 8? (www.reddit.com)

+18 5w gemma qwen

Not deeply technically fluent but have ran few models locally before, around the time before gemma 4 dropped. I tried some low quant of qwen 2.5 coder and after some tinkering I got it to run but it was just so slow, obviously.
Developers who use local AI - Q4_0 vs Q8_0 KV quant? (www.reddit.com)

+325 5w moe qwen llama

I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territo…
MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it (www.reddit.com)

+1012 5w llama

I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model.
85 GPU-hours comparing 5 abliteration methods on Qwen3.6-27B: benchmarks, safety, weight forensics - Abliterlitics (www.reddit.com)

+5811 5w

I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed u…
M1 Ultra vs M3 Ultra speed (www.reddit.com)

1 5w qwen

Anyone have both of these and tested them? How much faster is the M3 Ultra in PP and TG speed compared to the M1?
Convert With MPT Support? (www.reddit.com)

+23 5w qwen

Hi All, I'm trying to understand the process of creating GGUF with MTP support. Does the original Qwen/Qwen3.6-27B support MTP?
Using Local LLMs for research (www.reddit.com)

+11 5w qwen

Hey there. I am an undergrad who has been doing mostly SWE, but will be doing ML research under my professor over the summer.
Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks (www.reddit.com)

+25 5w qwen llama

Here are some results (llama.cpp)! Task 1: write a short poem 27B Dense: 12.5 tokens/s 27B Dense MTP: (spec-draft-n-max 6): 14.5 tokens/s 27B Dense MTP (spec-draft-n-max 3): 18.7 tokens/s Task 2: edit a hello word html artifact 27B Dense:…
Best local model for C# coding with 24GB VRAM? (www.reddit.com)

+12 5w gemma qwen

I can't decide that Qwen 3.6 35b q4 (130k context) or Gemma 4 26b q4 (95k context) is better for C# coding with 24GB VRAM. Please share your experiences!
Testing llama.cpp MTP support on Qwen3.6 - RTX 5090 (www.reddit.com)

+218 5w llama

Setup: - RTX 5090, 32 GB, Linux - Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120) - Unsloth's…
LeanLoop, the Tool Claude Leans on (github.com via reddit)

+2 5w

So I bought a second graphics card the other week to get in on the local AI craze and I've been having the hardest time using it to build my website. It's been unreliable, the context gets eaten up, kind of hallucinates sometimes.
Now that MTP is merged... What's the best outputs you're getting on Qwen 3.6 35B on 2x3090s? (www.reddit.com)

+52 5w qwen llama

We've got great outputs for 27B via club 3090, but what about those of us who love the blazing speed of 35B on dual 3090s? I was getting 1500 p/p and 120 t/g with split layers, but MTP slowed it down to 80 t/g when I tested last week.
Best llama.cpp launch config for Qwen3.6 27B on RX 7800 XT (16 GB VRAM) for OpenClaw? (www.reddit.com)

2 5w moe openclaw llama+1

I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm. I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup.
Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs (www.reddit.com)

+7029 5w qwen

Saw this post comparing Qwen 3.6 variants on coding primitives, so I wanted to see how local quants stack up against frontier models on a similar dense, single-file coding task. I ran the exact same prompt across local and web-based models…
Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed (www.reddit.com)

+239 5w llama

TL;DR All models were Qwen3.6 27B-MTP vs Base 27B (15k single-turn): Faster overall Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%) Generation: 7.63 → 16.15 t/s (+111.77% speedup) Prompt Processing: 279.75 → 244.90 t/s (-12.46…
Extension idea: llama-server with custom samplers (www.reddit.com)

+3 5w llama

Just an idea and a prototype (made by Qwen3.6-27B-UD-Q6_K_XL via OpenCode) for allowing users to add custom sampling logic to llama-server without having to maintain their own entire fork and without having to make a wrapper that reimpleme…
I've updated my glorified Llama fork (LLM Inference Server) for P40's to utilise MTP + TurboQuant + DFlash (github.com via reddit)

5w qwen llama openai

LLM Inference Server A single-container, idle-aware, OpenAI-compatible inference router for a Tesla P40. Routes between Qwen 3.6 27B (MTP self-speculative decoding, TurboQuant turbo4 KV cache), Qwen 3.5 0.8B (multimodal transcription), Whi…
local llama.cpp parallel users - still so fast?! (www.reddit.com)

1 5w qwen llama

I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090 build\bin\llama-server.exe ^ -m "!MODEL_FILE!" ^ --mmproj "!MMPROJ_FILE!" ^ -ngl 99 ^ --ctx-size !MO…
Can a 5090 with qwen3.6 achieve > 3,000 tok/s ? bring your pitchforks (open-dllm) (www.reddit.com)

+8 5w deepseek

so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models).
Finding the 4x 3090 Sweet Spot (www.reddit.com)

+1010 5w vllm

https://preview.redd.it/8o43bjhe9d1h1.png?width=5346&format=png&auto=webp&s=1c87c2ee8b8ffff43495f543266056b0e26d3947 In another post I had someone ask me about the power draw of the 4x 3090 setup so I'm sharing a a full test I conducted to…
Just tried Ollama for the first time, it runs terrible with half GPU power on the default model it provides compared to the one you add, any reason why? (www.reddit.com)

13 5w ollama

My GPU power consumption is 250w (undervolted rtx3090) when I added Qwen3.5-27B-GGUF to Ollama using a template (Modelfile made by gpt). I gave it 3 task to test it, build a snake game, build a flappy bird game, and make an interactive gri…
Qwen 3.6 27B: IQ3XXS KV Q8 vs Q4XL KV Q4 (262K context) (www.reddit.com)

+510 5w qwen

hey yall. So I have a 24GB gpu.
how would you set up a local llm server for a business of 7 people? (www.reddit.com)

+419 5w gemma rag qwen

Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to…
Qwen3.6 9B will release around Google I/O? (www.reddit.com)

+49 5w gemma

I don't think alibaba officially stated about "no qwen3.6 smaller models", and according to the patterns, she should ave been released it in the first week of may, but I think they delayed a little bit to catch the spotlight from Google I/…
RLM models and Qwen3.6 (www.reddit.com)

+21 6w qwen codex

RLM models and Qwen3.6 Does anyone here have an RLM setup and how could I set it up? I want to make my Hermes agent even more powerful and I don't like that I need to open a new context window every time after just a few prompts.
2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache (www.reddit.com)

+711 6w llama

PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise) ------- Just wanted to share my current setup, that might help some users out there... services: llama-server: image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b912…
Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version) (www.reddit.com)

+1215 6w qwen llama

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests.
I am not sure if I should be proud or not. (www.reddit.com)

+102 6w deepseek

I managed to get working 4 sub-agents Qwen3.6 35b on dual rtx 3090, I am using deepseek as orchestrator. https://preview.redd.it/biksbgq0n81h1.png?width=783&format=png&auto=webp&s=cf8a4481c1ac439c3283925001c12841b8e6c2e7 They all working l…
I built a little game for my local agents to play via API and it's so cute seeing their feedback (www.reddit.com)

+26 6w

I made a text based craft/trade/cooperate game for my agents to play on intervals when I don't have anything else for them, and it's been so fun watching them plan things out and form little factions with each other to cooperate on trades…
club-5060ti: practical RTX 5060 Ti local LLM notes and configs (github.com via reddit)

+83 6w vllm llama openai

I put together a small public repo for RTX 5060 Ti 16GB local LLM setups: I took inspiration from the club-3090 repo, but this one is focused on documenting what we’ve actually tested on 5060 Ti hardware so the setup details are easier to…
Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct (www.reddit.com)

+298 6w vllm llama

Ok, hear me out. This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround (https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main) recipe was performing so much better than any other Qwen3.6 27B q…
Llama.cpp server running ~2 weeks straight. Loses its mind? (www.reddit.com)

+110 6w llama

I’ve got Qwen3.6 27b and Qwen3.6 35b running in two separate instances for over two weeks and they are considerably dumber now than when I launched them. is this a thing?
Small OpenCode plugin that helped me with broken tool calls from a local Qwen model (www.reddit.com)

+11 6w qwen llama

I’m using OpenCode with a local Qwen3.6-27B Q6_K GGUF model on an RTX 5090 with KV cache in Q8. For reference my llama.cpp build is compiled with CUDA 12.9.
Is there a big gap between Q4 and Q6 on Qwen3.6? (www.reddit.com)

+1 6w qwen

I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high.
Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future. (www.reddit.com)

+9 6w gemma qwen llama

I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly".
Automated AI researcher running locally with llama.cpp (www.reddit.com)

+13 6w ollama llama opus+1

Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure: https://github.com/huggingface/ml-int…
Turboquant+MTP for ROCm(Llama CPP) (www.reddit.com)

+32 6w llama

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.
- Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (www.reddit.com)
[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090 (www.reddit.com)

21 6w llama

DUAL 5090s!!! Absolutely amazing results with dual 5090s, basically doubling my tps.
Playing One Night Werewolf (Gemma4 & Qwen3.6) (www.reddit.com)

+84 6w llama

Finally feel like it’s possible. I have a custom build (vibe coded) UI on llama.cpp, allows model switching in the same chat.
Simpler self hosted alt to Open WebUI (www.reddit.com)

+116 6w chatgpt agentic

Got Qwen3.6 27B running on my newly assembled 4x 3090 rig (s/o 3090-club) and I'm trying to get the people in my house to adopt the local workflow. Open WebUI has improved a lot in the recent updates, but I still found it pretty rough for…
running Qwen 3.6 35b A3B on 2x 5060TI (www.reddit.com)

+312 6w qwen

i ran Qwen 3.6 35b A3B two 5060TI 16gb ( 32 gb vram also i have 32gb dram but i don't like offloading ) i used Q4 on LM Studio to get full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ? thanks !
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) (www.reddit.com)

+5 6w moe gemma qwen+1

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) (www.reddit.com)

+5821 6w vllm qwen agentic+1

TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results a…
Is it worth getting a 5090 for my needs? (www.reddit.com)

+222 6w

I'm considering biting the bullet and getting a pc with the following specs: 5090 Amd 9950x3d X870 motherboard 32gb ram (16x2) CL32 EDIT2: Price for this is falling in the arena of 5500-6000 USD where I live. Obviously costs a bomb.
qwen3.6 just stops (www.reddit.com)

+89 6w vllm qwen openai

https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens…
Can I improve performance for qwen 3.6 27b? (www.reddit.com)

+1 6w ollama qwen

Hardware OS: Windows 11 Pro 10.0.26200, Build 26200 CPU: Intel Core Ultra 7 270K Plus, 24 cores / 24 threads, max clock 3.7 GHz RAM: 32 GB DDR5 @ 5600 MHz, 2x16 GB Crucial CP16G56C46U5.C8D GPU: 2x NVIDIA GeForce RTX 3090, 24 GB VRAM each,…
Model for reverse engineering (www.reddit.com)

6w gemma

For a system with 4x RTX 3090: what's the best model you could use for reverse engineering C# code? Qwen3.5-122b-A10B?
Building the QWEN3.6 - Codex Bridge Furthe + Kindergarten Harness Reality Check (www.reddit.com)

+11 6w ollama qwen llama+2

I got a bit further with my harness for running Qwen 3.6 model on Codex. While testing, analyzing, and building the harness, I evolved TBG(O)llama-swap into a full forensic UI bridge and LLM analytics tool where every harness finding, modi…
Q: Does DFlash (and PFlash) work with Heretic models? (www.reddit.com)

+42 6w gemma

Z-Lab did some good work with speeding up output, while Luce managed to use smaller models of the same family to accelerate prefill... Since Heretic and other "smart ablation" tools can decensor a model, would they work with these multi-mo…
How many of you tried BeeLlama.cpp? How's it? Agentic coding possible with 8GB VRAM? (www.reddit.com)

+67 6w gemma agentic

We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs).
Qwen3.6:27b single-shot fixed a CSS UI bug that had Gemma4:26B doom looping uselessly for 15 minutes (www.reddit.com)

+421 6w moe agentic

Warning: long post ahead. On the bright side, it's 100 percent human-written, typos and all.
Local AI video pipeline review: Qwen3 27B beat Gemma 4 26B for tool calling (www.reddit.com)

+66 6w gemma qwen

Watched All About AI's 100% local Fireship-style video automation experiment over the weekend (link in comments). A few things worth flagging if you're trying the same stack.
I've seen a lot of folks ask "can local LLMs actually do anything useful?" (www.reddit.com)

+1641 6w qwen

And I'm here to share my experience. The answer is resoundingly 'yes'.
llama bench kv cache f32 error (www.reddit.com)

+13 6w llama

A did a quick google, but found nothing on this and I am scratching my head. Trying to do a llama-bench run with the kv cache set to f32 under Vulkan with a Strix halo.
High VRAM local coding model — still Qwen 3.6 27B? (www.reddit.com)

+87 6w deepseek qwen opus

I’ve been using Qwen 3.6 27B and it’s amazing. Not exactly your Opus replacement, but great for small tasks and checking work.
Thoughts on "production" model setups (www.reddit.com)

+22 6w qwen

I've been working with Qwen 3.6 27B and 35B-A3B models and pretty happy with them. The point I've reached now is how to split my uses cases.
RTX 5060Ti 16GB or RTX 3080 20GB? (www.reddit.com)

+319 6w vllm gemma copilot+2

I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.
New Qwen3.6 27b Autoround Quant (int4) Best Recipe (www.reddit.com)

+46 6w vllm

I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm.
Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM (www.reddit.com)

+11 6w moe agentic

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. Autocomplete: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L Agentic: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL Why these models: Qwen2.…
MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more (www.reddit.com)

+2 6w llama

I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment.
MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp (www.reddit.com)

+911 6w llama mcp

I was wondering what will be the difference in results with flag: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 vs MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Results are quite interesting 49tok/sec without MTP vs 64 tok/sec with MTP. PC: RTX5090+128GB DDR5…
How do I use MTP? (www.reddit.com)

+57 6w llama

Hi, I'm trying to use MTP with llama.cpp, I built from source the mtp-pr, download an MTP model from huggingface https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP/resolve/main/Qwen3.6-27B-Q6_K.gguf But when I run the model I have an erro…
Stop wasting electricity (www.reddit.com)

+10841 6w llama

Run on my rtx4090 llama.cpp params: llama-server -m ~/Projects/llm/models/Qwen3.6-27B-UD-Q4_K_XL.gguf --flash-attn on -ngl all -ctk q4_0 -ctv q4_0 -t 32 -c 262144 Power limit was set using sudo nvidia-smi -pl N On my observation, GPU const…
Models and Quants quality test results - the chessboard svg (Qwen3.6 27B/35B-A3B/Zaya1) (www.reddit.com)

+133 6w qwen

According to this. I run several more tests to cover more models and quants.
Estimate inference speed of local Qwen3.6-35B on Mac M5... (www.reddit.com)

12 6w grok moe deepseek+2

"Based on currently available information, estimate the prefill/decode speed of Qwen3.6-35B-A3B Q8 with 262K context on a Mac M5 Ultra 128GB." I'm surprised that almost every LLM fails at this task (ChatGPT/Gemini/Grok/Claude/DeepSeek/Kimi…
New Qwen3.6 35B finetune - 0GM-1.0-35B-A3B-0427 (www.reddit.com)

10 6w

https://huggingface.co/0G-AI/0GM-1.0-35B-A3B-0427 So far it behaves better than for example Qwopus in terms of consistent answers, iv been testing Q6K from https://huggingface.co/mradermacher/0GM-1.0-35B-A3B-0427-i1-GGUF Also i checked the…
- Why is my Qwen3.6-35B-A3B so much dumber than Qwen3.5-35B-A3B? (www.reddit.com)
Will unsloth release MLX versions of the MTP qwen3.6 and gemma 4 models? (www.reddit.com)

+32 6w gemma

Question in title. Would be awesome to have this on macs, especially q8 or whatever the minimal-loss quant is, since macs can have lots of ram.
Does anyone else have issues with Qwen-3.6-27B stability in the Codex harness? (www.reddit.com)

+1 6w qwen llama codex

I run the 4 bit quant of Qwen-3.6-27B in the codex harness with unsloth recommended llama-server settings, thinking enabled. I have tried the default chat template and the updated ones and have updated both my GGUFs and llama-cpp to the mo…
Will there be any more Qwen3.6 series models? (www.reddit.com)

+4125 6w qwen

I'm still hoping we see a Qwen3.6-122B or a Qwen3.6-coder, but my hopes are dimming. Seems like we would have seen/heard something by now, even if just tantalizing hints from the Qwen folks.
Anyone with 4x 5060ti based setups? (www.reddit.com)

+1016 6w qwen

I am currently running 2x RTX 5060 ti and happened across some good sales for additional ones coinciding with a really good sale of a highend Z890 motherboard (replacing my B860 board) that could support quad GPUs (with 2 M.2 adapters, end…
Does 'preserve_thinking' work with openwebui? (www.reddit.com)

+311 6w llama

I'm running qwen3.6-35b with llama.cpp connected to openwebui. And I noticed the model fails the number guessing game test on openwebui while it works perfectly with the llama.cpp web ui.
Don't you have issues in W11 with AMD GPU where llama.cpp suddenly drops performance for no reason ? (www.reddit.com)

12 6w moe qwen llama

I have this issue in all Windows installations I have done in my system, which of course, does not occur in Linux. 7900XTX + 9800x3D + 64GB DDR5 Issue is that for some reason, after sometime, llama.cpp performance cuts in half, even restar…
PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server (www.reddit.com)

+306 6w llama

Hey folks, just a heads-up for anyone running Qwen3.6 through llama-server. I ran into an issue where the preserve_thinking parameter wasn't working as expected, even though I had it explicitly enabled in my models.ini config.
Why is opencode so slow in processing the prompt with llama server? (www.reddit.com)

+650 6w llama

I'm running opencode and llama-server locally. I have 32gb ram and 780m igpu.
The Qwen 3.6 35B A3B hype is real!!! (www.reddit.com)

+20967 6w qwen

My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research. My research is on some pretty niche topics and I doubt that anything like it is…
rtx 5070ti with Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf token speed 564/41 (www.reddit.com)

3 6w moe

--model "/mnt/e/my-path-change-to-yours/qwen3.6-35b/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf" \ --ctx-size 262144 \ --parallel 1 \ --n-cpu-moe 29 \ --no-mmap \ --mlock \ --cache-type-k q4_0 \ --cache-type-v q4_0 10.8/16 d…
OpenClaw + oMLX shows 0 cached tokens, but Hermes uses cache fine with the same local model, what am I missing? (www.reddit.com)

6 6w openclaw qwen

Hey everyone, I’m trying to debug a weird prompt cache issue with OpenClaw + oMLX, and I’d appreciate help from anyone running local agents on MLX/oMLX. The short version is this: I’m running oMLX v0.3.8 on my Mac, serving: Qwen3.6-35B-A3B…
As of today, what's the *most stable* model to run on a 32Gb RAM Mac w/ 256k context? (www.reddit.com)

+1032 6w llama agentic

Hey everyone, I've been playing around with Gemma4 and Qwen3.6 on my 32Gb Macbook Pro M2 Max since their release but I'm struggling at finding: The best software to run it (oMLX, llama.cpp, ...) The best model + quant to pick The best sett…
MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close. (www.reddit.com)

+2712 6w qwen

I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when…
Getting a feel for how fast X tokens/second really is. (www.reddit.com)

+9332 6w qwen

I love following all your adventures with local LLM setups. Quality and size of the models are important, but so is performance.
Building out my tool library, any recommendations? I just added email capability and im starting to get hyped! (www.reddit.com)

+52 6w qwen

I'm using OpenWebUI and and making tools/skills to improve my models functionality. I am currently using Qwen 3.6 35B A3B Q8 (F16) 256k I grabbed `parallel tools` to be able to run multiple tool calls at once..
Speeding up local LLM for usable coding agent (www.reddit.com)

+616 6w qwen

TL;DR: Qwen 3.6 35B-A3B (Q4_K_M) is running slow at around 9 t/s with 72% filled context (36147 tokens window) and a total response time of 77s including prefill and token generation. Ran this using LM Studio on Windows with the attached i…
Hello from 10KM high! - Thanks to Qwen 3.6 35b a3b! (www.reddit.com)

+3413 6w qwen

Typing this on a cramped flight, but I was having issues connecting to the plane's wifi on my ubuntu laptop, when it was effortless on my phone. The issue I was having was the Laptop WiFi connected to the plane wifi network, but captive po…
Has anyone bought a 3080 20GB mod recently? (www.reddit.com)

+811 6w qwen

I think it would suit my needs perfectly, but I'm scared of getting scammed on Alibaba so looking for some sellers who have delivered. Follow-up question for those who have the card, how well does it run Qwen 3.6 27B?
am I running this llama-bench of Qwen3.6-27B on these V100s right? (www.reddit.com)

+210 6w llama

basically what I'm doing here is trying to validate whether or not it's a reasonable idea to get a couple of V100s, either SXMs with PCIe adapters or straight-up PCIe cards in the first place, for the sake of running this model or models l…
What are the best 40-500 B MoE LLM models now? (www.reddit.com)

7 6w moe gemma qwen

Due to old GPU I run on CPU and came to appreciate value of MoE. I know of MoE for Qwen 3.6 and Gemma-4, which are <40B.
Hugging Face co-founder says Qwen 3.6 27B running on airplane mode is close to latest Opus in Claude Code (www.reddit.com)

+3723 6w qwen opus claude-code

I'm keeping a close eye on the development of local llms.
Probe-Detected Grokking in Multi-Probe DPO (openinterp.org via hn)

+3 6w dpo

Probe-Detected Grokking in Multi-Probe DPO Orthogonal Learning Beyond Task-Specific Detectors in Qwen3.6-27B Probe-Detected Grokking in Multi-Probe DPO: Orthogonal Learning Beyond Task-Specific Detectors Abstract We report a phase-transiti…
After you’ve setup local models, where can you find interesting apps that can use them? (www.reddit.com)

+610 6w openclaw

I have Qwen3.6-27B as my main model, I use it for coding with opencode and chatting with open-webui, yet to try out hermes or openclaw. I found out about their existence basically by searching or through reddit - but maybe there’s more tha…
9070xt inference for q3 qwen 27B (www.reddit.com)

+2 6w qwen llama

In llamacpp I'm getting 12tok/s, does this number look right to you and what can I do to increase this number (if possible)? cd ~/llama.cpp && ./build/bin/llama-server -m models/qwen-3.6-27b-abliterated-q3.gguf -ngl 999 -c 65536 (i need th…
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) (www.reddit.com)

+2520 6w qwen llama

TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled.
vLLM + NVFP4 + Qwen3.6 27B: "Checkpoint does not provide a q scaling factor"? (www.reddit.com)

1 6w vllm qwen

I have been trying various NVFP4 based variations of Qwen 3.6 27B, and I am seeing this for the ones that look most interesting to run on my 2x 16GB VRAM with KV cache fp8. vllm | (Worker_TP0 pid=136) WARNING 05-09 13:49:27 [kv_cache.py:10…
Should we use a non-thinking model for code after using a thinking one for plan? (Agentic coding) (www.reddit.com)

+2 6w agentic

I usually use Qwen3.6 27B (slow as heck on my RX 6800 but it works) for plan and Qwen3.6 35B A3B for the coding. But I was thinking the other day if I should remove the thinking from the code model.
More Qwen3.6-27B MTP success but on dual Mi50s (www.reddit.com)

+123 6w llama

TLDR: The hype is real! 1.5x speedup.
Pi and Qwen3.6 27B make setting up Archlinux really easy. (www.reddit.com)

+313 6w qwen

Just thought I'd share this use case. I was setting up a miniPC as a home theatre with Archlinux (It's the OS I'm most familiar with).
Show HN: Transformer Math Explorer (simonramstedt.com via hn)

+21 6w moe qwen

Interactive reference for transformer models, presented via dataflow graphs, drillable down to elementary mathematical operations. Covers models from GPT-2 to Qwen 3.6, with MLA, MoE, RoPE, MTP, hybrid attention, and other variants togglea…
potentially stupid problem trying to llama-bench Qwen3.6-27B across two V100s in llama.cpp (www.reddit.com)

+13 6w llama

this is almost certainly a skill issue, however: ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 1 --flash-attn 1 --device CUDA0,CUDA1 -p 2048 -d 4096,16384,65536 rather than splitting across those two cards, it firs…
Just got a 8x 32gb v100 server... now what (www.reddit.com)

+335 6w qwen llama

Looking for suggestions. Current setup llama.cpp and ran qwen 3.5 397b 256k context.
RTX Pro 4500 Blackwell - Qwen 3.6 27B? (www.reddit.com)

+214 6w qwen codex

have have a server running a 4500 blackwell on cuda 13.1 and nvidia/595.58.03 with 48GB mem assigned to it. I have build: dcad77cc3 (8933) with Qwen3.6-27B UD-Q5_K_XL loaded and connected it to Roo code.
Those of you who like Gemma4 models - how are you guys using them? (www.reddit.com)

+515 6w sonnet opus

I have been using local LLM for coding quite a lot as well as some other tasks (like data extraction from images) and I had quite a good success with Qwen3.6 models. It's obviously not Sonnet/Opus, but I am able to get quite a lot of work…
Qwen 35B-A3B is very usable with 12GB of VRAM (www.reddit.com)

+4713 6w moe qwen llama

Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.
Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 (www.reddit.com)

+3842 6w deepseek llama

So I've been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use. So after a day of vibecoding I think I may have gotten something viable.
My severely discounted rusted (!!!) 2x 3090 + 128GB system is beginning to die — what's the most perf-equivalent, cost-effective (lol) next hop? (www.reddit.com)

16 6w

Just before RAMpocalypse, I built a 2x3090 using dirt-cheap humid-env-run GPUs and 128gb RAM on a RYzen 5 9600X. I've been pretty happy with the local models it lets me run, spilling into RAM occasionally as needed.
Local LLM for electronics design work? (www.reddit.com)

+22 6w

Another hobby is working on electronics projects ranging from low-voltage control and signal processing to HV tube amp circuits. I design and simulate in LTspice before prototyping.
z-lab released gemma-4-26B-A4B-it-DFlash. Anybody tried it yet? (huggingface.co via reddit)

+135 6w vllm gemma qwen

Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago.
Comprehensive guide on renting/setting up beefy LLM server for local models? (www.reddit.com)

2 6w

Hey there everyone, I've been struggling to find a actual good guide that's not some fluffy video or AI slop on renting hardware from a service to run a local LLM with high token output Before I invest in some serious hardware, I thought I…
Qwen3.6:27b vs qwen3-coder:30b vs deepseek-coder:33b on code gen, tool calling, and agent tasks (www.reddit.com)

6 7w humaneval function-calling ollama+1

Ran a full eval against four local models last weekend and the spread between them is wider than I expected. All running through Ollama on CPU, no cloud, same prompts, same hardware.
Support for spec prefill and spec decode on qwen3.6 model family (www.reddit.com)

+1 7w

Anyone familiar with getting both to work? I've got a few work systems and I want to make a case for inhouse data generation for the team, and I've got a very very crusty implementation going by putting a bifrost service on one of them, an…
I've created the fastest local AI engine for Apple Silicon. Optimised for agentic use. (www.reddit.com)

+83 7w agentic

https://preview.redd.it/p0rqofxvrtzg1.png?width=1460&format=png&auto=webp&s=8ce5b18b4ddaad9b71f71fd8eb623839fc9c6c8b For weeks I've been working on creating the fastest local AI engine for Apple Silicon... And I finally did!
Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK (www.reddit.com)

7w vllm qwen

TL;DR On 4× RTX 3090 with NVLink bonded between GPU pairs (0↔2 and 1↔3), pinning TP=2 to a NVLinked pair gave +25% throughput at concurrency 1 and +53% at concurrency 4 vs running TP=2 over PCIe. Adding the other two GPUs to make it TP=4 m…
Thinking of moving from 2x 5060 Ti 16GB to a RTX 5000 48GB (www.reddit.com)

+120 7w qwen

I am a freelance developer. Qwen 3.6 27B is great on the 5060s but a bit slow.
Qwen 3.6 Looping with Tools? (www.reddit.com)

+111 7w qwen llama mcp

For some reason, my qwen started looping a lot recently, ever since I introduced MCP tool calls. I don't know why as I didn't really change anything other than that.
Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? (www.reddit.com)

+1 7w qwen llama

Ok so, I will try to explain myself as much as possible because onlinew I really cannot find much about this. Let's start by my settings for running Qwen 3.6 35B: Qwen 3.6: cmd: '/X --port ${PORT} --chat-template-kwargs '{"preserve_thinkin…
Is it my imagination or... (www.reddit.com)

6 7w qwen llama

Is Qwen 3.6 35b now considerably stupider in the latest llama-server releases? I had this model doing cartwheels two upgrades ago.
Qwen 36 27B + Gemma 4 - the best set for 1x 3090 ? (www.reddit.com)

+1 7w gemma qwen

Hi guys 👋 When I started my adventure with Qwen 3.6 27B I felt wow.... Now when I connect it with Gemma 4 I'm feeling more wow...
Disappointed in Qwen 3.6 coding capabilities (www.reddit.com)

31 7w qwen llama codex

I know that coming from Codex I should adjust my expectations, but still. I'm working on a midsize project.
Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide (www.reddit.com)

+25 7w moe llama opus

I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday.
Mimo2.5 (not pro) under llama.cpp? - primary model opencoder? (www.reddit.com)

+14 7w llama

I tried running AesSedai/MiMo-2.5-GGUF:Q4-K-M under llama.cpp (main tree, compiled 36hours ago) Hardware: nvidia A6000 with 48GB RAM + 300GB CPU RAM I had no success: error loading model: missing tensor blk.0.attn_q.weight ... Is Mimo alre…
why llama.cpp can’t combine speculative decode methods? (www.reddit.com)

+105 7w llama agentic

dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g.
MTP - The proofs in the puddin! Using it with Qwen3.6-27b (www.reddit.com)

+1 7w llama

Been running llama.cpp MTP with Qwen3.6-27B Q4_K_M as my daily coding assistant and got curious what was actually happening under the hood. Pulled the metrics from llama-server and charted a full session.
Any tool that tells you the cheapest setup needed to run a model? I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds. (www.reddit.com)

+616 7w qwen

I’m looking for a tool or calculator that can estimate the minimum hardware needed to run a specific model locally. For example, I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds.
What models for coding are you running for a mid level PC? (www.reddit.com)

+13 7w moe gemma qwen

I have a 4060 (8GB Vram) and 16GB of ram wondering which models could fit in my setup for coding, the new Qwen 3.6 and Gemma 4 MoE models look good but might not fit, wondering about your experiences
Fine-tuned Qwen3.6-35B-A3B DeltaNet experiment (www.reddit.com)

+23 7w moe

I fine-tuned Qwen3.6-35B-A3B on its own outputs for $7 on Apple Silicon + Modal. DeltaNet LoRA targeting was the hard part.
Has Qwen3.6-27B Surpassed GPT-5.5? (Not Joking) (www.reddit.com)

7w gpt-5

So I had this idea for a project which was to try to fix a pretty hard coding problem using local agents running in a loop. The project is a compiler for biology protocols from vendors.
Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development (www.reddit.com)

+26 7w qwen opus agentic

tl;dr - For software development, Qwen3.6 27B, 5090 gives you ~3x speed over M5 Max, letting you plow through code, while M5 Max gives you ~4x memory, letting you use higher quantization and bigger context. Which would you choose and why?
Get faster qwen 3.6 27b (www.reddit.com)

+4411 7w qwen llama

Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp Thought I would knowledge share Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF And am17an commit /media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/ll…
Group cluster rental as a service (www.reddit.com)

7w qwen claude-code

With the explosion of apps like open claw, and the launch of my own app (trigger warning, not open source), there is massive demand for tokens. It used to be possible to avoid anxiety about your monthly bill by just buying a claude code su…
Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results (www.reddit.com)

+149 7w llama

Following my previous post https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq, a few people asked for the 35B A3B version. The model is up on HuggingFace at https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF if anyone wants to ch…
Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot (www.reddit.com)

+4 7w copilot llama chatgpt

Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700.
The GB10 Solution Atlas is now open source, the inference engine made for the community with breakneck inference speeds (Qwen3.6-35B-FP8 100+ tok/s) (www.reddit.com)

+4 7w vllm

Some of you saw our post a couple weeks back about hitting 102 tok/s stable on Qwen3.5-35B on a DGX Spark. A lot of you asked "cool, where's the code?" Today's the day: Github Atlas is open source.
Why people cares token/s in decoding more? (www.reddit.com)

+316 7w agentic

What I've noticed while using local LLM recently is that in most cases, bottlenecks occur not in decoding but in prompt processing. If the prompt processing speed is usable, in most settings (since it takes about 15k when starting based on…
open weights agents in pi/opencode, anyone else hitting endless loops with nested tool calls? (www.reddit.com)

11 7w qwen

Both opencode and pi coding work, but I've hit the same wall with open weights. Qwen 3.6 and even fine-tuned variants, they drift into loops once the tool calls get nested or ambiguous.
HOT TAKE: local models + agent harnesses are now capable enough to hand off junior-level IT professional tasks to [human written] (www.reddit.com)

+3023 7w agentic

This post will have a slight old-man-shakes-fist-at-sky vibe, because….well… I’m older, so if you’re not into that, then please feel free skip it. I have been contributing to this sub for like 3 years now but I’m fearful this post will lik…
Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM (www.reddit.com)

+225 7w vllm llama

So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF. This is not a "best possible setup" claim.
Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)? (www.reddit.com)

+35 7w moe llama agentic

Hey folks, looking for advice before I delete or keep a huge model file. I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM.
Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR (www.reddit.com)

+2617 7w llama

Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF…
I am trying to replace Claude in an agentic TDD pipeline with local LLM (www.reddit.com)

12 7w agentic

Based on my last post and some comments, I added Qwen3.6:latest and Devstral to the evaluation. I am still looking for suggestions on which local model can run a complete TDD loop autonomously.
2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints (www.reddit.com)

+14938 7w qwen llama agentic+2

WARNING: wait before download from HF: I just realised my upload of the new versions with the additional fix in the chat template has not completed yet. I will remove this warning once done The recent PR to llama.cpp bring MTP support to Q…
5060ti 16gb or 5070 12gb for local LLM (www.reddit.com)

14 7w qwen

As a title says, what is better taking the consideration that it will probably offload to CPU anyway? Models Qwen 3.6 35b and maybe I am not sure it will be usable Qwen 3.6 27b...
Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? (www.reddit.com)

+46 7w qwen llama

How is this dual setup's performance? Is it difficult to set-up everything with for example llama.cpp?
Best local model for MBP 48GB UM (www.reddit.com)

2 7w glm openclaw qwen

I have been toying with GLM 4.7 flash mlx a while ago using lmstudio. I had integrated it successfully with openclaw and it was kinda stable in tool calling.
Solidity LM surpasses Opus (www.reddit.com)

+1 7w opus

My weekend project overran a little but happy with the end result. soleval pass@1 beat Opus 4.7 on the same set of tasks.
Qwen 3.6 and inline comments (www.reddit.com)

+2 7w qwen

I've been using Qwen 3.6 with the Pi harness, and so far I'm really enjoying the experience. I've noticed Qwen is great at leaving inline comments when writing Typescript (haven't tried other languages).
Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) (www.reddit.com)

+8944 7w qwen

The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.
Thinking mode is becoming a liability for production agents (www.reddit.com)

+24 7w

Every new model release I see now has thinking on by default. But then the production results I'm seeing don't justify it.
Qwen 3.6 27B MTP on v100 32GB: 54 t/s (www.reddit.com)

+42 7w copilot qwen llama

Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.
What do you use Gemma 4 for? (www.reddit.com)

+514 7w gemma qwen agentic

Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks.
Claude Code @ Opus 4.7 vs OpenCode @ qwen3.6:27b. Both shipped a playable cozy roguelite. (www.reddit.com)

+1 7w opus claude-code

could not extract summary
Gemma4:31b-coding-mtp-bf16 - slow on Macbook M5 128gb (www.reddit.com)

+42 7w ollama gemma llama

Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet) https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ Running in Open Webui to view token/s output and I…
Smaller gguf getting way less tokens per second?? So confused! (www.reddit.com)

+37 7w moe llama

Noob here, Running Qwen3.6 35B A3B in LM Studio on a 3080 10GB + Ryzen 5 3600 on Windows 10. Tried some unsloth quants with identical settings (GPU offload 40, MoE layers to CPU 40, context 8192, flash attention on).
My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM) (www.reddit.com)

+211 7w llama

UPDATE: i have switched to vulkan (image: ghcr.io/ggml-org/llama.cpp:server-vulkan-b9014) and now i am getting prompt eval: 591.01 tok/s generation: 41.90 tok/s which is faster than rocm new config: services: llama-cpp: container_name: lla…
Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster. (open.substack.com via reddit)

+237 7w gemma qwen

Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article...
[Benchmark] Llama.cpp: Mac vs CPU vs GPU + CPU, Qwen3.6 27B, Q8 (www.reddit.com)

+1 7w llama

https://preview.redd.it/fm8fr1vllczg1.png?width=1254&format=png&auto=webp&s=23dbb32e85c71b9454a617de174d0f416b786bb2 llama.cpp parameters: -c 260000 --jinja --no-mmap model: HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced:Q8_K_P Based on…
Use Qwen3.6 right way -> send it to pi coding agent and forget (www.reddit.com)

+54 7w

https://preview.redd.it/z4b01gklaczg1.jpg?width=1080&format=pjpg&auto=webp&s=3cefa63d5d15eac5eedbb39ef19d6c476b22ae64 Just a reminder, the harness you use can makes a huge diffrence (your llm client and interface bascially), It's is way mo…
Vulkan backend outperforms ROCm on Strix Halo (gfx1151) — llama.cpp benchmark (www.reddit.com)

+1129 7w moe llama

Just ran some llama-bench comparisons between ROCm and Vulkan backends on my Strix Halo system. Vulkan came out ahead, which surprised me.
Preserve thinking on or off? (Qwen 3.6) (www.reddit.com)

+1011 7w qwen

Are y'all using the preserve thinking flag or do you have it off? If so, why?
Qwen3.6 merged chat template from allanchan339 and froggeric (www.reddit.com)

+3711 7w opus

Hi, recently froggeric and allanchan339 released enhanced/fixed template for Qwen3.6 each one addressing different topics. I didn't know which one to use so I merged both with the help of Claude Opus to have the best of both.
Thoughts on GRM-2.6-Plus-GGUF ? (www.reddit.com)

+22 7w qwen

Judging by what they state, it should be better than Qwen 3.6 27B
MacBook Pro M1 (64GB) + VSCode + Roo + LM Studio + Qwen3.6-35B-A3B-Q6_K.gguf = 😞 (www.reddit.com)

+12 7w

I've tried the setup in the title today for some vibe coding (ctx=262144, temp=0.6). I must be doing something wrong because it doesn't really work for me.
Considering two Sparks for local coding (www.reddit.com)

+422 7w minimax

I'm currently running a 4x RTX 3090 system (96GB VRAM, DDR4 2133 RAM) and have tested opencode and pi.dev using Qwen3.5-122B-A10B (AWQ) up to 200k context for web app coding (html/js/python). I'm now seriously considering picking up two Sp…
Best config for Qwen3.6? (www.reddit.com)

+17 7w agentic

With all the high praise for the model all around, I also want to try it on my own. I have an rtx3060 12gb vram and 16gb system ram.
Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching (www.reddit.com)

+12 7w deepseek qwen llama

Hey everyone, I’ve been experimenting with running Qwen models locally on my setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB CPU: Ryzen 5700X OS: Windows 11 What I’m currently running Qwen 3.6 35B (UD Q4_K_M) llama-server.exe -m "C:\Users\Dino\…
Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB (www.reddit.com)

+3626 7w qwen agentic

----START HUMAN TEXT---- Hi all, I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV wil…
qwen 3.6 27B looping problem (www.reddit.com)

+1 7w gemma qwen llama

Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen…
Mtplx – 2.24x faster TPS – The native MTP inference engine for Apple Silicon (github.com via hn)

+2 7w

Native MTP speculative decoding on Apple Silicon ~2.24× over no-MTP AR at temp=0.6 on Qwen3.6-27B · math-correct rejection sampling · MLX-native · zero external drafter Multiplier is hardware-independent. Absolute tok/s scales with memory…
How do you estimate total memory usage? (www.reddit.com)

+41 7w llama

Qwen3.6 35B A3B UD IQ4_NL_XL. 512k context tokens for 4 parallel processing, key cache quantized to Q_8 and value cache quantized to Q_4.
Best Llama Config for Turboquant_Plus? (Stats below) (www.reddit.com)

+1 7w moe qwen llama

So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy.
The more I use it, the more I'm impressed (www.reddit.com)

+1912 7w qwen codex opus

Qwen 3.6 27b vs Codex GPT 5.5 / Claude Opus 4.7 My local llm discovered a bug that they both missed And it turns out it's critical GPT 5.5 and Claude both stood their ground and didn't give up until the end - they claimed to be right all a…
Sglang is better for serving a model for a personal agent harness? (www.reddit.com)

3 7w vllm

If one has enough vram, would Sglang be a superior choice than vLLM or llamacpp in terms of inference speed for serving a model dedicated to powering a personal (single user) agent harness like Hermes agent? Sglang has MTP for speculative…
Deep research + report "a la McKinsey" with Hermes Agent and qwen3.6-35b-a3b Q6_K. (github.com via reddit)

+6 7w qwen

Hi there. Not native English speaker.
Kvaser - Moving beyond simple agents: Building a Local-First AI Orchestrator with Qwen 3.6, Kiwix, and Wolfram (www.reddit.com)

+1 7w model-context-protocol rag qwen+1

For the past two weeks, I’ve been spending 4–5 hours a day building a custom MCP (Model Context Protocol) orchestration server. What started as a simple experiment with Qwen 3.6 35B has evolved into a full-scale "Man-in-the-Middle" proxy t…
Local Harness Benchmark: Pi Coding Agent vs. OpenCode with Qwen3.6 35B A3B (grigio.org via hn)

+1 7w

In the rapidly evolving landscape of AI-assisted development, choosing the right "harness" is as critical as choosing the model itself. This benchmark explores two of the most prominent open-source harnesses for local LLMs: Pi Coding Agent…
Advice needed on eGPU and Mini PC (www.reddit.com)

+15 7w vllm moe qwen

Hi all, I come across to relatively niche problem and could not find much useful posts or guides about it. I have a mini pc (Beelink Ser 8, 8745HS and 32GB 5600 DDR5 SODIMM) headless server for hosting some routing services, and I am wonde…
Qwen 3.6 35B MoE at full 262K context on an RTX 3090. Here's exactly how I did it. (low.li via reddit)

7w moe qwen llama

I spent a while getting this dialed in and wrote up the full recipe. Short version: 35B MoE TQ3_4S fits in 12.4GB of weights KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40 Total VR…
Does running a model (like qwen3.6-27b) on vllm or transformers use less VRAM than llama.cpp? (www.reddit.com)

5 7w glm vllm qwen+1

I have been using llama.cpp to run some models recently. For example, I've been running GLM-4.7-Flash with this command .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-s…
First time GPU buyer. Got a RTX 5000 Pro. Was it a bad decision compared to two 3090s? (www.reddit.com)

+1129 7w

I’ve run models exclusively on apple silicon up until now, but wanted to up my inference game. I bought a slightly used RTX 5000 Pro Blackwell for a bit more than twice as much as two 3090s.
General vs Reasoning [Qwen 3.6] (www.reddit.com)

6 7w qwen

I want to play with Qwen 3.6. Unsloth shows 4 different parameter options for different use-cases.
Anyone tried +- 100B models locally with foreign languages? (www.reddit.com)

+56 7w glm gemma qwen

I am quite curious as I tried Gemma 4 31B, Qwen 3.6 27B, GLM 4.7 30B and some others in my native language (czech). Gemma performs "best" and considering the fact its "just" 18GB model - it actually blows my mind how well it can respond in…
Using ollama for Openclaw (www.reddit.com)

6 7w ollama openclaw

Hi all, I have recently installed openclaw on a raspberry pi4, linking it to my local Ollama instance (RTX 3090 with 24Gb, as well as 96Gb of DDR5 RAM bought before the madness), in my case running Qwen3.6 (latest) capped at 16k context. A…
3xR9700 for semi-autonomous research and development - looking for setup/config ideas. (www.reddit.com)

+43 7w qwen llama

Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback.
If you've been waiting to try local AI development, please try it (www.reddit.com)

+2821 7w llama cursor claude-code

I have snobbishly long felt that the local models were not 'up to my standards' for local development, or otherwise able to compete with GHCP, Claude Code, Cursor etc. Boy was I wrong.
Qwen 3.6 seems to have a lot of trouble with tool calling (www.reddit.com)

+133 7w ollama qwen codex

(I'm on Windows system running these models locally) I've used both Codex and OpenCode with Qwen 3.6 27b and 35b running locally. I'm having a bitch of a time getting them to correctly create files.
Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... (www.reddit.com)

+1532 7w opus

I've had better results quality wise with 35B AND it's much faster than 27B. Just curious cause I see lots of people post about 27B.
I made a visualizer for Hugging Face models (www.reddit.com)

+8810 7w gemma qwen

I built hfviewer.com, a small tool for visually exploring Hugging Face model architectures. You can paste a Hugging Face URL and get an interactive visualization of the architecture, which can make it easier to understand how different mod…
What could they mean by "warmed steady-state"? (www.reddit.com)

+1 7w llama

https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash_10x_prefill_speedup_over_llamacpp_at_128k/ Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, v…
Need advice on Qwen 3.6 27B INT4 quantization (www.reddit.com)

3 7w vllm qwen llama+1

Hello everyone, I think Qwen 3.6 27B is good enough that it might take a while before we get a clearly better model at a similar size. I have a single headless RTX 3090 with a 300W power limit.
Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. (www.reddit.com)

+138 7w qwen llama

I wanted to share an open-source app that I built for running LLMs locally on my setup. My setup Hardware FEVM FAEX1 (128GB) RTX Pro 5000 Blackwell (48GB), connected over OCuLink Aoostar AG02 2x2TB internal m.2 drives on raid-0 using mdadm.
Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real. (www.reddit.com)

8 7w vllm gemma qwen

Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants)…
Kv cache quantization: ignorance, or malice? (www.reddit.com)

+1533 7w vllm qwen agentic

I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between spee…
Is it worth adding local LLM to agentic coding stack? (www.reddit.com)

+110 7w qwen codex agentic

Hey All my agentic coding stack includes claude-code 20x max, and codex 20x max. I use heavy scripting for orchestrating and testing multiple projects, been ai coding for 3 years.
5070 Ti —> 3090 move. Worth it? (www.reddit.com)

25 7w

I got into LLMs late 2024, and local in Jan 2025. since then, I’ve upgraded my mini PC then added eGPU with 5070 Ti back when it was retailing for $750-$800.
What's your tps on 3090 + Qwen 3.6 27B in real tasks? (www.reddit.com)

+614 7w qwen llama

I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models.
We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local (www.reddit.com)

+489 7w tool-calling ollama opus+1

LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far.
Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer (www.reddit.com)

+104 7w vllm openai

The angle here is native Windows, no WSL. Simple installation, open source, no telemetry.
Create Plan.md with Claude Code Opus, Execute Plan.md locally in Open Code using Qwen 3.6 27B Q8 (www.reddit.com)

+13 7w qwen opus claude-code

Does anyone do this? Any tips?
Have Qwen said anything about further Qwen 3.6 models? (www.reddit.com)

+2611 7w qwen

Have Qwen hinted at whether other models (9B, 122B, 397B) would be getting the 3.6 treatment? Or have they in any way confirmed or hinted at "this is it"?
"LLM is created so engineer don't have to write a report", anyway found out ONLYOFFICE can connect to OpenAI compatible, using Qwen 3.6 to do elaboration. (www.reddit.com)

+21 7w gemma qwen openai

It is pluggin made for ONLYOFFICE, much simpler than copy-paste from webui. PS.
Qwen3.6-27B-NVFP4 - images (www.reddit.com)

+7 7w llama

Model: Abiray-Qwen3.6-27B-NVFP4.gguf Specs: - Legion 7i Gen10 - NVIDIA GeForce RTX™ 5090 - Intel® Core™ Ultra 9 275HX × 24 - RAM 32.0 GiB llamacpp settings: ./build/bin/llama-server \ -m ~/.lmstudio/models/lmstudio-community/Qwen3.6-27B-GG…
- Qwen3.6-27B - Closed-loop SVG Images (www.reddit.com)
Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver (www.reddit.com)

+2044 7w gemma copilot qwen

So in response to the Great Token Reconning of 2026, I decided to try out Qwen 3.6 as a daily driver, and although it's only been about a day, I have to say I'm thoroughly impressed. I had to download the VSCode insiders edition and set up…
4080 Super > RTX 6000 Pro, Wow! (www.reddit.com)

36 7w qwen

A friend is going on vacation for a couple weeks and is lending me an RTX 6000 Pro rig to mess around with. Holy cow, it is so much faster than my 4080 Super!
Which other models will my system support? (www.reddit.com)

10 7w llama

This is my system: OS: Nobara Linux 43 Processor: Ryzen 9 5980HX RAM: 16 GB GPU: Radeon RX 6800M (12GB) I'm using llama.cpp and Qwen3.6-35B-A3B-UD-Q4_K_M is working okay in this system using vulkan. I'm getting a speed of ~17 t/s.
PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 (www.reddit.com)

+2 7w llama

Hey fellow Llamas, thank you for all the nice words and great feedback on the last post I made. We have something new we thought would be useful to share.
Need help optimizing qwen 3.6 on my 2x 5060ti 16gb (www.reddit.com)

+11 7w ollama qwen llama

Hi all, I tried to setup my pc to run llm, but got some issue: the first question of the chat is generally fine, but from the 3rd follow up question, the backend often be unresponsive and I have to manually restart the llama cpp server, or…
Model stuck in some thinking zone where it keeps saying a similar thing again and again (www.reddit.com)

+12 8w gemma

I experienced this with Q4 and Q3 versions of Qwen3.6-35B-A3B and Gemma-4-26B-A4B. It starts saying things which sound similar in thinking mode: I must do ....
love it - Qwen3.6-27B — UD-Q5_K_XL evaluation (www.reddit.com)

+31 8w agentic

by Kyle Hessling A hands-on benchmark of the Unsloth dynamic Q5 quantization, self-hosted on a single RTX 5090. 19 runs, 93.9 k generation tokens, across agentic reasoning, production-grade front-end design, and canvas / WebGL creative cod…
Qwen 3.6 27B Neo Code Q4 KM I matrix is badass (www.reddit.com)

+32 8w qwen

So i am using this model in tax accounting. Have a shitty Ryzen 9 7940HS (8C/16T), 60 GB RAM, Radeon 780M iGPU, 1 TB Kingston NVMe, Win 11 Pro.
Looking for feedback: using Ollama with local Office/PDF files in a desktop app (www.reddit.com)

+15 8w ollama qwen

I’m building OpenYak, a desktop AI workspace for using local models with real files on your computer. In this demo I’m using Ollama with Qwen/Qwen3.6-35B-A3B to review an attached budget workbook.
What is best code editor for local LLM deployment (LM Studio, llama.cpp) as of May 2026? (www.reddit.com)

10 8w gemma qwen llama+3

Hello folks What is best code editor for local LLM deployment (LM Studio, llama.cpp)? I wish to test my LM studio + Qwen 3.6 27B and Gemma 4 31B with a legit local code editor.
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (www.reddit.com)

+10834 8w gemma qwen

Gemma just crushed Qwen in a local LLM gamedev contest! Device: MacBook Pro M5 Max, 64GB RAM Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens.
Why is Step-3.5-Flash (196B-A11B) much cheaper to run than Qwen3.6-35B-A3B? (www.reddit.com)

+11 8w

Surely a x4 bigger model should be more expensive for inference?! API prices at e.g.
Qwen 3.6 - Loops and repetitions (www.reddit.com)

+24 8w qwen

I normally seldom experience loops, either reasoning or responses, using Qwen 3.6 27B Q8 with 256k context window in Agent Zero. But the 35B A3B Q8 with 256k context window gets constant loops and is basically unusable within Agent Zero.
What's the best suscription under 20$? (www.reddit.com)

+21 8w minimax deepseek qwen+1

I’m pretty overwhelmed. I feel like there are so many options that I don’t know which one to choose, and trying things until I find a decent one isn’t really my thing—even though I enjoy it.
Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops) (www.reddit.com)

+25 8w gemma qwen

I've got to the point where I need some help. I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking".
Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) (www.reddit.com)

+207 8w vllm

Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (~125K context, higher TPS). We’ve been pushing further on both context length and stability for tool-agent workloads.
Long-context coding on RTX 5080 16GB: Qwen3.6-35B-A3B holds 30 t/s at 128K (89 t/s fresh), no quality drop (www.reddit.com)

+4 8w llama anthropic claude-code

I wanted to see how much of my coding-agent workflow I could move local instead of paying for hosted tools forever. There was another push: Anthropic's own April 23 postmortem confirmed product-layer regressions through March/April.
Would implementing a dual GPU configuration enhance the TPS? (www.reddit.com)

+115 8w qwen

I am currently utilizing a single RX9070 16GB, achieving a performance of 20 tokens per second with Qwen 3.6 27B. Would integrating an additional RX9070 enhance this performance, or would the output remain consistent?
Are Qwen 3.6 27B and 35B making other ~30B models obsolete? (www.reddit.com)

+1338 8w gemma qwen

Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older ~30B models irrelevant? They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows.
12GB-Club: 4070S qwen3.6 27b + 35b a3b, and Gemma 4 26b a4b + 31b speeds (www.reddit.com)

+72 8w cline gemma

Longtime lurker here, thought i should post my speeeeds... I have a RTX 4070S 12 GB Vram (+10% OC), AMD 9800x3D with 4x16 Gb DDR5 6000Mhz CL30.
Qwen3.6 27B seems struggling at 90k on 128k ctx windows (www.reddit.com)

+513 8w llama

I have RX 7900 XTX, running Qwen3.6 27B Q4_K_XL. got 400ish pp and 30s tps.
Actual comparison between locally ran Qwen-3.6-27B and proprietary models (www.reddit.com)

+3 8w qwen

Hey y'all! I've recently written a text in Russian about my experience comparing Qwen-3.6-27B with lower tier cloud models on hard tasks -- I wanted to share the translation of the post, since I found the results interesting and surprising.
Qwen-27B as a Local Agent — It Actually Works Now (www.reddit.com)

+8 8w gpt-4 qwen

It's been a busy week testing and trying to get the 27B model set up correctly. TL;DR: The only setup that worked for my dual 3090s was this one.
Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer (www.reddit.com)

8w vllm moe agentic

Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops. This isn’t a criticism of Qwen3.6.
Tenstorrent TT-QuietBox 2 Specifications (Blackhole) (www.reddit.com)

+5 8w minimax qwen

Source: https://docs.tenstorrent.com/systems/quietbox/quietbox-bh-2/specifications.html Currently supported models: https://tenstorrent.com/developers From the specification docs above: CPU: Ryzen 7 9700X 65W Granite Ridge 3.8GHz Memory: 2…
Qwen3.6-27B 4.256bpw in full VRAM on a 5070 Ti with 50000 q4_0 context - not turbo! (www.reddit.com)

+2112 8w qwen

Hugging face link here. Ive been waiting for sokann to drop his Qwen 3.6 GGUF for 16 GB GPUs as his Qwen 3.5 was my GGUF of choice.
Larger Gemma-4/Qwen3.6 (www.reddit.com)

+1711 8w moe gemma

Qwen3.5-122B-A10B at Q6_K is really good. Do you think we will see a larger MoE Gemma-4 or Qwen3.6 at some point?
Qwen Models are such good models? (www.reddit.com)

+713 8w moe qwen

https://preview.redd.it/o1uxb57u47yg1.png?width=862&format=png&auto=webp&s=d38204fe6ccd0d8326dcd98a534e9a226d213f99 How trustworthy are Artificial Analysis intelligence index? so according to them Qwen 3.6 27B is better than bigger MoE mod…
Strip Qwen3.6 dense of its multimodal capabilities (www.reddit.com)

+814 8w moe

This may be naive but if we stripped a model of its image processing/voice processing capabilities, can it make it smaller or faster? Is that even possible?
Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max (www.reddit.com)

+101 8w qwen

Followup to yesterday's post: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point.
Qwen3.6-27B-UD-Q6_K_XL.gguf sometimes gets stuck in a loop (www.reddit.com)

+34 8w llama

Hi all I'm running Qwen3.6-27B-UD-Q6_K_XL.gguf using llama swap and llama-server with these parameters (actually stolen for some posts on this subreddit.) llama-server \ -m /models/Qwen3.6-27B/Qwen3.6-27B-UD-Q6_K_XL.gguf \ --mmproj /models…
Don't forget about dem free gains! (www.reddit.com)

+12 8w llama

Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago: 1 card - 1580/44: $ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB): Device 0: NV…
Sorry if it's not the best place to ask this, of the models in the image, which is the best for (problem solving)/Coding and the best one for studying (ask LLM concepts) ? My PC build is RX 9060 XT 16GB + I3 12100F + 16 GB DDR4 + llama.cpp with Vulkan backend + Linux Mint. (www.reddit.com)

+58 8w moe qwen llama

I gave some math problems to Qwen 3.5 27B and Qwen 3.6 27B and they got all of them right, pretty smart models I would say, but very slow and electricity consuming, they took like 5 mins with my GPU at 120 W to solve a problem. The MoE mod…
llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary (www.reddit.com)

+175 8w llama

I tested two llama.cpp builds on the same Qwen3.6-27B-NVFP4 model. llama-bench reports the model label as qwen35 27B NVFP4, but the actual tested model is Qwen3.6-27B-NVFP4.
How do you objectively tell if your custom agent tools are actually better? (www.reddit.com)

+2 8w

I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.
What it feels like to have to have Qwen 3.6 or Gemma 4 running locally (www.reddit.com)

+164 8w mistral gemma qwen

Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour.
Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working (www.reddit.com)

+64 8w vllm

I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards. Hardware: 2x RTX 5060 Ti 16GB 32GB total VRAM Proxmox LXC 16 vCPU ~60GB RAM CU…
Ran my own benchmark Qwen 3.6 35B vs Gemma 4 26B.... theres a clear winner here (www.reddit.com)

7 8w hallucination gemma qwen

Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989? According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe…
3.6 27B Tool Calling Issues (vLLM) (www.reddit.com)

+1 8w vllm qwen

Has anyone got a reliable vLLM recipe for 3.6 27B that fixes the tool calling issues? I am getting "Not let me..." - then nothing.
Field report: Qwen 3.6 27b on an M2 Macbook Pro with 32GB RAM (www.reddit.com)

11 8w qwen

This post is a lot shorter than my 35B-A3B field report because almost everything is the same. But if you want to know how to reproduce it, see my earlier post.
Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) (www.reddit.com)

+12 8w qwen llama

Hello, I would like a suggestion from those who are already actively involved in this world. Basically, I own this workstation: Ryzen 9 5900X 32GB di RAM DDR4 RTX 5060Ti PCCOOLER CPS YS1000 1000W Currently, I can quite easily code with Qwe…
Qwen3.6-27B-GGUF:UD-Q8_K_XL and llama.cpp issue (DGX SPARK) (www.reddit.com)

+14 8w llama

Hey all, im having a crisis that i just cant figure... i used Qwen3.6-27B-GGUF:UD-Q8_K_XL ever since it came out (on a DGX SPARK) and it worked like magic with decent performance (~50 t/s) , im updating SPARK and llama.cpp on a daily basis…
Qwen3.6-27B created this Open Webui tool (www.reddit.com)

+22 8w qwen gemini chatgpt

I usually go for Claude for those kinds of Open WebUI tool creations, but rate limits are getting tight so I decided to just let Qwen3.6-27B-Q5 handle it through Open WebUI. It did it in one shot.
Gemma4-31B-3bit-mlx · Hugging Face: 3 & 5 mixed quant for RAM poor Mac users. (huggingface.co via reddit)

+52 8w

Just dropped another 3&5 mixed quant for the RAM Poor Base-model-only Mac users that want to try Gemma4 top of the line LLM. 6gb smaller that the other 3bit-mlx out there and 25% faster.
Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max (www.reddit.com)

+159 8w qwen llama

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K.
Devstral Small 2 24B vs Qwen 3.6 27b or both? 1x 3090 (www.reddit.com)

+24 8w qwen

Hi got 1x 3090 and I'm thinking about these both models. I'm using from Friday Qwen and this model is amazing!
I'm Not a Dev But I Use Qwen 3.6 35b to Code (www.reddit.com)

+247 8w glm qwen

Full disclosure: I used to program a bit, but I was garbage at it so I found a new career. This was eons ago so I'm not a dev, obviously.
Qwen3.6-27B IQ4_XS FULL VRAM with 110k context (www.reddit.com)

+1 8w llama

Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 vers…
Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (www.reddit.com)

+9035 8w humaneval function-calling qwen+1

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…
Anyone tried Qwen 3.6 27b on the r9700 yet? (www.reddit.com)

1 8w qwen

The memory bandwidth on the r9700 looks quite good compared to my Mac or a Strix Halo and I'm wondering how this turns out. Thanks!
Qwen 3.6 27b S2 Opus + GLM + Kimi (huggingface.co via reddit)

1 8w glm qwen opus

My first time releasing a fine-tune publicly! If anyone wants to independently eval against base, that’d be awesome.
[7900XT] Qwen3.6 27B for OpenCode (www.reddit.com)

+63 8w moe llama

I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far: llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --t…
anyone know where to use qwen 3.6 27b via api/coding plan? (www.reddit.com)

+315 8w qwen

I want to test this model out but I don't have a setup that can do it locally. openrouter and all my coding plans don't include it.
Power-limit vs TG/s for 2x3090 (www.reddit.com)

+12 8w vllm openai

Trying to find the sweet-spot to tradeoff between power and tg/s. 250W seems to be a sweet spot for Qwen3.6-27B.
I test'ed the number of Ll's in Qwen 3.6 35B.. It required 3 tries (www.reddit.com)

2 8w qwen

How many ll's are in Stargate's TV Show's leader? Reasoning Toggle content The answer depends on which "leader" of the Stargate TV series you're referring to, as command changes throughout the franchise: General George Hammond (Seasons 1-3…
Qwen 3.6 27B (IQ3XXS) vs 35B A3B (IQ4XS)? (www.reddit.com)

+12 8w openclaw qwen

Just was wondering what people feel is better. I do need 262K context so these are the biggest quants of each I can fit on my 3090 with KVcache at Q8.
how fast can qwen3.6 35b get (www.reddit.com)

+13 8w openai anthropic claude-code

i wanted to see how fast i could make qwen3.6 35b run on a single h100, so i put together a sglang setup for it. it exposes an openai compatible api and also works with claude code through anthropic compatible routing from the connect tab.
Local model on coding has reached a certain threshold to be feasible for real work (www.reddit.com)

+3010 8w moe qwen

We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public…
Last llama.cpp update broke web search tool calling with Qwen 3.6 27b. (www.reddit.com)

+24 8w qwen llama

At least in open-webui. Nothing has changed except for the backend update.
Are OSS runnable model good now? (www.reddit.com)

4 8w copilot qwen gemini

Hi, I currently have access to 2–3 RTX 3090 GPUs (ideally I’d like something that runs well on 2). I can install models up to around 100 GB in size.
Is mlx-optiq legit? Has anyone tested the new quants for Gemma4/qwen3.6 yet? (www.reddit.com)

+1 8w gemma

https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit https://huggingface.co/mlx-community/Qwen3.6-27B-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-31B…
Qwen 3.6 27B on Strix Halo 128GB: any experiences? (www.reddit.com)

+1016 8w qwen

I'd jump on runpod and ssh in to test my workloads, but they don't have it. Would love to know how well this runs, particularly as context approaches a full 256K.
For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. (www.reddit.com)

+59 8w vllm llama

I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of diff…
Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 (www.reddit.com)

+5015 8w humaneval

Hey fellow Llamas, your time is precious, so I'll keep it short. We built a GGUF port of DFlash speculative decoding.
- which is faster and better for coding? Luce-Org/Dflash or noonghunna/qwen36-27b-single-3090 (www.reddit.com)
GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B (www.reddit.com)

+111 8w llama

Hi folks, Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too. The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficien…
Question regarding 4 t/s Qwen 3.6 performance (www.reddit.com)

10 8w qwen llama

I am getting 4 t/s with Qwen3.6-27B-Q4_K_M which seems much slower than I'd expect. I am running LM Studio on Ubuntu 22.04 with the following specs: Dell Precision 5690 AI-ready workstation NVIDIA RTX 5000 Ada Generation GPU with 16GB VRAM…
Agents for end-to-end document redaction and review tasks (OCR and PII identification - Qwen 3.6 vs closed-source comparison) (www.reddit.com)

+12 8w qwen agentic

(Links to all files, apps, and repos mentioned in this post can be found in the 'full post' link in my first comment) Agents for document redaction and review tasks Document redaction tasks involve text and vision capabilities, and long co…
Why are there so few small local creative writing models from the Chinese? (www.reddit.com)

+334 8w mistral gemma qwen+1

At this moment, the models such as Qwen 3.6 35b/27b crush the competition, yet I can't help, but notice this pattern. While the local RP scene is abundant with the Western model tunes: LLaMA, Mistral (all sizes), Nemo and more recently Gem…
PI agent integrated with Cline-Kanban repo: All using PI and Qwen 3.6 35B MOE UD 4K_XL (www.reddit.com)

+22 8w cline moe qwen+3

Repo: statisticalplumber/kanban at pi-agent-integration Hi Guys, To test Qwen 3.6’s potential, I also wanted the Cline Kanban project to have an open-source agent to work with. The last time I tested Cline Kanban, it didn’t support agents…
Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s (github.com via reddit)

+115 8w vllm

Qwen3.6-27B vLLM Docker Docker-based vLLM serving for Qwen3.6-27B with Lorbus AutoRound INT4 quant and MTP speculative decoding. Model is downloaded at runtime and stored on a host volume so the container can be upgraded without redownload…
Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B (www.reddit.com)

+2 8w llama

Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice speed increase for working on the same cod…
Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP? (www.reddit.com)

+28 8w vllm llama

I'm a daily llama-cpp user and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM.
What is the best coding agent (CLI) like Claude Code for Local Development (www.reddit.com)

+520 8w llama claude-code

Hey all: I am trying to set up claude code to work with llama.cpp, I am using the Qwen3.6-35B-A3B. I usually use claude code + ZLM subscription i got lucky with $30 yearly - the set up is very simple with their automated script, but for th…
Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call) (www.reddit.com)

+714 8w vllm qwen claude-code

I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command: vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \ --enable-auto-tool-choice --tool-call-parser qwen3_xml \ --enable-prefi…
VSCode and agent integration (www.reddit.com)

8w copilot qwen llama

I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and ch…
Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found! (huggingface.co via reddit)

+478 8w qwen

Been using this for a few days. It is BY FAR the best uncensored model I have found for Qwen 3.6 35B.
[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage (www.reddit.com)

+3 8w moe

Config CtxSize: 131,072 GpuLayers: 99 CpuMoeLayers: 38 Threads: 16 BatchSize/UBatchSize: 4096/4096 CacheType K/V: q8_0 Tool Context: file mode (tools.kilocode.official.md) Metric M Model XL Model Difference Avg Tokens/sec 28.92 29.78 +0.86…
Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big. (www.reddit.com)

+1936 8w gemma llama

As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.
qwen3.6 27b poor experience (www.reddit.com)

6 8w sonnet qwen claude-code

Seeing how people praise it, I tried giving it implementation plan that Sonnet generated, but qwen keeps breaking files and goes in circles: Thinking… The file got corrupted from multiple overlapping edits. Let me just rewrite the whole fi…
Vs code extension (www.reddit.com)

+1 8w cline qwen codex

Which coding agent extension are most of you fining best with LM studio as the local server 🤔 Im running qwen 3.6 27b Ive used Cline and continue mostly. I haven't checkout all the options but im looking for something that looks and feels…
Qwen3.6-27B-FP8 - JS file is too long and causing JSON truncation (www.reddit.com)

+14 8w vllm qwen

Apologies in advance, if this is a newbie question. When running Qwen3.6-27B-FP8 using the below command on an Nvidia RTX PRO 5000, in opencode, I am seeing errors such as: "The issue is that the JS file is too long and causing JSON trunca…
Qwen3.6-35B-A3B KLDs - INTs and NVFPs (www.reddit.com)

+52 8w vllm

https://preview.redd.it/c76w57d1yexg1.png?width=1482&format=png&auto=webp&s=1164d8bc3e2e8a4157f26dd5583238a736474932 KLD for INTs and NVFP4s. AS ALWAYS - Use Case is important.
- Qwen3.6 35b a3b Particle System (www.reddit.com)
Quant Qwen3.6-27B on 16GB VRAM with 100k context length (www.reddit.com)

+2 8w llama

https://preview.redd.it/tblmrwxkbexg1.png?width=1193&format=png&auto=webp&s=6dea1e6684e75e22852d57c0c72e9171deb56ae2 I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6…
RTX 3090 + 27B model performance issues (llama.cpp) what am I doing wrong (www.reddit.com)

+217 8w llama agentic

Hey folks — looking for some advice on improving my local LLM setup (and also exploring agentic coding workflows). Current setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB Using llama.cpp with a Qwen3.6 27B Q6 model (GGUF) Running through OpenCo…
Were Qwen3.6 models scrubbed from openrouter? (www.reddit.com)

+210 8w moe gemma qwen

I made a simple app using openrouter, hoping to use the new small qwen models (the a3b moe and the 27b dense one), but they aren’t listed. Also, I swear some qwen3.6 models that were listed before are missing now.
I like my models dense. Can model makers please bring back or update the dense models from like 2 years ago? A nice 39b or 72b maybe? (www.reddit.com)

12 8w moe

Seriously, Qwen3.6 27b is mopping the floor against models like 5 times its size right now. It doesn’t take a rocket scientist to figure out that maybe the whole a2b and a3b MoE thing isn’t the best solution after all.
What do you consider to be the minimum performance (t/s) for local Agent workflows? (www.reddit.com)

+58 8w llama anthropic claude-code

What would you say is the minimum amount of tokens per second you would tolerate for your local agent workflows? I have been trying pi.dev connected to a llama.cpp instance running Qwen3.6-27B-Q6_K_L with 200K context running on an RTX A60…
Replace RTX 2060 12G with second RTX 5060 Ti 16G for Qwen 3.6 27B? (www.reddit.com)

8w qwen

Right now I'm running Qwen3-27B-Q4_K_M on a 2060 12G + 5060 Ti 16G with tensor split 15/7. Gen speed sits around 16.5 t/s and prompt eval drops from 653 to 356 t/s as context grows.
Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19 (www.reddit.com)

+201 8w vllm

Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP Can follow the same recipe I used for Qwen3.5-27B to achieve ~80 tps on a single RTX 5090 at…
- Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19 (www.reddit.com)
Throughput and TTFT comparisons of Qwen 3.6 27B, Qwen 3.6 35B A3B and Gemma 4 models on H100 (www.reddit.com)

+85 8w vllm gemma qwen

I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple.
Best local gui setup Mac (www.reddit.com)

+1 8w codex

Hi all, I have a server (dual 7900xt) running qwen3.6 27b in LMStudio, because I love LMlink for its ease of use and I am okay with the model chugging along at ~25t/s in the background. I then serve the mode to my Mac, via LMlink.
Qwen3.6-35B-A3B-UD-IQ4_XS C++ to Rust Code Port Test: It Worked (Mostly)! (www.reddit.com)

+11 8w gemma qwen

When Qwen3.6-35B-A3B was released a week or so ago, I sort of expected an iterative improvement on the previous Qwen3.5 models. After all, those models were pretty decent as compared with the previous local models I had tried, and Qwen3.5…
How are you running Qwen 3.6 27B on windows? (www.reddit.com)

9 8w qwen llama gemini

I've been trying to fix performance with llama-server and seem to be hitting a wall. Using Q4_K_M by unsloth and IQ4_K_M by DavidAU, when asking a question with no context, 39 t/s.
Local LLaMA server GPU upgrade advice (www.reddit.com)

+49 8w llama

TLDR : Should an RTX 3090 + T4 be faster than a P40 + T4 for OpenCode with Qwen3.6 35B A3B ? --- Hi, Nowadays, I have an architecture running : A Tesla P40 w/ 24GB VRAM A Tesla T4 w/ 16GB VRAM I mainly use this setup to run models like GPT…
Open source multi-cursor/background computer-use (Codex-like) using Hermes Agent + Qwen3.6-35B-A3B-4bit + Cua-Driver (www.reddit.com)

+13 8w codex cursor

could not extract summary
Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) (www.reddit.com)

+7036 8w llama

I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that K cache compression is not really recommended in most case…
Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect! (www.reddit.com)

+15746 8w llama

So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me. I am running a 3070 8gb + 64gb DDR4.
Best settings for Qwen 3.6 -27B for 2X3090? (cannot make it to be smarter than Qwen 3.6 35B-A3B! (www.reddit.com)

+41 8w vllm qwen

I'm sure people have asked before for settings for these gpu's, but for me, no matter what I do, It doesn't work as good as 3.6 35B! I've tried VLLM and LLAMACPP .
Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help (www.reddit.com)

+718 8w qwen llama

The new dense model is great, but I’m trying to figure out how to increase PP and Token generation speed. I’m running Q8 quants across 3 7900xtx GPUs and I’m consistently only getting 18-20 t/s generation speed and ~650 t/s prompt processi…
Using the iGPU as the primary graphics card may improve token generation speed for PCIe graphics cards (www.reddit.com)

+1217 8w llama

A few days ago, I was trying to improve token generation speed on my RTX 4070 Super 12GB while running Qwen3.6 35B A3B UD-IQ3_XXS (Unsloth) with llama.cpp, but to no avail. At that time, I had my monitor plugged in my 4070 and didn't even…
Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results (localbench.substack.com via reddit)

+31358 8w gemma qwen

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results 4 models tested with q8_0 and q4_0 KV cache against full-precision baseline What this measures KV cache quantization stores the key-value cache in lower precision to s…
What are your most interesting and hard Vision use cases? I plan to do side by side comparison of Gemma 4 (31B) vs Qwen 3.6(27B) Vision and I look for inspiration (www.reddit.com)

+318 8w vllm gemma qwen

Hey guys, I built a custom vLLM pipeline to run Gemma 4 (31B FP8) and Qwen 3.5 side-by-side locally to see how they actually perform in the wild with preprocessing of audio and images. But of course new model Qwen 3.6 27B came out just whe…
Qwen3.6 uncensored AWQ (www.reddit.com)

+63 8w vllm

I have tested Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf on my 4x3090 system (opencode) and find it really good and fast. However, I can't find any uncensored models for vllm (preferably as AWQ).
Complete beginner to Agentic coding, is Qwen3.6-27B + pi.dev the right starting point or should I be looking elsewhere? (www.reddit.com)

+324 8w chatgpt agentic

Hello fellow members of this lovely community, Let me start by saying that I’m about as far from a professional developer as it gets. I’m a hobbyist whose entire coding experience consists of building various Python/VBA tools and simple Ja…
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan) (www.reddit.com)

+7138 9w moe qwen llama

I have ThinkPad T14 Gen 5 (8840U, Radeon 780M, 64GB DDR5 5600 MT/s ). Tried out the recent Qwen MoE release, and pp/tg speed is good (on vulkan) (250+pp, 20 tg): ~/dev/llama.cpp master* ❯ ./build-vulkan/bin/llama-bench \ -hf AesSedai/Qwen3…
My 12-agent Qwen 35B stack on Ollama died at 500 tokens every single time. Raw MLX fixed it and broke 4 other things I didn't see coming. (www.reddit.com)

11 9w moe ollama qwen

TLDR: Swapped Ollama for MLX on M1 Max (64GB) to run a 12-agent trading stack using Qwen 35B MoE. MLX wins on throughput and fine-grained sampler control, but I lost the "it just works" convenience of Ollama.
Get goosebumps (www.reddit.com)

8 9w

Please comment here if you just cancelled your claude subscription. So that we can see how much you have confidence in open source or open weight models especially with qwen3.6 release.
Qwen 3.6 27b IQ4_XS - 22 tp/s on RTX 5060TI 16b, 24k ctx (www.reddit.com)

+3920 9w qwen llama

Maybe it be helpful for someone: llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000 Cant run this model with higher kv quants on >8192ctx size. -ub & -b setted for 256 allowed me for…
local models are getting crazy good but why is agent memory still so cooked? (www.reddit.com)

8 9w qwen

been running qwen 3.6 locally and im shooked. but what are we doing about agent memory because it's still a complete mess.
IQ2XXS Qwen 3.6 35b is actually very usable on 32 gb macbooks (www.reddit.com)

5 9w moe qwen

just tested the MoE qwen model with 2 bit percision and its suprising good. I used the 2 bit xxs from unsloth and it seems to maintain intelligence really well, never failed a tool call so far and suprisingly good at 3js, even better than…
Severe instability and looping issues with local LLMs (Qwen, Zen4, llama.cpp) (www.reddit.com)

+221 9w qwen llama

I tried working on a local LLM project today and honestly ended up pretty frustrated. I tested several approaches, but none of them worked reliably.
Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives (www.reddit.com)

+262116 9w qwen

MacBook Pro M5 MAX 64GB. Qwen 3.6 35B - 72 TPS.
Qwen3.6-35b-a3b MLX variants benchmarking: Vanilla - TurboQuant - RotorQuant @5bit (www.reddit.com)

+4 9w

Trying to find out the best local LLM inference engine for Hermes in terms of performance and memory footprint. Tool calling accuracy is already up there, so I focused on pure token crunching.
Qwen3.6-27B with the tiny harness of Kon (www.reddit.com)

+1 9w openai

It's working very well out of the box on the tiny harness of kon; ~270 tokens without the tool schema (~1000 tokens including). https://github.com/0xku/kon Members from LocalLLaMA have already contributed many interesting features recently…
coding with Qwen3.6-27B-UD-Q2_K_XL.gguf (www.reddit.com)

+58 9w llama

pi llama.cpp awesome torus awesome torus Windows, 5070 (12GB)
Qwen3.6-27B Uncensored Aggressive is out with K_P quants! (www.reddit.com)

+654 9w qwen

The dense sibling of the 35B-A3B drop is here, Qwen3.6 27B Uncensored Aggressive is out! Aggressive = no refusals; NO personality changes/alterations or any of that, it is the ORIGINAL release of Qwen just completely uncensored https://hug…
What am I missing about samplers? (www.reddit.com)

+13 9w qwen

Hi all, With the recent release of models that require temp = 1, top_k = N, and top_p = 0.95, I'm wondering why labs actually prefer those truncation samplers over just min_p? As far as I understand, min_p isn't supported everywhere, and t…
Windows freezing up as VRAM fills up - Does this happen for everyone? (www.reddit.com)

+12 9w gemma

Hey everyone, I run llamacpp precompiled with CUDA 12.4 on Windows 11 with a RTX 4090. With small models like gemma-4-E4B everything runs fine, but as soon as I run a bigger model like Qwen3.6-27B (IQ4_NL) or a medium sized model with larg…
What speed is everyone getting on Qwen3.6 27b? (www.reddit.com)

+879 9w llama

I'm getting ~13 tps on Q8_0, with a context window of 128000, K Q8_0, V Q8_0 this is on 3x GPUS (1x2060super 8gb, 2x5060ti 16gb), via llamacpp unsure if this is slow or to be expected? */llama-server --port 8080 --model */llama.cpp/Qwen3.6…
Are Qwens v3.6 good at vectorizing raster images? (www.reddit.com)

4 9w qwen

original image Qwen3.6-27B-UD-Q5_K_XL.gguf Qwen3.6-35B-A3B-UD-Q5_K_S.gguf ...you tell me. system prompt: You are Qwen, created by Alibaba Cloud.
Got a RTX a5000 24gb, what models could I use? (www.reddit.com)

4 9w llama

I just got a used RTX a5000 24gb to use for local models, I mainly use AI to code, but I prefer to spend some money now instead of $200 per month on claude to use 50% of it in a single prompt. My current specs are: Ryzen 7 9800x3d 64Gb DDR…
Qwen3.6 One Shot Tetris Game (www.reddit.com)

+1532 9w moe llama

I am blown away by what this model can generate locally. I asked for a flashy Tetris game with particle effect and boy did it deliver!
The Claude Code Pro removal is getting framed as 'just go local' but for production systems it's messier (www.reddit.com)

6 9w rag qwen anthropic+1

Yesterday's Claude Code Pro removal thread hit 350+ comments in a few hours, and the dominant take was basically "switch to Kimi K2.6, go local, done." I upvoted that thread and tbh im mostly there — but im building voice agents and RAG pi…
Show HN: Qwen Lens Studio – multimodal app on Qwen3.6-35B-A3B, runs on Ollama (github.com via hn)

+21 9w ollama qwen

Qwen Lens Studio A multimodal AI studio built around a single Qwen vision-language model, exposed through five focused tools plus a batch runner and a persistent session log. Ship a screenshot → get code.
Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B (www.reddit.com)

+2237 9w qwen

Does anyone else have the same experience comparing these two - for me 3.5 122B outperforms 3.6 by a large margin. 3.6 gets lost as long as the task requires a couple of more steps.
- Optimizing Qwen 3.6 35B A3B sampling parameters. (www.reddit.com)
- PSA re Qwen 3.6 35B A3B q4 + agents (www.reddit.com)
Qwen3 27B FP8 + TurboQuant on RTX 5090 - anyone tried? (www.reddit.com)

+217 9w qwen llama

Do I understand correctly, based on this comment, that I can potentially fit Qwen 3.6 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, a…
is Qwen3.6-27B comparable with Opus 4.5? (www.reddit.com)

+113 9w opus agentic

https://preview.redd.it/qtzdx5ud0rwg1.jpg?width=1200&format=pjpg&auto=webp&s=aa25d9f0bb8007ee6e4065cfa46a9685454c89cd - Outstanding agentic coding, surpasses Qwen3.5-397B-A17B across all major coding benchmarks - Strong reasoning across te…
Kimi 2.6 and qwen3.6 is out but still as slow as ever (www.reddit.com)

6 9w ollama

Has anyone tried these? I found this on ollama: https://ollama.com/library/kimi-k2.6, https://ollama.com/library/qwen3.6 My issue is that they are extremely slow on my local.
Llama.cpp parameters for Qwen 3.6 with RTX 3090 (www.reddit.com)

+1112 9w qwen llama agentic

Hi, I'm trying to run Qwen 3.6-35B on my RTX 3090 (24 GB of VRAM) but I'm not sure about 2 thing: - Which variant of the model to use ? (Q4_K_S, Q3_K_XL, other ?
Qwen3.6-27B released! (www.reddit.com)

+549141 9w qwen agentic

Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight.
- Qwen3.6 27B really good? (www.reddit.com)
Qwen3.6-35B becomes competitive with cloud models when paired with the right agent (www.reddit.com)

+478125 9w qwen

A short follow-up to my previous post, where I showed that changing the scaffold around the same 9B Qwen model moved benchmark performance from 19.11% to 45.56%: https://www.reddit.com/r/LocalLLaMA/s/JMHuAGj1LV After feedback from people h…
Consider running a bigger quant if possible (www.reddit.com)

+4040 9w qwen agentic

Just a little reminder that *if* it is possible for you to run bigger quants, do it. I ran Qwen 3.6 IQ4_XS at 128k context was very much disappointed because it would loop, make formatting errors, implement wrong things etc.
Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2 (www.reddit.com)

+1732 9w glm gemma qwen

I have run two tests on each LLM with OpenCode to check their basic readiness and convenience: - Create IndexNow CLI in Golang (Easy Task) and - Create Migration Map for a website following SiteStructure Strategy. (Complex Task) Tested Qwe…
R9700 Qwen3.6 Benchmarks? (www.reddit.com)

+310 9w llama

Can someone who owns a R9700 (single GPU enough) to add a llama-bench output with Qwen3.6-35B-A3B Q5_K_P here in the thread? Other benchmarks are also welcome :) I just want to see the t/s and compare it with my local solution, because I m…
Need help with llama.cpp Qwen3.6 configuration on a single 3090 w/ 48GB RAM (www.reddit.com)

+1 9w llama

Hey there, I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I…
Qwen 3.6 35B-A3B takes a long time at image processing. Is it happening only to me? (www.reddit.com)

+12 9w moe qwen llama+1

9900x, RTX 4080, 96GB RAM. Llama-cpp, Windows.
How to best utilize local LLM give my hardware? (www.reddit.com)

1 9w qwen agentic claude-code

Hi all, I’m new to local LLMs but as someone who extensively uses agentic coding I thought I’d try it out. I am running a MacBook Pro with M3 Max 64gb ram.
Llama.cpp's auto fit works much better than I expected (www.reddit.com)

+6435 9w llama
I tested 9 local models on the same flight sim prompt, all Q8, different Q providers, MLX (www.reddit.com)

+1 9w moe gemma claude-code

I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count.
Gemma 4 is much less popular on Hugging Face than Qwen 3.x. (www.reddit.com)

+111 9w gemma qwen

The difference is quite big: likes downloads last month finetunes Qwen3.5-27B 952 3,233,034 263 Qwen3.5-35B-A3B 1,397 3,977,637 87 Qwen3.6-35B-A3B 1,115 458,436 60 gemma-4-31B 323 343,895 13 gemma-4-26B-A4B 227 118,464 13
Gemma 4-31B vs Qwen 3.5-27B vs Qwen 3.6-35B-A3B on a browser-agent vision prompt — MoE wins on every axis (www.reddit.com)

+12 9w moe gemma qwen+1

I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…
Are commonly recommended sampling parameters often too high? (www.reddit.com)

+210 9w qwen
PSA : you don't need a Blackwell card to run mxfp4 models (RTX 3080 + Qwen 3.6 35B A3B) (www.reddit.com)

11 9w qwen llama
kIOGPUCommandBufferCallbackErrorImpactingInteractivity... recreate the backend to recover (www.reddit.com)

+2 9w qwen llama
Brand new dual 3090 PC - what should I install first for the best local agentic coding experience? (www.reddit.com)

6 9w vllm qwen llama+1
Doing real coding work locally for the first time (www.reddit.com)

+1217 9w codex agentic claude-code
Is there any way to implement multimodal RAG using some open-source multimodal large models? (www.reddit.com)

+11 9w rag
Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into (www.reddit.com)

+2528 9w moe llama agentic
Oculink eGPU for LLMs: RTX 5070 Ti (256-bit) vs 5060 Ti (128-bit) paired with 4090m (256-bit) laptop? (www.reddit.com)

+27 9w qwen
(Interactive)OpenCode Racing Game Comparison Qwen3.6 35B vs Qwen3.5 122B vs Qwen3.5 27B vs Qwen3.5 4B vs Gemma 4 31B vs Gemma 4 26B vs Qwen3 Coder Next vs GLM 4.7 Flash (www.reddit.com)

+6629 9w glm gemma mcp
Opus 4.7 Max subscriber. Switching to Kimi 2.6 (www.reddit.com)

+20868 9w qwen cursor opus+2
Thoughts on MoE Qwen 3.6 35B? (www.reddit.com)

14 9w moe qwen
SOLVED! Was "Help needed: Ollama > qwen3.6 in OpenCode on 64Gb M4" (www.reddit.com)

1 9w ollama
Qwen 3.6 comaprable with the old Qwen 3 coder 480B? (www.reddit.com)

3 9w cline qwen

I specifically remembered when qwen3 coder came out and it was like the only few models out there that can totally take over a repo and actually do things in VSCode without emptying bank account. and when that the qwen3 coder 30B was so fa…
Recommended parameters for Qwen 3.6 35B A3B on a 8GB VRAM card and 24GB RAM? (www.reddit.com)

+214 9w moe qwen llama
Qwen3.6-35B-A3B running on a Mac mini M4 16GB (www.reddit.com)

9 9w llama
Qwen 3.6 on rtx6000 96gb (www.reddit.com)

11 9w qwen
Help on jiberish output on Qwen3.6-35B-A3B-GGUF::UD-IQ3_S (www.reddit.com)

10 9w llama
LLM Neuroanatomy III - LLMs seem to think in geometry, not language (www.reddit.com)

+6053 9w gemma
llama-bench results with SYCL backend - Intel Arc B70 (on a pcie 3.0 motherboard) (www.reddit.com)

8 9w llama
Qwen3.6 agent + Cisco switch: local NetOps AI actually works! (www.reddit.com)

+124 9w cline qwen llama+1
5070 Ti (New) vs 3090 (Used) to pair with 4070 for local LLMs? (www.reddit.com)

13 9w moe gemma qwen
Dual GPU setup (yes, no)? (www.reddit.com)

5 9w llama
Running Qwen 3.6 35B-A3B-4B on MacBook Pro M5 64GB with tools (www.youtube.com via hn)

+3 9w qwen
"Browser OS" implemented by Qwen 3.6 35B: The best result I ever got from a local model (gist.github.com via reddit)

+4819 9w qwen
Small Gemma 4, Qwen 3.6 and Qwen 3 Coder Next comparison for a debugging use-case (www.reddit.com)

+819 9w gemma qwen
Alguém utilizando PI como headless? (www.reddit.com)

9w openclaw qwen
what is the state of using rotoquant at the moment? (www.reddit.com)

+34 9w llama
Full AMD workstation- dual 7900 XTX (www.reddit.com)

11 9w qwen
Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice (www.reddit.com)

15 9w moe qwen llama
How is Rotorquant/planarquant/iso qaunt better? (www.reddit.com)

+22 9w gemma qwen llama
Qwen3-30B-A3B-Instruct-2507 is better than the new Qwen 3.6 for our tasks (www.reddit.com)

9 9w vllm moe gemma+1
Qwen 3.6 35B different quant speeds ? (www.reddit.com)

9w qwen llama
What starts to become possible with two 3090s that wasn't with just one? (www.reddit.com)

+1978 9w qwen
Intel Arc B70 with HP z640 workstation (pcie 3) (www.reddit.com)

+117 9w llama
Qwen 3.6 CoT issue? (www.reddit.com)

+116 9w qwen llama openai
KV cache compression on Qwen 3.6 (1M context): 10.7GB → 6.9GB, V ≈ 3.5× smaller (www.reddit.com)

11 9w qwen
For chat and Q&A: Which MoE model is better: Qwen 3.6 35B or Gemma 4 26B (no coding or agents) (www.reddit.com)

+2018 9w moe gemma qwen
I'm running qwen3.6-35b-a3b with 8 bit quant and 64k context thru OpenCode on my mbp m5 max 128gb and it's as good as claude (www.reddit.com)

+560269 9w
Ask HN: How do you use Local LLMs? (April 2026) (news.ycombinator.com)

+3 9w gemma qwen gemini+1
Agentic coding Qwen 3.6, Q6_K 125k context vs Q5_K_XL 200k context (www.reddit.com)

+515 9w qwen agentic

What would you choose if you were in my shoes? How viable is 125k for agentic coding really?
5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4 (www.reddit.com)

+2 9w qwen llama

Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I…
Newbie here (www.reddit.com)

9w moe llama mcp

Hi guys im on 9950x 196gb and a 4090 This parameters are ok? mi main use will be coding llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --n-cpu-moe 20 -c 250000 --host 0.0.0.0 --port 8082 --reasoning-budget -1 --top-k 20 --top-p 0…
Intel Lunar Lake 258V (32GB) vs Qwen 3.6 35B-A3B: Pushing the limits of MoP architecture. (www.reddit.com)

+2 9w moe gemma qwen

Hardware: Intel Core Ultra 7 258V, 32GB Unified Memory. Model: Qwen 3.6 35B A3B (Quant: Q3_K_S) via LM Studio.
Imposing my laptop to run Qwen 3.6 (www.reddit.com)

10 9w moe qwen llama

So, I am excited with the new MoE model released by Alibaba. And as an excited person, I want to believe that it can actually run in my hardware.
Generating Logisim Evolution circuits (www.reddit.com)

+12 9w qwen

Short: I want to generate with Qwen 3.6 something like this https://preview.redd.it/bd6rbgnoatvg1.png?width=960&format=png&auto=webp&s=a1c079f37c048fa2c687709465b0c830a0184a4c After many hours, I'm able to generate a working file without w…
This is very fair. Other interesting context behaviors you've experienced? (www.reddit.com)

9w moe qwen gemini

I guess the model didn't feel it needed to do anything beyond proving. Not entirely sure how I got it to act so..
I pray there is a Qwen 3.6 122b version (4x3090 owner) (www.reddit.com)

4 9w qwen

The 3.5 122b model already is fantastic at 4-bit. Really the best model I ever ran on my 4x3090, but from what I read how 35B 3.6 is doing, the 3.6 122b model would be an absolute value banger.
Qwen 3.6 35B crushes Gemma 4 26B on my tests (www.reddit.com)

+7329 9w gemma qwen agentic

I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information fr…
qwen3.6-35b-a3b tool calling input problem... too bad... (www.reddit.com)

7 9w

Hey guys. Some people including me are having trouble on qwen3.6-35b tool calling.
Testing Qwen3.6 with Hermes Agent on agentic coding. Locally with llama.cpp. (www.reddit.com)

9w llama agentic

I'll be testing the setup and try out the Hermes Agent live: https://www.youtube.com/live/q5vqvwZykRI
Don't ask Qwen 3.6 35b to give you aski image of Yoshi :) (www.reddit.com)

+1011 9w qwen agentic

https://preview.redd.it/dfqed57qgsvg1.png?width=1706&format=png&auto=webp&s=3859209698d2e844e2731326e355d60928658f8a The most fun part was reasoning, here is a gist: https://gist.github.com/anzax/5f06716c66180013cd715f6c2e5848df There is a…
Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant (www.reddit.com)

+1437 9w moe qwen llama

Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM.
Qwen 3.6 q8 at 50t/s or q4 at 112 t/s? (www.reddit.com)

+1416 9w qwen

What are some ways that you would go about thinking about choosing between the two for use in a harness like pi? Did a good bit with q4 yesterday and it was so consistent and reliable I had it set to 131k context and it worked through 2 co…
best possible GPU setup for using qwen 3.6 ? (www.reddit.com)

15 9w deepseek qwen

hi have been recently thinking to buy my personal GPU for hosting open source models can someone give any suggestion ? and also suppose i don't wanna remain restricted to qwen 3.6 but some math heavy tasks too for which i wanna deepseek or…
Context Compaction / Summarization on Apple Silicon (www.reddit.com)

+15 9w

I've been very impressed with qwen3.6-35B-A3B on Apple Silicon (and actually my AMD iGPU setup with DDR5 and a 760M does well too). It can actually navigate a codebase and write useful code.
Qwen3.6 Fails n8n Tool Calling (www.reddit.com)

+13 9w llama

https://preview.redd.it/na4ub5yzprvg1.png?width=1654&format=png&auto=webp&s=e356e0ab0829bb275352d1035c35c645a381c3c7 I am using Kaggle to serve Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf but tool calling is not always working. I also tested it with R…
Qwen3.6-35B-A3B just dropped — quick thoughts after trying it (www.reddit.com)

4 9w moe gemma

Just gave the new Qwen3.6-35B-A3B a spin. It’s a MoE model (35B total, ~3B active), but honestly the more interesting part is how much they’re pushing agent-style coding.
Qwen3.6 is incredible with OpenCode! (www.reddit.com)

+14655 9w gemma opus claude-code

I've tried a few different local models in the past (gemma 4 being the latest), but none of them felt as good as this. (Or maybe I just didn't give them a proper chance, you guys let me know).
7900XTX, Qwen 3.6 35B A3B, 150t/s that drops to 50t/s for no reason? (www.reddit.com)

+19 9w qwen llama

MSI B650 Gaming Plus 9800X3D 64GB DDR5 6400mts Windows 11 When I first boot my PC and I run this model, I get 155-160t/s, and for some reason, after a couple minutes, say, 10 minutes, not using AI or anything in particular, GPU temp at 40c…
- Low performance in 7900XTX in Qwen 3.6 35B A3B (www.reddit.com)
Qwen 3.6 is the first local model that actually feels worth the effort for me (www.reddit.com)

+20778 9w qwen opus

I spent some time yesterday after work trying out the new qwen3.6-35b-a3b model, and at least for me it's the first time that I actually felt that a local model wasn't more of a pain to use than it was worth. I've been using LLMs in my per…
Qwen 3.6 No think? (www.reddit.com)

+25 9w qwen

I’ve been seeing a lot of good feedback about the qwen 3.6 model and its reasoning performance but has anyone tested it with reasoning off? I’ve been building a low latency app using Qwen 3 30ba3b 2507 and 3.5 no think was not an improveme…
Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now) (www.reddit.com)

+194 10w qwen llama

Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task.
Clearing up some memory while running llms locally. 25-32token per second gpu poor rx6700xt 12gb and 32gb ddr4 (www.reddit.com)

3 10w qwen

QWEN 3.6 35B A3B MXFP4 https://preview.redd.it/bclr8ukcoqvg1.png?width=904&format=png&auto=webp&s=853b211505ef6b9184d0571ca8fc46295437322a hey everyone this is my first post, anyways the thing is that there is this program called https://m…
Qwen3.6 35B A3B is THE ONE The Local LLM Champ on OpenCode benchmark dashboard [video] (www.youtube.com via hn)

+2 10w

About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
GPU strategy for local LLM + mixed workloads (70-person company) — NVIDIA vs AMD? (www.reddit.com)

+43 10w rag agentic

Hey all, we’re a mid-sized company (~70 people) and currently planning to bring a lot of our workloads on-prem instead of relying on cloud APIs. The goal for the moment is to run small to mid-sized models in the range of 30B like Qwen3.6 o…
Benckmark Qwen 3.6-35b uncensored on Rtx3090 (www.reddit.com)

+11 10w ollama qwen

Hello I saw the new model is out but even with 24gb of vram, I have too many browser and task to use it , so I have downloaded and tested the version of HauHauCS https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressiv…
censorship in qwen3.6? (www.reddit.com)

+115 10w qwen

I do not want to spread conspiracies, please weight my information carefully, and maybe somecan can hopefully prove me wrong. I installed the brandnew qwen3.6 yesterday and ran a few of my own traditional tests, not a very deep dive, just…
Did you know that you can use Qwen3.5-35B-A3B-Base as an instruction/reasoning Model? (www.reddit.com)

+134 10w qwen

https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-Base-GGUF Yes, Qwen 3.6 is out and it's a great model. However, who needs an even more "uncensored but official" model, can try out this one.
Context checkpoint erasure in llama.cpp ? (www.reddit.com)

+37 10w function-calling qwen llama

Has anyone been able to solve or mitigate context checkpoints being erased during single user inference, specifically when function calling is part of the chat history? I've been using Qwen 3.5 35B A3B for some time (now using 3.6), tested…
Strix Halo concurrency 4 16k context 64 t/s Qwen3.6-35B-A3B-Q8_0 (www.reddit.com)

+2 10w qwen llama

https://preview.redd.it/4906akj9dovg1.png?width=1527&format=png&auto=webp&s=c49e255ac79a3c5455f44603422f8af7ddc12594 First of all can we make https://www.youtube.com/watch?v=2lUC8Gimxz8 Angine de Poitrine this subs official band? Those guy…
Abliterated version of the new Qwen3.6-35B-A3B up on HF (www.reddit.com)

+17 10w moe gemini

Pushed an abliterated Qwen3.6-35B-A3B to HF. Worth noting because MoE abliteration is genuinely different from dense — the refusal signal lives in the expert path, not attention, so standard Q/K/V LoRA doesn’t cut it.
- What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M (www.reddit.com)
Qwen3.6-35B-A3B — full JANG suite (15 profiles, 1L through 6K) for Apple Silicon (www.reddit.com)

+1 10w moe

Full JANG adaptive mixed-precision quantization sweep of Qwen3.6-35B-A3B: https://huggingface.co/collections/bearzi/qwen36-35b-a3b-jang All 15 profiles, from extreme compression to near-lossless: JANG_1L JANG_2S/2M/2L JANG_3S/3M/3L/3K JANG…
Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way. (www.reddit.com)

5 10w gemma qwen llama

I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp…
Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1 (www.reddit.com)

+75 10w qwen

could not extract summary
Is there a way to have qwen-code CLI read images? (www.reddit.com)

+21 10w qwen llama codex

Basically I am asking the model to describe an image, but it says it can't process the images. The weird thing is that if I send the image encoded directly on the prompt, it works just fine, I am using llama-server with qwen3.5 (tried all…
Qwen3.6-35B is worse at tool use and reasoning loops than 3.5? (www.reddit.com)

+14 10w tool-use

Been running the new model entire evening in different quants and coding tasks with OpenCode. Used oMLX and LM Studio.
GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx (www.reddit.com)

+912 10w llama

The TheTom's turboquant's GPU accelerated turboquant (turbo3) has unlocked high context gains for the 35BA3B family. I can now achieve ~40tg/s via the following GPU-POOR compilation flags and configuration: cmake -B build -DGGML_CUDA=ON -D…
PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on. (www.reddit.com)

+8813 10w qwen

I had previously posted here about a fix to their 3.5 template to help resolve the KV cache invalidation issue from their template. A lot of you found it useful.
How to run MoE models without necessary RAM? (Apple Silicon) (www.reddit.com)

+113 10w moe qwen

Hey, I have a M1 Pro 16gb machine, and I wanted to run the Qwen3.6/3.5 35A3B model. However, this model cannot fit on a 4bit quant on my system.
The only metric that matters: "[Qwen3.6-35B-A3B-GGUF] drew a better pelican riding a bicycle than Opus 4.7 did!" (news.ycombinator.com via reddit)

+2613 10w opus

could not extract summary
- Qwen3.6-35B-A3B draws a better pelican than Opus 4.7 (twitter.com)
Qwen3.6 local test (live) with llama.cpp. Is it going to be better than Gemma4? (www.youtube.com via reddit)

4 10w llama

About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp (www.reddit.com)

+144 10w vllm llama

Here is how to run the new Qwen3.6-35B-A3B > At full context on a 4090 - IQ4_XS gguf with llama cpp > At full context on a Spark - FP8 with a tweaked vLLM Here is the docker compose with llama cpp services: llamacpp: container_name: llamac…
Comparison Qwen 3.6 35B MoE vs Qwen 3.5 35B MoE on Research Paper to WebApp (www.reddit.com)

+4923 10w moe qwen llama

Note: First is Qwen3.5 35B MoE (Left) and Second is Qwen3.6 (Right) Hi Guys Just did quick comparison of Qwen3.6 35B MoE against Qwen 3.5 35B MoE. with reasoning off using llama.cpp and same quant unsloth 4 K_XL GGUF First is Qwen3.5 outco…
Web OS result from Qwen3.6 35B is by far the best I tested in my laptop (codepen.io via reddit)

+1712 10w qwen llama

This is my first test with this model and Qwen impressed me. I will rate it 98% usable web os compared to my previous best 70% usable result from qwen3 next coder at q2.
My Qwen 3.6 fails the car wash vibe check (www.reddit.com)

8 10w gemma qwen claude-code

I configured it to the best of my abilities, even at Q8. It fails to give the correct number of tools it supports on Claude Code and it fails the car wash test.
Qwen 3.6: worse adherence? (www.reddit.com)

+3727 10w vllm rag qwen

Just swapped Qwen 3.5 for the 3.6 variant (FP8, RTX 6000 Pro) using the same recommended generation settings. My stack is vLLM (v0.19.0) + Open WebUI (v0.8.12) in a RAG setup where the model has access to several document retrieval tools.
Anybody else seeing Qwen3.6-35B-A3B go crazy thinking in circles? (Compared to Qwen3.5-35B-A3B) (www.reddit.com)

18 10w moe qwen

I was working on a simple frontend web design task earlier (styling some buttons) with Qwen3.5-35B-A3B. The end results weren't great, but at least it kept trying to change stuff and call toosl properly.
Alibaba open-sources Qwen3.6-35B-A3B, a 35B MoE model with 3B active parameters (huggingface.co via hn)

+8 10w vllm moe

Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransf…
Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All (qwen.ai via hn)

+290157 10w qwen agentic

Qwen Studio offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
- Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (news.ycombinator.com)
Do you guys think there’s a high chance of Singularity being open source? (www.reddit.com)

+7467 10w glm gemma qwen+1

GLM 5.1 is dominant in almost every aspect in Design arena, surpassing Opus 4.6 in many tasks. Although user experiences vary dependent on subscription plans for both of those one of them is open source.

← all threads