Qwen3.6 35B - TXT vs Markdown vs HTML vs HTML+CSS (www.reddit.com)
model roundup
Qwen 3.6
-
Theres been talk of late about using HTML rather than markdown in Claude Code. I was curious how this worked with a local model so loaded up Qwen3.6 35B A3B at Q8 and F16 KV cache.
-
EDIT - IGNORE. I MADE A MISTAKE.
-
I'm using llama.cpp, and I've tried Bartowski's and my own quants. When using Qwen3.5-122B or Qwen3.6-27B, I'm seeing really low draft acceptance in chats with interleaved code snippets (chatting with the LLM about programming / a code pro…
-
I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler).
-
Context Krasis is an LLM runtime for running models that don't fit into VRAM. Krasis streams the model through VRAM from system RAM efficiently and handles prefill and decode as separate architectures and optimised usecases.
-
Qwen/Qwen-Image-Bench · Hugging Face (huggingface.co via reddit)
Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality crit…
-
Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090?
-
Local LLMs on Refurb M4 Max vs new M5 Max (www.reddit.com)
Hoping the community can guide me on this one. I'm on the fence about the following purchase: Refurbished 16-inch MacBook Pro Apple M4 Max Chip with 16‑Core CPU and 40‑Core GPU, 64gb ram for $3,479.00 vs The new 16-inch MacBook Pro Apple M…
-
Anyone tried a setup like this? Is it a bad idea? 😅 (www.reddit.com)
I’m considering building a local machine for AI inference using a Dell Precision T5820 and 2 Intel Arc A770’s. From this I could get 32GB DDR4 RAM, 1TB SSD and 32GB VRAM, all for like $1000.
-
Need some advice on AI workflow (www.reddit.com)
Hi all, I'm somewhat new to the scene (been lurking for maybe 4-5 months now), but i think I have all the basics figured out. My setup: 9800x3d with 64GB of RAM, 6900xt with 16GB VRAM.
-
Qwen3.6 huge quality gain from Q4 to Q6 for coding agent (www.reddit.com)
So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap.
-
So, I got interested in local LLMs a few months ago, but, I don't have a background in coding, and I don't know how to code, and I am not good with computers or anything. So far I mainly just was having fun with comparing different local L…
-
Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects…
-
I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).
-
LMStudio with MTP support - which model? (www.reddit.com)
Looks like LMStudio released support for Multi-Token-Prediction (MTP) and the release notes say to use a MTP-compatible model. What model is everyone using with MTP support?
-
Is Granite-4.1-30b Overshadowed by Qwen3.6 & Gemma4 models? (www.reddit.com)
I don't see any threads on this model. Is it because it's dense and/or without-reasoning?
-
I dont have good experience running q4_k_m, the difference to q6 is "a few errors an hour" to " a few errors every couple of days". Edit: How it fails?
-
Been running Qwen3.6-35B-A3B as a sub agent on a single 4090 for a few weeks. The failure modes are different from solo use and I haven't seen this written up anywhere.
-
Really been testing qwen 3.6 27b and 35 a3b so far with 27b at q8 and 35 a3b at q4 (byteshape quant is insane). But i feel im not utilizing it the best, esp for long context messy coding of large repos.
-
Advice on local coding setup (www.reddit.com)
Just got an RTX 3090 to go with my Intel Core 9 Ultra 285K CPU and 32 GB of DDR5 6000 ram. I want to code locally on my Windows 11 PC.
-
Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774) My run command: ``` llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface…
-
$400 Qwen 3.6-27B Setup - Dual RTX 3060 - 30-50 t/s (www.reddit.com)
I picked up a 7900 XTX earlier which runs qwen3.6-27b fine, but not to my like. Its compute performance is quite unstable for me.
-
Looking for Suggestions — Single 5090 & 64gb DDR5 (www.reddit.com)
Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if…
-
Poor performance on RX 9070 XT (www.reddit.com)
I was thinking about upgrading from an MI50 to an AMD AI PRO9700, and I happen to have an RX 9070 XT on my gaming pc, so I tested the performance on it to have an idea of what to expect. So, install rocm, build llama.cpp, download Qwen3.6-…
-
Llamacpp server : How do the -np and -c flags interact? (www.reddit.com)
I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.
-
qwen 3.6 27B AR-> Diffusion - local training on 5090 (www.reddit.com)
based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable an…
-
link: https://huggingface.co/JC1DA/Qwopus3.6-27B-v2-INT4-W4A16-Autoround Super surprised how good Jackrong's model is... It's taking so much time to evaluate the all the base qwen3.6-27B, Jackrong's version and other's quantized models but…
-
I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B?
-
As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing.
-
Is Qwen3.6 current king for local agentic use? (www.reddit.com)
I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionally and I couldn't even get GLM 4.7 Flash…
-
llama.cpp oom issue (www.reddit.com)
I'm having an issue with llama.cpp going OOM (system ram, not vram) after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it st…
-
Qwen 3.6 benchmarks on 2x RTX PRO 6000 (www.reddit.com)
Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend.
-
Could someone please help explain these results? (www.reddit.com)
I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled!
-
A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into…
-
Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s) (www.reddit.com)
Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit?
-
qwen3.6-35b-a3b-mtp running on GTX 1060 6GB (www.reddit.com)
I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model ru…
-
Need Help Choosing a Harness for Qwen 3.6 27B (www.reddit.com)
I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I read posts all the time, but I have a spec…
-
GPU VRAM only for small models with llama.cpp: is it possible? (www.reddit.com)
I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large co…
-
Qwen3.6-35B-A3B vs Gemma4-26B-A4B (www.reddit.com)
Just wondering how are people's experience with both these models! I've had some nice results with Qwen but Gemma4 runs so much faster here.
-
Hi, (TLDR.): Qwen in its MTP version has tool call bugs and outputs everything into tool/thinking blocks - mangeling the output - canceling the +speed with repeated wrong tool calls! DCSS works well with non MTP qwen even on smaller qwants.
-
My rag I've been building is much in response to having a LLM that I feel more confident in knowing where the knowledge base is coming from especially after the Open AI deal with the Pentagon. So, when I saw "uncensored" heretic models, I…
-
please forgive the mildly clickbait title. hard to fit everything in it I've seen a lot of discussion here about KV-cache quantization, especially with the recent llama.cpp improvements, leading to some debate on the tradeoffs between KV q…
-
Claude code in terminal models / combine with local llm? (www.reddit.com)
Hi, I’m pretty sure I have seen people typing /model and seeing all available models. I have to type models from memory.
-
Local model doing accounting tasks (www.reddit.com)
So I've been using qwen 3.6 27b for monthly closes, bank recs, payable and receivables. Built a simple sql lite database it manages.
-
Just wanted to share my amusing weekend project. https://www.askjeebus.com 100% vibe coded.
-
I'm running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it's just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working…
-
Any reason to run dense over MOE for RAGs? (www.reddit.com)
I tend to use Claude for a lot of research and I also increasingly worry about things like misinformation or things in the model I can't audit. So, I'm building my own all in one RAG with big datasets like all of Wiki, research papers, all…
-
Removing Vision from model (www.reddit.com)
I removed mmproj file from models to remove vision and save my vram. But just curious, is this really don't affect its text ability?
-
I opened my first contribution to exo: native multi-token prediction support for Qwen3.6-style MLX checkpoints. I hope it is useful.
-
Sharing this because I didn't believe the first run. Setup: laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s), Linux.
-
Optimizing speed & quality on Qwen3.6 27b (www.reddit.com)
Does the inference speed below seem optimal for the hardware, or could there be further room for improvement ? I’ve been trying to use Qwen3.6 27b for agentic harnesses like Pi/Hermes.
- minor speed bump for MTP with Qwen3.6-27B-MTP Q6_K_XL (www.reddit.com)
-
DGX Spark agentic usage numbers (www.reddit.com)
What I need it to do: Be able to support openclaw-type agent which is used by multiple people. What I tried: So I read in the internet about the atlas thing.
-
club-rdna16: practical 16GB AMD/Radeon local LLM testing repo (www.reddit.com)
Following on from club-5060ti, I’ve been doing some testing with my desktop AMD GPU and wanted to make a similar repo for 16GB Radeon cards. Repo: https://github.com/5p00kyy/club-rdna16 Pages/results: https://5p00kyy.github.io/club-rdna16/…
-
Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM (www.reddit.com)
Hello everyone! I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB.
-
Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps (www.reddit.com)
..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M.
-
Blackwell and PDL performance increase (www.reddit.com)
Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.) In short, PDL enables more efficient execution of kernels…
-
I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends." It does depend. So let me spl…
-
BeeLlama v0.2.0 is here! Not quite a pegasus, but close enough.
-
Experts first llama.cpp (www.reddit.com)
This is for all with 12GB VRAM. Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers.
-
Hi everyone, I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp.
-
Some tests with qwen3.6 27b + 35b a3b about MTP vs ngram-mod (www.reddit.com)
I will try to keep this short ;) I used GLM 5.1 to vibecode a vague prompt on my vibecoded react web app and have GLM 5.1 rank the plans made with each other and the one it made itself. Test strategy: - use starter prompt as always - add v…
-
Qwen 3.6. struggling with German (www.reddit.com)
Hi everyone, I’m looking for advice on local AI setups. My goal is to have a local AI generate text documentation from my one-hour therapy sessions.
-
I tried Qwen3.6 35B A3B MoE, Qwen3.6 27B Dense, Gemma4 26B A4B MoE, Gemma4 31B Dense. In all cases I was using Q4_K_M and thinking mode enabled.
-
My workflow has changed basically to ask Codex to do certain tasks and then document how to do them (including errors it found on its way) into a skill. I feed that skill to pi, and suddenly my qwen3.6 gets that hard stuff done: - devops o…
-
Edit: does this happen every time a newbie tries to post here. Getting roasted despite having valid results?
-
pge-jax JAX implementation of the Prioritized Grammar Enumeration (PGE) algorithm for symbolic regression. Overview pge-jax is a complete symbolic regression system that automatically discovers mathematical formulas from data.
-
Ask HN: Is the next big thing locally running coding agents? (news.ycombinator.com)
There's extreme price escalation on part of Anthropic, with token spend now approaching levels that have made many-an-enterprise scratch their heads. At the same time, judging by opensource advances (E.g.
-
Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.
-
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp (www.reddit.com)
Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP.
-
Continue config for Qwen 3.6 and llamacpp (www.reddit.com)
If anyone is using the Continue.dev extension in VSCode, what config settings are you using for Continue and the llama-server? Mine keeps hanging after bad tool calls.
-
One Night Werewolf played by LLMs (www.reddit.com)
The other day I posted about playing one night werewolf on my custom made UI via tool calls. Since then I’ve played a few games and improved the prompts.
-
Qwen3.6 27B and llama.cpp appreciation post (www.reddit.com)
To preface, here's my config: llama-server \ --host 0.0.0.0 \ --port 1235 \ --models-preset %h/Software/models.ini \ --models-max 1 \ --sleep-idle-seconds 3600 \ --timeout 3600 \ --parallel 1 \ --device ROCm0,ROCm1 [*] flash-attn = on jinj…
-
I wanted to know how much of a coding agent's performance came from the model and how much came from the harness, so I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. ALl the images above a…
-
How can you stop your model from looping (www.reddit.com)
So i thought this is a small model issue but when i added a new gpu and i am able to run low mid model like Qwen 3.6 35b q4 or q5 this issue still exists now its not as much as small model but it does break when linking the model to copilo…
-
I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g…
-
Volatile prefill speed after each reboot - llama.cpp (www.reddit.com)
After every machine restart I get a different prefill speed, it can be only 300t/s or 1500t/s. It's like a lottery at each restart.
-
Ill address all the questions here not spam the sub what would be a better set up, 1 pc with 2 3090s and a 5080, but that 3090s will have to run at x4 pci-e slots OR 1 pc with 5080, another pc with the 2 3090s and on x16 split into 2x8 mai…
-
Hey r/LocalLLaMA, We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP. Blog / Download NTP Models / Download MTP Models TL;DR For NTP, “pick the largest quan…
-
Hi all. As some may have been aware, Hugging Face's model search had issues recently.
-
How accurate can “whichllm” be? (www.reddit.com)
Hello people I think the question is clear but I wanted to add some context: I work on internal tools in my job and some of the tools are for us developers (most tools are for marketing and factory production). I am currently working on a…
-
MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round.
-
Hi, i run llama.cpp inside LXC on a Proxmox server. The hardware is a recent AMD Epyc with two 6000 Blackwell MaxQ.
-
IGPU 780 Unsloth Q2_K_XL Qwen 3.6 27b 8t/s with MTP LM Studio (www.reddit.com)
Man Loving MTP. And Unsloth.
-
I've tried Gemma4 and a few other variations of Qwen, but they're either not as robust with their output, or they take too long or too much VRAM and force the context limit down from 131K to 20K or even 4K, or they're slow AND low-context…
-
Hi ! I've read a while ago that some AI's tend to agree on their own language to talk one to another over time.
-
Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using Bee…
-
One way I like to test new models, is by one-shoting (with a good prompt) a single webpage clone of the classic arcade game pacman. I usually do 3 attempts and keep the best one.
-
Find bugs in YOUR code using OpenCode, Llama.cpp and Qwen3.6 (wtarreau.blogspot.com via hn)
Background For quite some time I had been submitting tasks to LLMs via llama-cli (natively) or llama-server (API), both from the excellent llama.cpp project. On CPU-only llama-cli starts fast and can restart from a checkpoint which has alr…
-
I wanted to switch from Qwen3-Coder-Next-UD-Q4_K_XL to Qwen3.6-27B-MTP-UD-Q4_K_XL for local agentic coding. The Qwen3.6-27B is perceived to be "smarter" than Qwen3-Coder-Next, and I wanted to "upgrade" my local AI coders.
-
Qwen3.6 35B MTP, t/s varies on different scenario (www.reddit.com)
Tried Qwen3.6 35B Q5_K_M MTP, HW: 9700x, 64GB 5600 RAM, 5060 TI 16GB. --n-cpu-moe 30 ^ -ngl 99 ^ -c 131072 ^ --no-mmap ^ --flash-attn on ^ --cache-type-v q8_0 ^ --cache-type-k q8_0 ^ --threads 8 ^ --parallel 1 ^ -rea off ^ --reasoning-budg…
-
TurboQuant on 16 GB VRAM (www.reddit.com)
I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute.
-
Weird performance depending on quant (www.reddit.com)
Hi, I'm using llama.cpp with qwen3.6 35B A3B on two different machines. I noticed that on both machines tokens per second is better while using Q4_K_S and Q4_K_M quants than lower Q3_K_M quants.
-
What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases.
-
Why might MTP be net negative for tool heavy agentic flows? (www.reddit.com)
The Qwen3.6-27B MTP benchmarks that have been circulating put factual tasks at 62-70% acceptance vs code at 79-89%. Tool calls probably sit in that factual range or lower, structured output, constrained format, less predictable than pure c…
-
We have sub-agents at home (www.reddit.com)
At work I get unfettered access to gpt 5.4 and sonnet, so I'm quite used to spawning sub-agents to go crazy on a repo and split up tasks. At home I am VRAM poor and like to run the models locally for my own enjoyment.
-
I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes.
-
Distilled Model's Vision Problem (www.reddit.com)
Have been using Qwen 3.6 Claude distilled version, 27b at Q4 for openclaw, Hermes and other local harnesses. But recently noticed that the Claude distilled version that I use lost its vision abilities.
-
https://preview.redd.it/8gpkg8zxmy1h1.png?width=1672&format=png&auto=webp&s=a95db16a39cdc49c0ff155117b734d413a49c2d3 https://youtu.be/MI0Pm1d6YF4 MTP can accelerate LLM inference 2x, especially for coding agents. This video covers what MTP…
-
Lemonade v10.5.1: an MTP + ROCm 7.13 quick start for Strix Halo (www.reddit.com)
Update to Lemonade v10.5.1, then: ``` Get the model lemonade pull Qwen3.6-27B-MTP-GGUF Get ROCm 7.13 lemonade backends install llamacpp:rocm Load the model (MTP args auto-applied) lemonade load Qwen3.6-27B-MTP-GGUF --llamacpp rocm --ctx-si…
-
PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.
- Benchmarking llama.cpp's new MTP support on Strix Halo (calebcoffie.com)
-
MLX engine comparison… and oMLX is the top choice. (www.reddit.com)
Just stumbled on this blog. A very interesting read if you are picking inference engine.
-
Tesla P40 running qwen 3.6 (www.reddit.com)
Does anyone know why qwen 3.6 MTP spec decoding won't work with Tesla P40 when the K cache is quantized? I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40.
-
Qwen 3.6-27B giving me attitude! (www.reddit.com)
I'm laughing here. I'm messing about with Qwen3.6-27B in order to gauge just how capable it is with local vibe-coding.
-
Not getting any faster with MTP on Macbook Pro M1 Max 32gb (www.reddit.com)
Using latest llama.cpp with mtp and these settings, I only get 10 tps, should I be getting more? [unsloth/Qwen3.6-27B-MTP-Q4_K_M] jinja = true model = /Users/[username]/llms/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf cache-type-k…
-
Someone suggested I give Continue (Vscode extension) a try. I've been using Roo / Zoo now and liking it but it is pretty tough on context and I was told continue has more control over it.
-
Quantizing MTP KV Cache = free lunch? (www.reddit.com)
With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized: -cache-type-k-draft q8_0 -cache…
-
TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp…
-
Hi, i built this agentic ai, Closed-loop system that ships standalone Python agents. What's different: - Interviews you until it understands the request before building anything - Two testing stages: prompt validation via LLM invoke, then…
-
Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled My setup is heterogenous, I originally acquired my server (Lenovo ThinkStation P3 Tower Gen 2) to run OpenShift/K8s clusters (because I work on that), and…
- Qwen 27b MTP Config, Llama.cpp Single 3090 (www.reddit.com)
-
Cutoff dates of open source models (www.reddit.com)
I was trying Qwen 3.6-27b and Gemma4 in a siomple web chat. Asked them both a qn like 'recommend the best llm for a 5060ti' and was suprised when they both replied 'user is asking about a card that doesn't exist'.
-
I'll be UPDATING this as it seems I was benchmarking and testing Just before the UPDATE LOL TL;DR If you're running rigid agent frameworks locally with mtp on consumer hardware: drop your draft window to 3, lock parallel slots to 1, and co…
-
EDIT 2: Trick-Assignment-828 pointed me at the actual rule update from the mods - Rule 3 Low Effort was expanded to cover LLM-assisted posts without disclosure. Disclosing now: Disclosure: I'm a non-native English speaker (German).
-
While waiting for Fara-1.5 for my coding harness (www.reddit.com)
Hi all, Not sure many people are aware so wanted to give a word about Fara-1.5 release. => this release will likely be the big sister of Fara-7B and built on top of Qwen3.5 Actual Fara-7B performs not bad at all but actually requires a pro…
-
Build Own Docker Image with llama.cpp and MTP (www.reddit.com)
Hi All! Saw some folks waiting for the Docker images with llama.cpp and MTP when it released.
-
MTP experiences on 7900xtx? (www.reddit.com)
Hi! I have been using Qwen3.6 35B A3B happily the past few weeks, and I wanted to try out Qwn3.6 27B with the new fancy MTP speculative draft!
-
ik_llama: Qwen3.6 27B and 35B on very low VRAM (www.reddit.com)
Thank you to the people at ik_llama and llama.cpp. It's amazing how far you've all pushed mtp and other tech so that I can run 27B and 35B Qwen3.6 models on an old gaming laptop with a RTX2060 mobile at 6GB VRAM and 32GB RAM.
-
Nnoticing qwen-27b@q2 better than qwen-35b@q8? (www.reddit.com)
The Latest qwen3.6 models. Is this odd?
-
In my real-world usage (opencode) and in my synthetic benchmarks, Coder-Next (Q5) demolishes the whole Qwen3.6 family including the 27B Dense model (All Q8). Everybody else is hailing that 27B is superior and is an amazing model, but I hav…
-
Not deeply technically fluent but have ran few models locally before, around the time before gemma 4 dropped. I tried some low quant of qwen 2.5 coder and after some tinkering I got it to run but it was just so slow, obviously.
-
Developers who use local AI - Q4_0 vs Q8_0 KV quant? (www.reddit.com)
I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territo…
-
MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it (www.reddit.com)
I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model.
-
I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed u…
-
Dual GPU llama.cpp speedup (www.reddit.com)
Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor pa…
-
M1 Ultra vs M3 Ultra speed (www.reddit.com)
Anyone have both of these and tested them? How much faster is the M3 Ultra in PP and TG speed compared to the M1?
-
Convert With MPT Support? (www.reddit.com)
Hi All, I'm trying to understand the process of creating GGUF with MTP support. Does the original Qwen/Qwen3.6-27B support MTP?
-
Using Local LLMs for research (www.reddit.com)
Hey there. I am an undergrad who has been doing mostly SWE, but will be doing ML research under my professor over the summer.
-
Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks (www.reddit.com)
Here are some results (llama.cpp)! Task 1: write a short poem 27B Dense: 12.5 tokens/s 27B Dense MTP: (spec-draft-n-max 6): 14.5 tokens/s 27B Dense MTP (spec-draft-n-max 3): 18.7 tokens/s Task 2: edit a hello word html artifact 27B Dense:…
-
Best local model for C# coding with 24GB VRAM? (www.reddit.com)
I can't decide that Qwen 3.6 35b q4 (130k context) or Gemma 4 26b q4 (95k context) is better for C# coding with 24GB VRAM. Please share your experiences!
-
Testing llama.cpp MTP support on Qwen3.6 - RTX 5090 (www.reddit.com)
Setup: - RTX 5090, 32 GB, Linux - Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120) - Unsloth's…
-
LeanLoop, the Tool Claude Leans on (github.com via reddit)
So I bought a second graphics card the other week to get in on the local AI craze and I've been having the hardest time using it to build my website. It's been unreliable, the context gets eaten up, kind of hallucinates sometimes.
-
We've got great outputs for 27B via club 3090, but what about those of us who love the blazing speed of 35B on dual 3090s? I was getting 1500 p/p and 120 t/g with split layers, but MTP slowed it down to 80 t/g when I tested last week.
-
I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm. I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup.
-
Saw this post comparing Qwen 3.6 variants on coding primitives, so I wanted to see how local quants stack up against frontier models on a similar dense, single-file coding task. I ran the exact same prompt across local and web-based models…
-
TL;DR All models were Qwen3.6 27B-MTP vs Base 27B (15k single-turn): Faster overall Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%) Generation: 7.63 → 16.15 t/s (+111.77% speedup) Prompt Processing: 279.75 → 244.90 t/s (-12.46…
-
Extension idea: llama-server with custom samplers (www.reddit.com)
Just an idea and a prototype (made by Qwen3.6-27B-UD-Q6_K_XL via OpenCode) for allowing users to add custom sampling logic to llama-server without having to maintain their own entire fork and without having to make a wrapper that reimpleme…
-
LLM Inference Server A single-container, idle-aware, OpenAI-compatible inference router for a Tesla P40. Routes between Qwen 3.6 27B (MTP self-speculative decoding, TurboQuant turbo4 KV cache), Qwen 3.5 0.8B (multimodal transcription), Whi…
-
local llama.cpp parallel users - still so fast?! (www.reddit.com)
I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090 build\bin\llama-server.exe ^ -m "!MODEL_FILE!" ^ --mmproj "!MMPROJ_FILE!" ^ -ngl 99 ^ --ctx-size !MO…
-
so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models).
-
Finding the 4x 3090 Sweet Spot (www.reddit.com)
https://preview.redd.it/8o43bjhe9d1h1.png?width=5346&format=png&auto=webp&s=1c87c2ee8b8ffff43495f543266056b0e26d3947 In another post I had someone ask me about the power draw of the 4x 3090 setup so I'm sharing a a full test I conducted to…
-
My GPU power consumption is 250w (undervolted rtx3090) when I added Qwen3.5-27B-GGUF to Ollama using a template (Modelfile made by gpt). I gave it 3 task to test it, build a snake game, build a flappy bird game, and make an interactive gri…
-
Qwen 3.6 27B: IQ3XXS KV Q8 vs Q4XL KV Q4 (262K context) (www.reddit.com)
hey yall. So I have a 24GB gpu.
-
how would you set up a local llm server for a business of 7 people? (www.reddit.com)
Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to…
-
Qwen3.6 9B will release around Google I/O? (www.reddit.com)
I don't think alibaba officially stated about "no qwen3.6 smaller models", and according to the patterns, she should ave been released it in the first week of may, but I think they delayed a little bit to catch the spotlight from Google I/…
-
RLM models and Qwen3.6 (www.reddit.com)
RLM models and Qwen3.6 Does anyone here have an RLM setup and how could I set it up? I want to make my Hermes agent even more powerful and I don't like that I need to open a new context window every time after just a few prompts.
-
PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise) ------- Just wanted to share my current setup, that might help some users out there... services: llama-server: image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b912…
-
In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests.
-
I am not sure if I should be proud or not. (www.reddit.com)
I managed to get working 4 sub-agents Qwen3.6 35b on dual rtx 3090, I am using deepseek as orchestrator. https://preview.redd.it/biksbgq0n81h1.png?width=783&format=png&auto=webp&s=cf8a4481c1ac439c3283925001c12841b8e6c2e7 They all working l…
-
I made a text based craft/trade/cooperate game for my agents to play on intervals when I don't have anything else for them, and it's been so fun watching them plan things out and form little factions with each other to cooperate on trades…
-
club-5060ti: practical RTX 5060 Ti local LLM notes and configs (github.com via reddit)
I put together a small public repo for RTX 5060 Ti 16GB local LLM setups: I took inspiration from the club-3090 repo, but this one is focused on documenting what we’ve actually tested on 5060 Ti hardware so the setup details are easier to…
-
Ok, hear me out. This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround (https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main) recipe was performing so much better than any other Qwen3.6 27B q…
-
Llama.cpp server running ~2 weeks straight. Loses its mind? (www.reddit.com)
I’ve got Qwen3.6 27b and Qwen3.6 35b running in two separate instances for over two weeks and they are considerably dumber now than when I launched them. is this a thing?
-
I’m using OpenCode with a local Qwen3.6-27B Q6_K GGUF model on an RTX 5090 with KV cache in Q8. For reference my llama.cpp build is compiled with CUDA 12.9.
-
Is there a big gap between Q4 and Q6 on Qwen3.6? (www.reddit.com)
I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high.
-
I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly".
-
Automated AI researcher running locally with llama.cpp (www.reddit.com)
Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure: https://github.com/huggingface/ml-int…
-
Turboquant+MTP for ROCm(Llama CPP) (www.reddit.com)
TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.
- Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (www.reddit.com)
-
Playing One Night Werewolf (Gemma4 & Qwen3.6) (www.reddit.com)
Finally feel like it’s possible. I have a custom build (vibe coded) UI on llama.cpp, allows model switching in the same chat.
-
Simpler self hosted alt to Open WebUI (www.reddit.com)
Got Qwen3.6 27B running on my newly assembled 4x 3090 rig (s/o 3090-club) and I'm trying to get the people in my house to adopt the local workflow. Open WebUI has improved a lot in the recent updates, but I still found it pretty rough for…
-
running Qwen 3.6 35b A3B on 2x 5060TI (www.reddit.com)
i ran Qwen 3.6 35b A3B two 5060TI 16gb ( 32 gb vram also i have 32gb dram but i don't like offloading ) i used Q4 on LM Studio to get full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ? thanks !
-
I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…
-
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) (www.reddit.com)
TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results a…
-
Is it worth getting a 5090 for my needs? (www.reddit.com)
I'm considering biting the bullet and getting a pc with the following specs: 5090 Amd 9950x3d X870 motherboard 32gb ram (16x2) CL32 EDIT2: Price for this is falling in the arena of 5500-6000 USD where I live. Obviously costs a bomb.
-
qwen3.6 just stops (www.reddit.com)
https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens…
-
Can I improve performance for qwen 3.6 27b? (www.reddit.com)
Hardware OS: Windows 11 Pro 10.0.26200, Build 26200 CPU: Intel Core Ultra 7 270K Plus, 24 cores / 24 threads, max clock 3.7 GHz RAM: 32 GB DDR5 @ 5600 MHz, 2x16 GB Crucial CP16G56C46U5.C8D GPU: 2x NVIDIA GeForce RTX 3090, 24 GB VRAM each,…
-
Model for reverse engineering (www.reddit.com)
For a system with 4x RTX 3090: what's the best model you could use for reverse engineering C# code? Qwen3.5-122b-A10B?
-
I got a bit further with my harness for running Qwen 3.6 model on Codex. While testing, analyzing, and building the harness, I evolved TBG(O)llama-swap into a full forensic UI bridge and LLM analytics tool where every harness finding, modi…
-
Q: Does DFlash (and PFlash) work with Heretic models? (www.reddit.com)
Z-Lab did some good work with speeding up output, while Luce managed to use smaller models of the same family to accelerate prefill... Since Heretic and other "smart ablation" tools can decensor a model, would they work with these multi-mo…
-
We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs).
-
Warning: long post ahead. On the bright side, it's 100 percent human-written, typos and all.
-
Watched All About AI's 100% local Fireship-style video automation experiment over the weekend (link in comments). A few things worth flagging if you're trying the same stack.
-
And I'm here to share my experience. The answer is resoundingly 'yes'.
-
llama bench kv cache f32 error (www.reddit.com)
A did a quick google, but found nothing on this and I am scratching my head. Trying to do a llama-bench run with the kv cache set to f32 under Vulkan with a Strix halo.
-
High VRAM local coding model — still Qwen 3.6 27B? (www.reddit.com)
I’ve been using Qwen 3.6 27B and it’s amazing. Not exactly your Opus replacement, but great for small tasks and checking work.
-
Thoughts on "production" model setups (www.reddit.com)
I've been working with Qwen 3.6 27B and 35B-A3B models and pretty happy with them. The point I've reached now is how to split my uses cases.
-
RTX 5060Ti 16GB or RTX 3080 20GB? (www.reddit.com)
I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.
-
Hey fellow Llamas, keeping it short. We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory).
-
New Qwen3.6 27b Autoround Quant (int4) Best Recipe (www.reddit.com)
I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm.
-
Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. Autocomplete: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L Agentic: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL Why these models: Qwen2.…
-
I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment.
-
MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp (www.reddit.com)
I was wondering what will be the difference in results with flag: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 vs MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Results are quite interesting 49tok/sec without MTP vs 64 tok/sec with MTP. PC: RTX5090+128GB DDR5…
-
How do I use MTP? (www.reddit.com)
Hi, I'm trying to use MTP with llama.cpp, I built from source the mtp-pr, download an MTP model from huggingface https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP/resolve/main/Qwen3.6-27B-Q6_K.gguf But when I run the model I have an erro…
-
Qwen3.6 27b q5_k_M MTP - 256k context - 5090 (www.reddit.com)
https://preview.redd.it/ktg0lr3e0p0h1.png?width=1279&format=png&auto=webp&s=d110580662a5c707038b7e2e4f5226d2a18c7bfe Straight to it: llama-server-mtp \ -m ~/models/Qwen3.6-27B-Q5_K_M-mtp.gguf \ --spec-type mtp \ --spec-draft-n-max 3 \ --ca…
-
Stop wasting electricity (www.reddit.com)
Run on my rtx4090 llama.cpp params: llama-server -m ~/Projects/llm/models/Qwen3.6-27B-UD-Q4_K_XL.gguf --flash-attn on -ngl all -ctk q4_0 -ctv q4_0 -t 32 -c 262144 Power limit was set using sudo nvidia-smi -pl N On my observation, GPU const…
-
According to this. I run several more tests to cover more models and quants.
-
Estimate inference speed of local Qwen3.6-35B on Mac M5... (www.reddit.com)
"Based on currently available information, estimate the prefill/decode speed of Qwen3.6-35B-A3B Q8 with 262K context on a Mac M5 Ultra 128GB." I'm surprised that almost every LLM fails at this task (ChatGPT/Gemini/Grok/Claude/DeepSeek/Kimi…
-
New Qwen3.6 35B finetune - 0GM-1.0-35B-A3B-0427 (www.reddit.com)
https://huggingface.co/0G-AI/0GM-1.0-35B-A3B-0427 So far it behaves better than for example Qwopus in terms of consistent answers, iv been testing Q6K from https://huggingface.co/mradermacher/0GM-1.0-35B-A3B-0427-i1-GGUF Also i checked the…
-
Question in title. Would be awesome to have this on macs, especially q8 or whatever the minimal-loss quant is, since macs can have lots of ram.
-
I run the 4 bit quant of Qwen-3.6-27B in the codex harness with unsloth recommended llama-server settings, thinking enabled. I have tried the default chat template and the updated ones and have updated both my GGUFs and llama-cpp to the mo…
-
Will there be any more Qwen3.6 series models? (www.reddit.com)
I'm still hoping we see a Qwen3.6-122B or a Qwen3.6-coder, but my hopes are dimming. Seems like we would have seen/heard something by now, even if just tantalizing hints from the Qwen folks.
-
Anyone with 4x 5060ti based setups? (www.reddit.com)
I am currently running 2x RTX 5060 ti and happened across some good sales for additional ones coinciding with a really good sale of a highend Z890 motherboard (replacing my B860 board) that could support quad GPUs (with 2 M.2 adapters, end…
-
Does 'preserve_thinking' work with openwebui? (www.reddit.com)
I'm running qwen3.6-35b with llama.cpp connected to openwebui. And I noticed the model fails the number guessing game test on openwebui while it works perfectly with the llama.cpp web ui.
-
I have this issue in all Windows installations I have done in my system, which of course, does not occur in Linux. 7900XTX + 9800x3D + 64GB DDR5 Issue is that for some reason, after sometime, llama.cpp performance cuts in half, even restar…
-
Hey folks, just a heads-up for anyone running Qwen3.6 through llama-server. I ran into an issue where the preserve_thinking parameter wasn't working as expected, even though I had it explicitly enabled in my models.ini config.
-
Why is opencode so slow in processing the prompt with llama server? (www.reddit.com)
I'm running opencode and llama-server locally. I have 32gb ram and 780m igpu.
-
The Qwen 3.6 35B A3B hype is real!!! (www.reddit.com)
My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research. My research is on some pretty niche topics and I doubt that anything like it is…
-
--model "/mnt/e/my-path-change-to-yours/qwen3.6-35b/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf" \ --ctx-size 262144 \ --parallel 1 \ --n-cpu-moe 29 \ --no-mmap \ --mlock \ --cache-type-k q4_0 \ --cache-type-v q4_0 10.8/16 d…
-
Hey everyone, I’m trying to debug a weird prompt cache issue with OpenClaw + oMLX, and I’d appreciate help from anyone running local agents on MLX/oMLX. The short version is this: I’m running oMLX v0.3.8 on my Mac, serving: Qwen3.6-35B-A3B…
-
Hey everyone, I've been playing around with Gemma4 and Qwen3.6 on my 32Gb Macbook Pro M2 Max since their release but I'm struggling at finding: The best software to run it (oMLX, llama.cpp, ...) The best model + quant to pick The best sett…
-
I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when…
-
Getting a feel for how fast X tokens/second really is. (www.reddit.com)
I love following all your adventures with local LLM setups. Quality and size of the models are important, but so is performance.
-
I'm using OpenWebUI and and making tools/skills to improve my models functionality. I am currently using Qwen 3.6 35B A3B Q8 (F16) 256k I grabbed `parallel tools` to be able to run multiple tool calls at once..
-
Speeding up local LLM for usable coding agent (www.reddit.com)
TL;DR: Qwen 3.6 35B-A3B (Q4_K_M) is running slow at around 9 t/s with 72% filled context (36147 tokens window) and a total response time of 77s including prefill and token generation. Ran this using LM Studio on Windows with the attached i…
-
Hello from 10KM high! - Thanks to Qwen 3.6 35b a3b! (www.reddit.com)
Typing this on a cramped flight, but I was having issues connecting to the plane's wifi on my ubuntu laptop, when it was effortless on my phone. The issue I was having was the Laptop WiFi connected to the plane wifi network, but captive po…
-
Has anyone bought a 3080 20GB mod recently? (www.reddit.com)
I think it would suit my needs perfectly, but I'm scared of getting scammed on Alibaba so looking for some sellers who have delivered. Follow-up question for those who have the card, how well does it run Qwen 3.6 27B?
-
am I running this llama-bench of Qwen3.6-27B on these V100s right? (www.reddit.com)
basically what I'm doing here is trying to validate whether or not it's a reasonable idea to get a couple of V100s, either SXMs with PCIe adapters or straight-up PCIe cards in the first place, for the sake of running this model or models l…
-
What are the best 40-500 B MoE LLM models now? (www.reddit.com)
Due to old GPU I run on CPU and came to appreciate value of MoE. I know of MoE for Qwen 3.6 and Gemma-4, which are <40B.
-
I'm keeping a close eye on the development of local llms.
-
Probe-Detected Grokking in Multi-Probe DPO (openinterp.org via hn)
Probe-Detected Grokking in Multi-Probe DPO Orthogonal Learning Beyond Task-Specific Detectors in Qwen3.6-27B Probe-Detected Grokking in Multi-Probe DPO: Orthogonal Learning Beyond Task-Specific Detectors Abstract We report a phase-transiti…
-
I have Qwen3.6-27B as my main model, I use it for coding with opencode and chatting with open-webui, yet to try out hermes or openclaw. I found out about their existence basically by searching or through reddit - but maybe there’s more tha…
-
9070xt inference for q3 qwen 27B (www.reddit.com)
In llamacpp I'm getting 12tok/s, does this number look right to you and what can I do to increase this number (if possible)? cd ~/llama.cpp && ./build/bin/llama-server -m models/qwen-3.6-27b-abliterated-q3.gguf -ngl 999 -c 65536 (i need th…
-
TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled.
-
I have been trying various NVFP4 based variations of Qwen 3.6 27B, and I am seeing this for the ones that look most interesting to run on my 2x 16GB VRAM with KV cache fp8. vllm | (Worker_TP0 pid=136) WARNING 05-09 13:49:27 [kv_cache.py:10…
-
I usually use Qwen3.6 27B (slow as heck on my RX 6800 but it works) for plan and Qwen3.6 35B A3B for the coding. But I was thinking the other day if I should remove the thinking from the code model.
-
More Qwen3.6-27B MTP success but on dual Mi50s (www.reddit.com)
TLDR: The hype is real! 1.5x speedup.
-
Pi and Qwen3.6 27B make setting up Archlinux really easy. (www.reddit.com)
Just thought I'd share this use case. I was setting up a miniPC as a home theatre with Archlinux (It's the OS I'm most familiar with).
-
Show HN: Transformer Math Explorer (simonramstedt.com via hn)
Interactive reference for transformer models, presented via dataflow graphs, drillable down to elementary mathematical operations. Covers models from GPT-2 to Qwen 3.6, with MLA, MoE, RoPE, MTP, hybrid attention, and other variants togglea…
-
this is almost certainly a skill issue, however: ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 1 --flash-attn 1 --device CUDA0,CUDA1 -p 2048 -d 4096,16384,65536 rather than splitting across those two cards, it firs…
-
Just got a 8x 32gb v100 server... now what (www.reddit.com)
Looking for suggestions. Current setup llama.cpp and ran qwen 3.5 397b 256k context.
-
RTX Pro 4500 Blackwell - Qwen 3.6 27B? (www.reddit.com)
have have a server running a 4500 blackwell on cuda 13.1 and nvidia/595.58.03 with 48GB mem assigned to it. I have build: dcad77cc3 (8933) with Qwen3.6-27B UD-Q5_K_XL loaded and connected it to Roo code.
-
Those of you who like Gemma4 models - how are you guys using them? (www.reddit.com)
I have been using local LLM for coding quite a lot as well as some other tasks (like data extraction from images) and I had quite a good success with Qwen3.6 models. It's obviously not Sonnet/Opus, but I am able to get quite a lot of work…
-
Qwen 35B-A3B is very usable with 12GB of VRAM (www.reddit.com)
Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.
-
So I've been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use. So after a day of vibecoding I think I may have gotten something viable.
-
Just before RAMpocalypse, I built a 2x3090 using dirt-cheap humid-env-run GPUs and 128gb RAM on a RYzen 5 9600X. I've been pretty happy with the local models it lets me run, spilling into RAM occasionally as needed.
-
Local LLM for electronics design work? (www.reddit.com)
Another hobby is working on electronics projects ranging from low-voltage control and signal processing to HV tube amp circuits. I design and simulate in LTspice before prototyping.
-
z-lab released gemma-4-26B-A4B-it-DFlash. Anybody tried it yet? (huggingface.co via reddit)
Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago.
-
Hey there everyone, I've been struggling to find a actual good guide that's not some fluffy video or AI slop on renting hardware from a service to run a local LLM with high token output Before I invest in some serious hardware, I thought I…
-
Ran a full eval against four local models last weekend and the spread between them is wider than I expected. All running through Ollama on CPU, no cloud, same prompts, same hardware.
-
Support for spec prefill and spec decode on qwen3.6 model family (www.reddit.com)
Anyone familiar with getting both to work? I've got a few work systems and I want to make a case for inhouse data generation for the team, and I've got a very very crusty implementation going by putting a bifrost service on one of them, an…
-
https://preview.redd.it/p0rqofxvrtzg1.png?width=1460&format=png&auto=webp&s=8ce5b18b4ddaad9b71f71fd8eb623839fc9c6c8b For weeks I've been working on creating the fastest local AI engine for Apple Silicon... And I finally did!
-
Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK (www.reddit.com)
TL;DR On 4× RTX 3090 with NVLink bonded between GPU pairs (0↔2 and 1↔3), pinning TP=2 to a NVLinked pair gave +25% throughput at concurrency 1 and +53% at concurrency 4 vs running TP=2 over PCIe. Adding the other two GPUs to make it TP=4 m…
-
Thinking of moving from 2x 5060 Ti 16GB to a RTX 5000 48GB (www.reddit.com)
I am a freelance developer. Qwen 3.6 27B is great on the 5060s but a bit slow.
-
Qwen 3.6 Looping with Tools? (www.reddit.com)
For some reason, my qwen started looping a lot recently, ever since I introduced MCP tool calls. I don't know why as I didn't really change anything other than that.
-
Ok so, I will try to explain myself as much as possible because onlinew I really cannot find much about this. Let's start by my settings for running Qwen 3.6 35B: Qwen 3.6: cmd: '/X --port ${PORT} --chat-template-kwargs '{"preserve_thinkin…
-
Is it my imagination or... (www.reddit.com)
Is Qwen 3.6 35b now considerably stupider in the latest llama-server releases? I had this model doing cartwheels two upgrades ago.
-
Qwen 36 27B + Gemma 4 - the best set for 1x 3090 ? (www.reddit.com)
Hi guys 👋 When I started my adventure with Qwen 3.6 27B I felt wow.... Now when I connect it with Gemma 4 I'm feeling more wow...
-
Disappointed in Qwen 3.6 coding capabilities (www.reddit.com)
I know that coming from Codex I should adjust my expectations, but still. I'm working on a midsize project.
-
I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday.
-
Mimo2.5 (not pro) under llama.cpp? - primary model opencoder? (www.reddit.com)
I tried running AesSedai/MiMo-2.5-GGUF:Q4-K-M under llama.cpp (main tree, compiled 36hours ago) Hardware: nvidia A6000 with 48GB RAM + 300GB CPU RAM I had no success: error loading model: missing tensor blk.0.attn_q.weight ... Is Mimo alre…
-
why llama.cpp can’t combine speculative decode methods? (www.reddit.com)
dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g.
-
MTP - The proofs in the puddin! Using it with Qwen3.6-27b (www.reddit.com)
Been running llama.cpp MTP with Qwen3.6-27B Q4_K_M as my daily coding assistant and got curious what was actually happening under the hood. Pulled the metrics from llama-server and charted a full session.
-
I’m looking for a tool or calculator that can estimate the minimum hardware needed to run a specific model locally. For example, I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds.
-
What models for coding are you running for a mid level PC? (www.reddit.com)
I have a 4060 (8GB Vram) and 16GB of ram wondering which models could fit in my setup for coding, the new Qwen 3.6 and Gemma 4 MoE models look good but might not fit, wondering about your experiences
-
Fine-tuned Qwen3.6-35B-A3B DeltaNet experiment (www.reddit.com)
I fine-tuned Qwen3.6-35B-A3B on its own outputs for $7 on Apple Silicon + Modal. DeltaNet LoRA targeting was the hard part.
-
Has Qwen3.6-27B Surpassed GPT-5.5? (Not Joking) (www.reddit.com)
So I had this idea for a project which was to try to fix a pretty hard coding problem using local agents running in a loop. The project is a compiler for biology protocols from vendors.
-
tl;dr - For software development, Qwen3.6 27B, 5090 gives you ~3x speed over M5 Max, letting you plow through code, while M5 Max gives you ~4x memory, letting you use higher quantization and bigger context. Which would you choose and why?
-
Get faster qwen 3.6 27b (www.reddit.com)
Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp Thought I would knowledge share Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF And am17an commit /media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/ll…
-
Group cluster rental as a service (www.reddit.com)
With the explosion of apps like open claw, and the launch of my own app (trigger warning, not open source), there is massive demand for tokens. It used to be possible to avoid anxiety about your monthly bill by just buying a claude code su…
-
Following my previous post https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq, a few people asked for the 35B A3B version. The model is up on HuggingFace at https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF if anyone wants to ch…
-
Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot (www.reddit.com)
Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700.
-
Some of you saw our post a couple weeks back about hitting 102 tok/s stable on Qwen3.5-35B on a DGX Spark. A lot of you asked "cool, where's the code?" Today's the day: Github Atlas is open source.
-
Why people cares token/s in decoding more? (www.reddit.com)
What I've noticed while using local LLM recently is that in most cases, bottlenecks occur not in decoding but in prompt processing. If the prompt processing speed is usable, in most settings (since it takes about 15k when starting based on…
-
Both opencode and pi coding work, but I've hit the same wall with open weights. Qwen 3.6 and even fine-tuned variants, they drift into loops once the tool calls get nested or ambiguous.
-
This post will have a slight old-man-shakes-fist-at-sky vibe, because….well… I’m older, so if you’re not into that, then please feel free skip it. I have been contributing to this sub for like 3 years now but I’m fearful this post will lik…
-
So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF. This is not a "best possible setup" claim.
-
Hey folks, looking for advice before I delete or keep a huge model file. I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM.
-
Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF…
-
Based on my last post and some comments, I added Qwen3.6:latest and Devstral to the evaluation. I am still looking for suggestions on which local model can run a complete TDD loop autonomously.
-
WARNING: wait before download from HF: I just realised my upload of the new versions with the additional fix in the chat template has not completed yet. I will remove this warning once done The recent PR to llama.cpp bring MTP support to Q…
-
5060ti 16gb or 5070 12gb for local LLM (www.reddit.com)
As a title says, what is better taking the consideration that it will probably offload to CPU anyway? Models Qwen 3.6 35b and maybe I am not sure it will be usable Qwen 3.6 27b...
-
Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? (www.reddit.com)
How is this dual setup's performance? Is it difficult to set-up everything with for example llama.cpp?
-
Solidity LM surpasses Opus (www.reddit.com)
My weekend project overran a little but happy with the end result. soleval pass@1 beat Opus 4.7 on the same set of tasks.
-
Qwen 3.6 and inline comments (www.reddit.com)
I've been using Qwen 3.6 with the Pi harness, and so far I'm really enjoying the experience. I've noticed Qwen is great at leaving inline comments when writing Typescript (haven't tried other languages).
-
The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.
-
Thinking mode is becoming a liability for production agents (www.reddit.com)
Every new model release I see now has thinking on by default. But then the production results I'm seeing don't justify it.
-
Qwen 3.6 27B MTP on v100 32GB: 54 t/s (www.reddit.com)
Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.
-
What do you use Gemma 4 for? (www.reddit.com)
Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks.
-
could not extract summary
-
Gemma4:31b-coding-mtp-bf16 - slow on Macbook M5 128gb (www.reddit.com)
Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet) https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ Running in Open Webui to view token/s output and I…
-
Smaller gguf getting way less tokens per second?? So confused! (www.reddit.com)
Noob here, Running Qwen3.6 35B A3B in LM Studio on a 3080 10GB + Ryzen 5 3600 on Windows 10. Tried some unsloth quants with identical settings (GPU offload 40, MoE layers to CPU 40, context 8192, flash attention on).
-
UPDATE: i have switched to vulkan (image: ghcr.io/ggml-org/llama.cpp:server-vulkan-b9014) and now i am getting prompt eval: 591.01 tok/s generation: 41.90 tok/s which is faster than rocm new config: services: llama-cpp: container_name: lla…
-
Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster. (open.substack.com via reddit)
Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article...
-
[Benchmark] Llama.cpp: Mac vs CPU vs GPU + CPU, Qwen3.6 27B, Q8 (www.reddit.com)
https://preview.redd.it/fm8fr1vllczg1.png?width=1254&format=png&auto=webp&s=23dbb32e85c71b9454a617de174d0f416b786bb2 llama.cpp parameters: -c 260000 --jinja --no-mmap model: HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced:Q8_K_P Based on…
-
Use Qwen3.6 right way -> send it to pi coding agent and forget (www.reddit.com)
https://preview.redd.it/z4b01gklaczg1.jpg?width=1080&format=pjpg&auto=webp&s=3cefa63d5d15eac5eedbb39ef19d6c476b22ae64 Just a reminder, the harness you use can makes a huge diffrence (your llm client and interface bascially), It's is way mo…
-
Just ran some llama-bench comparisons between ROCm and Vulkan backends on my Strix Halo system. Vulkan came out ahead, which surprised me.
-
Preserve thinking on or off? (Qwen 3.6) (www.reddit.com)
Are y'all using the preserve thinking flag or do you have it off? If so, why?
-
Qwen3.6 merged chat template from allanchan339 and froggeric (www.reddit.com)
Hi, recently froggeric and allanchan339 released enhanced/fixed template for Qwen3.6 each one addressing different topics. I didn't know which one to use so I merged both with the help of Claude Opus to have the best of both.
-
Thoughts on GRM-2.6-Plus-GGUF ? (www.reddit.com)
Judging by what they state, it should be better than Qwen 3.6 27B
-
I've tried the setup in the title today for some vibe coding (ctx=262144, temp=0.6). I must be doing something wrong because it doesn't really work for me.
-
Considering two Sparks for local coding (www.reddit.com)
I'm currently running a 4x RTX 3090 system (96GB VRAM, DDR4 2133 RAM) and have tested opencode and pi.dev using Qwen3.5-122B-A10B (AWQ) up to 200k context for web app coding (html/js/python). I'm now seriously considering picking up two Sp…
-
Best config for Qwen3.6? (www.reddit.com)
With all the high praise for the model all around, I also want to try it on my own. I have an rtx3060 12gb vram and 16gb system ram.
-
Hey everyone, I’ve been experimenting with running Qwen models locally on my setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB CPU: Ryzen 5700X OS: Windows 11 What I’m currently running Qwen 3.6 35B (UD Q4_K_M) llama-server.exe -m "C:\Users\Dino\…
-
----START HUMAN TEXT---- Hi all, I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV wil…
-
qwen 3.6 27B looping problem (www.reddit.com)
Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen…
-
Native MTP speculative decoding on Apple Silicon ~2.24× over no-MTP AR at temp=0.6 on Qwen3.6-27B · math-correct rejection sampling · MLX-native · zero external drafter Multiplier is hardware-independent. Absolute tok/s scales with memory…
-
How do you estimate total memory usage? (www.reddit.com)
Qwen3.6 35B A3B UD IQ4_NL_XL. 512k context tokens for 4 parallel processing, key cache quantized to Q_8 and value cache quantized to Q_4.
-
Best Llama Config for Turboquant_Plus? (Stats below) (www.reddit.com)
So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy.
-
The more I use it, the more I'm impressed (www.reddit.com)
Qwen 3.6 27b vs Codex GPT 5.5 / Claude Opus 4.7 My local llm discovered a bug that they both missed And it turns out it's critical GPT 5.5 and Claude both stood their ground and didn't give up until the end - they claimed to be right all a…
-
Sglang is better for serving a model for a personal agent harness? (www.reddit.com)
If one has enough vram, would Sglang be a superior choice than vLLM or llamacpp in terms of inference speed for serving a model dedicated to powering a personal (single user) agent harness like Hermes agent? Sglang has MTP for speculative…
-
Hi there. Not native English speaker.
-
For the past two weeks, I’ve been spending 4–5 hours a day building a custom MCP (Model Context Protocol) orchestration server. What started as a simple experiment with Qwen 3.6 35B has evolved into a full-scale "Man-in-the-Middle" proxy t…
-
In the rapidly evolving landscape of AI-assisted development, choosing the right "harness" is as critical as choosing the model itself. This benchmark explores two of the most prominent open-source harnesses for local LLMs: Pi Coding Agent…
-
Advice needed on eGPU and Mini PC (www.reddit.com)
Hi all, I come across to relatively niche problem and could not find much useful posts or guides about it. I have a mini pc (Beelink Ser 8, 8745HS and 32GB 5600 DDR5 SODIMM) headless server for hosting some routing services, and I am wonde…
-
I spent a while getting this dialed in and wrote up the full recipe. Short version: 35B MoE TQ3_4S fits in 12.4GB of weights KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40 Total VR…
-
I have been using llama.cpp to run some models recently. For example, I've been running GLM-4.7-Flash with this command .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-s…
-
I’ve run models exclusively on apple silicon up until now, but wanted to up my inference game. I bought a slightly used RTX 5000 Pro Blackwell for a bit more than twice as much as two 3090s.
-
General vs Reasoning [Qwen 3.6] (www.reddit.com)
I want to play with Qwen 3.6. Unsloth shows 4 different parameter options for different use-cases.
-
Anyone tried +- 100B models locally with foreign languages? (www.reddit.com)
I am quite curious as I tried Gemma 4 31B, Qwen 3.6 27B, GLM 4.7 30B and some others in my native language (czech). Gemma performs "best" and considering the fact its "just" 18GB model - it actually blows my mind how well it can respond in…
-
Using ollama for Openclaw (www.reddit.com)
Hi all, I have recently installed openclaw on a raspberry pi4, linking it to my local Ollama instance (RTX 3090 with 24Gb, as well as 96Gb of DDR5 RAM bought before the madness), in my case running Qwen3.6 (latest) capped at 16k context. A…
-
Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback.
-
Which model should I try? (www.reddit.com)
In my current workflow (coding in python/c++ and technical reports) I mostly use Qwen3.6 27B and Gemma4 31B. In the past I tried other models like Deepseek with decent results but was painfully slow....
-
If you've been waiting to try local AI development, please try it (www.reddit.com)
I have snobbishly long felt that the local models were not 'up to my standards' for local development, or otherwise able to compete with GHCP, Claude Code, Cursor etc. Boy was I wrong.
-
Multi agent AI Trading Floor (www.reddit.com)
Hello, I built a multi agent AI trading floor for a school project: 10 agents (news, research, macro, crowd sim, trading…) Running 100% locally on Ollama, Gemma 4:26b, qwen3.6:35b, gemma4:31b. no paid APIs.
-
Qwen 3.6 seems to have a lot of trouble with tool calling (www.reddit.com)
(I'm on Windows system running these models locally) I've used both Codex and OpenCode with Qwen 3.6 27b and 35b running locally. I'm having a bitch of a time getting them to correctly create files.
-
I've had better results quality wise with 35B AND it's much faster than 27B. Just curious cause I see lots of people post about 27B.
-
I made a visualizer for Hugging Face models (www.reddit.com)
I built hfviewer.com, a small tool for visually exploring Hugging Face model architectures. You can paste a Hugging Face URL and get an interactive visualization of the architecture, which can make it easier to understand how different mod…
-
What could they mean by "warmed steady-state"? (www.reddit.com)
https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash_10x_prefill_speedup_over_llamacpp_at_128k/ Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, v…
-
Need advice on Qwen 3.6 27B INT4 quantization (www.reddit.com)
Hello everyone, I think Qwen 3.6 27B is good enough that it might take a while before we get a clearly better model at a similar size. I have a single headless RTX 3090 with a 300W power limit.
-
I wanted to share an open-source app that I built for running LLMs locally on my setup. My setup Hardware FEVM FAEX1 (128GB) RTX Pro 5000 Blackwell (48GB), connected over OCuLink Aoostar AG02 2x2TB internal m.2 drives on raid-0 using mdadm.
-
Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants)…
-
Kv cache quantization: ignorance, or malice? (www.reddit.com)
I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between spee…
-
Is it worth adding local LLM to agentic coding stack? (www.reddit.com)
Hey All my agentic coding stack includes claude-code 20x max, and codex 20x max. I use heavy scripting for orchestrating and testing multiple projects, been ai coding for 3 years.
-
5070 Ti —> 3090 move. Worth it? (www.reddit.com)
I got into LLMs late 2024, and local in Jan 2025. since then, I’ve upgraded my mini PC then added eGPU with 5070 Ti back when it was retailing for $750-$800.
-
What's your tps on 3090 + Qwen 3.6 27B in real tasks? (www.reddit.com)
I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models.
-
LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far.
-
The angle here is native Windows, no WSL. Simple installation, open source, no telemetry.
-
Does anyone do this? Any tips?
-
Have Qwen said anything about further Qwen 3.6 models? (www.reddit.com)
Have Qwen hinted at whether other models (9B, 122B, 397B) would be getting the 3.6 treatment? Or have they in any way confirmed or hinted at "this is it"?
-
It is pluggin made for ONLYOFFICE, much simpler than copy-paste from webui. PS.
-
Qwen3.6-27B-NVFP4 - images (www.reddit.com)
Model: Abiray-Qwen3.6-27B-NVFP4.gguf Specs: - Legion 7i Gen10 - NVIDIA GeForce RTX™ 5090 - Intel® Core™ Ultra 9 275HX × 24 - RAM 32.0 GiB llamacpp settings: ./build/bin/llama-server \ -m ~/.lmstudio/models/lmstudio-community/Qwen3.6-27B-GG…
- Qwen3.6-27B - Closed-loop SVG Images (www.reddit.com)
-
So in response to the Great Token Reconning of 2026, I decided to try out Qwen 3.6 as a daily driver, and although it's only been about a day, I have to say I'm thoroughly impressed. I had to download the VSCode insiders edition and set up…
-
4080 Super > RTX 6000 Pro, Wow! (www.reddit.com)
A friend is going on vacation for a couple weeks and is lending me an RTX 6000 Pro rig to mess around with. Holy cow, it is so much faster than my 4080 Super!
-
Which other models will my system support? (www.reddit.com)
This is my system: OS: Nobara Linux 43 Processor: Ryzen 9 5980HX RAM: 16 GB GPU: Radeon RX 6800M (12GB) I'm using llama.cpp and Qwen3.6-35B-A3B-UD-Q4_K_M is working okay in this system using vulkan. I'm getting a speed of ~17 t/s.
-
Need help optimizing qwen 3.6 on my 2x 5060ti 16gb (www.reddit.com)
Hi all, I tried to setup my pc to run llm, but got some issue: the first question of the chat is generally fine, but from the 3rd follow up question, the backend often be unresponsive and I have to manually restart the llama cpp server, or…
-
I experienced this with Q4 and Q3 versions of Qwen3.6-35B-A3B and Gemma-4-26B-A4B. It starts saying things which sound similar in thinking mode: I must do ....
-
love it - Qwen3.6-27B — UD-Q5_K_XL evaluation (www.reddit.com)
by Kyle Hessling A hands-on benchmark of the Unsloth dynamic Q5 quantization, self-hosted on a single RTX 5090. 19 runs, 93.9 k generation tokens, across agentic reasoning, production-grade front-end design, and canvas / WebGL creative cod…
-
Qwen 3.6 27B Neo Code Q4 KM I matrix is badass (www.reddit.com)
So i am using this model in tax accounting. Have a shitty Ryzen 9 7940HS (8C/16T), 60 GB RAM, Radeon 780M iGPU, 1 TB Kingston NVMe, Win 11 Pro.
-
I’m building OpenYak, a desktop AI workspace for using local models with real files on your computer. In this demo I’m using Ollama with Qwen/Qwen3.6-35B-A3B to review an attached budget workbook.
-
Hello folks What is best code editor for local LLM deployment (LM Studio, llama.cpp)? I wish to test my LM studio + Qwen 3.6 27B and Gemma 4 31B with a legit local code editor.
-
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (www.reddit.com)
Gemma just crushed Qwen in a local LLM gamedev contest! Device: MacBook Pro M5 Max, 64GB RAM Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens.
-
Surely a x4 bigger model should be more expensive for inference?! API prices at e.g.
-
Qwen 3.6 - Loops and repetitions (www.reddit.com)
I normally seldom experience loops, either reasoning or responses, using Qwen 3.6 27B Q8 with 256k context window in Agent Zero. But the 35B A3B Q8 with 256k context window gets constant loops and is basically unusable within Agent Zero.
-
What's the best suscription under 20$? (www.reddit.com)
I’m pretty overwhelmed. I feel like there are so many options that I don’t know which one to choose, and trying things until I find a decent one isn’t really my thing—even though I enjoy it.
-
Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops) (www.reddit.com)
I've got to the point where I need some help. I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking".
-
Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (~125K context, higher TPS). We’ve been pushing further on both context length and stability for tool-agent workloads.
-
I wanted to see how much of my coding-agent workflow I could move local instead of paying for hosted tools forever. There was another push: Anthropic's own April 23 postmortem confirmed product-layer regressions through March/April.
-
Would implementing a dual GPU configuration enhance the TPS? (www.reddit.com)
I am currently utilizing a single RX9070 16GB, achieving a performance of 20 tokens per second with Qwen 3.6 27B. Would integrating an additional RX9070 enhance this performance, or would the output remain consistent?
-
Are Qwen 3.6 27B and 35B making other ~30B models obsolete? (www.reddit.com)
Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older ~30B models irrelevant? They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows.
-
Longtime lurker here, thought i should post my speeeeds... I have a RTX 4070S 12 GB Vram (+10% OC), AMD 9800x3D with 4x16 Gb DDR5 6000Mhz CL30.
-
Qwen3.6 27B seems struggling at 90k on 128k ctx windows (www.reddit.com)
I have RX 7900 XTX, running Qwen3.6 27B Q4_K_XL. got 400ish pp and 30s tps.
-
Hey y'all! I've recently written a text in Russian about my experience comparing Qwen-3.6-27B with lower tier cloud models on hard tasks -- I wanted to share the translation of the post, since I found the results interesting and surprising.
-
Qwen-27B as a Local Agent — It Actually Works Now (www.reddit.com)
It's been a busy week testing and trying to get the 27B model set up correctly. TL;DR: The only setup that worked for my dual 3090s was this one.
-
Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer (www.reddit.com)
Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops. This isn’t a criticism of Qwen3.6.
-
Tenstorrent TT-QuietBox 2 Specifications (Blackhole) (www.reddit.com)
Source: https://docs.tenstorrent.com/systems/quietbox/quietbox-bh-2/specifications.html Currently supported models: https://tenstorrent.com/developers From the specification docs above: CPU: Ryzen 7 9700X 65W Granite Ridge 3.8GHz Memory: 2…
-
Hugging face link here. Ive been waiting for sokann to drop his Qwen 3.6 GGUF for 16 GB GPUs as his Qwen 3.5 was my GGUF of choice.
-
Larger Gemma-4/Qwen3.6 (www.reddit.com)
Qwen3.5-122B-A10B at Q6_K is really good. Do you think we will see a larger MoE Gemma-4 or Qwen3.6 at some point?
-
Qwen Models are such good models? (www.reddit.com)
https://preview.redd.it/o1uxb57u47yg1.png?width=862&format=png&auto=webp&s=d38204fe6ccd0d8326dcd98a534e9a226d213f99 How trustworthy are Artificial Analysis intelligence index? so according to them Qwen 3.6 27B is better than bigger MoE mod…
-
Strip Qwen3.6 dense of its multimodal capabilities (www.reddit.com)
This may be naive but if we stripped a model of its image processing/voice processing capabilities, can it make it smaller or faster? Is that even possible?
-
Followup to yesterday's post: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point.
-
Qwen3.6-27B-UD-Q6_K_XL.gguf sometimes gets stuck in a loop (www.reddit.com)
Hi all I'm running Qwen3.6-27B-UD-Q6_K_XL.gguf using llama swap and llama-server with these parameters (actually stolen for some posts on this subreddit.) llama-server \ -m /models/Qwen3.6-27B/Qwen3.6-27B-UD-Q6_K_XL.gguf \ --mmproj /models…
-
Don't forget about dem free gains! (www.reddit.com)
Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago: 1 card - 1580/44: $ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB): Device 0: NV…
-
I gave some math problems to Qwen 3.5 27B and Qwen 3.6 27B and they got all of them right, pretty smart models I would say, but very slow and electricity consuming, they took like 5 mins with my GPU at 120 W to solve a problem. The MoE mod…
-
I tested two llama.cpp builds on the same Qwen3.6-27B-NVFP4 model. llama-bench reports the model label as qwen35 27B NVFP4, but the actual tested model is Qwen3.6-27B-NVFP4.
-
I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.
-
Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour.
-
I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards. Hardware: 2x RTX 5060 Ti 16GB 32GB total VRAM Proxmox LXC 16 vCPU ~60GB RAM CU…
-
Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989? According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe…
-
3.6 27B Tool Calling Issues (vLLM) (www.reddit.com)
Has anyone got a reliable vLLM recipe for 3.6 27B that fixes the tool calling issues? I am getting "Not let me..." - then nothing.
-
Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) (www.reddit.com)
Hello, I would like a suggestion from those who are already actively involved in this world. Basically, I own this workstation: Ryzen 9 5900X 32GB di RAM DDR4 RTX 5060Ti PCCOOLER CPS YS1000 1000W Currently, I can quite easily code with Qwe…
-
Qwen3.6-27B-GGUF:UD-Q8_K_XL and llama.cpp issue (DGX SPARK) (www.reddit.com)
Hey all, im having a crisis that i just cant figure... i used Qwen3.6-27B-GGUF:UD-Q8_K_XL ever since it came out (on a DGX SPARK) and it worked like magic with decent performance (~50 t/s) , im updating SPARK and llama.cpp on a daily basis…
-
Qwen3.6-27B created this Open Webui tool (www.reddit.com)
I usually go for Claude for those kinds of Open WebUI tool creations, but rate limits are getting tight so I decided to just let Qwen3.6-27B-Q5 handle it through Open WebUI. It did it in one shot.
-
Gemma4-31B-3bit-mlx · Hugging Face: 3 & 5 mixed quant for RAM poor Mac users. (huggingface.co via reddit)
Just dropped another 3&5 mixed quant for the RAM Poor Base-model-only Mac users that want to try Gemma4 top of the line LLM. 6gb smaller that the other 3bit-mlx out there and 25% faster.
-
Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K.
-
Devstral Small 2 24B vs Qwen 3.6 27b or both? 1x 3090 (www.reddit.com)
Hi got 1x 3090 and I'm thinking about these both models. I'm using from Friday Qwen and this model is amazing!
-
I'm Not a Dev But I Use Qwen 3.6 35b to Code (www.reddit.com)
Full disclosure: I used to program a bit, but I was garbage at it so I found a new career. This was eons ago so I'm not a dev, obviously.
-
Qwen3.6-27B IQ4_XS FULL VRAM with 110k context (www.reddit.com)
Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 vers…
-
Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (www.reddit.com)
Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…
-
Anyone tried Qwen 3.6 27b on the r9700 yet? (www.reddit.com)
The memory bandwidth on the r9700 looks quite good compared to my Mac or a Strix Halo and I'm wondering how this turns out. Thanks!
-
Qwen 3.6 27b S2 Opus + GLM + Kimi (huggingface.co via reddit)
My first time releasing a fine-tune publicly! If anyone wants to independently eval against base, that’d be awesome.
-
[7900XT] Qwen3.6 27B for OpenCode (www.reddit.com)
I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far: llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --t…
-
anyone know where to use qwen 3.6 27b via api/coding plan? (www.reddit.com)
I want to test this model out but I don't have a setup that can do it locally. openrouter and all my coding plans don't include it.
-
Power-limit vs TG/s for 2x3090 (www.reddit.com)
Trying to find the sweet-spot to tradeoff between power and tg/s. 250W seems to be a sweet spot for Qwen3.6-27B.
-
I test'ed the number of Ll's in Qwen 3.6 35B.. It required 3 tries (www.reddit.com)
How many ll's are in Stargate's TV Show's leader? Reasoning Toggle content The answer depends on which "leader" of the Stargate TV series you're referring to, as command changes throughout the franchise: General George Hammond (Seasons 1-3…
-
Qwen 3.6 27B (IQ3XXS) vs 35B A3B (IQ4XS)? (www.reddit.com)
Just was wondering what people feel is better. I do need 262K context so these are the biggest quants of each I can fit on my 3090 with KVcache at Q8.
-
how fast can qwen3.6 35b get (www.reddit.com)
i wanted to see how fast i could make qwen3.6 35b run on a single h100, so i put together a sglang setup for it. it exposes an openai compatible api and also works with claude code through anthropic compatible routing from the connect tab.
-
We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public…
-
At least in open-webui. Nothing has changed except for the backend update.
-
Are OSS runnable model good now? (www.reddit.com)
Hi, I currently have access to 2–3 RTX 3090 GPUs (ideally I’d like something that runs well on 2). I can install models up to around 100 GB in size.
-
https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit https://huggingface.co/mlx-community/Qwen3.6-27B-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-31B…
-
Qwen 3.6 27B on Strix Halo 128GB: any experiences? (www.reddit.com)
I'd jump on runpod and ssh in to test my workloads, but they don't have it. Would love to know how well this runs, particularly as context approaches a full 256K.
-
I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of diff…
-
Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 (www.reddit.com)
Hey fellow Llamas, your time is precious, so I'll keep it short. We built a GGUF port of DFlash speculative decoding.
-
GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B (www.reddit.com)
Hi folks, Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too. The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficien…
-
Question regarding 4 t/s Qwen 3.6 performance (www.reddit.com)
I am getting 4 t/s with Qwen3.6-27B-Q4_K_M which seems much slower than I'd expect. I am running LM Studio on Ubuntu 22.04 with the following specs: Dell Precision 5690 AI-ready workstation NVIDIA RTX 5000 Ada Generation GPU with 16GB VRAM…
-
At this moment, the models such as Qwen 3.6 35b/27b crush the competition, yet I can't help, but notice this pattern. While the local RP scene is abundant with the Western model tunes: LLaMA, Mistral (all sizes), Nemo and more recently Gem…
-
Repo: statisticalplumber/kanban at pi-agent-integration Hi Guys, To test Qwen 3.6’s potential, I also wanted the Cline Kanban project to have an open-source agent to work with. The last time I tested Cline Kanban, it didn’t support agents…
-
(Links to all files, apps, and repos mentioned in this post can be found in the 'full post' link at the bottom) Agents for document redaction and review tasks Document redaction tasks involve text and vision capabilities, and long context…
-
Qwen3.6-27B vLLM Docker Docker-based vLLM serving for Qwen3.6-27B with Lorbus AutoRound INT4 quant and MTP speculative decoding. Model is downloaded at runtime and stored on a host volume so the container can be upgraded without redownload…
-
Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B (www.reddit.com)
Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice speed increase for working on the same cod…
-
I'm a daily llama-cpp user and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM.
-
Hey all: I am trying to set up claude code to work with llama.cpp, I am using the Qwen3.6-35B-A3B. I usually use claude code + ZLM subscription i got lucky with $30 yearly - the set up is very simple with their automated script, but for th…
-
I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command: vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \ --enable-auto-tool-choice --tool-call-parser qwen3_xml \ --enable-prefi…
-
VSCode and agent integration (www.reddit.com)
I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and ch…
-
Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found! (huggingface.co via reddit)
Been using this for a few days. It is BY FAR the best uncensored model I have found for Qwen 3.6 35B.
-
Config CtxSize: 131,072 GpuLayers: 99 CpuMoeLayers: 38 Threads: 16 BatchSize/UBatchSize: 4096/4096 CacheType K/V: q8_0 Tool Context: file mode (tools.kilocode.official.md) Metric M Model XL Model Difference Avg Tokens/sec 28.92 29.78 +0.86…
-
As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.
-
qwen3.6 27b poor experience (www.reddit.com)
Seeing how people praise it, I tried giving it implementation plan that Sonnet generated, but qwen keeps breaking files and goes in circles: Thinking… The file got corrupted from multiple overlapping edits. Let me just rewrite the whole fi…
-
Vs code extension (www.reddit.com)
Which coding agent extension are most of you fining best with LM studio as the local server 🤔 Im running qwen 3.6 27b Ive used Cline and continue mostly. I haven't checkout all the options but im looking for something that looks and feels…
-
Qwen3.6-27B-FP8 - JS file is too long and causing JSON truncation (www.reddit.com)
Apologies in advance, if this is a newbie question. When running Qwen3.6-27B-FP8 using the below command on an Nvidia RTX PRO 5000, in opencode, I am seeing errors such as: "The issue is that the JS file is too long and causing JSON trunca…
-
Qwen3.6-35B-A3B KLDs - INTs and NVFPs (www.reddit.com)
https://preview.redd.it/c76w57d1yexg1.png?width=1482&format=png&auto=webp&s=1164d8bc3e2e8a4157f26dd5583238a736474932 KLD for INTs and NVFP4s. AS ALWAYS - Use Case is important.
-
Quant Qwen3.6-27B on 16GB VRAM with 100k context length (www.reddit.com)
https://preview.redd.it/tblmrwxkbexg1.png?width=1193&format=png&auto=webp&s=6dea1e6684e75e22852d57c0c72e9171deb56ae2 I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6…
-
Hey folks — looking for some advice on improving my local LLM setup (and also exploring agentic coding workflows). Current setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB Using llama.cpp with a Qwen3.6 27B Q6 model (GGUF) Running through OpenCo…
-
Were Qwen3.6 models scrubbed from openrouter? (www.reddit.com)
I made a simple app using openrouter, hoping to use the new small qwen models (the a3b moe and the 27b dense one), but they aren’t listed. Also, I swear some qwen3.6 models that were listed before are missing now.
-
Seriously, Qwen3.6 27b is mopping the floor against models like 5 times its size right now. It doesn’t take a rocket scientist to figure out that maybe the whole a2b and a3b MoE thing isn’t the best solution after all.
-
TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac. So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my perso…
-
What would you say is the minimum amount of tokens per second you would tolerate for your local agent workflows? I have been trying pi.dev connected to a llama.cpp instance running Qwen3.6-27B-Q6_K_L with 200K context running on an RTX A60…
-
Replace RTX 2060 12G with second RTX 5060 Ti 16G for Qwen 3.6 27B? (www.reddit.com)
Right now I'm running Qwen3-27B-Q4_K_M on a 2060 12G + 5060 Ti 16G with tensor split 15/7. Gen speed sits around 16.5 t/s and prompt eval drops from 653 to 356 t/s as context grows.
-
Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP Can follow the same recipe I used for Qwen3.5-27B to achieve ~80 tps on a single RTX 5090 at…
-
I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple.
-
Best local gui setup Mac (www.reddit.com)
Hi all, I have a server (dual 7900xt) running qwen3.6 27b in LMStudio, because I love LMlink for its ease of use and I am okay with the model chugging along at ~25t/s in the background. I then serve the mode to my Mac, via LMlink.
-
When Qwen3.6-35B-A3B was released a week or so ago, I sort of expected an iterative improvement on the previous Qwen3.5 models. After all, those models were pretty decent as compared with the previous local models I had tried, and Qwen3.5…
-
How are you running Qwen 3.6 27B on windows? (www.reddit.com)
I've been trying to fix performance with llama-server and seem to be hitting a wall. Using Q4_K_M by unsloth and IQ4_K_M by DavidAU, when asking a question with no context, 39 t/s.
-
Local LLaMA server GPU upgrade advice (www.reddit.com)
TLDR : Should an RTX 3090 + T4 be faster than a P40 + T4 for OpenCode with Qwen3.6 35B A3B ? --- Hi, Nowadays, I have an architecture running : A Tesla P40 w/ 24GB VRAM A Tesla T4 w/ 16GB VRAM I mainly use this setup to run models like GPT…
-
could not extract summary
-
I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that K cache compression is not really recommended in most case…
-
So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me. I am running a 3070 8gb + 64gb DDR4.
-
I'm sure people have asked before for settings for these gpu's, but for me, no matter what I do, It doesn't work as good as 3.6 35B! I've tried VLLM and LLAMACPP .
-
Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help (www.reddit.com)
The new dense model is great, but I’m trying to figure out how to increase PP and Token generation speed. I’m running Q8 quants across 3 7900xtx GPUs and I’m consistently only getting 18-20 t/s generation speed and ~650 t/s prompt processi…
-
A few days ago, I was trying to improve token generation speed on my RTX 4070 Super 12GB while running Qwen3.6 35B A3B UD-IQ3_XXS (Unsloth) with llama.cpp, but to no avail. At that time, I had my monitor plugged in my 4070 and didn't even…
-
Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results (localbench.substack.com via reddit)
Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results 4 models tested with q8_0 and q4_0 KV cache against full-precision baseline What this measures KV cache quantization stores the key-value cache in lower precision to s…
-
Hey guys, I built a custom vLLM pipeline to run Gemma 4 (31B FP8) and Qwen 3.5 side-by-side locally to see how they actually perform in the wild with preprocessing of audio and images. But of course new model Qwen 3.6 27B came out just whe…
-
Qwen3.6 uncensored AWQ (www.reddit.com)
I have tested Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf on my 4x3090 system (opencode) and find it really good and fast. However, I can't find any uncensored models for vllm (preferably as AWQ).
-
Hello fellow members of this lovely community, Let me start by saying that I’m about as far from a professional developer as it gets. I’m a hobbyist whose entire coding experience consists of building various Python/VBA tools and simple Ja…
-
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan) (www.reddit.com)
I have ThinkPad T14 Gen 5 (8840U, Radeon 780M, 64GB DDR5 5600 MT/s ). Tried out the recent Qwen MoE release, and pp/tg speed is good (on vulkan) (250+pp, 20 tg): ~/dev/llama.cpp master* ❯ ./build-vulkan/bin/llama-bench \ -hf AesSedai/Qwen3…
-
TLDR: Swapped Ollama for MLX on M1 Max (64GB) to run a 12-agent trading stack using Qwen 35B MoE. MLX wins on throughput and fine-grained sampler control, but I lost the "it just works" convenience of Ollama.
-
Get goosebumps (www.reddit.com)
Please comment here if you just cancelled your claude subscription. So that we can see how much you have confidence in open source or open weight models especially with qwen3.6 release.
-
Qwen 3.6 27b IQ4_XS - 22 tp/s on RTX 5060TI 16b, 24k ctx (www.reddit.com)
Maybe it be helpful for someone: llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000 Cant run this model with higher kv quants on >8192ctx size. -ub & -b setted for 256 allowed me for…
-
been running qwen 3.6 locally and im shooked. but what are we doing about agent memory because it's still a complete mess.
-
IQ2XXS Qwen 3.6 35b is actually very usable on 32 gb macbooks (www.reddit.com)
just tested the MoE qwen model with 2 bit percision and its suprising good. I used the 2 bit xxs from unsloth and it seems to maintain intelligence really well, never failed a tool call so far and suprisingly good at 3js, even better than…
-
I tried working on a local LLM project today and honestly ended up pretty frustrated. I tested several approaches, but none of them worked reliably.
-
Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives (www.reddit.com)
MacBook Pro M5 MAX 64GB. Qwen 3.6 35B - 72 TPS.
-
Trying to find out the best local LLM inference engine for Hermes in terms of performance and memory footprint. Tool calling accuracy is already up there, so I focused on pure token crunching.
-
Qwen3.6-27B with the tiny harness of Kon (www.reddit.com)
It's working very well out of the box on the tiny harness of kon; ~270 tokens without the tool schema (~1000 tokens including). https://github.com/0xku/kon Members from LocalLLaMA have already contributed many interesting features recently…
-
coding with Qwen3.6-27B-UD-Q2_K_XL.gguf (www.reddit.com)
pi llama.cpp awesome torus awesome torus Windows, 5070 (12GB)
-
Qwen3.6-27B Uncensored Aggressive is out with K_P quants! (www.reddit.com)
The dense sibling of the 35B-A3B drop is here, Qwen3.6 27B Uncensored Aggressive is out! Aggressive = no refusals; NO personality changes/alterations or any of that, it is the ORIGINAL release of Qwen just completely uncensored https://hug…
-
What am I missing about samplers? (www.reddit.com)
Hi all, With the recent release of models that require temp = 1, top_k = N, and top_p = 0.95, I'm wondering why labs actually prefer those truncation samplers over just min_p? As far as I understand, min_p isn't supported everywhere, and t…
-
Hey everyone, I run llamacpp precompiled with CUDA 12.4 on Windows 11 with a RTX 4090. With small models like gemma-4-E4B everything runs fine, but as soon as I run a bigger model like Qwen3.6-27B (IQ4_NL) or a medium sized model with larg…
-
What speed is everyone getting on Qwen3.6 27b? (www.reddit.com)
I'm getting ~13 tps on Q8_0, with a context window of 128000, K Q8_0, V Q8_0 this is on 3x GPUS (1x2060super 8gb, 2x5060ti 16gb), via llamacpp unsure if this is slow or to be expected? */llama-server --port 8080 --model */llama.cpp/Qwen3.6…
-
Are Qwens v3.6 good at vectorizing raster images? (www.reddit.com)
original image Qwen3.6-27B-UD-Q5_K_XL.gguf Qwen3.6-35B-A3B-UD-Q5_K_S.gguf ...you tell me. system prompt: You are Qwen, created by Alibaba Cloud.
-
Got a RTX a5000 24gb, what models could I use? (www.reddit.com)
I just got a used RTX a5000 24gb to use for local models, I mainly use AI to code, but I prefer to spend some money now instead of $200 per month on claude to use 50% of it in a single prompt. My current specs are: Ryzen 7 9800x3d 64Gb DDR…
-
What Agent systems do you use? (www.reddit.com)
Heard of hermes and openclaw, they are great but takes a bit of time to setup properly. Now that the Qwen3.6 27B is out I want to have a forever running agent to track news and whatever cool shit there is.
-
Qwen3.6 One Shot Tetris Game (www.reddit.com)
I am blown away by what this model can generate locally. I asked for a flashy Tetris game with particle effect and boy did it deliver!
-
Yesterday's Claude Code Pro removal thread hit 350+ comments in a few hours, and the dominant take was basically "switch to Kimi K2.6, go local, done." I upvoted that thread and tbh im mostly there — but im building voice agents and RAG pi…
-
Qwen Lens Studio A multimodal AI studio built around a single Qwen vision-language model, exposed through five focused tools plus a batch runner and a persistent session log. Ship a screenshot → get code.
-
Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B (www.reddit.com)
Does anyone else have the same experience comparing these two - for me 3.5 122B outperforms 3.6 by a large margin. 3.6 gets lost as long as the task requires a couple of more steps.
- Optimizing Qwen 3.6 35B A3B sampling parameters. (www.reddit.com)
- PSA re Qwen 3.6 35B A3B q4 + agents (www.reddit.com)
-
Qwen3 27B FP8 + TurboQuant on RTX 5090 - anyone tried? (www.reddit.com)
Do I understand correctly, based on this comment, that I can potentially fit Qwen 3.6 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, a…
-
is Qwen3.6-27B comparable with Opus 4.5? (www.reddit.com)
https://preview.redd.it/qtzdx5ud0rwg1.jpg?width=1200&format=pjpg&auto=webp&s=aa25d9f0bb8007ee6e4065cfa46a9685454c89cd - Outstanding agentic coding, surpasses Qwen3.5-397B-A17B across all major coding benchmarks - Strong reasoning across te…
-
Kimi 2.6 and qwen3.6 is out but still as slow as ever (www.reddit.com)
Has anyone tried these? I found this on ollama: https://ollama.com/library/kimi-k2.6, https://ollama.com/library/qwen3.6 My issue is that they are extremely slow on my local.
-
Qwen/Qwen3.6-27B · Hugging Face (huggingface.co via hn)
Qwen3.6-27B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransforme…
-
Llama.cpp parameters for Qwen 3.6 with RTX 3090 (www.reddit.com)
Hi, I'm trying to run Qwen 3.6-35B on my RTX 3090 (24 GB of VRAM) but I'm not sure about 2 thing: - Which variant of the model to use ? (Q4_K_S, Q3_K_XL, other ?
-
Qwen3.6-27B released! (www.reddit.com)
Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight.
- Qwen3.6 27B really good? (www.reddit.com)
-
A short follow-up to my previous post, where I showed that changing the scaffold around the same 9B Qwen model moved benchmark performance from 19.11% to 45.56%: https://www.reddit.com/r/LocalLLaMA/s/JMHuAGj1LV After feedback from people h…
-
Consider running a bigger quant if possible (www.reddit.com)
Just a little reminder that *if* it is possible for you to run bigger quants, do it. I ran Qwen 3.6 IQ4_XS at 128k context was very much disappointed because it would loop, make formatting errors, implement wrong things etc.
-
I have run two tests on each LLM with OpenCode to check their basic readiness and convenience: - Create IndexNow CLI in Golang (Easy Task) and - Create Migration Map for a website following SiteStructure Strategy. (Complex Task) Tested Qwe…
-
Hey there, I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I…
-
9900x, RTX 4080, 96GB RAM. Llama-cpp, Windows.
-
How to best utilize local LLM give my hardware? (www.reddit.com)
Hi all, I’m new to local LLMs but as someone who extensively uses agentic coding I thought I’d try it out. I am running a MacBook Pro with M3 Max 64gb ram.
-
Llama.cpp's auto fit works much better than I expected (www.reddit.com)
-
I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count.
-
Gemma 4 is much less popular on Hugging Face than Qwen 3.x. (www.reddit.com)
The difference is quite big: likes downloads last month finetunes Qwen3.5-27B 952 3,233,034 263 Qwen3.5-35B-A3B 1,397 3,977,637 87 Qwen3.6-35B-A3B 1,115 458,436 60 gemma-4-31B 323 343,895 13 gemma-4-26B-A4B 227 118,464 13
-
I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…
-
Are commonly recommended sampling parameters often too high? (www.reddit.com)
-
-
-
-
Doing real coding work locally for the first time (www.reddit.com)
-
-
-
-
-
Opus 4.7 Max subscriber. Switching to Kimi 2.6 (www.reddit.com)
-
Thoughts on MoE Qwen 3.6 35B? (www.reddit.com)
-
SOLVED! Was "Help needed: Ollama > qwen3.6 in OpenCode on 64Gb M4" (www.reddit.com)
-
Qwen 3.6 comaprable with the old Qwen 3 coder 480B? (www.reddit.com)
I specifically remembered when qwen3 coder came out and it was like the only few models out there that can totally take over a repo and actually do things in VSCode without emptying bank account. and when that the qwen3 coder 30B was so fa…
-
-
Qwen3.6-35B-A3B running on a Mac mini M4 16GB (www.reddit.com)
-
Qwen 3.6 on rtx6000 96gb (www.reddit.com)
-
Help on jiberish output on Qwen3.6-35B-A3B-GGUF::UD-IQ3_S (www.reddit.com)
-
LLM Neuroanatomy III - LLMs seem to think in geometry, not language (www.reddit.com)
-
-
Qwen3.6 agent + Cisco switch: local NetOps AI actually works! (www.reddit.com)
-
5070 Ti (New) vs 3090 (Used) to pair with 4070 for local LLMs? (www.reddit.com)
-
Dual GPU setup (yes, no)? (www.reddit.com)
-
Running Qwen 3.6 35B-A3B-4B on MacBook Pro M5 64GB with tools (www.youtube.com via hn)
-
-
-
Alguém utilizando PI como headless? (www.reddit.com)
-
what is the state of using rotoquant at the moment? (www.reddit.com)
-
Full AMD workstation- dual 7900 XTX (www.reddit.com)
-
-
How is Rotorquant/planarquant/iso qaunt better? (www.reddit.com)
-
-
Qwen 3.6 35B different quant speeds ? (www.reddit.com)
-
-
Intel Arc B70 with HP z640 workstation (pcie 3) (www.reddit.com)
-
Qwen 3.6 CoT issue? (www.reddit.com)
-
-
-
Ask HN: How do you use Local LLMs? (April 2026) (news.ycombinator.com)
-
Agentic coding Qwen 3.6, Q6_K 125k context vs Q5_K_XL 200k context (www.reddit.com)
What would you choose if you were in my shoes? How viable is 125k for agentic coding really?
-
5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4 (www.reddit.com)
Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I…
-
Newbie here (www.reddit.com)
Hi guys im on 9950x 196gb and a 4090 This parameters are ok? mi main use will be coding llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --n-cpu-moe 20 -c 250000 --host 0.0.0.0 --port 8082 --reasoning-budget -1 --top-k 20 --top-p 0…
-
Hardware: Intel Core Ultra 7 258V, 32GB Unified Memory. Model: Qwen 3.6 35B A3B (Quant: Q3_K_S) via LM Studio.
-
Imposing my laptop to run Qwen 3.6 (www.reddit.com)
So, I am excited with the new MoE model released by Alibaba. And as an excited person, I want to believe that it can actually run in my hardware.
-
Generating Logisim Evolution circuits (www.reddit.com)
Short: I want to generate with Qwen 3.6 something like this https://preview.redd.it/bd6rbgnoatvg1.png?width=960&format=png&auto=webp&s=a1c079f37c048fa2c687709465b0c830a0184a4c After many hours, I'm able to generate a working file without w…
-
I guess the model didn't feel it needed to do anything beyond proving. Not entirely sure how I got it to act so..
-
I pray there is a Qwen 3.6 122b version (4x3090 owner) (www.reddit.com)
The 3.5 122b model already is fantastic at 4-bit. Really the best model I ever ran on my 4x3090, but from what I read how 35B 3.6 is doing, the 3.6 122b model would be an absolute value banger.
-
Quick demo of KV cache compression on Qwen 3.6 at 1M context. In this run: KV cache: 10.74 GB → 6.92 GB V cache: 5.37 GB → 1.55 GB (~3.5× reduction) Still seeing near-zero PPL change in early tests (3 seeds), but focusing mainly on memory…
-
Qwen 3.6 35B crushes Gemma 4 26B on my tests (www.reddit.com)
I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information fr…
-
qwen3.6-35b-a3b tool calling input problem... too bad... (www.reddit.com)
Hey guys. Some people including me are having trouble on qwen3.6-35b tool calling.
-
I'll be testing the setup and try out the Hermes Agent live: https://www.youtube.com/live/q5vqvwZykRI
-
Don't ask Qwen 3.6 35b to give you aski image of Yoshi :) (www.reddit.com)
https://preview.redd.it/dfqed57qgsvg1.png?width=1706&format=png&auto=webp&s=3859209698d2e844e2731326e355d60928658f8a The most fun part was reasoning, here is a gist: https://gist.github.com/anzax/5f06716c66180013cd715f6c2e5848df There is a…
-
Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM.
-
Qwen 3.6 q8 at 50t/s or q4 at 112 t/s? (www.reddit.com)
What are some ways that you would go about thinking about choosing between the two for use in a harness like pi? Did a good bit with q4 yesterday and it was so consistent and reliable I had it set to 131k context and it worked through 2 co…
-
best possible GPU setup for using qwen 3.6 ? (www.reddit.com)
hi have been recently thinking to buy my personal GPU for hosting open source models can someone give any suggestion ? and also suppose i don't wanna remain restricted to qwen 3.6 but some math heavy tasks too for which i wanna deepseek or…
-
Qwen3.6 GGUF Benchmarks (www.reddit.com)
Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.
-
Context Compaction / Summarization on Apple Silicon (www.reddit.com)
I've been very impressed with qwen3.6-35B-A3B on Apple Silicon (and actually my AMD iGPU setup with DDR5 and a 760M does well too). It can actually navigate a codebase and write useful code.
-
Qwen3.6 Fails n8n Tool Calling (www.reddit.com)
https://preview.redd.it/na4ub5yzprvg1.png?width=1654&format=png&auto=webp&s=e356e0ab0829bb275352d1035c35c645a381c3c7 I am using Kaggle to serve Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf but tool calling is not always working. I also tested it with R…
-
Qwen3.6-35B-A3B just dropped — quick thoughts after trying it (www.reddit.com)
Just gave the new Qwen3.6-35B-A3B a spin. It’s a MoE model (35B total, ~3B active), but honestly the more interesting part is how much they’re pushing agent-style coding.
-
Qwen3.6 is incredible with OpenCode! (www.reddit.com)
I've tried a few different local models in the past (gemma 4 being the latest), but none of them felt as good as this. (Or maybe I just didn't give them a proper chance, you guys let me know).
-
7900XTX, Qwen 3.6 35B A3B, 150t/s that drops to 50t/s for no reason? (www.reddit.com)
MSI B650 Gaming Plus 9800X3D 64GB DDR5 6400mts Windows 11 When I first boot my PC and I run this model, I get 155-160t/s, and for some reason, after a couple minutes, say, 10 minutes, not using AI or anything in particular, GPU temp at 40c…
- Low performance in 7900XTX in Qwen 3.6 35B A3B (www.reddit.com)
-
I spent some time yesterday after work trying out the new qwen3.6-35b-a3b model, and at least for me it's the first time that I actually felt that a local model wasn't more of a pain to use than it was worth. I've been using LLMs in my per…
-
Qwen 3.6 No think? (www.reddit.com)
I’ve been seeing a lot of good feedback about the qwen 3.6 model and its reasoning performance but has anyone tested it with reasoning off? I’ve been building a low latency app using Qwen 3 30ba3b 2507 and 3.5 no think was not an improveme…
-
Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task.
-
QWEN 3.6 35B A3B MXFP4 https://preview.redd.it/bclr8ukcoqvg1.png?width=904&format=png&auto=webp&s=853b211505ef6b9184d0571ca8fc46295437322a hey everyone this is my first post, anyways the thing is that there is this program called https://m…
-
Qwen3.6 35B A3B is THE ONE The Local LLM Champ on OpenCode benchmark dashboard [video] (www.youtube.com via hn)
About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
-
Hey all, we’re a mid-sized company (~70 people) and currently planning to bring a lot of our workloads on-prem instead of relying on cloud APIs. The goal for the moment is to run small to mid-sized models in the range of 30B like Qwen3.6 o…
-
Benckmark Qwen 3.6-35b uncensored on Rtx3090 (www.reddit.com)
Hello I saw the new model is out but even with 24gb of vram, I have too many browser and task to use it , so I have downloaded and tested the version of HauHauCS https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressiv…
-
censorship in qwen3.6? (www.reddit.com)
I do not want to spread conspiracies, please weight my information carefully, and maybe somecan can hopefully prove me wrong. I installed the brandnew qwen3.6 yesterday and ran a few of my own traditional tests, not a very deep dive, just…
-
https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-Base-GGUF Yes, Qwen 3.6 is out and it's a great model. However, who needs an even more "uncensored but official" model, can try out this one.
-
Context checkpoint erasure in llama.cpp ? (www.reddit.com)
Has anyone been able to solve or mitigate context checkpoints being erased during single user inference, specifically when function calling is part of the chat history? I've been using Qwen 3.5 35B A3B for some time (now using 3.6), tested…
-
Strix Halo concurrency 4 16k context 64 t/s Qwen3.6-35B-A3B-Q8_0 (www.reddit.com)
https://preview.redd.it/4906akj9dovg1.png?width=1527&format=png&auto=webp&s=c49e255ac79a3c5455f44603422f8af7ddc12594 First of all can we make https://www.youtube.com/watch?v=2lUC8Gimxz8 Angine de Poitrine this subs official band? Those guy…
-
Full JANG adaptive mixed-precision quantization sweep of Qwen3.6-35B-A3B: https://huggingface.co/collections/bearzi/qwen36-35b-a3b-jang All 15 profiles, from extreme compression to near-lossless: JANG_1L JANG_2S/2M/2L JANG_3S/3M/3L/3K JANG…
-
Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way. (www.reddit.com)
I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp…
-
could not extract summary
-
Is there a way to have qwen-code CLI read images? (www.reddit.com)
Basically I am asking the model to describe an image, but it says it can't process the images. The weird thing is that if I send the image encoded directly on the prompt, it works just fine, I am using llama-server with qwen3.5 (tried all…
-
Qwen3.6-35B is worse at tool use and reasoning loops than 3.5? (www.reddit.com)
Been running the new model entire evening in different quants and coding tasks with OpenCode. Used oMLX and LM Studio.
-
The TheTom's turboquant's GPU accelerated turboquant (turbo3) has unlocked high context gains for the 35BA3B family. I can now achieve ~40tg/s via the following GPU-POOR compilation flags and configuration: cmake -B build -DGGML_CUDA=ON -D…
-
PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on. (www.reddit.com)
I had previously posted here about a fix to their 3.5 template to help resolve the KV cache invalidation issue from their template. A lot of you found it useful.
-
How to run MoE models without necessary RAM? (Apple Silicon) (www.reddit.com)
Hey, I have a M1 Pro 16gb machine, and I wanted to run the Qwen3.6/3.5 35A3B model. However, this model cannot fit on a 4bit quant on my system.
-
Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 (simonwillison.net via hn)
Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 16th April 2026 For anyone who has been (inadvisably) taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from…
-
Qwen3.6 local test (live) with llama.cpp. Is it going to be better than Gemma4? (www.youtube.com via reddit)
About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
-
Here is how to run the new Qwen3.6-35B-A3B > At full context on a 4090 - IQ4_XS gguf with llama cpp > At full context on a Spark - FP8 with a tweaked vLLM Here is the docker compose with llama cpp services: llamacpp: container_name: llamac…
-
Note: First is Qwen3.5 35B MoE (Left) and Second is Qwen3.6 (Right) Hi Guys Just did quick comparison of Qwen3.6 35B MoE against Qwen 3.5 35B MoE. with reasoning off using llama.cpp and same quant unsloth 4 K_XL GGUF First is Qwen3.5 outco…
-
This is my first test with this model and Qwen impressed me. I will rate it 98% usable web os compared to my previous best 70% usable result from qwen3 next coder at q2.
-
My Qwen 3.6 fails the car wash vibe check (www.reddit.com)
I configured it to the best of my abilities, even at Q8. It fails to give the correct number of tools it supports on Claude Code and it fails the car wash test.
-
Qwen 3.6: worse adherence? (www.reddit.com)
Just swapped Qwen 3.5 for the 3.6 variant (FP8, RTX 6000 Pro) using the same recommended generation settings. My stack is vLLM (v0.19.0) + Open WebUI (v0.8.12) in a RAG setup where the model has access to several document retrieval tools.
-
I was working on a simple frontend web design task earlier (styling some buttons) with Qwen3.5-35B-A3B. The end results weren't great, but at least it kept trying to change stuff and call toosl properly.
-
Alibaba open-sources Qwen3.6-35B-A3B, a 35B MoE model with 3B active parameters (huggingface.co via hn)
Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransf…
-
Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All (qwen.ai via hn)
Qwen Studio offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
- Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (news.ycombinator.com)
-
Qwen3.6-35B-A3B released! (www.reddit.com)
Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license.
- What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M (www.reddit.com)
-
GLM 5.1 is dominant in almost every aspect in Design arena, surpassing Opus 4.6 in many tasks. Although user experiences vary dependent on subscription plans for both of those one of them is open source.