model roundup

Qwen 3.6

565 items · started 2026-04-16 · closed 2026-05-30

  1. Theres been talk of late about using HTML rather than markdown in Claude Code. I was curious how this worked with a local model so loaded up Qwen3.6 35B A3B at Q8 and F16 KV cache.

  2. EDIT - IGNORE. I MADE A MISTAKE.

  3. I'm using llama.cpp, and I've tried Bartowski's and my own quants. When using Qwen3.5-122B or Qwen3.6-27B, I'm seeing really low draft acceptance in chats with interleaved code snippets (chatting with the LLM about programming / a code pro…

  4. I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler).

  5. Context Krasis is an LLM runtime for running models that don't fit into VRAM. Krasis streams the model through VRAM from system RAM efficiently and handles prefill and decode as separate architectures and optimised usecases.

  6. Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality crit…

  7. Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090?

  8. Hoping the community can guide me on this one. I'm on the fence about the following purchase: Refurbished 16-inch MacBook Pro Apple M4 Max Chip with 16‑Core CPU and 40‑Core GPU, 64gb ram for $3,479.00 vs The new 16-inch MacBook Pro Apple M…

  9. I’m considering building a local machine for AI inference using a Dell Precision T5820 and 2 Intel Arc A770’s. From this I could get 32GB DDR4 RAM, 1TB SSD and 32GB VRAM, all for like $1000.

  10. Hi all, I'm somewhat new to the scene (been lurking for maybe 4-5 months now), but i think I have all the basics figured out. My setup: 9800x3d with 64GB of RAM, 6900xt with 16GB VRAM.

  11. So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap.

  12. So, I got interested in local LLMs a few months ago, but, I don't have a background in coding, and I don't know how to code, and I am not good with computers or anything. So far I mainly just was having fun with comparing different local L…

  13. Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects…

  14. I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).

  15. Looks like LMStudio released support for Multi-Token-Prediction (MTP) and the release notes say to use a MTP-compatible model. What model is everyone using with MTP support?

  16. I don't see any threads on this model. Is it because it's dense and/or without-reasoning?

  17. I dont have good experience running q4_k_m, the difference to q6 is "a few errors an hour" to " a few errors every couple of days". Edit: How it fails?

  18. Been running Qwen3.6-35B-A3B as a sub agent on a single 4090 for a few weeks. The failure modes are different from solo use and I haven't seen this written up anywhere.

  19. Really been testing qwen 3.6 27b and 35 a3b so far with 27b at q8 and 35 a3b at q4 (byteshape quant is insane). But i feel im not utilizing it the best, esp for long context messy coding of large repos.

  20. Just got an RTX 3090 to go with my Intel Core 9 Ultra 285K CPU and 32 GB of DDR5 6000 ram. I want to code locally on my Windows 11 PC.

  21. Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774) My run command: ``` llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface…

  22. I picked up a 7900 XTX earlier which runs qwen3.6-27b fine, but not to my like. Its compute performance is quite unstable for me.

  23. Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if…

  24. I was thinking about upgrading from an MI50 to an AMD AI PRO9700, and I happen to have an RX 9070 XT on my gaming pc, so I tested the performance on it to have an idea of what to expect. So, install rocm, build llama.cpp, download Qwen3.6-…

  25. I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.

  26. based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable an…

  27. link: https://huggingface.co/JC1DA/Qwopus3.6-27B-v2-INT4-W4A16-Autoround Super surprised how good Jackrong's model is... It's taking so much time to evaluate the all the base qwen3.6-27B, Jackrong's version and other's quantized models but…

  28. I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B?

  29. As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing.

  30. I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionally and I couldn't even get GLM 4.7 Flash…

  31. I'm having an issue with llama.cpp going OOM (system ram, not vram) after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it st…

  32. Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend.

  33. I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled!

  34. A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into…

  35. Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit?

  36. I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model ru…

  37. I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I read posts all the time, but I have a spec…

  38. I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large co…

  39. Just wondering how are people's experience with both these models! I've had some nice results with Qwen but Gemma4 runs so much faster here.

  40. Hi, (TLDR.): Qwen in its MTP version has tool call bugs and outputs everything into tool/thinking blocks - mangeling the output - canceling the +speed with repeated wrong tool calls! DCSS works well with non MTP qwen even on smaller qwants.

  41. My rag I've been building is much in response to having a LLM that I feel more confident in knowing where the knowledge base is coming from especially after the Open AI deal with the Pentagon. So, when I saw "uncensored" heretic models, I…

  42. please forgive the mildly clickbait title. hard to fit everything in it I've seen a lot of discussion here about KV-cache quantization, especially with the recent llama.cpp improvements, leading to some debate on the tradeoffs between KV q…

  43. Hi, I’m pretty sure I have seen people typing /model and seeing all available models. I have to type models from memory.

  44. So I've been using qwen 3.6 27b for monthly closes, bank recs, payable and receivables. Built a simple sql lite database it manages.

  45. Just wanted to share my amusing weekend project. https://www.askjeebus.com 100% vibe coded.

  46. I'm running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it's just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working…

  47. I tend to use Claude for a lot of research and I also increasingly worry about things like misinformation or things in the model I can't audit. So, I'm building my own all in one RAG with big datasets like all of Wiki, research papers, all…

  48. I removed mmproj file from models to remove vision and save my vram. But just curious, is this really don't affect its text ability?

  49. I opened my first contribution to exo: native multi-token prediction support for Qwen3.6-style MLX checkpoints. I hope it is useful.

  50. Sharing this because I didn't believe the first run. Setup: laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s), Linux.

  51. Does the inference speed below seem optimal for the hardware, or could there be further room for improvement ? I’ve been trying to use Qwen3.6 27b for agentic harnesses like Pi/Hermes.

  52. What I need it to do: Be able to support openclaw-type agent which is used by multiple people. What I tried: So I read in the internet about the atlas thing.

  53. Following on from club-5060ti, I’ve been doing some testing with my desktop AMD GPU and wanted to make a similar repo for 16GB Radeon cards. Repo: https://github.com/5p00kyy/club-rdna16 Pages/results: https://5p00kyy.github.io/club-rdna16/…

  54. Hello everyone! I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB.

  55. ..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M.

  56. Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.) In short, PDL enables more efficient execution of kernels…

  57. I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends." It does depend. So let me spl…

  58. BeeLlama v0.2.0 is here! Not quite a pegasus, but close enough.

  59. This is for all with 12GB VRAM. Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers.

  60. Hi everyone, I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp.

  61. I will try to keep this short ;) I used GLM 5.1 to vibecode a vague prompt on my vibecoded react web app and have GLM 5.1 rank the plans made with each other and the one it made itself. Test strategy: - use starter prompt as always - add v…

  62. Hi everyone, I’m looking for advice on local AI setups. My goal is to have a local AI generate text documentation from my one-hour therapy sessions.

  63. I tried Qwen3.6 35B A3B MoE, Qwen3.6 27B Dense, Gemma4 26B A4B MoE, Gemma4 31B Dense. In all cases I was using Q4_K_M and thinking mode enabled.

  64. My workflow has changed basically to ask Codex to do certain tasks and then document how to do them (including errors it found on its way) into a skill. I feed that skill to pi, and suddenly my qwen3.6 gets that hard stuff done: - devops o…

  65. Edit: does this happen every time a newbie tries to post here. Getting roasted despite having valid results?

  66. pge-jax JAX implementation of the Prioritized Grammar Enumeration (PGE) algorithm for symbolic regression. Overview pge-jax is a complete symbolic regression system that automatically discovers mathematical formulas from data.

  67. There's extreme price escalation on part of Anthropic, with token spend now approaching levels that have made many-an-enterprise scratch their heads. At the same time, judging by opensource advances (E.g.

  68. Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.

  69. Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP.

  70. If anyone is using the Continue.dev extension in VSCode, what config settings are you using for Continue and the llama-server? Mine keeps hanging after bad tool calls.

  71. The other day I posted about playing one night werewolf on my custom made UI via tool calls. Since then I’ve played a few games and improved the prompts.

  72. To preface, here's my config: llama-server \ --host 0.0.0.0 \ --port 1235 \ --models-preset %h/Software/models.ini \ --models-max 1 \ --sleep-idle-seconds 3600 \ --timeout 3600 \ --parallel 1 \ --device ROCm0,ROCm1 [*] flash-attn = on jinj…

  73. I wanted to know how much of a coding agent's performance came from the model and how much came from the harness, so I vibed a setup to allow me to test multiple agentic harnesses/model combinations on the same task. ALl the images above a…

  74. So i thought this is a small model issue but when i added a new gpu and i am able to run low mid model like Qwen 3.6 35b q4 or q5 this issue still exists now its not as much as small model but it does break when linking the model to copilo…

  75. I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g…

  76. After every machine restart I get a different prefill speed, it can be only 300t/s or 1500t/s. It's like a lottery at each restart.

  77. Ill address all the questions here not spam the sub what would be a better set up, 1 pc with 2 3090s and a 5080, but that 3090s will have to run at x4 pci-e slots OR 1 pc with 5080, another pc with the 2 3090s and on x16 split into 2x8 mai…

  78. Hey r/LocalLLaMA, We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP. Blog / Download NTP Models / Download MTP Models TL;DR For NTP, “pick the largest quan…

  79. Hi all. As some may have been aware, Hugging Face's model search had issues recently.

  80. Hello people I think the question is clear but I wanted to add some context: I work on internal tools in my job and some of the tools are for us developers (most tools are for marketing and factory production). I am currently working on a…

  81. MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round.

  82. Hi, i run llama.cpp inside LXC on a Proxmox server. The hardware is a recent AMD Epyc with two 6000 Blackwell MaxQ.

  83. Man Loving MTP. And Unsloth.

  84. I've tried Gemma4 and a few other variations of Qwen, but they're either not as robust with their output, or they take too long or too much VRAM and force the context limit down from 131K to 20K or even 4K, or they're slow AND low-context…

  85. Hi ! I've read a while ago that some AI's tend to agree on their own language to talk one to another over time.

  86. Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using Bee…

  87. One way I like to test new models, is by one-shoting (with a good prompt) a single webpage clone of the classic arcade game pacman. I usually do 3 attempts and keep the best one.

  88. Background For quite some time I had been submitting tasks to LLMs via llama-cli (natively) or llama-server (API), both from the excellent llama.cpp project. On CPU-only llama-cli starts fast and can restart from a checkpoint which has alr…

  89. I wanted to switch from Qwen3-Coder-Next-UD-Q4_K_XL to Qwen3.6-27B-MTP-UD-Q4_K_XL for local agentic coding. The Qwen3.6-27B is perceived to be "smarter" than Qwen3-Coder-Next, and I wanted to "upgrade" my local AI coders.

  90. Tried Qwen3.6 35B Q5_K_M MTP, HW: 9700x, 64GB 5600 RAM, 5060 TI 16GB. --n-cpu-moe 30 ^ -ngl 99 ^ -c 131072 ^ --no-mmap ^ --flash-attn on ^ --cache-type-v q8_0 ^ --cache-type-k q8_0 ^ --threads 8 ^ --parallel 1 ^ -rea off ^ --reasoning-budg…

  91. I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute.

  92. Hi, I'm using llama.cpp with qwen3.6 35B A3B on two different machines. I noticed that on both machines tokens per second is better while using Q4_K_S and Q4_K_M quants than lower Q3_K_M quants.

  93. What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases.

  94. The Qwen3.6-27B MTP benchmarks that have been circulating put factual tasks at 62-70% acceptance vs code at 79-89%. Tool calls probably sit in that factual range or lower, structured output, constrained format, less predictable than pure c…

  95. At work I get unfettered access to gpt 5.4 and sonnet, so I'm quite used to spawning sub-agents to go crazy on a repo and split up tasks. At home I am VRAM poor and like to run the models locally for my own enjoyment.

  96. I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes.

  97. Have been using Qwen 3.6 Claude distilled version, 27b at Q4 for openclaw, Hermes and other local harnesses. But recently noticed that the Claude distilled version that I use lost its vision abilities.

  98. https://preview.redd.it/8gpkg8zxmy1h1.png?width=1672&format=png&auto=webp&s=a95db16a39cdc49c0ff155117b734d413a49c2d3 https://youtu.be/MI0Pm1d6YF4 MTP can accelerate LLM inference 2x, especially for coding agents. This video covers what MTP…

  99. Update to Lemonade v10.5.1, then: ``` Get the model lemonade pull Qwen3.6-27B-MTP-GGUF Get ROCm 7.13 lemonade backends install llamacpp:rocm Load the model (MTP args auto-applied) lemonade load Qwen3.6-27B-MTP-GGUF --llamacpp rocm --ctx-si…

  100. PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.

  101. Just stumbled on this blog. A very interesting read if you are picking inference engine.

  102. Does anyone know why qwen 3.6 MTP spec decoding won't work with Tesla P40 when the K cache is quantized? I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40.

  103. I'm laughing here. I'm messing about with Qwen3.6-27B in order to gauge just how capable it is with local vibe-coding.

  104. Using latest llama.cpp with mtp and these settings, I only get 10 tps, should I be getting more? [unsloth/Qwen3.6-27B-MTP-Q4_K_M] jinja = true model = /Users/[username]/llms/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf cache-type-k…

  105. Someone suggested I give Continue (Vscode extension) a try. I've been using Roo / Zoo now and liking it but it is pretty tough on context and I was told continue has more control over it.

  106. With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized: -cache-type-k-draft q8_0 -cache…

  107. TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp…

  108. Hi, i built this agentic ai, Closed-loop system that ships standalone Python agents. What's different: - Interviews you until it understands the request before building anything - Two testing stages: prompt validation via LLM invoke, then…

  109. Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled My setup is heterogenous, I originally acquired my server (Lenovo ThinkStation P3 Tower Gen 2) to run OpenShift/K8s clusters (because I work on that), and…

  110. I was trying Qwen 3.6-27b and Gemma4 in a siomple web chat. Asked them both a qn like 'recommend the best llm for a 5060ti' and was suprised when they both replied 'user is asking about a card that doesn't exist'.

  111. I'll be UPDATING this as it seems I was benchmarking and testing Just before the UPDATE LOL TL;DR If you're running rigid agent frameworks locally with mtp on consumer hardware: drop your draft window to 3, lock parallel slots to 1, and co…

  112. EDIT 2: Trick-Assignment-828 pointed me at the actual rule update from the mods - Rule 3 Low Effort was expanded to cover LLM-assisted posts without disclosure. Disclosing now: Disclosure: I'm a non-native English speaker (German).

  113. Hi all, Not sure many people are aware so wanted to give a word about Fara-1.5 release. => this release will likely be the big sister of Fara-7B and built on top of Qwen3.5 Actual Fara-7B performs not bad at all but actually requires a pro…

  114. Hi All! Saw some folks waiting for the Docker images with llama.cpp and MTP when it released.

  115. Hi! I have been using Qwen3.6 35B A3B happily the past few weeks, and I wanted to try out Qwn3.6 27B with the new fancy MTP speculative draft!

  116. Thank you to the people at ik_llama and llama.cpp. It's amazing how far you've all pushed mtp and other tech so that I can run 27B and 35B Qwen3.6 models on an old gaming laptop with a RTX2060 mobile at 6GB VRAM and 32GB RAM.

  117. The Latest qwen3.6 models. Is this odd?

  118. In my real-world usage (opencode) and in my synthetic benchmarks, Coder-Next (Q5) demolishes the whole Qwen3.6 family including the 27B Dense model (All Q8). Everybody else is hailing that 27B is superior and is an amazing model, but I hav…

  119. Not deeply technically fluent but have ran few models locally before, around the time before gemma 4 dropped. I tried some low quant of qwen 2.5 coder and after some tinkering I got it to run but it was just so slow, obviously.

  120. I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territo…

  121. I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model.

  122. I've been building Abliterlitics, an open-source abliteration forensics toolkit. The idea is straightforward: take the same base model, compare the different abliteration techniques others have applied, then measure what actually changed u…

  123. Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor pa…

  124. Anyone have both of these and tested them? How much faster is the M3 Ultra in PP and TG speed compared to the M1?

  125. Hi All, I'm trying to understand the process of creating GGUF with MTP support. Does the original Qwen/Qwen3.6-27B support MTP?

  126. Hey there. I am an undergrad who has been doing mostly SWE, but will be doing ML research under my professor over the summer.

  127. Here are some results (llama.cpp)! Task 1: write a short poem 27B Dense: 12.5 tokens/s 27B Dense MTP: (spec-draft-n-max 6): 14.5 tokens/s 27B Dense MTP (spec-draft-n-max 3): 18.7 tokens/s Task 2: edit a hello word html artifact 27B Dense:…

  128. I can't decide that Qwen 3.6 35b q4 (130k context) or Gemma 4 26b q4 (95k context) is better for C# coding with 24GB VRAM. Please share your experiences!

  129. Setup: - RTX 5090, 32 GB, Linux - Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120) - Unsloth's…

  130. So I bought a second graphics card the other week to get in on the local AI craze and I've been having the hardest time using it to build my website. It's been unreliable, the context gets eaten up, kind of hallucinates sometimes.

  131. We've got great outputs for 27B via club 3090, but what about those of us who love the blazing speed of 35B on dual 3090s? I was getting 1500 p/p and 120 t/g with split layers, but MTP slowed it down to 80 t/g when I tested last week.

  132. I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm. I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup.

  133. Saw this post comparing Qwen 3.6 variants on coding primitives, so I wanted to see how local quants stack up against frontier models on a similar dense, single-file coding task. I ran the exact same prompt across local and web-based models…

  134. TL;DR All models were Qwen3.6 27B-MTP vs Base 27B (15k single-turn): Faster overall Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%) Generation: 7.63 → 16.15 t/s (+111.77% speedup) Prompt Processing: 279.75 → 244.90 t/s (-12.46…

  135. Just an idea and a prototype (made by Qwen3.6-27B-UD-Q6_K_XL via OpenCode) for allowing users to add custom sampling logic to llama-server without having to maintain their own entire fork and without having to make a wrapper that reimpleme…

  136. LLM Inference Server A single-container, idle-aware, OpenAI-compatible inference router for a Tesla P40. Routes between Qwen 3.6 27B (MTP self-speculative decoding, TurboQuant turbo4 KV cache), Qwen 3.5 0.8B (multimodal transcription), Whi…

  137. I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090 build\bin\llama-server.exe ^ -m "!MODEL_FILE!" ^ --mmproj "!MMPROJ_FILE!" ^ -ngl 99 ^ --ctx-size !MO…

  138. so background - these people. Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, worked on converting AR -> diffusion (its already working from older models).

  139. https://preview.redd.it/8o43bjhe9d1h1.png?width=5346&format=png&auto=webp&s=1c87c2ee8b8ffff43495f543266056b0e26d3947 In another post I had someone ask me about the power draw of the 4x 3090 setup so I'm sharing a a full test I conducted to…

  140. My GPU power consumption is 250w (undervolted rtx3090) when I added Qwen3.5-27B-GGUF to Ollama using a template (Modelfile made by gpt). I gave it 3 task to test it, build a snake game, build a flappy bird game, and make an interactive gri…

  141. hey yall. So I have a 24GB gpu.

  142. Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to…

  143. I don't think alibaba officially stated about "no qwen3.6 smaller models", and according to the patterns, she should ave been released it in the first week of may, but I think they delayed a little bit to catch the spotlight from Google I/…

  144. RLM models and Qwen3.6 Does anyone here have an RLM setup and how could I set it up? I want to make my Hermes agent even more powerful and I don't like that I need to open a new context window every time after just a few prompts.

  145. PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise) ------- Just wanted to share my current setup, that might help some users out there... services: llama-server: image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b912…

  146. In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests.

  147. I managed to get working 4 sub-agents Qwen3.6 35b on dual rtx 3090, I am using deepseek as orchestrator. https://preview.redd.it/biksbgq0n81h1.png?width=783&format=png&auto=webp&s=cf8a4481c1ac439c3283925001c12841b8e6c2e7 They all working l…

  148. I made a text based craft/trade/cooperate game for my agents to play on intervals when I don't have anything else for them, and it's been so fun watching them plan things out and form little factions with each other to cooperate on trades…

  149. I put together a small public repo for RTX 5060 Ti 16GB local LLM setups: I took inspiration from the club-3090 repo, but this one is focused on documenting what we’ve actually tested on 5060 Ti hardware so the setup details are easier to…

  150. Ok, hear me out. This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround (https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main) recipe was performing so much better than any other Qwen3.6 27B q…

  151. I’ve got Qwen3.6 27b and Qwen3.6 35b running in two separate instances for over two weeks and they are considerably dumber now than when I launched them. is this a thing?

  152. I’m using OpenCode with a local Qwen3.6-27B Q6_K GGUF model on an RTX 5090 with KV cache in Q8. For reference my llama.cpp build is compiled with CUDA 12.9.

  153. I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high.

  154. I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly".

  155. Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure: https://github.com/huggingface/ml-int…

  156. TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

  157. Finally feel like it’s possible. I have a custom build (vibe coded) UI on llama.cpp, allows model switching in the same chat.

  158. Got Qwen3.6 27B running on my newly assembled 4x 3090 rig (s/o 3090-club) and I'm trying to get the people in my house to adopt the local workflow. Open WebUI has improved a lot in the recent updates, but I still found it pretty rough for…

  159. i ran Qwen 3.6 35b A3B two 5060TI 16gb ( 32 gb vram also i have 32gb dram but i don't like offloading ) i used Q4 on LM Studio to get full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ? thanks !

  160. I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…

  161. TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results a…

  162. I'm considering biting the bullet and getting a pc with the following specs: 5090 Amd 9950x3d X870 motherboard 32gb ram (16x2) CL32 EDIT2: Price for this is falling in the arena of 5500-6000 USD where I live. Obviously costs a bomb.

  163. https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens…

  164. Hardware OS: Windows 11 Pro 10.0.26200, Build 26200 CPU: Intel Core Ultra 7 270K Plus, 24 cores / 24 threads, max clock 3.7 GHz RAM: 32 GB DDR5 @ 5600 MHz, 2x16 GB Crucial CP16G56C46U5.C8D GPU: 2x NVIDIA GeForce RTX 3090, 24 GB VRAM each,…

  165. For a system with 4x RTX 3090: what's the best model you could use for reverse engineering C# code? Qwen3.5-122b-A10B?

  166. I got a bit further with my harness for running Qwen 3.6 model on Codex. While testing, analyzing, and building the harness, I evolved TBG(O)llama-swap into a full forensic UI bridge and LLM analytics tool where every harness finding, modi…

  167. Z-Lab did some good work with speeding up output, while Luce managed to use smaller models of the same family to accelerate prefill... Since Heretic and other "smart ablation" tools can decensor a model, would they work with these multi-mo…

  168. We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs).

  169. Warning: long post ahead. On the bright side, it's 100 percent human-written, typos and all.

  170. Watched All About AI's 100% local Fireship-style video automation experiment over the weekend (link in comments). A few things worth flagging if you're trying the same stack.

  171. And I'm here to share my experience. The answer is resoundingly 'yes'.

  172. A did a quick google, but found nothing on this and I am scratching my head. Trying to do a llama-bench run with the kv cache set to f32 under Vulkan with a Strix halo.

  173. I’ve been using Qwen 3.6 27B and it’s amazing. Not exactly your Opus replacement, but great for small tasks and checking work.

  174. I've been working with Qwen 3.6 27B and 35B-A3B models and pretty happy with them. The point I've reached now is how to split my uses cases.

  175. I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.

  176. Hey fellow Llamas, keeping it short. We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory).

  177. I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm.

  178. Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. Autocomplete: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L Agentic: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL Why these models: Qwen2.…

  179. I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment.

  180. I was wondering what will be the difference in results with flag: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 vs MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Results are quite interesting 49tok/sec without MTP vs 64 tok/sec with MTP. PC: RTX5090+128GB DDR5…

  181. Hi, I'm trying to use MTP with llama.cpp, I built from source the mtp-pr, download an MTP model from huggingface https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP/resolve/main/Qwen3.6-27B-Q6_K.gguf But when I run the model I have an erro…

  182. https://preview.redd.it/ktg0lr3e0p0h1.png?width=1279&format=png&auto=webp&s=d110580662a5c707038b7e2e4f5226d2a18c7bfe Straight to it: llama-server-mtp \ -m ~/models/Qwen3.6-27B-Q5_K_M-mtp.gguf \ --spec-type mtp \ --spec-draft-n-max 3 \ --ca…

  183. Run on my rtx4090 llama.cpp params: llama-server -m ~/Projects/llm/models/Qwen3.6-27B-UD-Q4_K_XL.gguf --flash-attn on -ngl all -ctk q4_0 -ctv q4_0 -t 32 -c 262144 Power limit was set using sudo nvidia-smi -pl N On my observation, GPU const…

  184. According to this. I run several more tests to cover more models and quants.

  185. "Based on currently available information, estimate the prefill/decode speed of Qwen3.6-35B-A3B Q8 with 262K context on a Mac M5 Ultra 128GB." I'm surprised that almost every LLM fails at this task (ChatGPT/Gemini/Grok/Claude/DeepSeek/Kimi…

  186. https://huggingface.co/0G-AI/0GM-1.0-35B-A3B-0427 So far it behaves better than for example Qwopus in terms of consistent answers, iv been testing Q6K from https://huggingface.co/mradermacher/0GM-1.0-35B-A3B-0427-i1-GGUF Also i checked the…

  187. Question in title. Would be awesome to have this on macs, especially q8 or whatever the minimal-loss quant is, since macs can have lots of ram.

  188. I run the 4 bit quant of Qwen-3.6-27B in the codex harness with unsloth recommended llama-server settings, thinking enabled. I have tried the default chat template and the updated ones and have updated both my GGUFs and llama-cpp to the mo…

  189. I'm still hoping we see a Qwen3.6-122B or a Qwen3.6-coder, but my hopes are dimming. Seems like we would have seen/heard something by now, even if just tantalizing hints from the Qwen folks.

  190. I am currently running 2x RTX 5060 ti and happened across some good sales for additional ones coinciding with a really good sale of a highend Z890 motherboard (replacing my B860 board) that could support quad GPUs (with 2 M.2 adapters, end…

  191. I'm running qwen3.6-35b with llama.cpp connected to openwebui. And I noticed the model fails the number guessing game test on openwebui while it works perfectly with the llama.cpp web ui.

  192. I have this issue in all Windows installations I have done in my system, which of course, does not occur in Linux. 7900XTX + 9800x3D + 64GB DDR5 Issue is that for some reason, after sometime, llama.cpp performance cuts in half, even restar…

  193. Hey folks, just a heads-up for anyone running Qwen3.6 through llama-server. I ran into an issue where the preserve_thinking parameter wasn't working as expected, even though I had it explicitly enabled in my models.ini config.

  194. I'm running opencode and llama-server locally. I have 32gb ram and 780m igpu.

  195. My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research. My research is on some pretty niche topics and I doubt that anything like it is…

  196. --model "/mnt/e/my-path-change-to-yours/qwen3.6-35b/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf" \ --ctx-size 262144 \ --parallel 1 \ --n-cpu-moe 29 \ --no-mmap \ --mlock \ --cache-type-k q4_0 \ --cache-type-v q4_0 10.8/16 d…

  197. Hey everyone, I’m trying to debug a weird prompt cache issue with OpenClaw + oMLX, and I’d appreciate help from anyone running local agents on MLX/oMLX. The short version is this: I’m running oMLX v0.3.8 on my Mac, serving: Qwen3.6-35B-A3B…

  198. Hey everyone, I've been playing around with Gemma4 and Qwen3.6 on my 32Gb Macbook Pro M2 Max since their release but I'm struggling at finding: The best software to run it (oMLX, llama.cpp, ...) The best model + quant to pick The best sett…

  199. I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when…

  200. I love following all your adventures with local LLM setups. Quality and size of the models are important, but so is performance.

  201. I'm using OpenWebUI and and making tools/skills to improve my models functionality. I am currently using Qwen 3.6 35B A3B Q8 (F16) 256k I grabbed `parallel tools` to be able to run multiple tool calls at once..

  202. TL;DR: Qwen 3.6 35B-A3B (Q4_K_M) is running slow at around 9 t/s with 72% filled context (36147 tokens window) and a total response time of 77s including prefill and token generation. Ran this using LM Studio on Windows with the attached i…

  203. Typing this on a cramped flight, but I was having issues connecting to the plane's wifi on my ubuntu laptop, when it was effortless on my phone. The issue I was having was the Laptop WiFi connected to the plane wifi network, but captive po…

  204. I think it would suit my needs perfectly, but I'm scared of getting scammed on Alibaba so looking for some sellers who have delivered. Follow-up question for those who have the card, how well does it run Qwen 3.6 27B?

  205. basically what I'm doing here is trying to validate whether or not it's a reasonable idea to get a couple of V100s, either SXMs with PCIe adapters or straight-up PCIe cards in the first place, for the sake of running this model or models l…

  206. Due to old GPU I run on CPU and came to appreciate value of MoE. I know of MoE for Qwen 3.6 and Gemma-4, which are <40B.

  207. I'm keeping a close eye on the development of local llms.

  208. Probe-Detected Grokking in Multi-Probe DPO Orthogonal Learning Beyond Task-Specific Detectors in Qwen3.6-27B Probe-Detected Grokking in Multi-Probe DPO: Orthogonal Learning Beyond Task-Specific Detectors Abstract We report a phase-transiti…

  209. I have Qwen3.6-27B as my main model, I use it for coding with opencode and chatting with open-webui, yet to try out hermes or openclaw. I found out about their existence basically by searching or through reddit - but maybe there’s more tha…

  210. In llamacpp I'm getting 12tok/s, does this number look right to you and what can I do to increase this number (if possible)? cd ~/llama.cpp && ./build/bin/llama-server -m models/qwen-3.6-27b-abliterated-q3.gguf -ngl 999 -c 65536 (i need th…

  211. TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled.

  212. I have been trying various NVFP4 based variations of Qwen 3.6 27B, and I am seeing this for the ones that look most interesting to run on my 2x 16GB VRAM with KV cache fp8. vllm | (Worker_TP0 pid=136) WARNING 05-09 13:49:27 [kv_cache.py:10…

  213. I usually use Qwen3.6 27B (slow as heck on my RX 6800 but it works) for plan and Qwen3.6 35B A3B for the coding. But I was thinking the other day if I should remove the thinking from the code model.

  214. TLDR: The hype is real! 1.5x speedup.

  215. Just thought I'd share this use case. I was setting up a miniPC as a home theatre with Archlinux (It's the OS I'm most familiar with).

  216. Interactive reference for transformer models, presented via dataflow graphs, drillable down to elementary mathematical operations. Covers models from GPT-2 to Qwen 3.6, with MLA, MoE, RoPE, MTP, hybrid attention, and other variants togglea…

  217. this is almost certainly a skill issue, however: ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 1 --flash-attn 1 --device CUDA0,CUDA1 -p 2048 -d 4096,16384,65536 rather than splitting across those two cards, it firs…

  218. Looking for suggestions. Current setup llama.cpp and ran qwen 3.5 397b 256k context.

  219. have have a server running a 4500 blackwell on cuda 13.1 and nvidia/595.58.03 with 48GB mem assigned to it. I have build: dcad77cc3 (8933) with Qwen3.6-27B UD-Q5_K_XL loaded and connected it to Roo code.

  220. I have been using local LLM for coding quite a lot as well as some other tasks (like data extraction from images) and I had quite a good success with Qwen3.6 models. It's obviously not Sonnet/Opus, but I am able to get quite a lot of work…

  221. Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.

  222. So I've been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use. So after a day of vibecoding I think I may have gotten something viable.

  223. Just before RAMpocalypse, I built a 2x3090 using dirt-cheap humid-env-run GPUs and 128gb RAM on a RYzen 5 9600X. I've been pretty happy with the local models it lets me run, spilling into RAM occasionally as needed.

  224. Another hobby is working on electronics projects ranging from low-voltage control and signal processing to HV tube amp circuits. I design and simulate in LTspice before prototyping.

  225. Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago.

  226. Hey there everyone, I've been struggling to find a actual good guide that's not some fluffy video or AI slop on renting hardware from a service to run a local LLM with high token output Before I invest in some serious hardware, I thought I…

  227. Ran a full eval against four local models last weekend and the spread between them is wider than I expected. All running through Ollama on CPU, no cloud, same prompts, same hardware.

  228. Anyone familiar with getting both to work? I've got a few work systems and I want to make a case for inhouse data generation for the team, and I've got a very very crusty implementation going by putting a bifrost service on one of them, an…

  229. https://preview.redd.it/p0rqofxvrtzg1.png?width=1460&format=png&auto=webp&s=8ce5b18b4ddaad9b71f71fd8eb623839fc9c6c8b For weeks I've been working on creating the fastest local AI engine for Apple Silicon... And I finally did!

  230. TL;DR On 4× RTX 3090 with NVLink bonded between GPU pairs (0↔2 and 1↔3), pinning TP=2 to a NVLinked pair gave +25% throughput at concurrency 1 and +53% at concurrency 4 vs running TP=2 over PCIe. Adding the other two GPUs to make it TP=4 m…

  231. I am a freelance developer. Qwen 3.6 27B is great on the 5060s but a bit slow.

  232. For some reason, my qwen started looping a lot recently, ever since I introduced MCP tool calls. I don't know why as I didn't really change anything other than that.

  233. Ok so, I will try to explain myself as much as possible because onlinew I really cannot find much about this. Let's start by my settings for running Qwen 3.6 35B: Qwen 3.6: cmd: '/X --port ${PORT} --chat-template-kwargs '{"preserve_thinkin…

  234. Is Qwen 3.6 35b now considerably stupider in the latest llama-server releases? I had this model doing cartwheels two upgrades ago.

  235. Hi guys 👋 When I started my adventure with Qwen 3.6 27B I felt wow.... Now when I connect it with Gemma 4 I'm feeling more wow...

  236. I know that coming from Codex I should adjust my expectations, but still. I'm working on a midsize project.

  237. I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday.

  238. I tried running AesSedai/MiMo-2.5-GGUF:Q4-K-M under llama.cpp (main tree, compiled 36hours ago) Hardware: nvidia A6000 with 48GB RAM + 300GB CPU RAM I had no success: error loading model: missing tensor blk.0.attn_q.weight ... Is Mimo alre…

  239. dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g.

  240. Been running llama.cpp MTP with Qwen3.6-27B Q4_K_M as my daily coding assistant and got curious what was actually happening under the hood. Pulled the metrics from llama-server and charted a full session.

  241. I’m looking for a tool or calculator that can estimate the minimum hardware needed to run a specific model locally. For example, I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds.

  242. I have a 4060 (8GB Vram) and 16GB of ram wondering which models could fit in my setup for coding, the new Qwen 3.6 and Gemma 4 MoE models look good but might not fit, wondering about your experiences

  243. I fine-tuned Qwen3.6-35B-A3B on its own outputs for $7 on Apple Silicon + Modal. DeltaNet LoRA targeting was the hard part.

  244. So I had this idea for a project which was to try to fix a pretty hard coding problem using local agents running in a loop. The project is a compiler for biology protocols from vendors.

  245. tl;dr - For software development, Qwen3.6 27B, 5090 gives you ~3x speed over M5 Max, letting you plow through code, while M5 Max gives you ~4x memory, letting you use higher quantization and bigger context. Which would you choose and why?

  246. Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp Thought I would knowledge share Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF And am17an commit /media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/ll…

  247. With the explosion of apps like open claw, and the launch of my own app (trigger warning, not open source), there is massive demand for tokens. It used to be possible to avoid anxiety about your monthly bill by just buying a claude code su…

  248. Following my previous post https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq, a few people asked for the 35B A3B version. The model is up on HuggingFace at https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF if anyone wants to ch…

  249. Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700.

  250. Some of you saw our post a couple weeks back about hitting 102 tok/s stable on Qwen3.5-35B on a DGX Spark. A lot of you asked "cool, where's the code?" Today's the day: Github Atlas is open source.

  251. What I've noticed while using local LLM recently is that in most cases, bottlenecks occur not in decoding but in prompt processing. If the prompt processing speed is usable, in most settings (since it takes about 15k when starting based on…

  252. Both opencode and pi coding work, but I've hit the same wall with open weights. Qwen 3.6 and even fine-tuned variants, they drift into loops once the tool calls get nested or ambiguous.

  253. This post will have a slight old-man-shakes-fist-at-sky vibe, because….well… I’m older, so if you’re not into that, then please feel free skip it. I have been contributing to this sub for like 3 years now but I’m fearful this post will lik…

  254. So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF. This is not a "best possible setup" claim.

  255. Hey folks, looking for advice before I delete or keep a huge model file. I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM.

  256. Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF…

  257. Based on my last post and some comments, I added Qwen3.6:latest and Devstral to the evaluation. I am still looking for suggestions on which local model can run a complete TDD loop autonomously.

  258. WARNING: wait before download from HF: I just realised my upload of the new versions with the additional fix in the chat template has not completed yet. I will remove this warning once done The recent PR to llama.cpp bring MTP support to Q…

  259. As a title says, what is better taking the consideration that it will probably offload to CPU anyway? Models Qwen 3.6 35b and maybe I am not sure it will be usable Qwen 3.6 27b...

  260. How is this dual setup's performance? Is it difficult to set-up everything with for example llama.cpp?

  261. My weekend project overran a little but happy with the end result. soleval pass@1 beat Opus 4.7 on the same set of tasks.

  262. I've been using Qwen 3.6 with the Pi harness, and so far I'm really enjoying the experience. I've noticed Qwen is great at leaving inline comments when writing Typescript (haven't tried other languages).

  263. The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.

  264. Every new model release I see now has thinking on by default. But then the production results I'm seeing don't justify it.

  265. Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.

  266. Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks.

  267. could not extract summary

  268. Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet) https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ Running in Open Webui to view token/s output and I…

  269. Noob here, Running Qwen3.6 35B A3B in LM Studio on a 3080 10GB + Ryzen 5 3600 on Windows 10. Tried some unsloth quants with identical settings (GPU offload 40, MoE layers to CPU 40, context 8192, flash attention on).

  270. UPDATE: i have switched to vulkan (image: ghcr.io/ggml-org/llama.cpp:server-vulkan-b9014) and now i am getting prompt eval: 591.01 tok/s generation: 41.90 tok/s which is faster than rocm new config: services: llama-cpp: container_name: lla…

  271. Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article...

  272. https://preview.redd.it/fm8fr1vllczg1.png?width=1254&format=png&auto=webp&s=23dbb32e85c71b9454a617de174d0f416b786bb2 llama.cpp parameters: -c 260000 --jinja --no-mmap model: HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced:Q8_K_P Based on…

  273. https://preview.redd.it/z4b01gklaczg1.jpg?width=1080&format=pjpg&auto=webp&s=3cefa63d5d15eac5eedbb39ef19d6c476b22ae64 Just a reminder, the harness you use can makes a huge diffrence (your llm client and interface bascially), It's is way mo…

  274. Just ran some llama-bench comparisons between ROCm and Vulkan backends on my Strix Halo system. Vulkan came out ahead, which surprised me.

  275. Are y'all using the preserve thinking flag or do you have it off? If so, why?

  276. Hi, recently froggeric and allanchan339 released enhanced/fixed template for Qwen3.6 each one addressing different topics. I didn't know which one to use so I merged both with the help of Claude Opus to have the best of both.

  277. Judging by what they state, it should be better than Qwen 3.6 27B

  278. I've tried the setup in the title today for some vibe coding (ctx=262144, temp=0.6). I must be doing something wrong because it doesn't really work for me.

  279. I'm currently running a 4x RTX 3090 system (96GB VRAM, DDR4 2133 RAM) and have tested opencode and pi.dev using Qwen3.5-122B-A10B (AWQ) up to 200k context for web app coding (html/js/python). I'm now seriously considering picking up two Sp…

  280. With all the high praise for the model all around, I also want to try it on my own. I have an rtx3060 12gb vram and 16gb system ram.

  281. Hey everyone, I’ve been experimenting with running Qwen models locally on my setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB CPU: Ryzen 5700X OS: Windows 11 What I’m currently running Qwen 3.6 35B (UD Q4_K_M) llama-server.exe -m "C:\Users\Dino\…

  282. ----START HUMAN TEXT---- Hi all, I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV wil…

  283. Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen…

  284. Native MTP speculative decoding on Apple Silicon ~2.24× over no-MTP AR at temp=0.6 on Qwen3.6-27B · math-correct rejection sampling · MLX-native · zero external drafter Multiplier is hardware-independent. Absolute tok/s scales with memory…

  285. Qwen3.6 35B A3B UD IQ4_NL_XL. 512k context tokens for 4 parallel processing, key cache quantized to Q_8 and value cache quantized to Q_4.

  286. So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy.

  287. Qwen 3.6 27b vs Codex GPT 5.5 / Claude Opus 4.7 My local llm discovered a bug that they both missed And it turns out it's critical GPT 5.5 and Claude both stood their ground and didn't give up until the end - they claimed to be right all a…

  288. If one has enough vram, would Sglang be a superior choice than vLLM or llamacpp in terms of inference speed for serving a model dedicated to powering a personal (single user) agent harness like Hermes agent? Sglang has MTP for speculative…

  289. Hi there. Not native English speaker.

  290. For the past two weeks, I’ve been spending 4–5 hours a day building a custom MCP (Model Context Protocol) orchestration server. What started as a simple experiment with Qwen 3.6 35B has evolved into a full-scale "Man-in-the-Middle" proxy t…

  291. In the rapidly evolving landscape of AI-assisted development, choosing the right "harness" is as critical as choosing the model itself. This benchmark explores two of the most prominent open-source harnesses for local LLMs: Pi Coding Agent…

  292. Hi all, I come across to relatively niche problem and could not find much useful posts or guides about it. I have a mini pc (Beelink Ser 8, 8745HS and 32GB 5600 DDR5 SODIMM) headless server for hosting some routing services, and I am wonde…

  293. I spent a while getting this dialed in and wrote up the full recipe. Short version: 35B MoE TQ3_4S fits in 12.4GB of weights KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40 Total VR…

  294. I have been using llama.cpp to run some models recently. For example, I've been running GLM-4.7-Flash with this command .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-s…

  295. I’ve run models exclusively on apple silicon up until now, but wanted to up my inference game. I bought a slightly used RTX 5000 Pro Blackwell for a bit more than twice as much as two 3090s.

  296. I want to play with Qwen 3.6. Unsloth shows 4 different parameter options for different use-cases.

  297. I am quite curious as I tried Gemma 4 31B, Qwen 3.6 27B, GLM 4.7 30B and some others in my native language (czech). Gemma performs "best" and considering the fact its "just" 18GB model - it actually blows my mind how well it can respond in…

  298. Hi all, I have recently installed openclaw on a raspberry pi4, linking it to my local Ollama instance (RTX 3090 with 24Gb, as well as 96Gb of DDR5 RAM bought before the madness), in my case running Qwen3.6 (latest) capped at 16k context. A…

  299. Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback.

  300. In my current workflow (coding in python/c++ and technical reports) I mostly use Qwen3.6 27B and Gemma4 31B. In the past I tried other models like Deepseek with decent results but was painfully slow....

  301. I have snobbishly long felt that the local models were not 'up to my standards' for local development, or otherwise able to compete with GHCP, Claude Code, Cursor etc. Boy was I wrong.

  302. Hello, I built a multi agent AI trading floor for a school project: 10 agents (news, research, macro, crowd sim, trading…) Running 100% locally on Ollama, Gemma 4:26b, qwen3.6:35b, gemma4:31b. no paid APIs.

  303. (I'm on Windows system running these models locally) I've used both Codex and OpenCode with Qwen 3.6 27b and 35b running locally. I'm having a bitch of a time getting them to correctly create files.

  304. I've had better results quality wise with 35B AND it's much faster than 27B. Just curious cause I see lots of people post about 27B.

  305. I built hfviewer.com, a small tool for visually exploring Hugging Face model architectures. You can paste a Hugging Face URL and get an interactive visualization of the architecture, which can make it easier to understand how different mod…

  306. https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash_10x_prefill_speedup_over_llamacpp_at_128k/ Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, v…

  307. Hello everyone, I think Qwen 3.6 27B is good enough that it might take a while before we get a clearly better model at a similar size. I have a single headless RTX 3090 with a 300W power limit.

  308. I wanted to share an open-source app that I built for running LLMs locally on my setup. My setup Hardware FEVM FAEX1 (128GB) RTX Pro 5000 Blackwell (48GB), connected over OCuLink Aoostar AG02 2x2TB internal m.2 drives on raid-0 using mdadm.

  309. Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants)…

  310. I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between spee…

  311. Hey All my agentic coding stack includes claude-code 20x max, and codex 20x max. I use heavy scripting for orchestrating and testing multiple projects, been ai coding for 3 years.

  312. I got into LLMs late 2024, and local in Jan 2025. since then, I’ve upgraded my mini PC then added eGPU with 5070 Ti back when it was retailing for $750-$800.

  313. I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models.

  314. LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far.

  315. The angle here is native Windows, no WSL. Simple installation, open source, no telemetry.

  316. Does anyone do this? Any tips?

  317. Have Qwen hinted at whether other models (9B, 122B, 397B) would be getting the 3.6 treatment? Or have they in any way confirmed or hinted at "this is it"?

  318. It is pluggin made for ONLYOFFICE, much simpler than copy-paste from webui. PS.

  319. Model: Abiray-Qwen3.6-27B-NVFP4.gguf Specs: - Legion 7i Gen10 - NVIDIA GeForce RTX™ 5090 - Intel® Core™ Ultra 9 275HX × 24 - RAM 32.0 GiB llamacpp settings: ./build/bin/llama-server \ -m ~/.lmstudio/models/lmstudio-community/Qwen3.6-27B-GG…

  320. So in response to the Great Token Reconning of 2026, I decided to try out Qwen 3.6 as a daily driver, and although it's only been about a day, I have to say I'm thoroughly impressed. I had to download the VSCode insiders edition and set up…

  321. A friend is going on vacation for a couple weeks and is lending me an RTX 6000 Pro rig to mess around with. Holy cow, it is so much faster than my 4080 Super!

  322. This is my system: OS: Nobara Linux 43 Processor: Ryzen 9 5980HX RAM: 16 GB GPU: Radeon RX 6800M (12GB) I'm using llama.cpp and Qwen3.6-35B-A3B-UD-Q4_K_M is working okay in this system using vulkan. I'm getting a speed of ~17 t/s.

  323. Hi all, I tried to setup my pc to run llm, but got some issue: the first question of the chat is generally fine, but from the 3rd follow up question, the backend often be unresponsive and I have to manually restart the llama cpp server, or…

  324. I experienced this with Q4 and Q3 versions of Qwen3.6-35B-A3B and Gemma-4-26B-A4B. It starts saying things which sound similar in thinking mode: I must do ....

  325. by Kyle Hessling A hands-on benchmark of the Unsloth dynamic Q5 quantization, self-hosted on a single RTX 5090. 19 runs, 93.9 k generation tokens, across agentic reasoning, production-grade front-end design, and canvas / WebGL creative cod…

  326. So i am using this model in tax accounting. Have a shitty Ryzen 9 7940HS (8C/16T), 60 GB RAM, Radeon 780M iGPU, 1 TB Kingston NVMe, Win 11 Pro.

  327. I’m building OpenYak, a desktop AI workspace for using local models with real files on your computer. In this demo I’m using Ollama with Qwen/Qwen3.6-35B-A3B to review an attached budget workbook.

  328. Hello folks What is best code editor for local LLM deployment (LM Studio, llama.cpp)? I wish to test my LM studio + Qwen 3.6 27B and Gemma 4 31B with a legit local code editor.

  329. Gemma just crushed Qwen in a local LLM gamedev contest! Device: MacBook Pro M5 Max, 64GB RAM Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens.

  330. Surely a x4 bigger model should be more expensive for inference?! API prices at e.g.

  331. I normally seldom experience loops, either reasoning or responses, using Qwen 3.6 27B Q8 with 256k context window in Agent Zero. But the 35B A3B Q8 with 256k context window gets constant loops and is basically unusable within Agent Zero.

  332. I’m pretty overwhelmed. I feel like there are so many options that I don’t know which one to choose, and trying things until I find a decent one isn’t really my thing—even though I enjoy it.

  333. I've got to the point where I need some help. I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking".

  334. Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (~125K context, higher TPS). We’ve been pushing further on both context length and stability for tool-agent workloads.

  335. I wanted to see how much of my coding-agent workflow I could move local instead of paying for hosted tools forever. There was another push: Anthropic's own April 23 postmortem confirmed product-layer regressions through March/April.

  336. I am currently utilizing a single RX9070 16GB, achieving a performance of 20 tokens per second with Qwen 3.6 27B. Would integrating an additional RX9070 enhance this performance, or would the output remain consistent?

  337. Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older ~30B models irrelevant? They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows.

  338. Longtime lurker here, thought i should post my speeeeds... I have a RTX 4070S 12 GB Vram (+10% OC), AMD 9800x3D with 4x16 Gb DDR5 6000Mhz CL30.

  339. I have RX 7900 XTX, running Qwen3.6 27B Q4_K_XL. got 400ish pp and 30s tps.

  340. Hey y'all! I've recently written a text in Russian about my experience comparing Qwen-3.6-27B with lower tier cloud models on hard tasks -- I wanted to share the translation of the post, since I found the results interesting and surprising.

  341. It's been a busy week testing and trying to get the 27B model set up correctly. TL;DR: The only setup that worked for my dual 3090s was this one.

  342. Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops. This isn’t a criticism of Qwen3.6.

  343. Source: https://docs.tenstorrent.com/systems/quietbox/quietbox-bh-2/specifications.html Currently supported models: https://tenstorrent.com/developers From the specification docs above: CPU: Ryzen 7 9700X 65W Granite Ridge 3.8GHz Memory: 2…

  344. Hugging face link here. Ive been waiting for sokann to drop his Qwen 3.6 GGUF for 16 GB GPUs as his Qwen 3.5 was my GGUF of choice.

  345. Qwen3.5-122B-A10B at Q6_K is really good. Do you think we will see a larger MoE Gemma-4 or Qwen3.6 at some point?

  346. https://preview.redd.it/o1uxb57u47yg1.png?width=862&format=png&auto=webp&s=d38204fe6ccd0d8326dcd98a534e9a226d213f99 How trustworthy are Artificial Analysis intelligence index? so according to them Qwen 3.6 27B is better than bigger MoE mod…

  347. This may be naive but if we stripped a model of its image processing/voice processing capabilities, can it make it smaller or faster? Is that even possible?

  348. Followup to yesterday's post: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point.

  349. Hi all I'm running Qwen3.6-27B-UD-Q6_K_XL.gguf using llama swap and llama-server with these parameters (actually stolen for some posts on this subreddit.) llama-server \ -m /models/Qwen3.6-27B/Qwen3.6-27B-UD-Q6_K_XL.gguf \ --mmproj /models…

  350. Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago: 1 card - 1580/44: $ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB): Device 0: NV…

  351. I gave some math problems to Qwen 3.5 27B and Qwen 3.6 27B and they got all of them right, pretty smart models I would say, but very slow and electricity consuming, they took like 5 mins with my GPU at 120 W to solve a problem. The MoE mod…

  352. I tested two llama.cpp builds on the same Qwen3.6-27B-NVFP4 model. llama-bench reports the model label as qwen35 27B NVFP4, but the actual tested model is Qwen3.6-27B-NVFP4.

  353. I've been running Qwen3.6-35B-A3B locally in pi agent and hit cat spam problem. Agent just ignore read tool and the model gets stuck reading the same file 3-4 times using cat, or dumping entire 2k-line logs instead of grepping.

  354. Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour.

  355. I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards. Hardware: 2x RTX 5060 Ti 16GB 32GB total VRAM Proxmox LXC 16 vCPU ~60GB RAM CU…

  356. Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989? According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe…

  357. Has anyone got a reliable vLLM recipe for 3.6 27B that fixes the tool calling issues? I am getting "Not let me..." - then nothing.

  358. Hello, I would like a suggestion from those who are already actively involved in this world. Basically, I own this workstation: Ryzen 9 5900X 32GB di RAM DDR4 RTX 5060Ti PCCOOLER CPS YS1000 1000W Currently, I can quite easily code with Qwe…

  359. Hey all, im having a crisis that i just cant figure... i used Qwen3.6-27B-GGUF:UD-Q8_K_XL ever since it came out (on a DGX SPARK) and it worked like magic with decent performance (~50 t/s) , im updating SPARK and llama.cpp on a daily basis…

  360. I usually go for Claude for those kinds of Open WebUI tool creations, but rate limits are getting tight so I decided to just let Qwen3.6-27B-Q5 handle it through Open WebUI. It did it in one shot.

  361. Just dropped another 3&5 mixed quant for the RAM Poor Base-model-only Mac users that want to try Gemma4 top of the line LLM. 6gb smaller that the other 3bit-mlx out there and 25% faster.

  362. Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K.

  363. Hi got 1x 3090 and I'm thinking about these both models. I'm using from Friday Qwen and this model is amazing!

  364. Full disclosure: I used to program a bit, but I was garbage at it so I found a new career. This was eons ago so I'm not a dev, obviously.

  365. Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 vers…

  366. Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…

  367. The memory bandwidth on the r9700 looks quite good compared to my Mac or a Strix Halo and I'm wondering how this turns out. Thanks!

  368. My first time releasing a fine-tune publicly! If anyone wants to independently eval against base, that’d be awesome.

  369. I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far: llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --t…

  370. I want to test this model out but I don't have a setup that can do it locally. openrouter and all my coding plans don't include it.

  371. Trying to find the sweet-spot to tradeoff between power and tg/s. 250W seems to be a sweet spot for Qwen3.6-27B.

  372. How many ll's are in Stargate's TV Show's leader? Reasoning Toggle content The answer depends on which "leader" of the Stargate TV series you're referring to, as command changes throughout the franchise: General George Hammond (Seasons 1-3…

  373. Just was wondering what people feel is better. I do need 262K context so these are the biggest quants of each I can fit on my 3090 with KVcache at Q8.

  374. i wanted to see how fast i could make qwen3.6 35b run on a single h100, so i put together a sglang setup for it. it exposes an openai compatible api and also works with claude code through anthropic compatible routing from the connect tab.

  375. We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public…

  376. At least in open-webui. Nothing has changed except for the backend update.

  377. Hi, I currently have access to 2–3 RTX 3090 GPUs (ideally I’d like something that runs well on 2). I can install models up to around 100 GB in size.

  378. https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit https://huggingface.co/mlx-community/Qwen3.6-27B-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-31B…

  379. I'd jump on runpod and ssh in to test my workloads, but they don't have it. Would love to know how well this runs, particularly as context approaches a full 256K.

  380. I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of diff…

  381. Hey fellow Llamas, your time is precious, so I'll keep it short. We built a GGUF port of DFlash speculative decoding.

  382. Hi folks, Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too. The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficien…

  383. I am getting 4 t/s with Qwen3.6-27B-Q4_K_M which seems much slower than I'd expect. I am running LM Studio on Ubuntu 22.04 with the following specs: Dell Precision 5690 AI-ready workstation NVIDIA RTX 5000 Ada Generation GPU with 16GB VRAM…

  384. At this moment, the models such as Qwen 3.6 35b/27b crush the competition, yet I can't help, but notice this pattern. While the local RP scene is abundant with the Western model tunes: LLaMA, Mistral (all sizes), Nemo and more recently Gem…

  385. Repo: statisticalplumber/kanban at pi-agent-integration Hi Guys, To test Qwen 3.6’s potential, I also wanted the Cline Kanban project to have an open-source agent to work with. The last time I tested Cline Kanban, it didn’t support agents…

  386. (Links to all files, apps, and repos mentioned in this post can be found in the 'full post' link at the bottom) Agents for document redaction and review tasks Document redaction tasks involve text and vision capabilities, and long context…

  387. Qwen3.6-27B vLLM Docker Docker-based vLLM serving for Qwen3.6-27B with Lorbus AutoRound INT4 quant and MTP speculative decoding. Model is downloaded at runtime and stored on a host volume so the container can be upgraded without redownload…

  388. Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice speed increase for working on the same cod…

  389. I'm a daily llama-cpp user and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM.

  390. Hey all: I am trying to set up claude code to work with llama.cpp, I am using the Qwen3.6-35B-A3B. I usually use claude code + ZLM subscription i got lucky with $30 yearly - the set up is very simple with their automated script, but for th…

  391. I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command: vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \ --enable-auto-tool-choice --tool-call-parser qwen3_xml \ --enable-prefi…

  392. I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and ch…

  393. Been using this for a few days. It is BY FAR the best uncensored model I have found for Qwen 3.6 35B.

  394. Config CtxSize: 131,072 GpuLayers: 99 CpuMoeLayers: 38 Threads: 16 BatchSize/UBatchSize: 4096/4096 CacheType K/V: q8_0 Tool Context: file mode (tools.kilocode.official.md) Metric M Model XL Model Difference Avg Tokens/sec 28.92 29.78 +0.86…

  395. As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.

  396. Seeing how people praise it, I tried giving it implementation plan that Sonnet generated, but qwen keeps breaking files and goes in circles: Thinking… The file got corrupted from multiple overlapping edits. Let me just rewrite the whole fi…

  397. Which coding agent extension are most of you fining best with LM studio as the local server 🤔 Im running qwen 3.6 27b Ive used Cline and continue mostly. I haven't checkout all the options but im looking for something that looks and feels…

  398. Apologies in advance, if this is a newbie question. When running Qwen3.6-27B-FP8 using the below command on an Nvidia RTX PRO 5000, in opencode, I am seeing errors such as: "The issue is that the JS file is too long and causing JSON trunca…

  399. https://preview.redd.it/c76w57d1yexg1.png?width=1482&format=png&auto=webp&s=1164d8bc3e2e8a4157f26dd5583238a736474932 KLD for INTs and NVFP4s. AS ALWAYS - Use Case is important.

  400. https://preview.redd.it/tblmrwxkbexg1.png?width=1193&format=png&auto=webp&s=6dea1e6684e75e22852d57c0c72e9171deb56ae2 I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6…

  401. Hey folks — looking for some advice on improving my local LLM setup (and also exploring agentic coding workflows). Current setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB Using llama.cpp with a Qwen3.6 27B Q6 model (GGUF) Running through OpenCo…

  402. I made a simple app using openrouter, hoping to use the new small qwen models (the a3b moe and the 27b dense one), but they aren’t listed. Also, I swear some qwen3.6 models that were listed before are missing now.

  403. Seriously, Qwen3.6 27b is mopping the floor against models like 5 times its size right now. It doesn’t take a rocket scientist to figure out that maybe the whole a2b and a3b MoE thing isn’t the best solution after all.

  404. TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac. So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my perso…

  405. What would you say is the minimum amount of tokens per second you would tolerate for your local agent workflows? I have been trying pi.dev connected to a llama.cpp instance running Qwen3.6-27B-Q6_K_L with 200K context running on an RTX A60…

  406. Right now I'm running Qwen3-27B-Q4_K_M on a 2060 12G + 5060 Ti 16G with tensor split 15/7. Gen speed sits around 16.5 t/s and prompt eval drops from 653 to 356 t/s as context grows.

  407. Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP Can follow the same recipe I used for Qwen3.5-27B to achieve ~80 tps on a single RTX 5090 at…

  408. I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple.

  409. Hi all, I have a server (dual 7900xt) running qwen3.6 27b in LMStudio, because I love LMlink for its ease of use and I am okay with the model chugging along at ~25t/s in the background. I then serve the mode to my Mac, via LMlink.

  410. When Qwen3.6-35B-A3B was released a week or so ago, I sort of expected an iterative improvement on the previous Qwen3.5 models. After all, those models were pretty decent as compared with the previous local models I had tried, and Qwen3.5…

  411. I've been trying to fix performance with llama-server and seem to be hitting a wall. Using Q4_K_M by unsloth and IQ4_K_M by DavidAU, when asking a question with no context, 39 t/s.

  412. TLDR : Should an RTX 3090 + T4 be faster than a P40 + T4 for OpenCode with Qwen3.6 35B A3B ? --- Hi, Nowadays, I have an architecture running : A Tesla P40 w/ 24GB VRAM A Tesla T4 w/ 16GB VRAM I mainly use this setup to run models like GPT…

  413. could not extract summary

  414. I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that K cache compression is not really recommended in most case…

  415. So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me. I am running a 3070 8gb + 64gb DDR4.

  416. I'm sure people have asked before for settings for these gpu's, but for me, no matter what I do, It doesn't work as good as 3.6 35B! I've tried VLLM and LLAMACPP .

  417. The new dense model is great, but I’m trying to figure out how to increase PP and Token generation speed. I’m running Q8 quants across 3 7900xtx GPUs and I’m consistently only getting 18-20 t/s generation speed and ~650 t/s prompt processi…

  418. A few days ago, I was trying to improve token generation speed on my RTX 4070 Super 12GB while running Qwen3.6 35B A3B UD-IQ3_XXS (Unsloth) with llama.cpp, but to no avail. At that time, I had my monitor plugged in my 4070 and didn't even…

  419. Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results 4 models tested with q8_0 and q4_0 KV cache against full-precision baseline What this measures KV cache quantization stores the key-value cache in lower precision to s…

  420. Hey guys, I built a custom vLLM pipeline to run Gemma 4 (31B FP8) and Qwen 3.5 side-by-side locally to see how they actually perform in the wild with preprocessing of audio and images. But of course new model Qwen 3.6 27B came out just whe…

  421. I have tested Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf on my 4x3090 system (opencode) and find it really good and fast. However, I can't find any uncensored models for vllm (preferably as AWQ).

  422. Hello fellow members of this lovely community, Let me start by saying that I’m about as far from a professional developer as it gets. I’m a hobbyist whose entire coding experience consists of building various Python/VBA tools and simple Ja…

  423. I have ThinkPad T14 Gen 5 (8840U, Radeon 780M, 64GB DDR5 5600 MT/s ). Tried out the recent Qwen MoE release, and pp/tg speed is good (on vulkan) (250+pp, 20 tg): ~/dev/llama.cpp master* ❯ ./build-vulkan/bin/llama-bench \ -hf AesSedai/Qwen3…

  424. TLDR: Swapped Ollama for MLX on M1 Max (64GB) to run a 12-agent trading stack using Qwen 35B MoE. MLX wins on throughput and fine-grained sampler control, but I lost the "it just works" convenience of Ollama.

  425. Please comment here if you just cancelled your claude subscription. So that we can see how much you have confidence in open source or open weight models especially with qwen3.6 release.

  426. Maybe it be helpful for someone: llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000 Cant run this model with higher kv quants on >8192ctx size. -ub & -b setted for 256 allowed me for…

  427. been running qwen 3.6 locally and im shooked. but what are we doing about agent memory because it's still a complete mess.

  428. just tested the MoE qwen model with 2 bit percision and its suprising good. I used the 2 bit xxs from unsloth and it seems to maintain intelligence really well, never failed a tool call so far and suprisingly good at 3js, even better than…

  429. I tried working on a local LLM project today and honestly ended up pretty frustrated. I tested several approaches, but none of them worked reliably.

  430. MacBook Pro M5 MAX 64GB. Qwen 3.6 35B - 72 TPS.

  431. Trying to find out the best local LLM inference engine for Hermes in terms of performance and memory footprint. Tool calling accuracy is already up there, so I focused on pure token crunching.

  432. It's working very well out of the box on the tiny harness of kon; ~270 tokens without the tool schema (~1000 tokens including). https://github.com/0xku/kon Members from LocalLLaMA have already contributed many interesting features recently…

  433. pi llama.cpp awesome torus awesome torus Windows, 5070 (12GB)

  434. The dense sibling of the 35B-A3B drop is here, Qwen3.6 27B Uncensored Aggressive is out! Aggressive = no refusals; NO personality changes/alterations or any of that, it is the ORIGINAL release of Qwen just completely uncensored https://hug…

  435. Hi all, With the recent release of models that require temp = 1, top_k = N, and top_p = 0.95, I'm wondering why labs actually prefer those truncation samplers over just min_p? As far as I understand, min_p isn't supported everywhere, and t…

  436. Hey everyone, I run llamacpp precompiled with CUDA 12.4 on Windows 11 with a RTX 4090. With small models like gemma-4-E4B everything runs fine, but as soon as I run a bigger model like Qwen3.6-27B (IQ4_NL) or a medium sized model with larg…

  437. I'm getting ~13 tps on Q8_0, with a context window of 128000, K Q8_0, V Q8_0 this is on 3x GPUS (1x2060super 8gb, 2x5060ti 16gb), via llamacpp unsure if this is slow or to be expected? */llama-server --port 8080 --model */llama.cpp/Qwen3.6…

  438. original image Qwen3.6-27B-UD-Q5_K_XL.gguf Qwen3.6-35B-A3B-UD-Q5_K_S.gguf ...you tell me. system prompt: You are Qwen, created by Alibaba Cloud.

  439. I just got a used RTX a5000 24gb to use for local models, I mainly use AI to code, but I prefer to spend some money now instead of $200 per month on claude to use 50% of it in a single prompt. My current specs are: Ryzen 7 9800x3d 64Gb DDR…

  440. Heard of hermes and openclaw, they are great but takes a bit of time to setup properly. Now that the Qwen3.6 27B is out I want to have a forever running agent to track news and whatever cool shit there is.

  441. I am blown away by what this model can generate locally. I asked for a flashy Tetris game with particle effect and boy did it deliver!

  442. Yesterday's Claude Code Pro removal thread hit 350+ comments in a few hours, and the dominant take was basically "switch to Kimi K2.6, go local, done." I upvoted that thread and tbh im mostly there — but im building voice agents and RAG pi…

  443. Qwen Lens Studio A multimodal AI studio built around a single Qwen vision-language model, exposed through five focused tools plus a batch runner and a persistent session log. Ship a screenshot → get code.

  444. Does anyone else have the same experience comparing these two - for me 3.5 122B outperforms 3.6 by a large margin. 3.6 gets lost as long as the task requires a couple of more steps.

  445. Do I understand correctly, based on this comment, that I can potentially fit Qwen 3.6 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, a…

  446. https://preview.redd.it/qtzdx5ud0rwg1.jpg?width=1200&format=pjpg&auto=webp&s=aa25d9f0bb8007ee6e4065cfa46a9685454c89cd - Outstanding agentic coding, surpasses Qwen3.5-397B-A17B across all major coding benchmarks - Strong reasoning across te…

  447. Has anyone tried these? I found this on ollama: https://ollama.com/library/kimi-k2.6, https://ollama.com/library/qwen3.6 My issue is that they are extremely slow on my local.

  448. Qwen3.6-27B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransforme…

  449. Hi, I'm trying to run Qwen 3.6-35B on my RTX 3090 (24 GB of VRAM) but I'm not sure about 2 thing: - Which variant of the model to use ? (Q4_K_S, Q3_K_XL, other ?

  450. Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight.

  451. A short follow-up to my previous post, where I showed that changing the scaffold around the same 9B Qwen model moved benchmark performance from 19.11% to 45.56%: https://www.reddit.com/r/LocalLLaMA/s/JMHuAGj1LV After feedback from people h…

  452. Just a little reminder that *if* it is possible for you to run bigger quants, do it. I ran Qwen 3.6 IQ4_XS at 128k context was very much disappointed because it would loop, make formatting errors, implement wrong things etc.

  453. I have run two tests on each LLM with OpenCode to check their basic readiness and convenience: - Create IndexNow CLI in Golang (Easy Task) and - Create Migration Map for a website following SiteStructure Strategy. (Complex Task) Tested Qwe…

  454. Hey there, I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I…

  455. 9900x, RTX 4080, 96GB RAM. Llama-cpp, Windows.

  456. Hi all, I’m new to local LLMs but as someone who extensively uses agentic coding I thought I’d try it out. I am running a MacBook Pro with M3 Max 64gb ram.

  457. I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count.

  458. The difference is quite big: likes downloads last month finetunes Qwen3.5-27B 952 3,233,034 263 Qwen3.5-35B-A3B 1,397 3,977,637 87 Qwen3.6-35B-A3B 1,115 458,436 60 gemma-4-31B 323 343,895 13 gemma-4-26B-A4B 227 118,464 13

  459. I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…

  460. I specifically remembered when qwen3 coder came out and it was like the only few models out there that can totally take over a repo and actually do things in VSCode without emptying bank account. and when that the qwen3 coder 30B was so fa…

  461. What would you choose if you were in my shoes? How viable is 125k for agentic coding really?

  462. Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I…

  463. Hi guys im on 9950x 196gb and a 4090 This parameters are ok? mi main use will be coding llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --n-cpu-moe 20 -c 250000 --host 0.0.0.0 --port 8082 --reasoning-budget -1 --top-k 20 --top-p 0…

  464. Hardware: Intel Core Ultra 7 258V, 32GB Unified Memory. Model: Qwen 3.6 35B A3B (Quant: Q3_K_S) via LM Studio.

  465. So, I am excited with the new MoE model released by Alibaba. And as an excited person, I want to believe that it can actually run in my hardware.

  466. Short: I want to generate with Qwen 3.6 something like this https://preview.redd.it/bd6rbgnoatvg1.png?width=960&format=png&auto=webp&s=a1c079f37c048fa2c687709465b0c830a0184a4c After many hours, I'm able to generate a working file without w…

  467. I guess the model didn't feel it needed to do anything beyond proving. Not entirely sure how I got it to act so..

  468. The 3.5 122b model already is fantastic at 4-bit. Really the best model I ever ran on my 4x3090, but from what I read how 35B 3.6 is doing, the 3.6 122b model would be an absolute value banger.

  469. Quick demo of KV cache compression on Qwen 3.6 at 1M context. In this run: KV cache: 10.74 GB → 6.92 GB V cache: 5.37 GB → 1.55 GB (~3.5× reduction) Still seeing near-zero PPL change in early tests (3 seeds), but focusing mainly on memory…

  470. I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information fr…

  471. Hey guys. Some people including me are having trouble on qwen3.6-35b tool calling.

  472. I'll be testing the setup and try out the Hermes Agent live: https://www.youtube.com/live/q5vqvwZykRI

  473. https://preview.redd.it/dfqed57qgsvg1.png?width=1706&format=png&auto=webp&s=3859209698d2e844e2731326e355d60928658f8a The most fun part was reasoning, here is a gist: https://gist.github.com/anzax/5f06716c66180013cd715f6c2e5848df There is a…

  474. Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM.

  475. What are some ways that you would go about thinking about choosing between the two for use in a harness like pi? Did a good bit with q4 yesterday and it was so consistent and reliable I had it set to 131k context and it worked through 2 co…

  476. hi have been recently thinking to buy my personal GPU for hosting open source models can someone give any suggestion ? and also suppose i don't wanna remain restricted to qwen 3.6 but some math heavy tasks too for which i wanna deepseek or…

  477. Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.

  478. I've been very impressed with qwen3.6-35B-A3B on Apple Silicon (and actually my AMD iGPU setup with DDR5 and a 760M does well too). It can actually navigate a codebase and write useful code.

  479. https://preview.redd.it/na4ub5yzprvg1.png?width=1654&format=png&auto=webp&s=e356e0ab0829bb275352d1035c35c645a381c3c7 I am using Kaggle to serve Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf but tool calling is not always working. I also tested it with R…

  480. Just gave the new Qwen3.6-35B-A3B a spin. It’s a MoE model (35B total, ~3B active), but honestly the more interesting part is how much they’re pushing agent-style coding.

  481. I've tried a few different local models in the past (gemma 4 being the latest), but none of them felt as good as this. (Or maybe I just didn't give them a proper chance, you guys let me know).

  482. MSI B650 Gaming Plus 9800X3D 64GB DDR5 6400mts Windows 11 When I first boot my PC and I run this model, I get 155-160t/s, and for some reason, after a couple minutes, say, 10 minutes, not using AI or anything in particular, GPU temp at 40c…

  483. I spent some time yesterday after work trying out the new qwen3.6-35b-a3b model, and at least for me it's the first time that I actually felt that a local model wasn't more of a pain to use than it was worth. I've been using LLMs in my per…

  484. I’ve been seeing a lot of good feedback about the qwen 3.6 model and its reasoning performance but has anyone tested it with reasoning off? I’ve been building a low latency app using Qwen 3 30ba3b 2507 and 3.5 no think was not an improveme…

  485. Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task.

  486. QWEN 3.6 35B A3B MXFP4 https://preview.redd.it/bclr8ukcoqvg1.png?width=904&format=png&auto=webp&s=853b211505ef6b9184d0571ca8fc46295437322a hey everyone this is my first post, anyways the thing is that there is this program called https://m…

  487. About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

  488. Hey all, we’re a mid-sized company (~70 people) and currently planning to bring a lot of our workloads on-prem instead of relying on cloud APIs. The goal for the moment is to run small to mid-sized models in the range of 30B like Qwen3.6 o…

  489. Hello I saw the new model is out but even with 24gb of vram, I have too many browser and task to use it , so I have downloaded and tested the version of HauHauCS https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressiv…

  490. I do not want to spread conspiracies, please weight my information carefully, and maybe somecan can hopefully prove me wrong. I installed the brandnew qwen3.6 yesterday and ran a few of my own traditional tests, not a very deep dive, just…

  491. https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-Base-GGUF Yes, Qwen 3.6 is out and it's a great model. However, who needs an even more "uncensored but official" model, can try out this one.

  492. Has anyone been able to solve or mitigate context checkpoints being erased during single user inference, specifically when function calling is part of the chat history? I've been using Qwen 3.5 35B A3B for some time (now using 3.6), tested…

  493. https://preview.redd.it/4906akj9dovg1.png?width=1527&format=png&auto=webp&s=c49e255ac79a3c5455f44603422f8af7ddc12594 First of all can we make https://www.youtube.com/watch?v=2lUC8Gimxz8 Angine de Poitrine this subs official band? Those guy…

  494. Full JANG adaptive mixed-precision quantization sweep of Qwen3.6-35B-A3B: https://huggingface.co/collections/bearzi/qwen36-35b-a3b-jang All 15 profiles, from extreme compression to near-lossless: JANG_1L JANG_2S/2M/2L JANG_3S/3M/3L/3K JANG…

  495. I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp…

  496. could not extract summary

  497. Basically I am asking the model to describe an image, but it says it can't process the images. The weird thing is that if I send the image encoded directly on the prompt, it works just fine, I am using llama-server with qwen3.5 (tried all…

  498. Been running the new model entire evening in different quants and coding tasks with OpenCode. Used oMLX and LM Studio.

  499. The TheTom's turboquant's GPU accelerated turboquant (turbo3) has unlocked high context gains for the 35BA3B family. I can now achieve ~40tg/s via the following GPU-POOR compilation flags and configuration: cmake -B build -DGGML_CUDA=ON -D…

  500. I had previously posted here about a fix to their 3.5 template to help resolve the KV cache invalidation issue from their template. A lot of you found it useful.

  501. Hey, I have a M1 Pro 16gb machine, and I wanted to run the Qwen3.6/3.5 35A3B model. However, this model cannot fit on a 4bit quant on my system.

  502. Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7 16th April 2026 For anyone who has been (inadvisably) taking my pelican riding a bicycle benchmark seriously as a robust way to test models, here are pelicans from…

  503. About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

  504. Here is how to run the new Qwen3.6-35B-A3B > At full context on a 4090 - IQ4_XS gguf with llama cpp > At full context on a Spark - FP8 with a tweaked vLLM Here is the docker compose with llama cpp services: llamacpp: container_name: llamac…

  505. Note: First is Qwen3.5 35B MoE (Left) and Second is Qwen3.6 (Right) Hi Guys Just did quick comparison of Qwen3.6 35B MoE against Qwen 3.5 35B MoE. with reasoning off using llama.cpp and same quant unsloth 4 K_XL GGUF First is Qwen3.5 outco…

  506. This is my first test with this model and Qwen impressed me. I will rate it 98% usable web os compared to my previous best 70% usable result from qwen3 next coder at q2.

  507. I configured it to the best of my abilities, even at Q8. It fails to give the correct number of tools it supports on Claude Code and it fails the car wash test.

  508. Just swapped Qwen 3.5 for the 3.6 variant (FP8, RTX 6000 Pro) using the same recommended generation settings. My stack is vLLM (v0.19.0) + Open WebUI (v0.8.12) in a RAG setup where the model has access to several document retrieval tools.

  509. I was working on a simple frontend web design task earlier (styling some buttons) with Qwen3.5-35B-A3B. The end results weren't great, but at least it kept trying to change stuff and call toosl properly.

  510. Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransf…

  511. Qwen Studio offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.

  512. Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license.

  513. GLM 5.1 is dominant in almost every aspect in Design arena, surpassing Opus 4.6 in many tasks. Although user experiences vary dependent on subscription plans for both of those one of them is open source.

← all threads