model

Qwen3.6-35B-A3B

5910215 downloads·2016 likes·image-text-to-text·transformers

from the model card

Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience. Qwen3.6 Highlights This release delivers substantial upgrades, particularly in Agentic Coding: the model now handles frontend workflows and repository-level reasoning with greater fluency and precision. Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead. For more details, please refer to our blog post Qwen3.6-35B-A3B. Model Overview Type: Causal Language Model with Vision Encoder Training Stage: Pre-training & Post-training Language Model Number of Parameters: 35B in total and 3B activated Hidden Dimension: 2048 Token Embedding: 248320 (Padded) Number of Layers: 40 Hidden Layout: 10 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE)) Gated DeltaNet: Number of Linear Attention Heads: 32 for V and 16 for QK…

discussions

Qwen 3.6 52 2026-06-05 – 2026-06-16

recent items

Qwen 3.6 93B with MTP on 2×RTX 3090 NVLink=187 tokens/SEC,LLM lost bleat-a-thon (github.com via hn) +1 12d

* AI CODE CREATION GitHub Copilot Write better code with AI GitHub Copilot app Direct agents from issue to merge MCP Registry New Integrate external tools DEVELOPER WORKFLOWS Actions Automate any workflow Codespaces Instant dev environment…

↯ Copilot ↯ Qwen 3.6 ↯ Qwen 3.6 copilot qwen mcp
RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8 (imil.net via hn) +2 13d

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8 A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.

↯ Qwen 3.6 ↯ Qwen 3.6 qwen
How to Setup a Local Coding Agent on macOS (ikyle.me via hn) +4110 13d

How to Setup a Local Coding Agent on macOS Running Gemma 4 26B-A4B and Qwen3.6 35B-A3B locally with llama.cpp, MTP speculative decoding, multimodal support, and PI as a coding agent. I'd had my internet fail a few times recently leaving me…

↯ Qwen 3.6 ↯ Qwen 3.6 gemma llama
advice for dual-gpu asymmetric (www.reddit.com via reddit) 2w

Hello everyone, i had a 3080ti 12gb and added a 3080 20gb, so it has a bit less speed but more memory than my main card. I could finally get some speed with the usual suspects (i am testing gemma 4 31b/26b-a4b and qwen 3.6 27b/35b-a3b), BU…

↯ Qwen 3.6 gemma qwen llama
Running Claude Code Offline on an M3 Pro with Qwen3.6 (har-ki.github.io via hn) +71 2w

06 — Air-Gapped Claude Code¶ The setup, the fixes that make it work, and the hardware that sets the pace Claude Code connects to a model running locally on the laptop. You provide a Kubernetes incident for investigation.

↯ Qwen 3.6 claude-code
Reasoning, but without actually *drafting* replies? (www.reddit.com via reddit) 2w

I've been experimenting a bit today with letting models reason for creative tasks, rationale being that it might help with keeping track of details and prompt adherence. And predictably, the wall I'm running into is that they all want to d…

↯ Qwen 3.6 gemma
Is Qwen 3.6 27B IQ4XS better than Gemma 4 31B QAT as a Hermes agent? (www.reddit.com via reddit) 2w

If Gemma 4 is better, does anyone have a link for the latest fixed template? Using LMstudio.

↯ Qwen 3.6 gemma qwen
MTP hyperparameter search (www.reddit.com via reddit) 2w

TLDR; I only got a 6% improvement on tokens/sec over naïve parameters. I was messing around and ran a hyperparameter search with optuna over the MTP and speculative decoding options of llama-server for Qwen3.6 27b on strix halo.

↯ Qwen 3.6 qwen llama
Executing a plan under context constraints (www.reddit.com via reddit) 2w

I'm running Qwen 3.6 35B-A3B via Pi harness on a 32gb unified RAM setup (Framework 13). llama.cpp, 64k context window.

↯ Qwen 3.6 qwen llama
Harnesses seem to have an issue. (www.reddit.com via reddit) 2w

There's a post i saw about Claude Fable where a user asked the model the car wash question and it sent me down a rabbit hole. I spun up qwen on llama.cpp and in the llama.cpp chat interface I asked the model and it got it right consistentl…

↯ Qwen 3.6 qwen llama opus
Need help improving speed of inference (www.reddit.com via reddit) 2w

Hello i'm running the qwen 3.6 27b in ud q5k xl, and with all the optimizations it barely fits in my 3090 vram with a 120k context, i'm sure it does not spill when context is full but i would like to improve the token generation speed. I w…

↯ Qwen 3.6 qwen llama
Deploy a Qwen 3.6 Agentic RAG — Step-by-Step Walkthrough (medium.com via reddit) 2w

Deploy an Agentic RAG powered by Alibaba’s latest Qwen 3.6, running fully on your machine.

↯ Qwen 3.6 rag qwen agentic
Qwen3.6-MTP-27B on Tesla V100 @ 55 TPS (llama.cpp) — Any way to push this higher without quality loss? (www.reddit.com via reddit) 2w

Hey everyone, I'm running Qwen3.6-MTP-27B-MTP (Q4_K_M) with llama.cpp server on a Tesla V100, and I'm currently getting around 55 tokens/sec. I'm trying to find out whether there are any configuration changes that could increase throughput…

↯ Qwen 3.6 llama
Qwen 3.6 27B AutoRound GGUF, need your feedback (www.reddit.com via reddit) 2w

I have always been a fan of the AutoRound quants of this model, for some reason, it thinks less (sort of like Qwopus models) and comes up with solutions quicker than Unsloth quants for instance. https://huggingface.co/sphaela/Qwen3.6-27B-A…

↯ Qwen 3.6 qwen
How useful is qwopus compared to qwen3.6 27b (www.reddit.com via reddit) 2w

I see a lot of conflict comments on this sub and elsewhere on how useful is qwopus compared to for example unsloth quants of qwen3.6 27b. Some say it’s worse some say it’s much better.

↯ Qwen 3.6 agentic
The 'storage tax' on cloud GPUs for short LLM runs is brutal. What's your workflow? (www.reddit.com via reddit) 2w

I’m trying to test Qwen3.6-27B for agentic coding through Cline / llama.cpp, but my local box struggles once the context gets longer. (my poor 3080 just can't keep up).

↯ Qwen 3.6 cline llama agentic
What's up on CPU inference these days? (www.reddit.com via reddit) 2w

What are the best models, quants and llama.cpp versions/forks for CPU inference these days? I have AVX2 but no AVX512 - Intel core ultra 7 165H; 64G RAM This seems to ask for massive MoE (a lot of RAM, not a lot of bandwidth/compute).

↯ Qwen 3.6 moe llama
I'm brand new to running LLMs and the sheer number of tools is overwhelming (www.reddit.com via reddit) 2w

Hey everyone. I'm brand new to running LLMs in general, even more new to running them locally, and the sheer number of tools available is absolutely overwhelming.

↯ Qwen 3.6 ollama gemma qwen
Qwen 3.6 35b A3B Speed Help (www.reddit.com via reddit) 2w

↯ Qwen 3.6 moe qwen
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit) 2w

I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.

↯ Qwen 3.6 moe gemma qwen+1
Gemma 4 31B's competence surprised me (www.reddit.com via reddit) 2w

I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.

↯ Qwen 3.6 gemma qwen
Qwen 3.6 for coding with 5090 - Your settings recommendations? (www.reddit.com via reddit) 2w

Hi, totally new to using LLMs for coding purposes, I am on Ubuntu and currently using LM Studio with Qwen 3.6 27B Q4 on a 5090. Finding it slow and context runs out fast.

↯ Qwen 3.6 qwen
Pursuit of performance Llama.cpp to MLX (www.reddit.com via reddit) 2w

Right now, I am running llama.cpp on a M2 ultra 64gig. Having great fun with unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL - Running opencode and finding it amazing to have such great tools running locally.

↯ Qwen 3.6 llama
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. (github.com via reddit) 2w

Forgive the claude summary, in the readme, but the base works. I'm still working on the hip kernal and having it combine with MTP.

↯ Qwen 3.6
Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance (www.reddit.com via reddit) 2w

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling, and u/complexminded pointe…

↯ Qwen 3.6
5070 Ti + 5060 Ti on vLLM hangs on GDN with Qwen3.6 (www.reddit.com via reddit) 2w

↯ Qwen 3.6 vllm
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better] (www.reddit.com via reddit) 2w

These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…

↯ Qwen 3.6 gemma qwen
[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup (www.reddit.com via reddit) 2w

Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative…

↯ Qwen 3.6

← all models