We surgically removed half the experts from Qwen3.5-35B-A3B to create 8 memory efficient domain specialists (coding, web, math, physics, biology, engineering, vocational, humanities). A cross-domain test shows a 96-point pass@5 gap between…
model
Qwen3.5-9B
huggingface.co/Qwen/Qwen3.5-9B ↗
5662081 downloads1256 likesimage-text-to-texttransformers
from the model card
Qwen3.5-9B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency. Qwen3.5 Highlights Qwen3.5 features the following enhancement: Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks. Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead. Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability. Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuance…
discussions
- Qwen 3.5 57 ongoing since 2026-04-13
recent items
Qwen3.5 50% expert reduction success (news.ycombinator.com via hn) Spring benchmark update: Gemma 4 / Qwen3.5 vs Gemma 3 / Qwen3 for chat (www.reddit.com via reddit) Google and Alibaba recently shipped Gemma 4 and Qwen3.5, so I wanted to see whether the new generations are actually better on my setup. My context is private local chat running on my own hardware, a Mac mini M4 Pro.
Local Coding Stacks (www.reddit.com via reddit) I’m trying to reduce my reliance on Claude. I have a 5090/128GB RAM.
I got it guys, I think I finally understand why you hate censored models (www.reddit.com via reddit) I was trying to do an easy task automatically with qwen-code using qwen3.5-122b I can totally do it myself, but I wanted to try, so maybe it could just do it entirely for me? But no, because it refused.
Gemma4 26b & E4B are crazy good, and replaced Qwen for me! (www.reddit.com via reddit) My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where…
Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s (www.reddit.com via reddit) Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to s…
GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) — aiming for 64K ctx, 30+ t/s (www.reddit.com via reddit) Hey all, Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B. What I’m targeting Context: 64K+ (ideally higher later) Speed: 30+ tok/s @ tg128 minimum Power: not critical…
What's your favorite small-medium local model? (www.reddit.com via reddit) I'm now having fun with Gemma-4-E4B and Qwen3.5-9B, trying different variants like Gemopus and Qwopus, and Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0 don't quite know other models, so what's your favorite? why and how are them?
How faster is Gemma 4 26B-A4B during inference vs 31B? (www.reddit.com via reddit) I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…
Hey, has anyone here used Qwen3.5-27B-NVFP4-GGUF with llama.cpp yet? (www.reddit.com via reddit) Hey! I was wondering if anyone of you have used Qwen3.5-27B-NVFP4-GGUF on RTX5090 on llama.cpp?
Been trying to get Qwen 3.5 to stop reasoning using old methods like /no_think, it didn't work, but it said something like "too late" in its reasoning (www.reddit.com via reddit) Wait, I need to be careful about the "no_think" tag in the system prompt. The system prompt says /no_think.
Please help me pick the right Qwen3.5-27B format/quant for RTX5090 (www.reddit.com via reddit) Hi all, first post here. I've started a project in OpenClaw a month ago, and it's been a very "intense" 4 weeks to say the least...
DFlash is real: x2 tg on small context with oMLX (www.reddit.com via reddit) Right from the oven with the latest commit: DFLASH_MAX_CTX=8192 uv run python -m omlx.cli serve oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16 ===================…
Minimax M2.7 on Q3_K_S or Smaller Model with greater precision? (www.reddit.com via reddit) I currently am looking for models to fit into my single DGX Spark for use. I have an RTX Pro 6000 and also a 5090 as well that I'm considering using in combination if the DGX Spark is too slow, but the intent here is to play around with Op…
Qwen3.5 35b is sure still one the best local model (pulling above its weight) - More Details (www.reddit.com via reddit) Last time I posted on how this model has performed in creating the webapp based on provided research paper. I got so much love to see people has appreciated the post and of-course the potential of this MOE model.
Thinking issue [Qwen3.5] (www.reddit.com via reddit) I've been testing a few models lately and I'm running into a weird issue with the bigger Qwen3.5s. Tested: Gemma 4 26B Qwen3.5 9B Qwen3.5 27B Qwen3.5 35B The 27B and 35B are driving me nuts.
Summarizing text locally, medical literature (www.reddit.com via reddit) Colleagues, I have a question: does anyone have a locally developed solution for summarizing text? Which qwant qwen 3.5 27b would be able to summarize an entire chapter of medical literature, about 25-30 A4 pages, without hallucinations?
Any way to work with NUMA Nodes? (www.reddit.com via reddit) I bought a dual Skylake server because 12 channels of memory (and 2 x 3090s) THEN found out about NUMA nodes after my poor test results. Very disappointed.
Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload (www.reddit.com via reddit) Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough.
My first impressions of Minimax M2.7 (Q5_K_M) vs Qwen 3.5 27b (Q8_0) (www.reddit.com via reddit) I'm not sure if the AesSedai's Q5_K_M version of Minimax M2.7 is too much lobotomized or if the model itself is kind of weak. I did a simple experiment with both models running with the recommended parameters.
Loading "stacks" of models on-demand? Does a tool like this exist? (www.reddit.com via reddit) I'd like to self-host some LLM models but a couple different ones for different usecases, and they don't all fit in VRAM at the same time. So i'm kind of looking for a tool in which i can define "profiles" or "stacks" of LLM's that get loa…
I want to run qwen3.5 27B q4_k_m on CPU, and I need help. (www.reddit.com via reddit) I am an local LLM beginner and I found this Reddit while looking for help. (Please understand that I am unfamiliar with Reddit.) (system- i5 4440 1.8GHz/b85m ds3h/DDR3 32GB/128GB SSD/Ubuntu 25.10 questing) I loaded Qwen3.5 27B Q4_K_M onto…
GRaPE 2 Model Family (www.reddit.com via reddit) Today I announce the first two models I am posting on here! First off, hello all of r/LocalLLaMA, nice to join.
running models bigger than physical memory capacity (www.reddit.com via reddit) has anyone really tried running models bigger than physical memory capacity? I'd guess most users stick with running models that fit in DRAM + VRAM https://unsloth.ai/docs/models/qwen3.5 even google gemma 4 are released with about 30+ bill…
Any magic prompt that Local LLM never turning back until everything completed? (building frontend application with qwen3.5-35b-a3b) (www.reddit.com via reddit) https://nestia.io/articles/well-designed-backend-fully-automated-frontend-development.html Trying to generate entire frontend application from well-designed contexts. Succeeded to fully implement frontend application just by one-shot promp…
Qwen 122B is AMAZING but is my config right? (128GB M4 Max) (www.reddit.com via reddit) Hi! I hope its okay for me to ask this here.
Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? (www.reddit.com via reddit) I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.
Can LLM make small change to the software program? (www.reddit.com via reddit) I'm currently vibe-coding (I'm new to vibe-coding) with Gemma 4 4EB Q4 and Qwen 3.5 9B Q5 (KV is quantized to 4 bits with new Google TurboQuant implemented in llama.cpp - I use koboldcpp and release said it's automatically activated): the…
Been out of the loop - Will this work for EXO/MLX? (www.reddit.com via reddit) Had to sell my AI server and am down to an M4 Macbook Air 16GB. If I were to buy a used M1 Air with 16GB (run it headless) and connect the two via EXO + Thunderbolt...would it be possible to be able to run a (19.6GB) Qwen 3.5-27B-Q5_K_M.gg…
Why don't Groq (with a q) and Cerebras add new models (www.reddit.com via reddit) Both Groq and Cerebras haven't really updated their provided model for a while, long enough to notice the difference between old and new models on the market. So why don't they add any new models?