https://preview.redd.it/u8062juegq3h1.png?width=1919&format=png&auto=webp&s=a213f6929c6cad58e92bc1681dac9f0545b04d13 Overview: As the market for consumer computing parts becomes more scarce due to the AI boom, finding ways to use lower-end…
model
Qwen3.5-9B
huggingface.co/Qwen/Qwen3.5-9B ↗
5662081 downloads1256 likesimage-text-to-texttransformers
from the model card
Qwen3.5-9B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc. Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency. Qwen3.5 Highlights Qwen3.5 features the following enhancement: Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with Qwen3 and outperforms Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks. Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead. Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability. Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuance…
discussions
- Qwen 3.5 174 2026-04-13 – 2026-05-30
recent items
Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop (www.reddit.com) Old Mac Pro still proving its worth (www.reddit.com) The “Trash Can” Mac Pro, once the most expensive machine you could buy from Apple, mine was just shy of £10,000 in 2016 — that’s £14k in today’s money. Until recently mine was just running as a kubernetes single node development platform,…
Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM? (www.reddit.com) Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dual rtx 3090 - gigabyte psu 1600 w What do y…
Long-context performance at lower quants (www.reddit.com) I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k conte…
Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode (www.reddit.com) I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends.
ran qwen3.5 locally on a flight with no wifi. claude code started straight-up hallucinating (www.reddit.com) heavy travel period last month, lots of offline time, and i could not stop building. airplane wifi was unusable so we switched models inside Claude Code and fired up qwen3.5 locally on an M4 macbook.
Is a 128 GB MacBook Pro M5 Max actually too slow for large-context local LLM coding workflows? (www.reddit.com) People are warning me about the prompt-processing speed of a MacBook Pro M5 Max with 128 GB RAM. My main concern is prompt ingestion / prefill latency and large-context handling — not raw token generation speed (which I think is OK).
Stop QwenLLama! Every other 4th post in this sub is about Qwen models in the past month (www.reddit.com) Disclaimer: I use Qwen models on a day to day basis.. You could take it as a rant or even my concern about innovation in other models.
ReAct tool-calling issue: Orchestration model computes internally instead of using tools (www.reddit.com) Built a local ReAct-style calculator agent with 6 tools: add subtract multiply divide modulo etc. The setup is: orchestrator agent dynamic tool selection ReAct loop tools exposed as functions Problem: Even when the user asks multi-step ari…
Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP (www.reddit.com) for anyone who cares... 😄 prompt = spen a 1000 tokens unsloth MTP models strix halo llama.cpp:server-rocm-mtp \ --spec-type draft-mtp \ --spec-draft-n-max 3 Qwen3.5-122B-Q5-MTP-General n_decoded = 100 tg = 29.77 t/s n_decoded = 179 tg = 27…
Show HN: Prism Coder – Qwen3.5-14B fine-tuned for MCP tool-routing decisions (github.com via hn) 🧠 Prism Coder 🌐 Read in your language: 🇬🇧 English · 🇪🇸 Español · 🇫🇷 Français · 🇵🇹 Português · 🇷🇴 Română · 🇺🇦 Українська · 🇷🇺 Русский · 🇩🇪 Deutsch · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇨🇳 中文 · 🇸🇦 العربية Persistent memory + tool-calling intelligence for AI a…
Jackrong/Qwopus3.5-9B-Coder-GGUF · Hugging Face (huggingface.co via reddit) Qwopus3.5-9B-coder is specially optimized and fine-tuned for high-performance 🤖 Agentic Coding, complex Tool Calling, and logical reasoning. 💡 Why the 9B Dense Model?
internlm/Intern-S2-Preview · Hugging Face (huggingface.co via reddit) Introduction We introduce Intern-S2-Preview, an efficient 35B scientific multimodal foundation model. Beyond conventional parameter and data scaling, Intern-S2-Preview explores task scaling: increasing the difficulty, diversity, and covera…
Want Built a React-style looping agent with small LLMs (Qwen 3.5 9B / Gemma4) + LangGraph? (www.reddit.com) Currently experimenting with building a React-style looping agent system using small LLMs like Qwen 3.5 9B and Gemma 4 (E2B), and I wanted to ask if anyone here has worked on something similar. Current setup: Using LangGraph Around 5 tools…
Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers (www.reddit.com) I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.
What political censorship looks like inside an LLM's weights (Qwen 3.5) (vas-blog.pages.dev via hn) A mechanistic-interpretability study of Qwen 3.5 Disclaimer. This is a mechanistic-interpretability study of how nation-state-mandated content filtering actually gets built into a deployed LLM's weights.
I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses (www.reddit.com) RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to bui…
At wits end for optimizing settings in llama.cpp for 100k context (www.reddit.com) Long story short, I am running Qwen3.5-35B-A3B (GGUF format) and other models on MacOS and getting around 1500 tokens/sec for prompt processing and around 35-50 tokens per second for prompt processing. I'm using the latest version of llama…
Hermes Agent issues with directory creation (www.reddit.com) I'm having issues with Hermes Agent actually processing commands through the terminal. I'm doing something simple like asking it to make a dir and it tells me it has, but it hasn't.
Very happy with Qwen 3.5 122B output. But is slowness expected? (www.reddit.com) I'm running the 122-billion Qwen 3.5, specifically Qwen3.5-122B-A10B-Q5_K_M, on DGX Spark (128 GB contiguous memory). I'm (very!) impressed with the general knowledge output.
Is there something wrong with Local LLM ability to read file? (www.reddit.com) So I've been feeding the sub file of anime episodes into Claude/ChatGPT/Deepseek and ask them to find all full name of Japanese character in it and put it into a python array so I can run a script to flip the name back to the original Japa…
qwen 2B model - thinks for 600 tokens on a simple "Hi" (www.reddit.com) Using llama.cpp Model - Q8 - unsloth/Qwen3.5-2B-GGUF Is this expected with tiny models like this one? I am trying tiny models for a since most of the task I have involves searching local files etc and need less of the models own knowledge.
Sort of my first venture towards finetuning on a qwen 3.5 4B heretic mode (www.reddit.com) How do you deem qwen 3.5 4B heretic variants for RP finetunes? I have been struggling to get a decent instruct based model, any tips regarding the goal would be really helpful.
Full Hermes Agent tutorial (Spanish with English auto-translation). Computer Use, MCP Blender, Hindsight memory and multi-agent setup (www.reddit.com) Spent weeks running Hermes Agent in production on my Mac Mini M4 before recording this. Wanted to show things nobody else was covering.
40+tok/s - optimized recipe for Qwen 3.5 122B Int4 on a single DGX Spark with vLLM (www.reddit.com) Hello guys, two days ago i ran the spark-arena for my Qwen 3.5 122B Recipe on a single DGX Spark and I got the highest score on speed for any context length and concurrency across all 3.5 122B Int4 Recipes. Just wanted to share if somebody…
Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates (www.reddit.com) Disclosure: I made this. Open-source, MIT, Windows + Linux.
No tg speedup with MTP on RX 6800 XT (www.reddit.com) I ran Qwen3.5 9B on my AMD RX 6800 XT with ROCM and it seems to actually be slowing down token generation. I'm using Unsloth's quants.
Wanna try the best coding model with my rtx 3090, not sure where to start, I believe Qwen3.5-27B-UD-Q4_K_XL would be the best? if so should I use ollama with it? (www.reddit.com) I've already searched, but information is getting updated each week, so it's really hard to get an answer, I really hope some of you guys can give me some tips. And can I use an agent with it to enhance the code?
Are the rich RAM /poor GPU people wrong here? (www.reddit.com) Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the…
MagenticLite is here: A full-stack agentic experience powered by Small Models - Fara-1.5 4B, 9B & 27B (www.microsoft.com via reddit) What if you could run a capable AI agent without leaning on frontier-scale models? MagenticLite is the next generation of Magentic-UI, an agentic experience reimagined and optimized for small language models.