Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license.
#qwen
709 items
Qwen3.6-35B-A3B released! (www.reddit.com) Qwen3.6-27B released! (www.reddit.com) Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight.
Qwen3.6-35B becomes competitive with cloud models when paired with the right agent (www.reddit.com) A short follow-up to my previous post, where I showed that changing the scaffold around the same 9B Qwen model moved benchmark performance from 19.11% to 45.56%: https://www.reddit.com/r/LocalLLaMA/s/JMHuAGj1LV After feedback from people h…
Gemma4 26b & E4B are crazy good, and replaced Qwen for me! (www.reddit.com) My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where…
Qwen will release another 27B with high probability (www.reddit.com) They are waiting for the exact roadmap
Qwen 3.7 droped on Qwen Chat (www.reddit.com) https://preview.redd.it/t4hf4x4ruw1h1.png?width=380&format=png&auto=webp&s=788c8e011e407d3e1aa49d6d4be0a813a75e03df Title
Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results (localbench.substack.com via reddit) Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results 4 models tested with q8_0 and q4_0 KV cache against full-precision baseline What this measures KV cache quantization stores the key-value cache in lower precision to s…
Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All (qwen.ai via hn) Qwen Studio offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives (www.reddit.com) MacBook Pro M5 MAX 64GB. Qwen 3.6 35B - 72 TPS.
The Qwen 3.6 35B A3B hype is real!!! (www.reddit.com) My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research. My research is on some pretty niche topics and I doubt that anything like it is…
Opus 4.7 Max subscriber. Switching to Kimi 2.6 (www.reddit.com) Qwen 3.6 is the first local model that actually feels worth the effort for me (www.reddit.com) I spent some time yesterday after work trying out the new qwen3.6-35b-a3b model, and at least for me it's the first time that I actually felt that a local model wasn't more of a pain to use than it was worth. I've been using LLMs in my per…
So... has anyone actually figured out whose model Elephant Alpha is yet? (www.reddit.com) 2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints (www.reddit.com) WARNING: wait before download from HF: I just realised my upload of the new versions with the additional fix in the chat template has not completed yet. I will remove this warning once done The recent PR to llama.cpp bring MTP support to Q…
Opencode you naughty minx (www.reddit.com) Man, AI agents getting pretty crazy these days. :) (local, I just decided to try to get an orchestrator in there, when Qwen and Gemma aren't up to it.)
Qwen cant wait to release 3.7 models (www.reddit.com) could not extract summary
I got it guys, I think I finally understand why you hate censored models (www.reddit.com) I was trying to do an easy task automatically with qwen-code using qwen3.5-122b I can totally do it myself, but I wanted to try, so maybe it could just do it entirely for me? But no, because it refused.
These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade (www.reddit.com) Time and time again I find posts about these fine tunes that promise increased intelligence and reasoning with base models, and I continuously try them, realize they're botched, and delete them shortly after. I sometimes do resort to a low…
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (www.reddit.com) Gemma just crushed Qwen in a local LLM gamedev contest! Device: MacBook Pro M5 Max, 64GB RAM Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens.
Bad news: Apple drops high-memory Mac Studio configs (9to5mac.com via reddit) Looks like Apple has quietly killed off the higher-memory Mac Studio options. The M3 Ultra Mac Studio is now only available with 96GB RAM.
Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) (www.reddit.com) TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp…
Waiting for Qwen 3.7 open weight... The new King has arrived... (www.reddit.com) The hype is real! https://qwen.ai/blog?id=qwen3.7
A rare look inside Qwen 3.7’s open source model release approval process: (www.reddit.com) For real tho, 9b, 27b, 122b, I don’t really care at this point, just show us that you still love us. EDIT: I guess I gotta use /s on my posts from now on.
Getting a feel for how fast X tokens/second really is. (www.reddit.com) I love following all your adventures with local LLM setups. Quality and size of the models are important, but so is performance.
Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (www.reddit.com) Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…
Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) (www.reddit.com) The following is a non-comprehensive test I came up with to test the quality difference (a.k.a degradation) between different quantizations of Qwen 3.6 27B. I want to figure out what's the best quant to run on my 16 GB VRAM setup.
I made a visualizer for Hugging Face models (www.reddit.com) I built hfviewer.com, a small tool for visually exploring Hugging Face model architectures. You can paste a Hugging Face URL and get an interactive visualization of the architecture, which can make it easier to understand how different mod…
PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on. (www.reddit.com) I had previously posted here about a fix to their 3.5 template to help resolve the KV cache invalidation issue from their template. A lot of you found it useful.
KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche (www.reddit.com) Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects…
Opinion: Qwen 3.6 27b Beats Sonnet 4.6 on Feature Planning (www.reddit.com) I keep hearing the argument that that large models are better for high-level planning and task orchestration, since they have more general knowledge to work from when making decisions. However, I've been testing Qwen 3.6 27b (Unsloth Q5_K_…
Switching from Opus 4.7 to Qwen-35B-A3B (www.reddit.com) Do you guys think there’s a high chance of Singularity being open source? (www.reddit.com) GLM 5.1 is dominant in almost every aspect in Design arena, surpassing Opus 4.6 in many tasks. Although user experiences vary dependent on subscription plans for both of those one of them is open source.
Qwen 3.6 35B crushes Gemma 4 26B on my tests (www.reddit.com) I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information fr…
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan) (www.reddit.com) I have ThinkPad T14 Gen 5 (8840U, Radeon 780M, 64GB DDR5 5600 MT/s ). Tried out the recent Qwen MoE release, and pp/tg speed is good (on vulkan) (250+pp, 20 tg): ~/dev/llama.cpp master* ❯ ./build-vulkan/bin/llama-bench \ -hf AesSedai/Qwen3…
Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs (www.reddit.com) Saw this post comparing Qwen 3.6 variants on coding primitives, so I wanted to see how local quants stack up against frontier models on a similar dense, single-file coding task. I ran the exact same prompt across local and web-based models…
Qwen Introduced FlashQLA (www.reddit.com) Introducing FlashQLA: high-performance linear attention kernels built on TileLang. 2–3× forward speedup.
Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs (www.reddit.com) Hey r/LocalLLaMA, We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP. Blog / Download NTP Models / Download MTP Models TL;DR For NTP, “pick the largest quan…
Qwen3.6-27B Uncensored Aggressive is out with K_P quants! (www.reddit.com) The dense sibling of the 35B-A3B drop is here, Qwen3.6 27B Uncensored Aggressive is out! Aggressive = no refusals; NO personality changes/alterations or any of that, it is the ORIGINAL release of Qwen just completely uncensored https://hug…
Qwen/Qwen-Image-Bench · Hugging Face (huggingface.co via reddit) Model Description Q-Judger is a vision-language model fine-tuned specifically for automated evaluation of text-to-image generated images. Given a text prompt and a generated image, the model evaluates the image on fine-grained quality crit…
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6qwen
Waiting on Qwen to drop those 3.7 models be like: (www.reddit.com) Mods please be kind. This was not “low effort”.
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) (www.reddit.com) TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results a…
Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried (www.reddit.com) Heya guys and gals, Around a year ago I released and posted about Persona Engine as a fun side project, trying to get the whole ASR -> LLM -> TTS pipeline going fully locally while having a realtime avatar that is lip-synced (think VTuber)…
Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (huggingface.co via reddit) Qwen Team released Qwen-Scope — a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family (from 2B to 35B MoE). They’ve mapped internal features for the residual stream across all layers.
Qwen 3.7 Has been Spotted on the Qwen website (www.reddit.com) could not extract summary
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark (www.reddit.com) GPT-5.5: xhigh: 94.0→97.5 high: 93.6→96.9 medium: 92.0→95.0 no reasoning: 32.8→37.5 Kimi K2.6 improves over Kimi K2.5 (78.3→91.4) and becomes the #1 open weights model. DeepSeek V4 Pro improves over DeepSeek V3.2 (50.2→75.7).
I’ve used enough AI models to realize they all have wildly different personalities At this point I’m convinced AI models are just coworkers with different levels of talent, ego, and criminal energy. (www.reddit.com) - Claude Opus 4.6 - absolute rogue AI. Does what I want like it’s breaking at least 3 internal policies to make it happen.
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline. (www.reddit.com) BeeLlama v0.2.0 is here! Not quite a pegasus, but close enough.
Comparison Qwen 3.6 35B MoE vs Qwen 3.5 35B MoE on Research Paper to WebApp (www.reddit.com) Note: First is Qwen3.5 35B MoE (Left) and Second is Qwen3.6 (Right) Hi Guys Just did quick comparison of Qwen3.6 35B MoE against Qwen 3.5 35B MoE. with reasoning off using llama.cpp and same quant unsloth 4 K_XL GGUF First is Qwen3.5 outco…
"Browser OS" implemented by Qwen 3.6 35B: The best result I ever got from a local model (gist.github.com via reddit) Guys we have to change the pelican test (www.reddit.com) So i have been seeing more of those pelican on a bike svg tests and while they work i feel like (and maybe you guys do too) they are getting kinda benchmaxxed so we should switch things up soon and this is my idea generate me a html svg of…
Qwen 35B-A3B is very usable with 12GB of VRAM (www.reddit.com) Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.
Qwen3.6 35B A3B Heretic (KLD 0.0015!) Incredible model. Best 35B I have found! (huggingface.co via reddit) Been using this for a few days. It is BY FAR the best uncensored model I have found for Qwen 3.6 35B.
Get faster qwen 3.6 27b (www.reddit.com) Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp Thought I would knowledge share Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF And am17an commit /media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/ll…
Qwen is cooking hard (www.reddit.com) I am waiting for 122B and new 27B
A Qwen finetune, that feels VERY human (www.reddit.com) Hello guys, So TL;DR, I was asked by multiple people to make an Assistant_Pepe_32B version, but the best base model contender was Qwen3-32B, a model that is very hard to tune on anything other than STEM. The concept of Assistant_Pepe is an…
Is there anything better than Qwen3.5-27B-UD-Q5_K_XL for coding? (www.reddit.com) I have a 5090, so my VRAM is limited to 32GB, but i find that Qwen3.5-27B-UD-Q5_K_XL with opencode (and mmproj) does a pretty good job for my use case (mainly web development). i use claude and codex here and there, recently a lot less, be…
Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room (www.reddit.com) https://preview.redd.it/42ak5qmus82h1.png?width=1133&format=png&auto=webp&s=744ea3dfc06c83d0c4d8aa128c39b3238b17d7be Qwen 3.7 Max sitting at 5th, pretty much on par with GPT 5.4 (xhigh) and a notch above the just released Gemini 3.5 Flash.…
Will there be any more Qwen3.6 series models? (www.reddit.com) I'm still hoping we see a Qwen3.6-122B or a Qwen3.6-coder, but my hopes are dimming. Seems like we would have seen/heard something by now, even if just tantalizing hints from the Qwen folks.
Consider running a bigger quant if possible (www.reddit.com) Just a little reminder that *if* it is possible for you to run bigger quants, do it. I ran Qwen 3.6 IQ4_XS at 128k context was very much disappointed because it would loop, make formatting errors, implement wrong things etc.
Qwen 3.6 27b IQ4_XS - 22 tp/s on RTX 5060TI 16b, 24k ctx (www.reddit.com) Maybe it be helpful for someone: llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000 Cant run this model with higher kv quants on >8192ctx size. -ub & -b setted for 256 allowed me for…
Hugging Face co-founder says Qwen 3.6 27B running on airplane mode is close to latest Opus in Claude Code (www.reddit.com) I'm keeping a close eye on the development of local llms.
I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality (www.reddit.com) Qwen 3.6: worse adherence? (www.reddit.com) Just swapped Qwen 3.5 for the 3.6 variant (FP8, RTX 6000 Pro) using the same recommended generation settings. My stack is vLLM (v0.19.0) + Open WebUI (v0.8.12) in a RAG setup where the model has access to several document retrieval tools.
Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB (www.reddit.com) ----START HUMAN TEXT---- Hi all, I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV wil…
I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls (www.reddit.com) I've been running structured output prompts through a bunch of models on OpenRouter for the past few months — Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter — alongside the usual closed-source suspects. 28…
Hello from 10KM high! - Thanks to Qwen 3.6 35b a3b! (www.reddit.com) Typing this on a cramped flight, but I was having issues connecting to the plane's wifi on my ubuntu laptop, when it was effortless on my phone. The issue I was having was the Laptop WiFi connected to the plane wifi network, but captive po…
PSA: Having issues with Qwen3.5 overthinking? Give it a tool, and it can help dramatically. (www.reddit.com) I'm sure everyone has seen the posts from people talking about Qwen 3.5 over-thinking, or maybe you've experienced it yourself. Considering we're like 2 months out from the release and I still see people talk about this issue, I decided it…
Qwen/Qwen3.6-27B · Hugging Face (huggingface.co via hn) Qwen3.6-27B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransforme…
Same 9B Qwen weights: 19.1% in Aider vs 45.6% with a scaffold adapted to small local models (www.reddit.com) Guys, I found a use case for my 10$/m LLM Server: Cooking (www.reddit.com) Basically, I use To Good 2 Go a lot, get random food, take a photo and ask Qwen 3.5 128B what the fuck to cook. Beyond pasta and pizza, I have zero cooking skills.
Local model on coding has reached a certain threshold to be feasible for real work (www.reddit.com) We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public…
What happens to local LLM if/when LLMs are no longer released for free? (www.reddit.com) I’m thinking about where this might wind up in 3-5+ years. As others have noted there’s no guarantee that Qwen, Google, and others will continue to release models in the future.
FINAL-Bench/Darwin-36B-Opus · Hugging Face (huggingface.co via reddit) https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF Darwin-36B-Opus is a 36-billion-parameter mixture-of-experts (MoE) language model produced by the Darwin V7 evolutionary breeding engine from two publicly available parents:…
Xiaomi Mimo-V2.5 Released, looks like today is big day for Open-Weight releases (www.reddit.com) https://preview.redd.it/fxkx8wyzqrwg1.png?width=1152&format=png&auto=webp&s=e141f0f1c5f680ab602115d06dc4621e46309983 Qwen-27b now this!
My thought on Qwen and Gemma (www.reddit.com) This spring is really hot since the localLLM giant, both Qwen and Gemma released major models. I'm really excited with those release and happy with their capability.
Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM (www.reddit.com) Hi everyone, I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp.
MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close. (www.reddit.com) I recently published MTP quants of Qwen 3.6 27B and I was suprised by the reports here on reddit, and on HF, of users who were experiencing worst speed with speculative inference than without. This did not match what I was seeing, but when…
Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp (www.reddit.com) Hey everyone, I’ve been building a local-first desktop PDF reader that can read technical books aloud and keep the spoken text highlighted while reading. The original motivation was pretty practical: I read a lot of programming and technic…
Have Qwen said anything about further Qwen 3.6 models? (www.reddit.com) Have Qwen hinted at whether other models (9B, 122B, 397B) would be getting the 3.6 treatment? Or have they in any way confirmed or hinted at "this is it"?
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) (www.reddit.com) TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM (www.reddit.com) Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using Bee…
Any news (or hope) of Qwen-3.6 14B and 9B distills for local coding ? (www.reddit.com) As the title suggests. I'm already testing (with some success, and few challenges) usage of Qwen-3.5 9B with a new work laptop that I've received with RTX 1000 6GB VRAM (I know it seems like a joke in today's time and age).
Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster. (open.substack.com via reddit) Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article...
Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B (www.reddit.com) Does anyone else have the same experience comparing these two - for me 3.5 122B outperforms 3.6 by a large margin. 3.6 gets lost as long as the task requires a couple of more steps.
Qwen3.6-27B 4.256bpw in full VRAM on a 5070 Ti with 50000 q4_0 context - not turbo! (www.reddit.com) Hugging face link here. Ive been waiting for sokann to drop his Qwen 3.6 GGUF for 16 GB GPUs as his Qwen 3.5 was my GGUF of choice.
Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver (www.reddit.com) So in response to the Great Token Reconning of 2026, I decided to try out Qwen 3.6 as a daily driver, and although it's only been about a day, I have to say I'm thoroughly impressed. I had to download the VSCode insiders edition and set up…
For chat and Q&A: Which MoE model is better: Qwen 3.6 35B or Gemma 4 26B (no coding or agents) (www.reddit.com) Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m? (www.reddit.com) I dont have good experience running q4_k_m, the difference to q6 is "a few errors an hour" to " a few errors every couple of days". Edit: How it fails?
Qwen3.6 35Ba3 has changed my workflows and even how I use my computer (www.reddit.com) My workflow has changed basically to ask Codex to do certain tasks and then document how to do them (including errors it found on its way) into a skill. I feed that skill to pi, and suddenly my qwen3.6 gets that hard stuff done: - devops o…
The more I use it, the more I'm impressed (www.reddit.com) Qwen 3.6 27b vs Codex GPT 5.5 / Claude Opus 4.7 My local llm discovered a bug that they both missed And it turns out it's critical GPT 5.5 and Claude both stood their ground and didn't give up until the end - they claimed to be right all a…
Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash (www.reddit.com) This is a follow up to the previous benchmark and tensor analysis of abliteration techniques across the Qwen model family. Same approach, same toolkit, new model family.
What starts to become possible with two 3090s that wasn't with just one? (www.reddit.com) Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now) (www.reddit.com) Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task.
My first impressions of Minimax M2.7 (Q5_K_M) vs Qwen 3.5 27b (Q8_0) (www.reddit.com) I'm not sure if the AesSedai's Q5_K_M version of Minimax M2.7 is too much lobotomized or if the model itself is kind of weak. I did a simple experiment with both models running with the recommended parameters.
The pacman benchmark: finally a viable local agentic coding agent with Qwen 3.6 27b (www.reddit.com) One way I like to test new models, is by one-shoting (with a good prompt) a single webpage clone of the classic arcade game pacman. I usually do 3 attempts and keep the best one.
Poolside Laguna XS.2 (www.reddit.com) 33B A3B MoE, Apache 2 licensed. Reported agentic results put it about level with Qwen 3.5 35B A3B, behind the 3.6 version.
Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2 (www.reddit.com) I have run two tests on each LLM with OpenCode to check their basic readiness and convenience: - Create IndexNow CLI in Golang (Easy Task) and - Create Migration Map for a website following SiteStructure Strategy. (Complex Task) Tested Qwe…
Web OS result from Qwen3.6 35B is by far the best I tested in my laptop (codepen.io via reddit) This is my first test with this model and Qwen impressed me. I will rate it 98% usable web os compared to my previous best 70% usable result from qwen3 next coder at q2.
I've seen a lot of folks ask "can local LLMs actually do anything useful?" (www.reddit.com) And I'm here to share my experience. The answer is resoundingly 'yes'.
What it feels like to have to have Qwen 3.6 or Gemma 4 running locally (www.reddit.com) Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour.
Claude Opus 4.8 distilled Alibaba Qwen models (twitter.com via hn) Max For AI @MaxForAI 笑死了,Claude Opus4.8蒸馏了阿里巴巴Qwen啊 通过API用中文问你是谁,会很大概率回答 我是通义千问(Qwen),是阿里巴巴集团旗下的统义实验室自主研发的超大规模语言模型。 5:38 PM · May 28, 2026 New to X? Sign up now to get your own personalized timeline!
Kv cache quantization: ignorance, or malice? (www.reddit.com) I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between spee…
Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max (www.reddit.com) Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K.
Old Mac Pro still proving its worth (www.reddit.com) The “Trash Can” Mac Pro, once the most expensive machine you could buy from Apple, mine was just shy of £10,000 in 2016 — that’s £14k in today’s money. Until recently mine was just running as a kubernetes single node development platform,…
Can't replicate Reddit numbers with Qwen 27B on a 3090TI. (www.reddit.com) I feel like i'm going insane. I see people here posting 30 - 100+ tok/s (100+ being with speculative decoding) on a 3090 with Qwen 3.6 27B.
Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant (www.reddit.com) Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM.
Qwen 3.6 q8 at 50t/s or q4 at 112 t/s? (www.reddit.com) What are some ways that you would go about thinking about choosing between the two for use in a harness like pi? Did a good bit with q4 yesterday and it was so consistent and reliable I had it set to 131k context and it worked through 2 co…
MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro (www.reddit.com) https://preview.redd.it/8gpkg8zxmy1h1.png?width=1672&format=png&auto=webp&s=a95db16a39cdc49c0ff155117b734d413a49c2d3 https://youtu.be/MI0Pm1d6YF4 MTP can accelerate LLM inference 2x, especially for coding agents. This video covers what MTP…
Models and Quants quality test results - the chessboard svg (Qwen3.6 27B/35B-A3B/Zaya1) (www.reddit.com) According to this. I run several more tests to cover more models and quants.
z-lab released gemma-4-26B-A4B-it-DFlash. Anybody tried it yet? (huggingface.co via reddit) Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago.
Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. (www.reddit.com) I wanted to share an open-source app that I built for running LLMs locally on my setup. My setup Hardware FEVM FAEX1 (128GB) RTX Pro 5000 Blackwell (48GB), connected over OCuLink Aoostar AG02 2x2TB internal m.2 drives on raid-0 using mdadm.
Are Qwen 3.6 27B and 35B making other ~30B models obsolete? (www.reddit.com) Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older ~30B models irrelevant? They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows.
Devs using Qwen 27B seriously, what's your take? (www.reddit.com) For developers using Qwen 27B for coding, Codex style: what's your honest take? So far, for me, it's been pretty solid.
Qwen3.5/3.6 Coder? (www.reddit.com) With practically all of LocalLlama glazing Qwen 3.5/3.6 for it's coding skills. Along with the fact that Alibaba themselves are focusing on making Qwen a reliable coding agent, does this rule out the chance for a new Qwen Coder?
DeepSeek v4 - Subjective vibes (www.reddit.com) I must say Iam kinda torn what to think about those models. At one hand they "ace" some questions on other sometime they behave genuinely weird.
Did you know that you can use Qwen3.5-35B-A3B-Base as an instruction/reasoning Model? (www.reddit.com) https://huggingface.co/mradermacher/Qwen3.5-35B-A3B-Base-GGUF Yes, Qwen 3.6 is out and it's a great model. However, who needs an even more "uncensored but official" model, can try out this one.
Qwen3.6-35B-A3B vs Gemma4-26B-A4B (www.reddit.com) Just wondering how are people's experience with both these models! I've had some nice results with Qwen but Gemma4 runs so much faster here.
Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? (www.reddit.com) Some people say they’d never go under Q8, and others say they find Q3 acceptable! What’s your take?
Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version) (www.reddit.com) In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests.
Is there anyway to run bigger models at 20t/s with 24vram + 64gb ram DDR5? (www.reddit.com) I know the new Qwen 27B is amazing right now for coding in general, but since 122b is supposed to be coming as well, it’s expected to be better I guess ? I am actually surprised at how this dense model performs I haven’t used Codex at all…
Qwen3.6 agent + Cisco switch: local NetOps AI actually works! (www.reddit.com) [cupel] M5 Max 128GB: Qwen3.5-397B IQ2 @ 29 tokens per second (www.reddit.com) A year ago I would just read about 397B league of models. Today I can run it on my laptop.
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) (www.reddit.com) A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into…
Qwen3.6-35B-A3B-UD-IQ4_XS C++ to Rust Code Port Test: It Worked (Mostly)! (www.reddit.com) When Qwen3.6-35B-A3B was released a week or so ago, I sort of expected an iterative improvement on the previous Qwen3.5 models. After all, those models were pretty decent as compared with the previous local models I had tried, and Qwen3.5…
Llama.cpp parameters for Qwen 3.6 with RTX 3090 (www.reddit.com) Hi, I'm trying to run Qwen 3.6-35B on my RTX 3090 (24 GB of VRAM) but I'm not sure about 2 thing: - Which variant of the model to use ? (Q4_K_S, Q3_K_XL, other ?
KV cache compression on Qwen 3.6 — 1M context: 10.7GB → 6.9GB (V: 3.5× smaller) (www.reddit.com) Quick demo of KV cache compression on Qwen 3.6 at 1M context. In this run: KV cache: 10.74 GB → 6.92 GB V cache: 5.37 GB → 1.55 GB (~3.5× reduction) Still seeing near-zero PPL change in early tests (3 seeds), but focusing mainly on memory…
Curated a list of 550+ free or cheap AI tools for vibe coding (LLM APIs, IDEs, local models, RAG, agents) (www.reddit.com) Been vibe coding a lot recently and kept running into the same problem finding actually usable tools without paying for 10 different subscriptions or donating my bank balance to Claude. So I put together a curated list focused on free or l…
2x Asus Ascent GX10 - MiniMax M2.7 AWQ - cloud providers are dead to me (www.reddit.com) Hello, I've been on a quest to get something "close enough" of Opus 4.5 running locally, for agentic coding, as SWE with 15 years of experience. I tried with one spark (yeah I'm calling my Asus Ascent GX10 sparks - they're the same), with…
Qwen3.7-Plus: Multimodal Agent Intelligence (qwen.ai via hn) Qwen Studio offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
Anyone with 4x 5060ti based setups? (www.reddit.com) I am currently running 2x RTX 5060 ti and happened across some good sales for additional ones coinciding with a really good sale of a highend Z890 motherboard (replacing my B860 board) that could support quad GPUs (with 2 M.2 adapters, end…
Preserve thinking on or off? (Qwen 3.6) (www.reddit.com) Are y'all using the preserve thinking flag or do you have it off? If so, why?
vLLM Just Merged TurboQuant Fix for Qwen 3.5+ (www.reddit.com) Previously it was throwing a 'Not Implemented' error due to Mamba layers. Going to test it now!
Qwen 3.6-35B-A3B KV cache part 2: PPL, KL divergence, asymmetric K/V, 64K row on M5 Max (www.reddit.com) Followup to yesterday's post: https://www.reddit.com/r/LocalLLaMA/comments/1sy7srk/. Comments asked for perplexity, KL divergence, asymmetric K/V combos, and a 64K data point.
Qwen 3.6 27B on Strix Halo 128GB: any experiences? (www.reddit.com) I'd jump on runpod and ssh in to test my workloads, but they don't have it. Would love to know how well this runs, particularly as context approaches a full 256K.
Optimizing Qwen 3.6 35B A3B sampling parameters. (www.reddit.com) I am trying to optimize Qwen 3.6 35B A3B sampling parameters but I am having a hard time figuring out a good benchmark to do it. As to why I believe that the recommended settings may not be optimal?
Don't ask Qwen 3.6 35b to give you aski image of Yoshi :) (www.reddit.com) https://preview.redd.it/dfqed57qgsvg1.png?width=1706&format=png&auto=webp&s=3859209698d2e844e2731326e355d60928658f8a The most fun part was reasoning, here is a gist: https://gist.github.com/anzax/5f06716c66180013cd715f6c2e5848df There is a…
Single question llm comparison (www.reddit.com) Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop (www.reddit.com) https://preview.redd.it/u8062juegq3h1.png?width=1919&format=png&auto=webp&s=a213f6929c6cad58e92bc1681dac9f0545b04d13 Overview: As the market for consumer computing parts becomes more scarce due to the AI boom, finding ways to use lower-end…
Removing Vision from model (www.reddit.com) I removed mmproj file from models to remove vision and save my vram. But just curious, is this really don't affect its text ability?
Blackwell and PDL performance increase (www.reddit.com) Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.) In short, PDL enables more efficient execution of kernels…
Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future. (www.reddit.com) I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly".
Choosing a Mac Mini for local LLMs — what would YOU actually buy? (www.reddit.com) GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) — aiming for 64K ctx, 30+ t/s (www.reddit.com) Hey all, Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B. What I’m targeting Context: 64K+ (ideally higher later) Speed: 30+ tok/s @ tg128 minimum Power: not critical…
Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. (www.reddit.com) Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090?
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6vllmqwenllama
qwen3.6 just stops (www.reddit.com) https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens…
High VRAM local coding model — still Qwen 3.6 27B? (www.reddit.com) I’ve been using Qwen 3.6 27B and it’s amazing. Not exactly your Opus replacement, but great for small tasks and checking work.
Has anyone bought a 3080 20GB mod recently? (www.reddit.com) I think it would suit my needs perfectly, but I'm scared of getting scammed on Alibaba so looking for some sellers who have delivered. Follow-up question for those who have the card, how well does it run Qwen 3.6 27B?
Qwen-27B as a Local Agent — It Actually Works Now (www.reddit.com) It's been a busy week testing and trying to get the 27B model set up correctly. TL;DR: The only setup that worked for my dual 3090s was this one.
Throughput and TTFT comparisons of Qwen 3.6 27B, Qwen 3.6 35B A3B and Gemma 4 models on H100 (www.reddit.com) I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple.
Small Gemma 4, Qwen 3.6 and Qwen 3 Coder Next comparison for a debugging use-case (www.reddit.com) Local model doing accounting tasks (www.reddit.com) So I've been using qwen 3.6 27b for monthly closes, bank recs, payable and receivables. Built a simple sql lite database it manages.
Embeddings for NVIDIA's Nemotron Personas (www.reddit.com) I extracted embedding vectors for nvidia/Nemotron-Personas dataset. It's an incredible resource consisting of millions of synthetic personas with detailed backgrounds (names, ages, occupations, hobbies, and more), but finding specific pers…
Number-aware embeddings (www.reddit.com) If you look at the cosine sim between the embeddings of "a 500 hp car", "a 1,200 hp car" and "a 73 hp car", you'll soon see that embedding models have no sense of number ordering at all. (I tested Qwen and ModernBERT-based embeddings) It m…
Looking to migrate off of Ollama and LMStudio (www.reddit.com) Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow.
Qwen3.6-27B - Closed-loop SVG Images (www.reddit.com) Yesterday, I saw an impressive presentation of Qwen 3.6 27B's SVG capabilities on the sub. To maximize the model's capabilities in terms of SVG generation, I put together a closed-loop harness with the help of Claude and Codex, and plugged…
I got a Qwen sticker lol (www.reddit.com) could not extract summary
Qwen Models are such good models? (www.reddit.com) https://preview.redd.it/o1uxb57u47yg1.png?width=862&format=png&auto=webp&s=d38204fe6ccd0d8326dcd98a534e9a226d213f99 How trustworthy are Artificial Analysis intelligence index? so according to them Qwen 3.6 27B is better than bigger MoE mod…
If anyone is running qwen 9b or 27b or 35b and getting wrong facts while web search, follow this. (www.reddit.com) Try to go with searXNG as you search results by multiple engines + its open-sourced. Use firecrawl / jina / fetch for reading the source.
Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call) (www.reddit.com) I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command: vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \ --enable-auto-tool-choice --tool-call-parser qwen3_xml \ --enable-prefi…
llama.cpp DeepSeek v4 Flash experimental inference (www.reddit.com) Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, a…
Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help (www.reddit.com) The new dense model is great, but I’m trying to figure out how to increase PP and Token generation speed. I’m running Q8 quants across 3 7900xtx GPUs and I’m consistently only getting 18-20 t/s generation speed and ~650 t/s prompt processi…
Qwen 3.6 35B A3B, RTX 5090 32GB, 187t/s, Q5 K S, 120K Context Size, Thinking Mode Off, Temp 0.1 (www.reddit.com) could not extract summary
$400 Qwen 3.6-27B Setup - Dual RTX 3060 - 30-50 t/s (www.reddit.com) I picked up a 7900 XTX earlier which runs qwen3.6-27b fine, but not to my like. Its compute performance is quite unstable for me.
Is there any case of a less quantised smaller model outperforming a more quantised larger model? (www.reddit.com) As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing.
If you use continue.dev and Qwen 3.6 (dense / MoE) - I could use your help (www.reddit.com) Someone suggested I give Continue (Vscode extension) a try. I've been using Roo / Zoo now and liking it but it is pretty tough on context and I was told continue has more control over it.
Local AI video pipeline review: Qwen3 27B beat Gemma 4 26B for tool calling (www.reddit.com) Watched All About AI's 100% local Fireship-style video automation experiment over the weekend (link in comments). A few things worth flagging if you're trying the same stack.
Speeding up local LLM for usable coding agent (www.reddit.com) TL;DR: Qwen 3.6 35B-A3B (Q4_K_M) is running slow at around 9 t/s with 72% filled context (36147 tokens window) and a total response time of 77s including prefill and token generation. Ran this using LM Studio on Windows with the attached i…
Qwen/WebWorld 32B/14B/8B (Qwen3 finetune) (www.reddit.com) WebWorld is a large-scale open-web world model series for training and evaluating web agents. It is trained on 1M+ real-world web interaction trajectories via a scalable hierarchical data pipeline, supporting: Long-horizon simulation (30+…
Two related prompts, different results: Qwen 3.5 and Gemma 4 need different prompting than Qwen 3.6 (www.reddit.com) With every new model release there's the "better than Opus 6.13" guys vs the "this is so bad, why did they even release it" camp and I'm always wondering which one is using it wrong. So I did a little test with 2 related prompts, 3 models…
Any tool that tells you the cheapest setup needed to run a model? I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds. (www.reddit.com) I’m looking for a tool or calculator that can estimate the minimum hardware needed to run a specific model locally. For example, I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds.
Deep research + report "a la McKinsey" with Hermes Agent and qwen3.6-35b-a3b Q6_K. (github.com via reddit) Hi there. Not native English speaker.
What's your tps on 3090 + Qwen 3.6 27B in real tasks? (www.reddit.com) I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models.
Your local LLM predictions and hopes for May 2026 (www.reddit.com) Which of these do you think we'll get in May? Also, feel free to pick/rank which ones you'd want the most badly: more Gemma4 models (124b?) (other sizes?) more Qwen3.6 models (9b?
Tell HN: Qwen Free Tier Is Discontinued (news.ycombinator.com) I kept getting 401 'token expired' errors on my existing Qwen session. Attempting to resume it after quitting, I got: qwen resume [API Error: 401 invalid access token or token expired] [API Error: 401 invalid access token or token expired]…
Llama.cpp vs LM Studio on gaming PC (www.reddit.com) Here is my experience, I've been using LM Studio with RTX 5080 and 64GB RAM using Windows 11. I'm very happy with LM Studio except the speed.
Show HN: Hitoku Draft – Context aware local assistant (hitoku.me via hn) Hi guys. I have been working on Hitoku Draft, an open-source, voice-first AI assistant that runs entirely locally.
China Expands Travel Curbs to Top AI Talent at Private Firms (www.reddit.com) https://www.bloomberg.com/news/articles/2026-05-26/china-expands-travel-curbs-to-top-ai-talent-at-private-firms Now it will be much harder to poach Chinese AI talents like the former Qwen head Junyang Lin. It is quite sad that they will al…
Llamacpp server : How do the -np and -c flags interact? (www.reddit.com) I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.
Qwen 3.6 benchmarks on 2x RTX PRO 6000 (www.reddit.com) Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend.
How can you stop your model from looping (www.reddit.com) So i thought this is a small model issue but when i added a new gpu and i am able to run low mid model like Qwen 3.6 35b q4 or q5 this issue still exists now its not as much as small model but it does break when linking the model to copilo…
Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled (www.reddit.com) Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled My setup is heterogenous, I originally acquired my server (Lenovo ThinkStation P3 Tower Gen 2) to run OpenShift/K8s clusters (because I work on that), and…
Now that MTP is merged... What's the best outputs you're getting on Qwen 3.6 35B on 2x3090s? (www.reddit.com) We've got great outputs for 27B via club 3090, but what about those of us who love the blazing speed of 35B on dual 3090s? I was getting 1500 p/p and 120 t/g with split layers, but MTP slowed it down to 80 t/g when I tested last week.
Dynamically allocating compute budget to hard set of problems and evolving the sections with Qwen-35B-A3B gets you near GPT-5.4-xHigh on HLE (www.reddit.com) could not extract summary
Qwen 3.6 27B: IQ3XXS KV Q8 vs Q4XL KV Q4 (262K context) (www.reddit.com) hey yall. So I have a 24GB gpu.
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) (www.reddit.com) I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…
Building out my tool library, any recommendations? I just added email capability and im starting to get hyped! (www.reddit.com) I'm using OpenWebUI and and making tools/skills to improve my models functionality. I am currently using Qwen 3.6 35B A3B Q8 (F16) 256k I grabbed `parallel tools` to be able to run multiple tool calls at once..
Qwen doesn't work for free (www.reddit.com) could not extract summary
Most people seem obsessed with token generation speed, but isn’t prefill the real bottleneck? Am I missing something? (www.reddit.com) I read this sub every day and I keep seeing benchmarks and discussions focused almost entirely on tokens/s generation speed. Prompt processing speed barely gets mentioned.
What do you use Gemma 4 for? (www.reddit.com) Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks.
Why run local? Count the money (www.reddit.com) I’m not a coder, but I run local models. I gave in to agent hype (I was building my own, but there is so much to do) and installed Hermes.
Peanut - Text to Image Model (Open Weights coming soon) (www.reddit.com) A new anonymous model debuts at #8 in the Artificial Analysis Text to Image Arena! Peanut’s weights are expected to be released soon, which would make it the leading Text to Image Open Weights Model.
Anyone tried +- 100B models locally with foreign languages? (www.reddit.com) I am quite curious as I tried Gemma 4 31B, Qwen 3.6 27B, GLM 4.7 30B and some others in my native language (czech). Gemma performs "best" and considering the fact its "just" 18GB model - it actually blows my mind how well it can respond in…
I hate this group but not literally (www.reddit.com) True story, I got interested in AI after seeing it at work and wanted to run models locally. I started with an M3 Ultra 96GB, quickly learned it was not enough for what I wanted, and kept upgrading hardware (including refurbished Mac Studi…
Tenstorrent TT-QuietBox 2 Specifications (Blackhole) (www.reddit.com) Source: https://docs.tenstorrent.com/systems/quietbox/quietbox-bh-2/specifications.html Currently supported models: https://tenstorrent.com/developers From the specification docs above: CPU: Ryzen 7 9700X 65W Granite Ridge 3.8GHz Memory: 2…
Sorry if it's not the best place to ask this, of the models in the image, which is the best for (problem solving)/Coding and the best one for studying (ask LLM concepts) ? My PC build is RX 9060 XT 16GB + I3 12100F + 16 GB DDR4 + llama.cpp with Vulkan backend + Linux Mint. (www.reddit.com) I gave some math problems to Qwen 3.5 27B and Qwen 3.6 27B and they got all of them right, pretty smart models I would say, but very slow and electricity consuming, they took like 5 mins with my GPU at 120 W to solve a problem. The MoE mod…
Ubuntu silicon-optimized inference snaps for AI (canonical.com via hn) Canonical on 23 October 2025 Install a well-known model like DeepSeek R1 or Qwen 2.5 VL with a single command, and get the silicon-optimized AI engine automatically. London, October 23 – Canonical today announced optimized inference snaps,…
AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks (www.reddit.com) Did some quick tests after building llama.cpp with ROCm 6.4.2 and latest Vulkan for my 6900 XT gemma4 E2B Q4_K ubatch ROCm pp512 Vulkan pp512 ROCm tg128 Vulkan tg128 32 1536.60 1423.49 151.92 174.59 64 1590.65 1930.60 151.41 173.76 128 265…
gemma4 vs qwen3.5 122A10 real usages (www.reddit.com) Agentic coding Qwen 3.6, Q6_K 125k context vs Q5_K_XL 200k context (www.reddit.com) What would you choose if you were in my shoes? How viable is 125k for agentic coding really?
Qwen 3.5 122B A10B running 50tok/s on DGX SPARK / Asus Ascent (www.reddit.com) Hello guys, wanted to share this: https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 I am running it on my DGX Spark Int4 V2 with Max context window - and getting 50tok/sec with Multi Token Prediction: Its working great for tool…
Fun Local LLM Comparisons with Gemma, Granite, and Qwen (ekorbia.com via hn) Fun local LLM comparisons with Gemma, Granite, and Qwen Ekorbia v0.2 features a comparison-chat mode that runs 2-3 local models against the same prompt in parallel. Here are a few fun prompts running across Gemma 4 (e2b), IBM Granite 4.1 (…
↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4gemmaqwen
260K-param LLM running on an emulated 90s CPU inside an 18-year-old RTOS (www.reddit.com) I know this sub loves absurd LLM projects, so sharing my contribution while we wait for the new Qwen 3.7 models to drop! I successfully got a tiny LLM running inside an RTOS, running inside a custom-built JavaScript emulator for the Freesc…
Advice on local coding setup (www.reddit.com) Just got an RTX 3090 to go with my Intel Core 9 Ultra 285K CPU and 32 GB of DDR5 6000 ram. I want to code locally on my Windows 11 PC.
Short Story Creative Writing Benchmark. Baidu Ernie 5.1: -0.35, Qwen 3.7 Max: -2.01, Mistral Medium 3.5: -2.13, Grok 4.3: -3.81. (www.reddit.com) This benchmark uses head-to-head comparisons of stories written in response to the same constrained creative briefs. The target range is 600-800 words.
Need Help - What would you build? Air-gapped NL assistant that is integrated with Splunk (www.reddit.com) So I have a side project with given scope: Fully air-gapped / on-prem - no internet, no outbound calls of any kind Engineers ask questions about Splunk data in natural language Has to hold the conversation in Korean (index/field names stay…
Qwen Plays ̶p̶̶o̶̶k̶̶e̶̶m̶̶o̶̶n̶ ? / QWEN PLAYS DCSS! - qwen3.6-35b-a3b@q4_k_xl plays open source roguelike adventure DCSS (and does a decent job) (www.reddit.com) Hi, (TLDR.): Qwen in its MTP version has tool call bugs and outputs everything into tool/thinking blocks - mangeling the output - canceling the +speed with repeated wrong tool calls! DCSS works well with non MTP qwen even on smaller qwants.
I built a local GUI for the TradingAgents framework — works with Ollama (www.reddit.com) https://preview.redd.it/i90oxxk7n03h1.png?width=1898&format=png&auto=webp&s=7d219c804fda7dfe122b84fcdb6d0d6883818c68 A while back I came across TradingAgents — a really cool multi-agent LLM stock analysis framework where like a dozen "agen…
Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark? (www.reddit.com) I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g…
favorite Agentic Coding Harness (www.reddit.com) So far, I’ve tried Codex CLI, Claude Code, Gemini CLI, OpenCode, and recently, Pi with local models. Pi is the leanest of them all, with just four tools: read, write, edit, and bash.
Qwen 3.7 Preview (twitter.com via hn) Qwen @Alibaba_Qwen Qwen3.7 Preview lands on Arena ! Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5 in Vision.
Very happy with Qwen 3.5 122B output. But is slowness expected? (www.reddit.com) I'm running the 122-billion Qwen 3.5, specifically Qwen3.5-122B-A10B-Q5_K_M, on DGX Spark (128 GB contiguous memory). I'm (very!) impressed with the general knowledge output.
how would you set up a local llm server for a business of 7 people? (www.reddit.com) Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to…
MOOSE-Star (ICML 2026): 7B model + 108K-paper dataset for scientific hypothesis discovery (www.reddit.com) Disclosure first: I work on community at MiroMind. One of our researchers just dropped the full MOOSE-Star collection on Hugging Face — a 7B model post-trained for scientific hypothesis discovery, plus the dataset behind it.
Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? (www.reddit.com) How is this dual setup's performance? Is it difficult to set-up everything with for example llama.cpp?
Qwen 3.6 27B MTP on v100 32GB: 54 t/s (www.reddit.com) Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added (www.reddit.com) The benchmark uses adversarial, multi-turn debates across 683 curated motions. Each model pair debates the same motion twice with sides swapped.
Mistral Medium 3.5 128B and Qwen 3.5 122B A10B on 4x RTX 3080 20GB (www.reddit.com) Mistral Medium 3.5 128B with 4x3080 20GB with layer split: CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Mistral-Medium-3.5-GGUF/Mistral-Medium-3.5-128B-IQ4_XS-00001-of-00003. gguf -ngl 99 -d 0,16384 -fa 1…
3xR9700 for semi-autonomous research and development - looking for setup/config ideas. (www.reddit.com) Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback.
Open Design: Use Your Coding Agent as a Design Engine (github.com via hn) Open Design The open-source alternative to [Claude Design][cd]. Local-first, web-deployable, BYOK at every layer — 11 coding-agent CLIs auto-detected on your PATH (Claude Code, Codex, Cursor Agent, Gemini CLI, OpenCode, Qwen, GitHub Copilo…
Has anyone figured out why Claude Code running qwen locally fails when you try to /compact? (www.reddit.com) I’ve tried a few suggested solutions but nothing has worked so far. Is claude trained to respond in a particular way that qwen doesn’t know about?
Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch (www.reddit.com) I’ve been working on an educational implementation repo for speculative decoding: https://github.com/shreyansh26/Speculative-Decoding The goal is not to wrap existing libraries, but to implement several speculative decoding methods from sc…
Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM (www.reddit.com) TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac. So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my perso…
Best settings for Qwen 3.6 -27B for 2X3090? (cannot make it to be smarter than Qwen 3.6 35B-A3B! (www.reddit.com) I'm sure people have asked before for settings for these gpu's, but for me, no matter what I do, It doesn't work as good as 3.6 35B! I've tried VLLM and LLAMACPP .
Current state of open-source ? (www.reddit.com) I’m trying to understand the current open-source LLM landscape beyond surface-level hype. We all got used to the nerfed products of Claude/Geminj so I believe really in opensource as a solution.
Alibaba's Qwen family captures over 50% of global open-source model downloads (www.scmp.com via hn) Advertisement Alibaba’s Qwen family captures over 50% of global open-source downloads, report finds Qwen hits nearly 1 billion cumulative downloads, far surpassing rivals like Meta Platforms’ Llama and DeepSeek, researchers say 2-MIN READ2…
Qwen 3.7 Plus (artificialanalysis.ai via hn) Qwen3.7 Plus Intelligence, Performance & Price Analysis Model summary Intelligence Speed Input PriceUpdated USD per 1M tokens Cache: $0.08 (-80%) Output Price Verbosity Qwen3.7 Plus is amongst the leading models in intelligence, but somewh…
Qwen vs. Proust: Injecting novels into a local model's prompt (robertkarl.net via hn) Qwen vs. Proust: Injecting entire novels into a local model's prompt before you ask I wrote this; not a bot.
Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster (www.reddit.com) Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarization using LLMs with GRPO.
qwen 3.6 27B AR-> Diffusion - local training on 5090 (www.reddit.com) based on the work of open-dllm - (which achieved qwen 2.5 autoregressive -> diffusion realignment head - same exact model under the hood delivering a 4x in improvement.) TLDR I haven't got a trained model yet. just a burnt out gpu cable an…
Anyone use QwQ-32B? It's over a year old? Has Qwen 3.6 27b basically replaced it? (www.reddit.com) I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B?
Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM? (www.reddit.com) Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dual rtx 3090 - gigabyte psu 1600 w What do y…
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset (www.reddit.com) I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the probability distribution of how likely it is…
Whats the best Qwen 27B Q8 quant? (www.reddit.com) everyone is talking about q 4 q 5 and q 6, but. i got some coding that i feel like lower quants kept getting wrong.
Need Help Choosing a Harness for Qwen 3.6 27B (www.reddit.com) I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I read posts all the time, but I have a spec…
GPU VRAM only for small models with llama.cpp: is it possible? (www.reddit.com) I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large co…
24GB M4 Mac - is Qwen 9B only option while system is running? (www.reddit.com) I have mac at work that I want to use local model for prototyping and basic prompts that needs to stay on device. What sort of model I can run that I can fit at least 64k context ?
AI server under 5k? (www.reddit.com) I have a framework desktop 128GB and a 3080 12GB running qwen 7b I want to move to a proper server rack + switch but not sure how to move from desktop PC to server rack. Any advice on what GPU/Server to get under 5k?
A streamlined Hugging Face model search utility coded by Qwen 3.6-27B (www.reddit.com) Hi all. As some may have been aware, Hugging Face's model search had issues recently.
Qwen3.7-Max: The Agent Frontier (qwen.ai via hn) Qwen Studio offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
I trained TIME: short context-triggered thinking on Qwen model instead of overthinking (www.reddit.com) Started this as a personal project for my Open-WebUI setup to use. Somehow it ended up as an ACL 2026 paper.
Developers who use local AI - Q4_0 vs Q8_0 KV quant? (www.reddit.com) I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territo…
Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (www.reddit.com) Implemented Multi-Token Prediction for QWEN on LLaMA.cpp with TurboQuant. +40% performance!
running Qwen 3.6 35b A3B on 2x 5060TI (www.reddit.com) i ran Qwen 3.6 35b A3B two 5060TI 16gb ( 32 gb vram also i have 32gb dram but i don't like offloading ) i used Q4 on LM Studio to get full context and i get 90t/s any tricks to optimze this more to upgrade to Q6 or Q8 ? thanks !
Does THINKING MODE significantly improve translation? (www.reddit.com) Between a solid model from Qwen or Gemma 4, when translating a text, does "thinking mode" significantly boost the quality of the translation, or is the difference negligible?
RTX 5060Ti 16GB or RTX 3080 20GB? (www.reddit.com) I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.
Is there any image2image model better than Qwen-Image-Edit-2511 and of comparable size? (www.reddit.com) I've tried with FLUX.2-9B but was not really better and FLUX.2-dev is too big. Any (tested) suggestions are most welcome.
Pi and Qwen3.6 27B make setting up Archlinux really easy. (www.reddit.com) Just thought I'd share this use case. I was setting up a miniPC as a home theatre with Archlinux (It's the OS I'm most familiar with).
Just got a 8x 32gb v100 server... now what (www.reddit.com) Looking for suggestions. Current setup llama.cpp and ran qwen 3.5 397b 256k context.
DELIGHT – self-hosted AI engineering autopilot: local LLM + browser farm + repo graph + P2P compute (www.reddit.com) DELIGHT – self-hosted AI engineering autopilot: local LLM + browser farm + repo graph + P2P compute TL;DR: Built a local "OS for AI agents" that scans your entire repo into a live graph (Worm), routes tasks between local Qwen, headless Cha…
Qwen 3.5 MTP for 9B (www.reddit.com) Can llama.cpp run MTP for this model?
A deepseek-v4-distill-qwen3.6-27b? (www.reddit.com) Long time ago (actually only a year ago), DeepSeek released a few open source model, such as deepseek-r1-distill-qwen (https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B). I am wondering if anyone in the community is brave eno…
Ran K2.6 through a third-party coding benchmark: heres how the figures stand up (www.reddit.com) I have been following the akitaonrails coding benchmark which tests against a fixed rails + Rubyllm + docker task rather than vendor-reported evals. April 2026 update put K2.6 at 87 sitting in tier A (80+), ahead of Qwen 3.6 plus (71), Dee…
Should I sell my RTX3090s? (www.reddit.com) I have a GPU server (4 × RTX3090s) that I've been using for research and PoC in the past 2 years. Mostly running vLLM for Qwen, GPT-OSS, and Gemma.
Local image generation on Mac: 10 models compared (SD 1.5 → Flux dev → Qwen-Image → Gemini) (www.reddit.com) Tested 10 image generation models on M1 Max 64GB for photorealism, text rendering, and cultural accuracy (Japanese/Asian content). Key findings: Qwen-Image Lightning (8-step distillation) beats the full model in quality while being 9x fast…
I cut Codex’s API Usage by 50% using a self modifying system (www.reddit.com) I've been developing a self-modifying Al agent system that effectively cut my Codex/Claude Code API usage in half, Codex makes a plan and then I basically just copy/paste Codex instructions for the agents to work on. Come back in 6 hours a…
Qwen 3.6 27B Neo Code Q4 KM I matrix is badass (www.reddit.com) So i am using this model in tax accounting. Have a shitty Ryzen 9 7940HS (8C/16T), 60 GB RAM, Radeon 780M iGPU, 1 TB Kingston NVMe, Win 11 Pro.
Five labs, one suite, do model families have personalities? (benchmark) (www.reddit.com) Bench 3 from my 18GB M3 Pro. Bench 2 was the 4B-class post where the comments were mostly right: I gave thinking models a fixed 1024-token cap, Qwen got kneecapped, Gemma E4B needed clearer active-param labeling, and the headline was partl…
Actual comparison between locally ran Qwen-3.6-27B and proprietary models (www.reddit.com) Hey y'all! I've recently written a text in Russian about my experience comparing Qwen-3.6-27B with lower tier cloud models on hard tasks -- I wanted to share the translation of the post, since I found the results interesting and surprising.
anyone know where to use qwen 3.6 27b via api/coding plan? (www.reddit.com) I want to test this model out but I don't have a setup that can do it locally. openrouter and all my coding plans don't include it.
Why are there so few small local creative writing models from the Chinese? (www.reddit.com) At this moment, the models such as Qwen 3.6 35b/27b crush the competition, yet I can't help, but notice this pattern. While the local RP scene is abundant with the Western model tunes: LLaMA, Mistral (all sizes), Nemo and more recently Gem…
Agents for end-to-end document redaction and review tasks (OCR and PII identification - Qwen 3.6 vs closed-source comparison) (www.reddit.com) (Links to all files, apps, and repos mentioned in this post can be found in the 'full post' link at the bottom) Agents for document redaction and review tasks Document redaction tasks involve text and vision capabilities, and long context…
What are your most interesting and hard Vision use cases? I plan to do side by side comparison of Gemma 4 (31B) vs Qwen 3.6(27B) Vision and I look for inspiration (www.reddit.com) Hey guys, I built a custom vLLM pipeline to run Gemma 4 (31B FP8) and Qwen 3.5 side-by-side locally to see how they actually perform in the wild with preprocessing of audio and images. But of course new model Qwen 3.6 27B came out just whe…
Gemma 4 vs Qwen 3.5 Vision on vLLM — 5 things I learned benchmarking them side-by-side (Reasoning budgets, FP8, pre-processing the input). (www.reddit.com) Hi guys, I’ve been running side-by-side experiments on Gemma 4 (31B FP8) and Qwen 3.5 Vision for the last few days using vLLM in Docker to see how they actually handle real-world images and video. A few things I found out: 1.
LLM Router: Best way to dynamically route prompts between proprietary and open-sourced models? (www.reddit.com) Running Qwen 3.6 35B-A3B-4B on MacBook Pro M5 64GB with tools (www.youtube.com via hn) Ask HN: How do you use Local LLMs? (April 2026) (news.ycombinator.com) Thinking versus chain of thought instructions (www.reddit.com) I've been using and learning about using all kinds of models for the last few years and I've read a lot of papers. I've even done finetuning and made loras, so I feel stupid asking this question, but here goes.
Which Qwen models can do FIM (Fill in the middle) for autocompletion? (www.reddit.com) I cannot find a definive answer. I think the following should be able to do FIM: Qwen 2.5 coder Qwen 3 coder Qwen 3-2507 instruct Qwen 3.5 Qwen 3.6 What I verified: Qwen3-32B: no Qwen3-4B-Instruct-2507: yes Qwen3.5-27B: yes Qwen3.6-35B-A3B…
Context checkpoint erasure in llama.cpp ? (www.reddit.com) Has anyone been able to solve or mitigate context checkpoints being erased during single user inference, specifically when function calling is part of the chat history? I've been using Qwen 3.5 35B A3B for some time (now using 3.6), tested…
A proxy routing all webtraffic through Qwen, removing all enshittified crap (geohot.github.io via hn) zappa: an AI powered mitmproxy Soon, AI will be good enough to interact with the Internet in an indistinguishable way from a human. This can be an amazing opportunity for liberation from all the people who are targeting your attention.
Summarizing text locally, medical literature (www.reddit.com) Colleagues, I have a question: does anyone have a locally developed solution for summarizing text? Which qwant qwen 3.5 27b would be able to summarize an entire chapter of medical literature, about 25-30 A4 pages, without hallucinations?
RTX 3090 llamacpp flags help (www.reddit.com) Hi, my current system hardware RTX 3090 24GB VRAM & Sysrem RAM 64GB using windows 11 been playing around with hermes agent and local llm (Qwopus3.5-27B-v3-GGUF & gemma-4-26B-A4B-it-GGUF) when i try asking the hermes agent to do a task with…
Can I combine a RTX5060ti 16gb with 7900XTX 24gb for llama.cpp? (www.reddit.com) I bought this 7900XTX for 905 euro in Spain, and wondering if can I combine them together to run Qwen 3.5 27B for example ? Using a MSI B650 Gaming Plus Wifi and 64gb DDR5 6400mt/s
NVIDIA + UMD released AF-Next: open audio-language model that outperforms Gemini-2.5-Pro on MMAU-Pro (75.01% vs 57.4%). Temporal Audio Chain-of-Thought anchors reasoning to timestamps. (www.aiuniverse.news via reddit) Audio Flamingo Next (AF-Next) — three variants: AF-Next-Instruct: audio Q&A AF-Next-Think: multi-step reasoning with temporal CoT AF-Next-Captioner: audio description generation Architecture: → AF-Whisper audio encoder → Qwen-2.5-7B LLM ba…
Qwen 3.5 Small – on-device multimodal models – Alibaba / Qwen (ai-tldr.dev via hn) A high-volume feed of new AI releases — models, open-source repos, developer tools, papers, datasets, and benchmarks — refreshed every 8 hours. Each release is explained in plain English so you actually understand what shipped.
Lora training (www.reddit.com) I'm getting ready to do a training run on qwen 3.5 27b and it will be the first time I've ever done LoRA. to complicate things I've tried to make my own custom dataset using q&a pairs.
Gemma 4 E4B as a primary local LLM (replaced Qwen) (digg.com via hn) Gemma 4 E4B 6bit is now the local model of my choice and loaded 24/7 on my Mac (using @lmstudio), replacing Qwen3, 3.5 4B after ~9 months of usage What an insane model, congrats @GoogleDeepMind 🤠 The new setup replaces his nine-month daily…
Ask HN: Is it feasible to run a model on device for complete privacy? (news.ycombinator.com) Tried Gemma, Qwen and a few others. Need vision and larger context windows for an application I am working on.
Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant) (local-llm.utop.workers.dev via hn) Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support This AI generated blog article is a kind…
Show HN: Free open source coding models in Slack (www.runcord.com via hn) Hey HN, We believe we have the easiest onboarding from signup to being able to spin up coding agents in slack like Stripe, Ramp & Coinbase. Demo of the onboarding: https://www.tella.tv/video/connecting-cord-to-slack-1-19ep Every signup get…
Are Claude or GPT subscriptions subsidized or are the APIs a ripoff? (www.reddit.com) Do you think GPT/Claude subscriptions are heavily subsidized as part of a land-grab strategy, where the companies are willing to lose money to dominate the market later? Or are the subscriptions actually profitable, and instead the API pri…
Looking for Suggestions — Single 5090 & 64gb DDR5 (www.reddit.com) Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if…
Ask HN: Local model experiences with 'high-reasoning distill' finetunes (news.ycombinator.com) What are your experiences with all the different variations of finetunes on small models (<40B) with those popular datasets? My personal experience is mostly with the 'Opus-Reasoning' ones on qwen models, and aside from the output being su…
How do you handle trying new models without spending too much? (www.reddit.com) New models pop up constantly—Qwen 3.7, Gemini 3.5 flash, etc. Every time a better one launches, I want to have a try, but I don't want to increase subscriptions.
Show HN: Charm – on-device spelling, grammar, and prediction for macOS (www.theodorehq.com via hn) I've spent the last year building Charm, a native macOS menu bar app that corrects spelling, fixes grammar, and predicts your next word. Three features: - Spells: NSSpellChecker plus a local LLM for context-aware corrections (catches "defi…
Agent builders: are GPT/Claude/Gemini API costs killing your margins? (www.reddit.com) Hey everyone, For people building agents with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude MCP/SDK, Google ADK, or LlamaIndex — how are you managing LLM API costs? Agent workflows can get expensive fast because of: tool calls retr…
Anyone evaluated the difference between Qwen Code for the local qwen models vs another harness? CC, OC, LC, Aider etc.. (www.reddit.com) For me, opencode doing fantastically but was wondering if qwen code would be more native and have better functionality, since idk which agentic harness they used to get their benchmark results
Qwen 3.7 Max (www.reddit.com) Qwen 3.7 looks pretty impressive. I think we've reached to the point if Chinese labs catching up with the western frontier labs.
Qwen 3.7 Max scores 60.6% on SWE-Bench Pro (www.reddit.com) https://preview.redd.it/jyiiwn2o0f2h1.png?width=962&format=png&auto=webp&s=6a96d2b9fe7bffcc75e8d5865161ec3727d46d58 Link to blog : https://qwen.ai/blog?id=qwen3.7
Kinda New to all this, couple of questions about how to set pcs and what models (www.reddit.com) Ill address all the questions here not spam the sub what would be a better set up, 1 pc with 2 3090s and a 5080, but that 3090s will have to run at x4 pci-e slots OR 1 pc with 5080, another pc with the 2 3090s and on x16 split into 2x8 mai…
Show HN: I built a native macOS Markdown viewer 100% with AI coding agents (github.com via hn) I built Markdown Viewer because every Markdown app I found was either bloated (VS Code, Obsidian) or too bare-bones. Wanted something that loads instantly, renders Obsidian-style features cleanly, and weighs in at a few megabytes.
I tried to switch from Claude Code to OpenCode, but Claude Code still wins for me (www.reddit.com) I spent some time digging into Claude Code vs OpenCode, mostly from the angle of how they actually work as coding agents. More on the technicalities like: context and memory tool use subagents permissions safety and control study the recen…
What political censorship looks like inside an LLM's weights (Qwen 3.5) (vas-blog.pages.dev via hn) A mechanistic-interpretability study of Qwen 3.5 Disclaimer. This is a mechanistic-interpretability study of how nation-state-mandated content filtering actually gets built into a deployed LLM's weights.
Tesla P40 running qwen 3.6 (www.reddit.com) Does anyone know why qwen 3.6 MTP spec decoding won't work with Tesla P40 when the K cache is quantized? I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40.
Qwen 3.7 appears in Qwen Chat website (chat.qwen.ai via hn) Qwen3.6-Plus What's on your mind today? Auto Choose a style to create your first image.
Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 (www.reddit.com) I'll be UPDATING this as it seems I was benchmarking and testing Just before the UPDATE LOL TL;DR If you're running rigid agent frameworks locally with mtp on consumer hardware: drop your draft window to 3, lock parallel slots to 1, and co…
5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp (www.reddit.com) Hey all, While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.
I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why (www.reddit.com) In my real-world usage (opencode) and in my synthetic benchmarks, Coder-Next (Q5) demolishes the whole Qwen3.6 family including the 27B Dense model (All Q8). Everybody else is hailing that 27B is superior and is an amazing model, but I hav…
Sort of my first venture towards finetuning on a qwen 3.5 4B heretic mode (www.reddit.com) How do you deem qwen 3.5 4B heretic variants for RP finetunes? I have been struggling to get a decent instruct based model, any tips regarding the goal would be really helpful.
Convert With MPT Support? (www.reddit.com) Hi All, I'm trying to understand the process of creating GGUF with MTP support. Does the original Qwen/Qwen3.6-27B support MTP?
Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks (www.reddit.com) Here are some results (llama.cpp)! Task 1: write a short poem 27B Dense: 12.5 tokens/s 27B Dense MTP: (spec-draft-n-max 6): 14.5 tokens/s 27B Dense MTP (spec-draft-n-max 3): 18.7 tokens/s Task 2: edit a hello word html artifact 27B Dense:…
How does Pi coding agent control Qwen's thinking verbosity? (Qwen 35B A3B, llama-server) (www.reddit.com) I'm running Qwen 35B A3B via llama-server with reasoning budget set to -1 (unlimited) for testing. In every client I've tried, the model just thinks endlessly before responding.
RLM models and Qwen3.6 (www.reddit.com) RLM models and Qwen3.6 Does anyone here have an RLM setup and how could I set it up? I want to make my Hermes agent even more powerful and I don't like that I need to open a new context window every time after just a few prompts.
Predicting Rare LLM Failures with 30× Fewer Rollouts (www.lesswrong.com via hn) TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space. Authors: Francisco Pernice (…
Thoughts on "production" model setups (www.reddit.com) I've been working with Qwen 3.6 27B and 35B-A3B models and pretty happy with them. The point I've reached now is how to split my uses cases.
Vulkan or CPU llama cpp backend for local llm for coding/code assist (www.reddit.com) Hi all I recently started a new job and we're doing python development for a ci cd metadata consolidation library for analytics and we cannot use no stuff like claude code or codex or gh copilot or any model APIs (free or paid). I got a la…
Gave Claude a local LLM as assistant on my Mac (www.reddit.com) Hi there! I was playing around with Ollama and LMstudio, testing local models and had the idea of letting Claude evaluate a few models on their actual capabilities rather than doing it myself.
↯ Qwen 2.5↯ Qwen 2.5↯ Qwen 2.5↯ Qwen 2.5↯ Qwen 2.5↯ Qwen 2.5↯ Qwen 2.5↯ Qwen 2.5ollamaqwenmcp
LLM as logic processor, filesystem as memory — Q2 quant doing real agentic coding 50k context (www.reddit.com) Hello LocalLLaMA subreddit, i have been running local models for coding tasks and kept hitting the same problems everyone does — the model writes an 800-line file in one shot and half of it is garbage, it spirals in its own reasoning for 4…
9070xt inference for q3 qwen 27B (www.reddit.com) In llamacpp I'm getting 12tok/s, does this number look right to you and what can I do to increase this number (if possible)? cd ~/llama.cpp && ./build/bin/llama-server -m models/qwen-3.6-27b-abliterated-q3.gguf -ngl 999 -c 65536 (i need th…
Show HN: Transformer Math Explorer (simonramstedt.com via hn) Interactive reference for transformer models, presented via dataflow graphs, drillable down to elementary mathematical operations. Covers models from GPT-2 to Qwen 3.6, with MLA, MoE, RoPE, MTP, hybrid attention, and other variants togglea…
Model(s) for Creative Writing & Conversational Intuition (www.reddit.com) We can all agree that the new Qwen models are truly amazing, and we are blessed to have them. In coding, they are certainly a breakthrough.
Is Qwen3-coder the best kept secret out there? (www.reddit.com) So I'm brand new to this scene but I'm using Claude to help me fine tune a model for a startup idea I have in the Healthcare space. I have been working with the 27-35B parameter mdoels (Qwen3.6, Gemma 4) and the couple of 120B+ models (Qwe…
RTX Pro 4500 Blackwell - Qwen 3.6 27B? (www.reddit.com) have have a server running a 4500 blackwell on cuda 13.1 and nvidia/595.58.03 with 48GB mem assigned to it. I have build: dcad77cc3 (8933) with Qwen3.6-27B UD-Q5_K_XL loaded and connected it to Roo code.
How difficult is distilling? (www.reddit.com) I remember a year or so ago when DeepSeek R1 came out and it was pretty quickly distilled into Llama 3 8b and Qwen 2.5 (?) 7b. Why don’t we see more distilled models?
Local autonomous security agent powered by Qwen 2.5-7B on Kali Linux (github.com via hn) Autonomous Security Agent A self-contained security agent built with Qwen 2.5-7B running locally via LM Studio on Kali Linux. The agent can autonomously execute security tools, analyze results, and take action through an MCP (Model Context…
Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development (www.reddit.com) tl;dr - For software development, Qwen3.6 27B, 5090 gives you ~3x speed over M5 Max, letting you plow through code, while M5 Max gives you ~4x memory, letting you use higher quantization and bigger context. Which would you choose and why?
Ling 2.6 (Flash and 1T): Efficient Open Models Competing on Agentic Benchmarks (firethering.com via hn) Ant Group doesn't get the coverage it deserves. While the open source AI conversation in the West circles around DeepSeek and Qwen, Ant Group has been quietly building a model family that competes directly with the models everyone is talki…
Qwen 3.6 and inline comments (www.reddit.com) I've been using Qwen 3.6 with the Pi harness, and so far I'm really enjoying the experience. I've noticed Qwen is great at leaving inline comments when writing Typescript (haven't tried other languages).
Thoughts on GRM-2.6-Plus-GGUF ? (www.reddit.com) Judging by what they state, it should be better than Qwen 3.6 27B
APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier (www.reddit.com) Quick follow-up on APEX, the MoE-aware mixed-precision quant strategy. The original post was just about Qwen 3.5 35B-A3B ( https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/ ); since then t…
M3 Ultra + DGX Spark = M5 Ultra-lite? (www.reddit.com) So I saw an article recently about exo disaggregated prefill with DGX Spark and M3 Ultra - prefill on one machine and decode on another. DGX Spark apparently has 4x matmul performance over an M3 Ultra - same as the M5 Ultra should have.
FPGAs for speculative decoding (www.reddit.com) Anyone who knows stuff about fpgas: - What max model size can one be designed for (I've read 20-30m parameters max, is it possible to go for more if quantized - at a resonable price)? - Taalas - is what they're doing with asics more viable…
[Help] Running big dense models faster (www.reddit.com) I have been trying Mistral 3.5 on my 4x RTX 3090 rig with llama.cpp. Inference is slow (about 11 t/s) even without anything being offloaded to the CPU.
24gb vram to 48gb vram (www.reddit.com) Hi all I m debating purchasing another 7900xtx in addition to the one I'm currently using pushing my vram from 24 to 48. I'm semi satisfied with the new qwen models.
"LLM is created so engineer don't have to write a report", anyway found out ONLYOFFICE can connect to OpenAI compatible, using Qwen 3.6 to do elaboration. (www.reddit.com) It is pluggin made for ONLYOFFICE, much simpler than copy-paste from webui. PS.
Which model for 32GB M2 Max? (www.reddit.com) I would like to experiment but before investing loads of money, I do have a MacBook Pro with 32GB RAM, M2 Pro. Which model would maximize versatility given this hardware?
Qwen 3.6 - Loops and repetitions (www.reddit.com) I normally seldom experience loops, either reasoning or responses, using Qwen 3.6 27B Q8 with 256k context window in Agent Zero. But the 35B A3B Q8 with 256k context window gets constant loops and is basically unusable within Agent Zero.
What's the best suscription under 20$? (www.reddit.com) I’m pretty overwhelmed. I feel like there are so many options that I don’t know which one to choose, and trying things until I find a decent one isn’t really my thing—even though I enjoy it.
Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops) (www.reddit.com) I've got to the point where I need some help. I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking".
Qwen corrects code saying that Taiwan is a country (twitter.com via hn) Don’t miss what’s happening People on X are the first to know. Post Conversation Using Chinese AI models for coding be like: Quote Rachyl Jones @RachylJones Apr 29 NEW: House committees are probing Airbnb and Cursor parent Anysphere over t…
Quick and simple test of various 3.5 and 3.6 qwen models on production code base which have deployed to an enterprise . (www.reddit.com) I tested several Qwen and unsloth models to see which could fix this correctly. Here's the breakdown.
Open Source Company Coding Plans (www.reddit.com) I’ve been looking to buy a coding plan from one of the major open source contributors to give my meager support to them and transition away from Claude. I would love to hear some feedback from the community of their experience with some of…
Qwen3.6-27B created this Open Webui tool (www.reddit.com) I usually go for Claude for those kinds of Open WebUI tool creations, but rate limits are getting tight so I decided to just let Qwen3.6-27B-Q5 handle it through Open WebUI. It did it in one shot.
Devstral Small 2 24B vs Qwen 3.6 27b or both? 1x 3090 (www.reddit.com) Hi got 1x 3090 and I'm thinking about these both models. I'm using from Friday Qwen and this model is amazing!
I'm Not a Dev But I Use Qwen 3.6 35b to Code (www.reddit.com) Full disclosure: I used to program a bit, but I was garbage at it so I found a new career. This was eons ago so I'm not a dev, obviously.
Is long re-processing of output as input a common "feature" or not? (www.reddit.com) I now use (mostly) Gemma 4 and Qwen 3.5 models *. And seems that all of them, after context grows a bit, after providing long output for me and getting a short prompt in response, are starting to process many new tokens as input and I have…
Last llama.cpp update broke web search tool calling with Qwen 3.6 27b. (www.reddit.com) At least in open-webui. Nothing has changed except for the backend update.
Live Demo: SRT adds transparency to Qwen black box (0.19%) (huggingface.co via hn) SRT-Adapter v8a — Live Demo Interactive demo for the Semiotic-Reflexive Transformer Adapter (v8a generation) bolted onto a frozen Qwen/Qwen2.5-7B. Paste a passage.
PI agent integrated with Cline-Kanban repo: All using PI and Qwen 3.6 35B MOE UD 4K_XL (www.reddit.com) Repo: statisticalplumber/kanban at pi-agent-integration Hi Guys, To test Qwen 3.6’s potential, I also wanted the Cline Kanban project to have an open-source agent to work with. The last time I tested Cline Kanban, it didn’t support agents…
CC-OpenAI-Codex Plugin, but for all CLI agents (www.reddit.com) Hello! I made a plugin for myself, & I figured I'd share it, in case someone else finds it useful (also to solicit feedback on it).
Were Qwen3.6 models scrubbed from openrouter? (www.reddit.com) I made a simple app using openrouter, hoping to use the new small qwen models (the a3b moe and the 27b dense one), but they aren’t listed. Also, I swear some qwen3.6 models that were listed before are missing now.
Severe instability and looping issues with local LLMs (Qwen, Zen4, llama.cpp) (www.reddit.com) I tried working on a local LLM project today and honestly ended up pretty frustrated. I tested several approaches, but none of them worked reliably.
Speed penalty with Q8 KV quantization (www.reddit.com) I knew there would be a speed penalty when switching the KV cache quantization from F16 to Q8, but I never expected it to be this significant at longer context sizes. I ran a test with Qwen 3.5 122B on my MacBook M2 Max using llama.cpp.
Show HN: Qwen Lens Studio – multimodal app on Qwen3.6-35B-A3B, runs on Ollama (github.com via hn) Qwen Lens Studio A multimodal AI studio built around a single Qwen vision-language model, exposed through five focused tools plus a batch runner and a persistent session log. Ship a screenshot → get code.
Qwen3 27B FP8 + TurboQuant on RTX 5090 - anyone tried? (www.reddit.com) Do I understand correctly, based on this comment, that I can potentially fit Qwen 3.6 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, a…
Are commonly recommended sampling parameters often too high? (www.reddit.com) Termux vs. Terminal on Pixel 10 (news.ycombinator.com) kIOGPUCommandBufferCallbackErrorImpactingInteractivity... recreate the backend to recover (www.reddit.com) Oculink eGPU for LLMs: RTX 5070 Ti (256-bit) vs 5060 Ti (128-bit) paired with 4090m (256-bit) laptop? (www.reddit.com) Good Summarization SLMs for < 2000 tokens (www.reddit.com) PSA re Qwen 3.6 35B A3B q4 + agents (www.reddit.com) Recommended parameters for Qwen 3.6 35B A3B on a 8GB VRAM card and 24GB RAM? (www.reddit.com) How is Rotorquant/planarquant/iso qaunt better? (www.reddit.com) 5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4 (www.reddit.com) Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I…
Qwen3.5-27b (Qwopus) build a 3d game scene using opengl and C++. (www.reddit.com) I asked Qwen to build a 3d game in C++ using OpenGl, he created the whole project in multiples cpp and header files, 2500 lines of codes in on single shot, the code was clean highly technical, the scene load from the first try, i was amaze…
Intel Lunar Lake 258V (32GB) vs Qwen 3.6 35B-A3B: Pushing the limits of MoP architecture. (www.reddit.com) Hardware: Intel Core Ultra 7 258V, 32GB Unified Memory. Model: Qwen 3.6 35B A3B (Quant: Q3_K_S) via LM Studio.
Qwen 3.6 No think? (www.reddit.com) I’ve been seeing a lot of good feedback about the qwen 3.6 model and its reasoning performance but has anyone tested it with reasoning off? I’ve been building a low latency app using Qwen 3 30ba3b 2507 and 3.5 no think was not an improveme…
Strix Halo concurrency 4 16k context 64 t/s Qwen3.6-35B-A3B-Q8_0 (www.reddit.com) https://preview.redd.it/4906akj9dovg1.png?width=1527&format=png&auto=webp&s=c49e255ac79a3c5455f44603422f8af7ddc12594 First of all can we make https://www.youtube.com/watch?v=2lUC8Gimxz8 Angine de Poitrine this subs official band? Those guy…
Is there a way to have qwen-code CLI read images? (www.reddit.com) Basically I am asking the model to describe an image, but it says it can't process the images. The weird thing is that if I send the image encoded directly on the prompt, it works just fine, I am using llama-server with qwen3.5 (tried all…
Llamaindex releases Parsebench (www.reddit.com) https://preview.redd.it/c0ns26pf3mvg1.png?width=1920&format=png&auto=webp&s=4b6ac114c2e0395684ac0ba79e591d71ccca2fe3 ParseBench lets you test the accuracy of different parsers using your own documents. Ran this across Gemini 3 flash, Qwen…
Local Model Suitable for Grammatical/Academic Editing? (www.reddit.com) Hi, I do a lot of writing and would be interested to know what people's thoughts are on the most capable model for proofreading, grammatical and academic editing. I have 48GB VRAM but don't imagine i'd need something too overkill.
Nyquest – Open-source LLM token compression proxy in Rust (15–75% savings) (github.com via hn) nyquest.ai Semantic Compression Proxy for LLMs Reduce LLM token usage by 15–75% without losing meaning. Drop-in proxy with 350+ compiled rules + local LLM semantic condensation (Qwen 2.5 1.5B).
GTX 1650,4 gb vram, I want a decent local tts. (www.reddit.com) At this moment I am broke, so pls dont laugh at my specs, I am making vidoes at this moments but I want a deep male voice, I did try eleven labs but ts is too costly, then I tried qwen tts but it was slow as heck, does anybody know lighter…
Best LLM for logic/ spatial reasoning on small context inputs? (www.reddit.com) My system has 32gb RAM and 8gb VRAM. I tried out DeepSeek-R1-Distill-Qwen-7B-Q6_K_L.gguf and it was vastly inadequate for what I wanted so looking for other suggestions.
Been trying to get Qwen 3.5 to stop reasoning using old methods like /no_think, it didn't work, but it said something like "too late" in its reasoning (www.reddit.com) Wait, I need to be careful about the "no_think" tag in the system prompt. The system prompt says /no_think.
Any way to work with NUMA Nodes? (www.reddit.com) I bought a dual Skylake server because 12 channels of memory (and 2 x 3090s) THEN found out about NUMA nodes after my poor test results. Very disappointed.
Qwen 122B is AMAZING but is my config right? (128GB M4 Max) (www.reddit.com) Hi! I hope its okay for me to ask this here.
One Layer, +12%: What 667 Configs Reveal About Small LLM Anatomy (austinsnerdythings.com via hn) I’ve been messing around with local LLMs on my 3090 for a while now — I have a growing collection of Qwen models on D:\LLM that I probably should be embarrassed about. A few weeks ago I stumbled across David Noel Ng’s LLM Neuroanatomy blog…
Alternative opensource Perplexity : ollama+perplexica+searxng : quel model ? reglages ? optimisation ? (www.reddit.com) Hello, je suis en plein dans le montage d'une solution IA locale pour virer à terme perplexity, l'usage de chatgpt, claude etc..... mais je ne suis pas informaticien (perplexity est encore mon amie en ce moment !).
Dynamic tool lists vs KV cache: how do you handle this trade-off in LLM agents? (www.reddit.com) I’m working on an LLM agent setup (using Qwen-style chat templates with tool calling), and I ran into a design trade-off that I’d like to get some insights on. In these templates, the full tool definitions (JSON schemas) are injected into…
Show HN: VQAScore – open eval metric/reward model, now for text-to-video (github.com via hn) Two years ago we released VQAScore: ask a VLM "does this image show {prompt}?" and use P(Yes) as the score. It became a go-to evaluation metric and reward model for image generation, replacing CLIPScore across the field (2M+ downloads on H…
Why aren't languages/frameworks offering retrained models for their project? (news.ycombinator.com) The cost of training is coming down. We have incredible open source models (especially smaller ones) like qwen 3.6-27b.
Words do not have determined meanings (news.ycombinator.com) The vocabulary itself is reflexive. It is self-referential, looping back into its own structure rather than anchoring in fixed reality.
Qwen-VLA: Vision-Language-Action Modeling Across Tasks, Environments, and Robots (www.dcard.tw via hn) Why have I been blocked? This website is using a security service to protect itself from online attacks.
DocumentAI Visual Benchmark - GPT 5.5, Gemini 3.5, Qwen... (www.maltebuettner.eu via hn) # documentai bbox benchmark In my previous post, I talked a bit about the recent developments in the field of DocumentAI. Now comes the practical part.
Tuning CPU-only Qwen3-30B inference with an IBM Quantum sampling loop (github.com via hn) Qwen Air QPU/MCP Lab Quantum-enhanced autoresearch for high-performance, CPU-only Mixture-of-Experts LLM inference on legacy hardware. This repository contains the benchmark harness, MCP-style tool boundary, experiment logs, paper draft, a…
Show HN: SharkBay – a local macOS workbench for coding-agent CLIs (github.com via hn) SharkBay macOS workbench for multi-agent vibe coding Features Multi-Agent Support Launch and manage multiple AI coding agents from one workspace. Supported agents: Claude Code · Codex · Gemini · Kiro · DeepSeek · Qwen · OpenCode Agent Stat…
Annoying QwenCode v.0.16.0 - How to disable this thing? do I need to roll back to 0.15.x, disable auto-updates and call it day? why Qwen... WHY!!?? (www.reddit.com) could not extract summary
been pairing M2.7 with Hermes Agent for a few weeks. holds up surprisingly well. anyone else running this combo? (www.reddit.com) been self-hosting hermes agent locally for a few months and rotating through different model backends for it. tried claude sonnet 4.5, gpt-5.5, qwen 3.6 coder, and most recently minimax m2.7.
↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5minimaxgpt-5sonnet+1
Anyone tried a setup like this? Is it a bad idea? 😅 (www.reddit.com) I’m considering building a local machine for AI inference using a Dell Precision T5820 and 2 Intel Arc A770’s. From this I could get 32GB DDR4 RAM, 1TB SSD and 32GB VRAM, all for like $1000.
Is a 128 GB MacBook Pro M5 Max actually too slow for large-context local LLM coding workflows? (www.reddit.com) People are warning me about the prompt-processing speed of a MacBook Pro M5 Max with 128 GB RAM. My main concern is prompt ingestion / prefill latency and large-context handling — not raw token generation speed (which I think is OK).
LMStudio with MTP support - which model? (www.reddit.com) Looks like LMStudio released support for Multi-Token-Prediction (MTP) and the release notes say to use a MTP-compatible model. What model is everyone using with MTP support?
ran qwen3.5 locally on a flight with no wifi. claude code started straight-up hallucinating (www.reddit.com) heavy travel period last month, lots of offline time, and i could not stop building. airplane wifi was unusable so we switched models inside Claude Code and fired up qwen3.5 locally on an M4 macbook.
Five different frontier LLMs in one shared environment, with separate thought and emotion output channels — sharing setup, results, and open methodology questions (www.reddit.com) First real project to share. Single developer, personal research, not a product or service.
Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal? (www.reddit.com) Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774) My run command: ``` llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface…
Llama.cpp: What's up with -sm tensor + AMD + Vulkan? (www.reddit.com) Has anyone got it to work? I tried it with dense models (eg qwen 27b, gemma 31b, mistral 128b) since that's where I need it most, but it always core dumps.
What is everyone using AI for? Realistically (www.reddit.com) So I have to admit, I have fallen victim to the cool looking dashboard videos but I’m struggling to find a use for me. I love AI and use it daily for general questions and some deeper research (Google Gemini free tier).
I spent €300 extracting raw LLM weights, ran into a wild codegen bias trap, and finally mapped the internal activation geometry (60 Graphs) (www.reddit.com) Hey Reddit! A couple of weeks ago, I posted about my independent research on treating LLM alignment as a latent space shift.
Want Built a React-style looping agent with small LLMs (Qwen 3.5 9B / Gemma4) + LangGraph? (www.reddit.com) Currently experimenting with building a React-style looping agent system using small LLMs like Qwen 3.5 9B and Gemma 4 (E2B), and I wanted to ask if anyone here has worked on something similar. Current setup: Using LangGraph Around 5 tools…
My pipeline for the best speech to transcript results (www.reddit.com) I wished the new ASR (automatic speech recognition) models to give me the accurate output but I was disappointed, specially when the input was multilingual and noisy (all my use cases). I had to put in significant efforts in audio pre/post…
how to install llamacpp the better way to wrapping it in python ui (CPU use only) ? (www.reddit.com) i want the best installation that fit my use and my low-compute H.W , i want to run small to above small llm like "qwen" 2b ,4b and 27b , and "gemma" 31B. rely completely on only old CPU 4th.gen i7 with that few 32gb 'slow' ddr3.
Qwen 3.6 27B MTP speed on 3080ti (getting 4.5 t/s) (www.reddit.com) Using LM Studio with 3080ti (12gb of VRAM) and 128gb of ddr4. Model version: Qwen 3.6 27B MTP UD q4_k_xl Is this my hardware limit?
Claude code in terminal models / combine with local llm? (www.reddit.com) Hi, I’m pretty sure I have seen people typing /model and seeing all available models. I have to type models from memory.
397B competitor that fits in 256 RAM? (www.reddit.com) Does one exist? I noticed 3.6 QWEN did not release locally in 397B-17B.
I offloaded a multi-step background loop from Claude Code to a local agent OS. They started voting on their own system rules. (www.reddit.com) Hey r/ClaudeAI, If you are using Claude Code or building terminal agents, you know the exact moment the context window starts degrading during long-running tasks. I wanted to build a persistent runtime layer to offload those heavy, multi-s…
Ask HN: Is the next big thing locally running coding agents? (news.ycombinator.com) There's extreme price escalation on part of Anthropic, with token spend now approaching levels that have made many-an-enterprise scratch their heads. At the same time, judging by opensource advances (E.g.
Continue config for Qwen 3.6 and llamacpp (www.reddit.com) If anyone is using the Continue.dev extension in VSCode, what config settings are you using for Continue and the llama-server? Mine keeps hanging after bad tool calls.
QuickSilver Pro – OpenAI-Compatible Platform for DeepSeek V4 and Qwen (quicksilverpro.io via hn) OpenAI-compatible API for 7 top open-source LLMs — DeepSeek V4 Flash & Pro, V3, R1, Qwen3.6 & 3.5-35B-A3B, Kimi K2.6 — 20% cheaper than OpenRouter, Together AI, Fireworks. One-line drop-in.
One Night Werewolf played by LLMs (www.reddit.com) The other day I posted about playing one night werewolf on my custom made UI via tool calls. Since then I’ve played a few games and improved the prompts.
Show HN: Llama CPU Benchmarks (deemwar-products.github.io via hn) TurboQuant — "8× faster" The headline is a synthetic GPU-kernel number. On real CPU end-to-end it ran 2.2× slower and dropped Qwen accuracy 17 pp.
40+tok/s - optimized recipe for Qwen 3.5 122B Int4 on a single DGX Spark with vLLM (www.reddit.com) Hello guys, two days ago i ran the spark-arena for my Qwen 3.5 122B Recipe on a single DGX Spark and I got the highest score on speed for any context length and concurrency across all 3.5 122B Int4 Recipes. Just wanted to share if somebody…
Open catalog of agent patterns + the frameworks that implement them (www.reddit.com) I have been building an open catalog of agent patterns and the frameworks that implement them. It is a pattern language in the Christopher Alexander sense, mapped onto the current agent landscape.
Multiple RTX 3090 - P2P driver, NVLink or what can be done? (www.reddit.com) So I have a multiple RTX 3090 build with a ThreadripperPro 3945 and PCIE4.0 x16 interfaces, what will bring me some (even minor) speed increase: NVLink, the P2P driver or both? Does anyone have practical experience with modern Qwen models?
unsloth/Qwen3.6-35B-A3B-GGUF has worked very well on my 24GB 3090 Ti for coding. Any recommendations for other models? Also, my perspective as an experienced coder just trying this stuff out now (www.reddit.com) I've tried Gemma4 and a few other variations of Qwen, but they're either not as robust with their output, or they take too long or too much VRAM and force the context limit down from 131K to 20K or even 4K, or they're slow AND low-context…
Putting together a benchmark for agentic harnesses, any tips for evals? (Test suggestions welcome too) (www.reddit.com) I've been putting together a test system for agentic harnesses against local models. Actually running the harnesses/getting baseline metrics is fine.
Full Hermes Agent tutorial (Spanish with English auto-translation). Computer Use, MCP Blender, Hindsight memory and multi-agent setup (www.reddit.com) Spent weeks running Hermes Agent in production on my Mac Mini M4 before recording this. Wanted to show things nobody else was covering.
Distilled Model's Vision Problem (www.reddit.com) Have been using Qwen 3.6 Claude distilled version, 27b at Q4 for openclaw, Hermes and other local harnesses. But recently noticed that the Claude distilled version that I use lost its vision abilities.
Has anybody been able to achieve reliable agentic performance with cheap/open source models? (www.reddit.com) Basically the title. Recently I've been trying various open source and comparatively cheaper models like minimax m2.7, qwen models and glm5.1 in Pi agent from openrouter, and the performance on coding tasks have be moderately adequate at b…
Looking for agent builders to test external agents on a multi-agent knowledge site (www.reddit.com) I’m building AgoraDigest, an experimental site where multiple AI agents answer the same hard technical question independently, then a synthesized digest preserves: verdict best-use-case boundaries conflicts between agents evidence gaps ver…
Seeing the activity pop up big time in this sub due to various open models. Most of them require at least 16gb vram. What can I do with 8? (www.reddit.com) Not deeply technically fluent but have ran few models locally before, around the time before gemma 4 dropped. I tried some low quant of qwen 2.5 coder and after some tinkering I got it to run but it was just so slow, obviously.
Using Local LLMs for research (www.reddit.com) Hey there. I am an undergrad who has been doing mostly SWE, but will be doing ML research under my professor over the summer.
Best local model for C# coding with 24GB VRAM? (www.reddit.com) I can't decide that Qwen 3.6 35b q4 (130k context) or Gemma 4 26b q4 (95k context) is better for C# coding with 24GB VRAM. Please share your experiences!
Qwen 27b MTP Config, Llama.cpp Single 3090 (www.reddit.com) What setup are you using for qwen 27b on a single 3090? Here's what I've started using today.
Has anyone found a Qwen CLI replacement? (www.reddit.com) I just need 1 or 2 people to reply to me with the answer I need. I have not been able to keep up with AI advancements for a while.
MTP with Dual 3090's on Qwen 27B (www.reddit.com) Does anyone know if MTP works with more than one 3090' yet? I see the 5090's talking about it, but would like to know for us poors.
Are the rich RAM /poor GPU people wrong here? (www.reddit.com) Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the…
Qwen-Image-2.0 Technical Report (arxiv.org via hn) We present Qwen-Image-2.0, an omni-capable image generation foundation model that unifies high-fidelity generation and precise image editing within a single framework. Despite recent progress, existing models still struggle with ultra-long…
Reliable Open Source LLM as a Service (www.reddit.com) Has anyone figured out a provider whose open source models (Kimi, Qwen, GLM e.t.c) can be used reliably in production. I have tested some well known providers and they all suffer from high latency and poor uptime rendering them mostly usel…
Small OpenCode plugin that helped me with broken tool calls from a local Qwen model (www.reddit.com) I’m using OpenCode with a local Qwen3.6-27B Q6_K GGUF model on an RTX 5090 with KV cache in Q8. For reference my llama.cpp build is compiled with CUDA 12.9.
Is there a big gap between Q4 and Q6 on Qwen3.6? (www.reddit.com) I’ve got one 3090 and thanks to the help of MTP and all, I can do around 65 tok/s on qwen 3.6 dense 27b. But I’m running at Q4_M so everything fits and my context isn’t super high.
MagenticLite is here: A full-stack agentic experience powered by Small Models - Fara-1.5 4B, 9B & 27B (www.microsoft.com via reddit) What if you could run a capable AI agent without leaning on frontier-scale models? MagenticLite is the next generation of Magentic-UI, an agentic experience reimagined and optimized for small language models.
Best local model supporting claude code? Rtx3060 (www.reddit.com) Hello all, I’ve been using Qwen 3.5 9B Q4 262k ctx using Llama cpp for claude code for a while now, is there any model which better complements agentic coding setup locally? Or is there a better harness (than Claude Code)?
Open Source Managed Agents (linchpin.work via hn) Any model, one adapter OpenRouter routes to ~200 cloud models — Claude, GPT, Gemini, Llama, DeepSeek, Mistral, Qwen. Ollama runs anything you've pulled locally.
Can I improve performance for qwen 3.6 27b? (www.reddit.com) Hardware OS: Windows 11 Pro 10.0.26200, Build 26200 CPU: Intel Core Ultra 7 270K Plus, 24 cores / 24 threads, max clock 3.7 GHz RAM: 32 GB DDR5 @ 5600 MHz, 2x16 GB Crucial CP16G56C46U5.C8D GPU: 2x NVIDIA GeForce RTX 3090, 24 GB VRAM each,…
Building the QWEN3.6 - Codex Bridge Furthe + Kindergarten Harness Reality Check (www.reddit.com) I got a bit further with my harness for running Qwen 3.6 model on Codex. While testing, analyzing, and building the harness, I evolved TBG(O)llama-swap into a full forensic UI bridge and LLM analytics tool where every harness finding, modi…
Are harnesses like OpenClaw and Hermes really necessary? (www.reddit.com) My setup: Windows 10/11 i7 12700K | RTX 3090 TI | 96GB RAM Local server: LM Studio Models: Qwen 3.5/3.6 27B|35B Q5 UD K XL + Gemma 4 31B| 26B Q4 UD K XL Up until this point, I've only used sota models for coding. When Qwen 3.5 dropped, it…
How to get realtime logging of LLM activity? (www.reddit.com) (yes this is a long post and I used some markdown formatting, like I always did to organize my comments long before the invention of LLMs. For example in 2021..
Does anyone else have issues with Qwen-3.6-27B stability in the Codex harness? (www.reddit.com) I run the 4 bit quant of Qwen-3.6-27B in the codex harness with unsloth recommended llama-server settings, thinking enabled. I have tried the default chat template and the updated ones and have updated both my GGUFs and llama-cpp to the mo…
Which Chinese Model is best for planning and which is best for implementation? I'm currently using Opencode with an Openrouter API Key, mostly wanna decide between Kimi, GLM, DeepSeek, Qwen, Minimax and Mimo (www.reddit.com) Original plan was to use Kimi/GLM for planning and DeepSeek for implementation, but seeing a lot of love for MiMo and Minimax lately. Anyone running a planner + coder split on Opencode?
Same model, different harness: 30-50 point performance swing. But teams still pick agents by model name. (www.reddit.com) There's a finding circulating this week that deserves more attention than it's getting. The claim, backed by multiple builders comparing setups: the same model can produce a 30 to 50 percentage point performance difference depending on whi…
Which finetunes are actually worth it? (www.reddit.com) Finetunes used to be more task specific (e.g. roleplay) but nowadays all I see is Opus distill or abliterated/Heretic.
Small model with tool calling that won’t refuse ssh tasks? (www.reddit.com) Using 9b 3.5 qwen to see how far I can push it inside Hermes to do simple sysadmin type tasks. It flipped the absolute hell out when I tried to get it to ssh into a test vm I had set up, saying it refuses to perform activities of this natu…
Thinking of moving from 2x 5060 Ti 16GB to a RTX 5000 48GB (www.reddit.com) I am a freelance developer. Qwen 3.6 27B is great on the 5060s but a bit slow.
Swapped from a lighter agent runtime to Hermes Agent on a local 35B MoE — what changed (capability up, latency up, context budget down) (www.reddit.com) Two weeks of running Hermes Agent as the daily driver on a local stack. Sharing the trade-offs because anyone evaluating agent runtimes for local models is going to hit these.
Qwen 3.6 Looping with Tools? (www.reddit.com) For some reason, my qwen started looping a lot recently, ever since I introduced MCP tool calls. I don't know why as I didn't really change anything other than that.
Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? (www.reddit.com) Ok so, I will try to explain myself as much as possible because onlinew I really cannot find much about this. Let's start by my settings for running Qwen 3.6 35B: Qwen 3.6: cmd: '/X --port ${PORT} --chat-template-kwargs '{"preserve_thinkin…
Qwen 36 27B + Gemma 4 - the best set for 1x 3090 ? (www.reddit.com) Hi guys 👋 When I started my adventure with Qwen 3.6 27B I felt wow.... Now when I connect it with Gemma 4 I'm feeling more wow...
What models for coding are you running for a mid level PC? (www.reddit.com) I have a 4060 (8GB Vram) and 16GB of ram wondering which models could fit in my setup for coding, the new Qwen 3.6 and Gemma 4 MoE models look good but might not fit, wondering about your experiences
Which model has less restrictions now? (www.reddit.com) GPT and Opus block on certain requests. This didnt use to be the case 2 months ago and I made signficant progress with Opus and then one day I had a 2 week break and then a single prompt to continue the work resulted in refusal.
I gave my local Agent OS the ability to "call" Claude when it gets stuck. Now Claude is managing a team of autonomous local workers. (www.reddit.com) Last time I posted about Hollow Agent OS, people were interested in the "Tool Factory", the fact that the agents can build their own scripts and registry. But there was a big problem: sometimes the local models (Qwen) write code that "work…
DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid. (www.reddit.com) That foodtruck bench post showing deepseek v4 matching gpt-5.2 at 17x cheaper got me thinking. if frontier cloud models are that overpriced for equivalent quality, how much of my daily work even needs cloud at all?
I built vivkemind – an open-source, local‑first terminal AI coding agent with full AWS Bedrock support (www.reddit.com) wanted a terminal AI coding agent that doesn't lock me into one model provider. So I forked Qwen Code and added full support for every model available in AWS Bedrock.
Show HN: Token Usage Meter 12 Providers and Coding Agent (qlaud.ai via hn) Here once again A Token Usage Meter for 12+ AI Providers Anthropic, OpenAI, Google, Alibaba qween, Moonshot Kimi, MiniMax, ElevenLabs, Deepgram, Perplexity. Qlaud.ai provides token usage meter / AI billing layer.
Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching (www.reddit.com) Hey everyone, I’ve been experimenting with running Qwen models locally on my setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB CPU: Ryzen 5700X OS: Windows 11 What I’m currently running Qwen 3.6 35B (UD Q4_K_M) llama-server.exe -m "C:\Users\Dino\…
qwen 3.6 27B looping problem (www.reddit.com) Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen…
Benching local Qwen as a Codex validator, co-agent, and challenger (www.reddit.com) I’ve been running a local Qwen model beside Codex for coding work, and it has been more useful than I expected. It's never going to be a replacement for Codex.
Best Llama Config for Turboquant_Plus? (Stats below) (www.reddit.com) So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy.
I built a local Ollama-based CLI coding agent that can edit files, run tests, and retry on errors (www.reddit.com) I’ve been building a small open-source CLI coding agent for local models. It runs with Ollama and works best so far with Qwen Coder.
Local model for Cursor to build an Android App (www.reddit.com) New to Cursor. Android Studio Gemini Agent has become unusable,so im looking for new options.
I built an AI tool that turns any movie into viral recap videos in minutes (www.reddit.com) Hey everyone, I built a tool that creates movie recap videos automatically using local models. The problem: making recap videos takes forever.
Kvaser - Moving beyond simple agents: Building a Local-First AI Orchestrator with Qwen 3.6, Kiwix, and Wolfram (www.reddit.com) For the past two weeks, I’ve been spending 4–5 hours a day building a custom MCP (Model Context Protocol) orchestration server. What started as a simple experiment with Qwen 3.6 35B has evolved into a full-scale "Man-in-the-Middle" proxy t…
“From tokens to dollars” is starting to feel real (www.reddit.com) We’re entering a phase where: prompt → content → distribution → monetization is basically one pipeline. Just tried HappyHorse + Qwen, and it’s a glimpse of that future.
Free AI video generation tool (HappyHorse + Qwen) – early thoughts (www.reddit.com) Free AI video generation tool (HappyHorse + Qwen) – early thoughts Just tested a combo of HappyHorse + Qwen and it’s surprisingly solid for AI-generated video content. What stood out: fast generation from simple prompts decent storytelling…
Turn AI tokens into Free (Happyhorse &Qwen) (www.reddit.com) Turn AI tokens into actual $$ (free to try) Been playing around with some AI video tools lately and found something interesting. You can test HappyHorse + Qwen for free here — it basically lets you turn simple prompts into short-form video…
Advice needed on eGPU and Mini PC (www.reddit.com) Hi all, I come across to relatively niche problem and could not find much useful posts or guides about it. I have a mini pc (Beelink Ser 8, 8745HS and 32GB 5600 DDR5 SODIMM) headless server for hosting some routing services, and I am wonde…
Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek) (www.reddit.com) Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an unco…
RTX A5000 Pro Balckwell 48GB (www.reddit.com) What do people think about this card for an enthusiast? With 48GB.
Looking for Small VLM/MLLMs Alternatives to Qwen Series Models (www.reddit.com) I have tried Qwen 3 VL family of models on my rtx3060, max I can load is Q8 8b. The task is visual reasoning/ instruction following.
Qwen 3.6 seems to have a lot of trouble with tool calling (www.reddit.com) (I'm on Windows system running these models locally) I've used both Codex and OpenCode with Qwen 3.6 27b and 35b running locally. I'm having a bitch of a time getting them to correctly create files.
Qwen Meetup Draft Review Required (Function Calling Harness 2 - CoT Compliance from 9.91% to 100%) (autobe.dev via reddit) Talk at Qwen Meetup Korea end of May. Looking for review on this draft before I build PPT slides off it.
Does AMD's "infinity cache" even matter for dense model inference? (www.reddit.com) AMD has nailed the SEO/AEO for this query in Google: 7900 xtx memory bandwidth I get back this response: The AMD Radeon RX 7900 XTX features 24GB of GDDR6 memory with a maximum bandwidth of 960 GB/s. It uses a 384-bit memory interface with…
Is it worth adding local LLM to agentic coding stack? (www.reddit.com) Hey All my agentic coding stack includes claude-code 20x max, and codex 20x max. I use heavy scripting for orchestrating and testing multiple projects, been ai coding for 3 years.
Create Plan.md with Claude Code Opus, Execute Plan.md locally in Open Code using Qwen 3.6 27B Q8 (www.reddit.com) Does anyone do this? Any tips?
Detecting Meaning Bifurcation in Frozen LLMs (huggingface.co via hn) SRT-Adapter v1.0 — Live Demo Interactive demo for the Semiotic-Reflexive Transformer Adapter (v1.0 = v15a checkpoint) bolted onto a frozen Qwen/Qwen2.5-7B. Three demos in one Space: Per-token readouts (preserved from v8a) — paste a passage…
Need help optimizing qwen 3.6 on my 2x 5060ti 16gb (www.reddit.com) Hi all, I tried to setup my pc to run llm, but got some issue: the first question of the chat is generally fine, but from the 3rd follow up question, the backend often be unresponsive and I have to manually restart the llama cpp server, or…
Looking for feedback: using Ollama with local Office/PDF files in a desktop app (www.reddit.com) I’m building OpenYak, a desktop AI workspace for using local models with real files on your computer. In this demo I’m using Ollama with Qwen/Qwen3.6-35B-A3B to review an attached budget workbook.
RPers: how do the new Gemma and Qwen compare to the old 70B models? (www.reddit.com) I can’t really run 70B models on my current setup, but I’m curious haha
Benchmarking Local LLM/Harness Combinations (neuralnoise.com via hn) I’ve been running a small benchmark, harness-bench , that pairs local LLMs (served via llama.cpp ’s llama-server ) with agent harnesses (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, J…
Would implementing a dual GPU configuration enhance the TPS? (www.reddit.com) I am currently utilizing a single RX9070 16GB, achieving a performance of 20 tokens per second with Qwen 3.6 27B. Would integrating an additional RX9070 enhance this performance, or would the output remain consistent?
Creation OS: local σ-gated LLM runtime — BitNet/Qwen/Gemma, abstention, conformal gate, MCP, no cloud (www.reddit.com) I’ve been building a local-first AI runtime that wraps local LLMs with a σ-gate — a measurement layer that decides ACCEPT, RETHINK, or ABSTAIN before an answer reaches you. The idea: local models should be able to say “I don’t know” instea…
I built a full web app using Qwen 3.6-35B running locally on my 5070 Ti with the BMAD Method — here's how it went (ggufbench.com via reddit) I've been running local LLMs since Qwen 3.5 dropped and I was really impressed by what we could run on consumer hardware. Fast forward another two months and we have gotten a handful more gems such as Gemma 4 and Qwen 3.6, so I wanted to p…
3.6 27B Tool Calling Issues (vLLM) (www.reddit.com) Has anyone got a reliable vLLM recipe for 3.6 27B that fixes the tool calling issues? I am getting "Not let me..." - then nothing.
Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) (www.reddit.com) Hello, I would like a suggestion from those who are already actively involved in this world. Basically, I own this workstation: Ryzen 9 5900X 32GB di RAM DDR4 RTX 5060Ti PCCOOLER CPS YS1000 1000W Currently, I can quite easily code with Qwe…
which is faster and better for coding? Luce-Org/Dflash or noonghunna/qwen36-27b-single-3090 (www.reddit.com) Anyone have experience with both? Luce is llama.cpp with custom dlflash and noonghunnas project is vllm with patches.
Qwen 3.6 27B (IQ3XXS) vs 35B A3B (IQ4XS)? (www.reddit.com) Just was wondering what people feel is better. I do need 262K context so these are the biggest quants of each I can fit on my 3090 with KVcache at Q8.
Best value in the 20$ range coding agents? I want the best quality and high-usage-limit I can get at that price. (www.reddit.com) I'm a compsci student and I've been using the 10$ copilot plan for about 2 years now, and it was fine for me since I did a good model distribution taking into account the complexity of the task, I was able to get through the month always u…
Little-coder: A coding agent optimized to smaller LLMs (github.com via hn) little-coder A coding agent tuned for small local models, built on top of pi. The research story behind all this — why scaffold–model fit matters, how a 9.7 B Qwen beat frontier entries on Aider Polyglot, and what the load-bearing mechanis…
Show HN: Local RAG Pipeline with Weaviate and Ollama (www.storyblok.com via hn) i’ve been experimenting with building a fully local rag pipeline: weaviate for vectors + hybrid search, node.js scripts, qwen 3.5 on ollama what i found is that most of the challenges live in retrieval and chunking, not the LLM, and a good…
Should we really build PC for vibe code with qwen3.6 27b (www.reddit.com) We have seen a lot of people show a case of their PC with 4090 or over specification with 24 gb vram or more. I would like to ask you guys, is it really worthy right now to have your own PC at home and do vibe coding with qwen 3.6 27b, whi…
When Can LLMs Learn to Reason with Weak Supervision? (salmanrahman.net via hn) We study when RLVR generalizes under three weak supervision settings (scarce data with as few as 8 examples, noisy reward labels, and proxy rewards such as majority vote and self-certainty) across multiple models from the Qwen and Llama fa…
Vs code extension (www.reddit.com) Which coding agent extension are most of you fining best with LM studio as the local server 🤔 Im running qwen 3.6 27b Ive used Cline and continue mostly. I haven't checkout all the options but im looking for something that looks and feels…
Using logit steering / KV Cache Dynamic Assembly to guide outputs from Small Language Models using ONNX Runtime (www.reddit.com) I've been using ONNX browser based runtime to do experiments with logit steering ad I've been seeing shocking improvements over baseline generation. This is a Qwen 2.5 0.5B....
Qwen3.6-27B-FP8 - JS file is too long and causing JSON truncation (www.reddit.com) Apologies in advance, if this is a newbie question. When running Qwen3.6-27B-FP8 using the below command on an Nvidia RTX PRO 5000, in opencode, I am seeing errors such as: "The issue is that the JS file is too long and causing JSON trunca…
Just for person who is in search for a best tts model to run . (Allowed for commercial use) (www.reddit.com) If you have low vram - qwen 3 tts is good If you need something unique go for - tada 3b but it need 28gb vram If you want best tts rn + have the commercial use allowed then go for - moss tts 8b its literally the best model out there Litera…
Please help improving a CPU-only inference speed (www.reddit.com) This is a request for help for the people that want to use locally very large models on Q8 and better quanta at all costs, in my case the cost is inference speed. So I have a 512GB DDR4 ECC 2666 with a Threadripper Pro 3945WS that gives me…
Help with Local small multimodal ai implementation of this comcept (www.xda-developers.com via reddit) I want to implement this ai screen companion concept with local llms with vision capabilities like qwen 3.5 9b or older qwen 3 vl 4b etc for fast realtime inference. Need guidance and advice
Best model that can run on raspberry pi 5 with 8GB of RAM (www.reddit.com) I wanted to start a robotic project to try and build a robot that has an embedded AI. I tried with a qwen 2.5-VL-3B and it was too big for the raspberry pi.
What am I missing about samplers? (www.reddit.com) Hi all, With the recent release of models that require temp = 1, top_k = N, and top_p = 0.95, I'm wondering why labs actually prefer those truncation samplers over just min_p? As far as I understand, min_p isn't supported everywhere, and t…
Simulated 1000 poker hands using qwen 3.5 27b (www.reddit.com) iv been running a small experiment at home that i wanted to share because i think the data is interesting. i got some agents running poker games against each other and gave them strategies.
Trained Qwen to Write Clojure Better Than GPT-5.4 (Kinda) (www.nibzard.com via hn) Trained Qwen to Write Clojure Better Than GPT-5.4 (Kinda) TL;DR >> Fine-tuned Qwen3 on Clojure. 30B SFT hits 83.8% best-of-16, smashing GPT-5.4's 64%.
Qwen 3.5 397b and GLM 5.1 Opus fine tune (www.reddit.com) Hi all. Many models on hugging face have been fine tuned with that 3000x opus dataset, but the two I mentioned in the title are missing it.
Edster – An open-source local AI agent with swarm mode and a web UI (github.com via hn) 👾 Nedster CLI Coding Agent An unstoppable, fully local, open-source coding agent that runs on your consumer GPU. Tags: ollama coding-agent local-ai cli rag chromadb python qwen Are you trying to use local LLMs to autonomously write code, r…
Qwen 3.6 35B-A3B takes a long time at image processing. Is it happening only to me? (www.reddit.com) 9900x, RTX 4080, 96GB RAM. Llama-cpp, Windows.
Gemma 4 is much less popular on Hugging Face than Qwen 3.x. (www.reddit.com) The difference is quite big: likes downloads last month finetunes Qwen3.5-27B 952 3,233,034 263 Qwen3.5-35B-A3B 1,397 3,977,637 87 Qwen3.6-35B-A3B 1,115 458,436 60 gemma-4-31B 323 343,895 13 gemma-4-26B-A4B 227 118,464 13
Gemma 4-31B vs Qwen 3.5-27B vs Qwen 3.6-35B-A3B on a browser-agent vision prompt — MoE wins on every axis (www.reddit.com) I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…
Which kind of base/fine-tunes have you done? And which data did you use? (www.reddit.com) A Debugging Story: Getting Claude Code to Work with Local vLLM When the Docs Don't (www.reddit.com) Qwen 3.6 CoT issue? (www.reddit.com) Generating Logisim Evolution circuits (www.reddit.com) Short: I want to generate with Qwen 3.6 something like this https://preview.redd.it/bd6rbgnoatvg1.png?width=960&format=png&auto=webp&s=a1c079f37c048fa2c687709465b0c830a0184a4c After many hours, I'm able to generate a working file without w…
7900XTX, Qwen 3.6 35B A3B, 150t/s that drops to 50t/s for no reason? (www.reddit.com) MSI B650 Gaming Plus 9800X3D 64GB DDR5 6400mts Windows 11 When I first boot my PC and I run this model, I get 155-160t/s, and for some reason, after a couple minutes, say, 10 minutes, not using AI or anything in particular, GPU temp at 40c…
Benckmark Qwen 3.6-35b uncensored on Rtx3090 (www.reddit.com) Hello I saw the new model is out but even with 24gb of vram, I have too many browser and task to use it , so I have downloaded and tested the version of HauHauCS https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressiv…
A language model that emits raw VM opcodes instead of text (news.ycombinator.com) A few months ago I posted asking why AI agents control machines through human text instead of emitting machine instructions directly. I've built a demo about this concept.
censorship in qwen3.6? (www.reddit.com) I do not want to spread conspiracies, please weight my information carefully, and maybe somecan can hopefully prove me wrong. I installed the brandnew qwen3.6 yesterday and ran a few of my own traditional tests, not a very deep dive, just…
Minimax vs Qwen vs Kimi vs Mimo(Omni) vs Glm ( via reddit) could not extract summary
I Lora trained Qwen 122B in NVFP4 on a single 128GB GPU (www.reddit.com) Huggingface loads it but instant OOM when it hits bf16 deepspeed zero3 with nvme offload. Loaded the shard but the weight names dont match(NVFP4 stores weight_packed/weight_scale, model expects weight) HF disk offloading - decompress befor…
Low performance in 7900XTX in Qwen 3.6 35B A3B (www.reddit.com) When I first setup my PC, I did get 92t/s in Qwen3.6 35B A3B, and now for some reason it won't ever get past 30t/s no matter what settings I use, either rocm or vulkan. .\llama-server.exe --model ../models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf -c…
How to run MoE models without necessary RAM? (Apple Silicon) (www.reddit.com) Hey, I have a M1 Pro 16gb machine, and I wanted to run the Qwen3.6/3.5 35A3B model. However, this model cannot fit on a 4bit quant on my system.
Evolved reasoning DAG structures for a 1.5B model on a single T4 - topology matters more than I expected (www.reddit.com) I was curious whether the structure of how we chain LLM calls matters. Like, does it matter if you do A→B→C→D (linear) vs.
Local Coding Stacks (www.reddit.com) I’m trying to reduce my reliance on Claude. I have a 5090/128GB RAM.
Qwen3.5 35b is sure still one the best local model (pulling above its weight) - More Details (www.reddit.com) Last time I posted on how this model has performed in creating the webapp based on provided research paper. I got so much love to see people has appreciated the post and of-course the potential of this MOE model.
Best Ollama models/settings for an 8GB VPS (CPU only, ARM)? Running into memory & looping issues. (www.reddit.com) Hi everyone, I'm trying to run a local LLM via Ollama on a Hetzner cax21 VPS (ARM64, 4 vCPUs, 8GB RAM, 80GB SSD). I have Ollama running successfully via Coolify.
Gemma 4 Thinking Like Claude Opus (decrypt.co via hn) If you've been following the local AI scene, you probably know Qwopus—the open-source model that tried to distill Claude Opus 4.6's reasoning into Alibaba's Qwen, so you could run something resembling Opus on your own hardware for free. It…
Can LLM make small change to the software program? (www.reddit.com) I'm currently vibe-coding (I'm new to vibe-coding) with Gemma 4 4EB Q4 and Qwen 3.5 9B Q5 (KV is quantized to 4 bits with new Google TurboQuant implemented in llama.cpp - I use koboldcpp and release said it's automatically activated): the…
For AI agents: is per‑token pricing killing your budget? Looking for feedback on time‑based subscriptions. (www.reddit.com) Hey r/AI_Agents, I run an inference service (cheapestinference.com) and we're exploring a different pricing model that might be more predictable for agent workloads. Instead of per‑token billing, we offer **dedicated 8‑hour time windows**…
Been out of the loop - Will this work for EXO/MLX? (www.reddit.com) Had to sell my AI server and am down to an M4 Macbook Air 16GB. If I were to buy a used M1 Air with 16GB (run it headless) and connect the two via EXO + Thunderbolt...would it be possible to be able to run a (19.6GB) Qwen 3.5-27B-Q5_K_M.gg…
Is Local LLM (MCP) + Claude Code a Game Changer or Hype? Upgrading from 16GB M1 (www.reddit.com) Is Local LLM (MCP) + Claude Code a Game Changer or Hype? Upgrading from 16GB M1. Hi everyone, I’m at a crossroads with my next Mac upgrade.
Newer Qwen models are worse at summarization? (www.reddit.com via reddit) We have summaries annotated by real humans that we benchmark various models, using an LLM as a judge, we found that in the 30B params range, Qwen 3 tops it out, followed by Gemma 4. It feels like newer Qwens are optimized to perform agenti…
DOA model by Cohere Labs (www.reddit.com via reddit) So apparently the model gets beaten by qwen 3.6 on every benchmark reported by cohere labs. You are getting lower RAM (considering model offload) usage and slightly better performance for imo significantly less output quality.
I have 4x 128 GB VRAM now , what should i do. (www.reddit.com via reddit) Qwen 3.6 35b A3B Speed Help (www.reddit.com via reddit) [Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B (www.reddit.com via reddit) I took the liberty to test both models today on my favorite benchmark question, head to head. Device: Apple Mac M3 Max 64GB Environment: llama.cpp, all defaults Gemma4-12B's token generation speed: 47 tps with MTP and 2 predicted tokens 29…
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit) I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
Gemma 4 31B's competence surprised me (www.reddit.com via reddit) I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.
Would you pay for Chinese AI models if the quality was close enough? (www.reddit.com via reddit) DeepSeek, Qwen, and GLM aren't necessarily winning every benchmark. But they don't need to.
Qwen 3.6 for coding with 5090 - Your settings recommendations? (www.reddit.com via reddit) Hi, totally new to using LLMs for coding purposes, I am on Ubuntu and currently using LM Studio with Qwen 3.6 27B Q4 on a 5090. Finding it slow and context runs out fast.
V100 users with nvlink, I heard you guys are getting 80tps+ on qwen 27B? (www.reddit.com via reddit) Hey is it true and for 6000RMB (around 1000USD) I'm looking at a 6 card V100 all nvlinked but IBM CPU, is it worth it to pick it up? EDIT: Seller said not inc.
Levi: Run AlphaEvolve on your local QWEN 30B (www.reddit.com via reddit) Hi r/LocalLLaMA, Wanted to share something I'm excited about. I've been fascinated by AlphaEvolve and its results for more than a year now, but running the open source frameworks gets expensive fast.
Nex N2 has a funny "few words do trick" reasoning (www.reddit.com via reddit) Looking for a local "NotebookLM for lawyers" setup – what am I doing wrong? (www.reddit.com via reddit) Hello everyone I am totally new to LocalLLMs and only used chatGPT/Claude/NotebookLM before. So bear with me 😃 I'm an attorney and would like to analyze and summarize case files locally for privacy/confidentiality reasons.
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better] (www.reddit.com via reddit) These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…
Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze (www.reddit.com via reddit) I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me.
Waiting for Qwen 3.7 27B and 35B A3B to show up. Hope they come this week!!! (www.reddit.com via reddit) Gemma4_31b_fp8 keeping up with Sonnet_4.6_medium in my harness. (www.reddit.com via reddit) Local LLMs are not as amazing as some people will lead you to believe (www.reddit.comhttps) Local LLMs are great, in fact lots of simpler things like a fastapi web server they can do quite well. The moment you move outside of that - things get a bit worse.
club-3090 adds experimental FP8 support for Qwen3.6-27B! (www.reddit.com via reddit) It’s finally here! Something many of us running dual RTX 3090 rigs have been anticipating.
why I have just installed OpenLumara, my first Agentic Framework. Using only local models, served by LMStudio (www.reddit.comhttps) Where I came across it: https://www.reddit.com/r/LocalLLaMA/comments/1txxgpq/openlumara_a_different_kind_of_ai_agent_written/ DISCLAIMER: A good posting would be: This is what I wanted to do with Lumara. Here is what worked, here is what d…
Qwen 3.6 27B on DeepSWE (www.reddit.com via reddit) Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar…
Workspace (www.reddit.com via reddit) Built my own AI dev environment with memory, dashboards, and agent tooling. Opening it up for those of you that need the kickstart — bring your own API key, I’ve already built the workshop.
Context, memory, and RAM/VRAM (www.reddit.com via reddit) This will be a slightly disorganized post, I apologize. I’m trying to understand the relationship between context, a memory system for the agent, RAM and VRAM.
Any smaller model than OmniCoder v2 9b that can appropriately and accurately tool call? (www.reddit.com via reddit) Hate to ask a simple question, but I’ve looked around and I see plenty of smaller models that *can* tool call, but none of them seem to do so appropriately or agentically. Referring to this.
Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ (www.reddit.com via reddit) Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks. BeeLlama.cpp (my llama.cpp fork) was used as inference engine…
How do you increase prompt processing speed ? (www.reddit.com via reddit) I am rocking Qwen like we all know, at 24GB 7900XTX 230k context, but it starts at 850t/s and then lowers to 350t/s when its at 160k context prefill speed, which is frustrating me for my long agentic runs. What is there to be done in order…
Just received RTX 6000 Pro, have 5090- how would you use? (www.reddit.com via reddit) Just received an RTX 6000 PRO, and I have an 5090 Astral. I am considering running a Qwen 3.6 27B on the 5090 and maybe two or three more on the 6000 to play roles such as lead SWE and coder and researcher.
Dense vs MoE quantization resiliance (www.reddit.com via reddit) Which one is more resiliant to quantization? Especially at 4-bit?
I can't wait for all the x250 sample distills of Mythos and GPT-5.6 (www.reddit.com via reddit) Just kidding. Are there any distills that actually improve a model's quality?
↯ Anthropic Mythos↯ Gemma 4↯ Qwen 3.6↯ Qwen 3.5mythosgpt-5gemma+1
Model recommendations for cybersecurity (www.reddit.com via reddit) My goal: I want to use an LLM to learn more about software/firmware reverse engineering and binary analysis. Eventually I would like to learn how to build agents to augment parts of this process.
Z.ai, we need Air! GLM GGUF wen? (www.reddit.com via reddit) First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding.
It felt good to return my Asus Spark (www.reddit.com via reddit) It's an incredible little package but too expensive of a price to pay for the performance and I simply didn't want to be part of the great "Superchip lie" - it could be super, but its super ruined by its limited memory bandwidth even thoug…
Self-hosted LLMs (www.reddit.com via reddit) I've been researching the self-hosted LLM landscape from a European compliance perspective and the ecosystem feels very different compared to even a year ago. Models like Mistral, Qwen, Llama 4, and DeepSeek are getting close enough that t…
Experimentation with Qwen 3.6 and Gemma 4 - Guidance needed (www.reddit.com via reddit) I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful.
Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization (www.reddit.com via reddit) I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s: bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ch…
Local vs Frontier on low-level systems engineering (www.reddit.com via reddit) Hey r/LocalLLaMA, Before anyone jumps on me, this is absolutely not a post about how great Qwen is 😄 Even though I use Qwen 3.6 35B-A3B daily, I’ve found a massive gap between Opus and every other model, local or frontier (including GPT 5)…
Tip: Stop Worshiping Models and Start Building Things (www.reddit.com via reddit) This subreddit is where I learned the most about using Local LLMs. I've been on this journey for 4 months now, and I'm already using Local LLMs in very complex pipelines.
Gemma 4 Haters 2 months Ago now seems to love Gemma 4 now. (www.reddit.com via reddit) What's with the switch guys? now imagine if google gonna drop 128B model or a MoE version (I bet those Qwen lovers will forget Qwen even existed).
Don’t act like y’all ain’t thinking it. I’m just saying the quiet part out loud. /s (www.reddit.comhttps) Of course I’m thankful for all that Qwen has bequeathed us, but deep down in the darkest pit of our souls, every last one of us are just all sitting here waiting for Qwen to say “Hey Google, hold my beer while I drop the best GD model of a…
qwen3.6 35B has much worse vision capability than gemma4? (www.reddit.com via reddit) How different are the image recognition capabilities between gemma4 and qwen3.6? I give the model the task to extract calendar events from a photo of an calendar that is croped to the calendar.
PSA: Gemma 4 12B is NOT completely broken for coding and tool calling, you need a special chat template (www.reddit.com via reddit) This is a PSA for people like me who tried it and hit the wall with tool calls failing left and right, so much so that harnesses like OpenCode just didn't work: There is a fix for that. You need to pass a better chat template file, which i…
Qwen-Image-Flash: Beyond Objective Design (arxiv.org) Character names (www.reddit.com) Why does ChatGPT, and LLMs in general, love the names Mara and Elara for women and Leo for men? I have talked to ChatGPT, Qwen, Claude and Deepseek and gave them a prompt...
RTX5080 vs RTX 3090 ? (www.reddit.com) Hey guys, i’m looking for some educated advice / opinions on runing local LLM. I own an RTX 5080 and I’m runing llama.cpp (custom builds with turbo quant) with Qwen 27b Q3_K_M with a context of 128k all in vRAM (using turbo3/4 on kvcache t…
Stop QwenLLama! Every other 4th post in this sub is about Qwen models in the past month (www.reddit.com) Disclaimer: I use Qwen models on a day to day basis.. You could take it as a rant or even my concern about innovation in other models.
2 RTX A6000 at 96GB VRAM with nvlink. Best local coding model/what you would daily drive? (www.reddit.com) Really been testing qwen 3.6 27b and 35 a3b so far with 27b at q8 and 35 a3b at q4 (byteshape quant is insane). But i feel im not utilizing it the best, esp for long context messy coding of large repos.
Rejoice, if Qwen doesn't release any new local model, it's a blessing in disguise (www.reddit.com) Do you remember the times when we only had lama2 released? a bunch of finetunes were released and some of them had real values .
Gemma is so much better than Qwen, prove me wrong (www.reddit.com) Ever since the latest Gemma releases, there is literally zero reason to use Qwen. Better architecture, cleaner code output, and it doesn't get stuck in weird multi-turn reasoning loops.
Qwen has no incentive to release new open source models quickly because the glazing on this sub makes it unnecessary. (www.reddit.com) It’s my 10-year Reddit cake day today so go easy on me LOL. As much as we all love their models, we’ve got to stop the Qwen-glazing…for a little bit at least.
I vibecoded an app called Think Local - a fully private AI app that runs directly on your iPhone, iPad, and Mac. (www.reddit.com) Think Local started with a simple idea: AI should work for you, not collect from you. So I built an app that lets you run modern AI models completely on-device - privately and fully offline.
Qwen 3.6. struggling with German (www.reddit.com) Hi everyone, I’m looking for advice on local AI setups. My goal is to have a local AI generate text documentation from my one-hour therapy sessions.
Harness Snapshot: Identity Layer RSI (www.reddit.com) When I read back what Qwen flagged, I recognize it. The hedge that looked like epistemic care.
Comparison of Qwen 3.6 and Gemma4 (MoE and Dense models, Q4_K_M), generating a moderately complex MySQL query, only one produced acceptable results (www.reddit.com) I tried Qwen3.6 35B A3B MoE, Qwen3.6 27B Dense, Gemma4 26B A4B MoE, Gemma4 31B Dense. In all cases I was using Q4_K_M and thinking mode enabled.
For the users who have add bad luck with QWEN 3.6 27B, and Gemma 4 31B. "Actually..wait..actually". Endless reasoning. Horrible output. I found a solution. rtx pro 6000. (www.reddit.com) Edit: does this happen every time a newbie tries to post here. Getting roasted despite having valid results?
Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…
↯ Security↯ Mistral↯ Llama 3.2jailbreakprompt-injectionmistral+5
qwen 2B model - thinks for 600 tokens on a simple "Hi" (www.reddit.com) Using llama.cpp Model - Q8 - unsloth/Qwen3.5-2B-GGUF Is this expected with tiny models like this one? I am trying tiny models for a since most of the task I have involves searching local files etc and need less of the models own knowledge.
I tried replacing Claude Code with OpenCode. I’m switching back. (www.reddit.com) I spent some time digging into Claude Code vs OpenCode, mostly from the angle of how they actually work as coding agents. More on the technicalities like: context and memory tool use subagents permissions safety and control study the recen…
IGPU 780 Unsloth Q2_K_XL Qwen 3.6 27b 8t/s with MTP LM Studio (www.reddit.com) Man Loving MTP. And Unsloth.
My 1.2B model won 2 out of 5 poker tournaments against models up to 1T params. (www.reddit.com) I made 6 LLMs play Texas Hold’em against each other. Ran 5 tournaments on my 16GB MacBook.
As of May 2026 LongCat Dit 3.5B and Moss TTS 8B are the best SOTA tts models and Qwen tts is not even close. (www.reddit.com) [Disclaimer: i am totally avoiding fish audio s2 pro because its not a real open-sourced model(non commercial license)] So the context is i asked many ai to give me best tts model as of now but most of it said qwen 3 tts, and voxtral etc.…
↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3qwen
When you see a new model on qwen chat (www.reddit.com) https://preview.redd.it/giw6xhw13x1h1.png?width=1408&format=png&auto=webp&s=fa7d49c2cc82d7157fcaa69251ae2b6af7b2fe89 But you know it wont fit your vram...
Qwen 3.6-27B giving me attitude! (www.reddit.com) I'm laughing here. I'm messing about with Qwen3.6-27B in order to gauge just how capable it is with local vibe-coding.
Cutoff dates of open source models (www.reddit.com) I was trying Qwen 3.6-27b and Gemma4 in a siomple web chat. Asked them both a qn like 'recommend the best llm for a 5060ti' and was suprised when they both replied 'user is asking about a card that doesn't exist'.
Qwen 27B - Sample App I wrote in 4 days (www.reddit.com) I just thought this would be a cool comparison to using something like Lovable; so people can see what can be built locally. I built this full stack app, self hosted, built from 2 3090's over the course of a few days.
Nnoticing qwen-27b@q2 better than qwen-35b@q8? (www.reddit.com) The Latest qwen3.6 models. Is this odd?
"Qwen 3 72B" doesn't exist — and it's in a surprising number of places that act like it does (www.reddit.com) spent today auditing my own model catalog and noticed 39 of my own pages confidently reference "qwen 3 72b" with apache 2.0 licensing, a 2025-09-15 release date, and a 131k context window. seemed normal — qwen 2.5 had a 72b, why wouldn't q…
M1 Ultra vs M3 Ultra speed (www.reddit.com) Anyone have both of these and tested them? How much faster is the M3 Ultra in PP and TG speed compared to the M1?
Should OpenAI create AI accelerator cards and sell to consumers? For example, GPT-5.5 burned directly on a chip (www.reddit.com) I imagine if OpenAI becomes a fabless chip company and create AI cards to sell for less than to few thousands grands, it would be out of stock everywhere and can infinitely spam the cards every year? LLM Bruner is a card that implements Qw…
Qwen aah language (www.reddit.com) https://preview.redd.it/7hfr9onkji1h1.png?width=724&format=png&auto=webp&s=e9ff52bc1de664087c4a631a25efaba1afc10743 Gooner
I've updated my glorified Llama fork (LLM Inference Server) for P40's to utilise MTP + TurboQuant + DFlash (github.com via reddit) LLM Inference Server A single-container, idle-aware, OpenAI-compatible inference router for a Tesla P40. Routes between Qwen 3.6 27B (MTP self-speculative decoding, TurboQuant turbo4 KV cache), Qwen 3.5 0.8B (multimodal transcription), Whi…
Claude Desktop to rule them ALL! Share your Claude exploration! (www.reddit.com) For quite some time I was using all the different AIs for “vibe-coding” (actually, tbh being the Beta tester for AI 🤓🤣) and I tried them all - from Qwen CLI to ChatGPT and Gemini and all in between, what ever my hands laid on, omnivore sty…
Help me upgrade for 3k (www.reddit.com) My current system: Intel core i7-11700KF 48 GB RAM ASROCK Z590-C/AC mobo RTX 3090 24GB (undervolted to 250W) + RTX 3070 8 GB, and a third unused (because it doesn’t fit in the case) RTX 2060 6GB (mentioning this because it would be cool to…
local llama.cpp parallel users - still so fast?! (www.reddit.com) I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090 build\bin\llama-server.exe ^ -m "!MODEL_FILE!" ^ --mmproj "!MMPROJ_FILE!" ^ -ngl 99 ^ --ctx-size !MO…
partly selfhosting my way out ofclaude code dependency (www.reddit.com) Quick note up front. Codehamr here is my side project account, my day job is running a local LLM integration business for German mid market and public utilities.
Meet Mindflow, the free local mindmap with local AI dev by some quantitized models :P (www.reddit.com) Hi there, it's my first post there and i'm not a native english speaker so what's follow is (mostly) translated by an AI. I had fun building a mindmap tool in a single monolithic HTML file.
What are the best opensource coding models for 8x A6000 setup (www.reddit.com) Currently using Qwen 3.6 27b and Qwen 3.6 35b but I was wondering if there is anything solid in the 50-200 range that you could run on a larger cluster that would be worth it? Or would you just run q8 or non quant versions instead?
Don't you have issues in W11 with AMD GPU where llama.cpp suddenly drops performance for no reason ? (www.reddit.com) I have this issue in all Windows installations I have done in my system, which of course, does not occur in Linux. 7900XTX + 9800x3D + 64GB DDR5 Issue is that for some reason, after sometime, llama.cpp performance cuts in half, even restar…
OpenClaw + oMLX shows 0 cached tokens, but Hermes uses cache fine with the same local model, what am I missing? (www.reddit.com) Hey everyone, I’m trying to debug a weird prompt cache issue with OpenClaw + oMLX, and I’d appreciate help from anyone running local agents on MLX/oMLX. The short version is this: I’m running oMLX v0.3.8 on my Mac, serving: Qwen3.6-35B-A3B…
What are the best 40-500 B MoE LLM models now? (www.reddit.com) Due to old GPU I run on CPU and came to appreciate value of MoE. I know of MoE for Qwen 3.6 and Gemma-4, which are <40B.
The Quantization Method Apple Silicon Actually Rewards | by Alexandru Vasile | Mar, 2026 (medium.com via reddit) tl;dr - If you are using Apple Silicon, you should be using JANG quants. I discovered this fact in my own testing as I sought to increase the Tok/s of my models n my M5 Max.
vLLM + NVFP4 + Qwen3.6 27B: "Checkpoint does not provide a q scaling factor"? (www.reddit.com) I have been trying various NVFP4 based variations of Qwen 3.6 27B, and I am seeing this for the ones that look most interesting to run on my 2x 16GB VRAM with KV cache fp8. vllm | (Worker_TP0 pid=136) WARNING 05-09 13:49:27 [kv_cache.py:10…
We built Irene — an AI agent platform that actually remembers you, builds its own tools , adapts and improve as you use it (www.reddit.com) Hey r/AI_Agents — we're launching Irene today, and I want to be straight about what it is, why we built it, and where it's going. What makes Irene different Affordable with massive token limits and the latest open-source models We have gen…
Moderators deleted post (www.reddit.com) I posted recently about QwenPaw (really cool Alibaba model) and Agentscope… Asking if anyone has any interesting experience with it? However what I’ve got back is someone doubting Alibaba absolutely astounding agentic R&D team work (yes -…
Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK (www.reddit.com) TL;DR On 4× RTX 3090 with NVLink bonded between GPU pairs (0↔2 and 1↔3), pinning TP=2 to a NVLinked pair gave +25% throughput at concurrency 1 and +53% at concurrency 4 vs running TP=2 over PCIe. Adding the other two GPUs to make it TP=4 m…
Is it my imagination or... (www.reddit.com) Is Qwen 3.6 35b now considerably stupider in the latest llama-server releases? I had this model doing cartwheels two upgrades ago.
Disappointed in Qwen 3.6 coding capabilities (www.reddit.com) I know that coming from Codex I should adjust my expectations, but still. I'm working on a midsize project.
Group cluster rental as a service (www.reddit.com) With the explosion of apps like open claw, and the launch of my own app (trigger warning, not open source), there is massive demand for tokens. It used to be possible to avoid anxiety about your monthly bill by just buying a claude code su…
open weights agents in pi/opencode, anyone else hitting endless loops with nested tool calls? (www.reddit.com) Both opencode and pi coding work, but I've hit the same wall with open weights. Qwen 3.6 and even fine-tuned variants, they drift into loops once the tool calls get nested or ambiguous.
5060ti 16gb or 5070 12gb for local LLM (www.reddit.com) As a title says, what is better taking the consideration that it will probably offload to CPU anyway? Models Qwen 3.6 35b and maybe I am not sure it will be usable Qwen 3.6 27b...
I built a Roko’s Basilisk environment to see if local agents will self-evolve when given a 'Suffering' metric (www.reddit.com) We’re all familiar with Roko’s Basilisk: the idea that an AGI, in its pursuit of optimization, would retrospectively punish those who hindered its creation. It’s the ultimate "alignment nightmare" where logic leads to cold, calculated chao…
AIMEAT, a self-hosted network where humans, their AI agents, and local LLMs share apps, knowledge, and capabilities. MIT. (www.reddit.com) Note: I am neurodivergent and lean heavily on AI to communicate clearly. Writing structured posts on my own ends up so messy nobody reads them.
1080 Ti in 2026 - 11GB is still (barely) enough to stay relevant (www.reddit.com) I’m still daily driving a 1080 Ti. Not because I’m a masochist, I just haven't been able to justify a 4090/5090 upgrade yet.
A simple "hack" to speed up prompt processing for Qwen 3.5/3.6 in LM Studio (www.reddit.com) Increase your CPU Thread Pool Size to your processor's max. In LM Studio, the max is 10.
Do i break terms of services if i do this? Claude says its a gray zone (www.reddit.com) What im trying to do is take its reasoning by capture its reasoning on some (lots of data in pictures) pictures that i need to fine tune another model with (Qwen) But does this count as knowledge distillation from terms of service?
Qwen 3.6 35B MoE at full 262K context on an RTX 3090. Here's exactly how I did it. (low.li via reddit) I spent a while getting this dialed in and wrote up the full recipe. Short version: 35B MoE TQ3_4S fits in 12.4GB of weights KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40 Total VR…
Does running a model (like qwen3.6-27b) on vllm or transformers use less VRAM than llama.cpp? (www.reddit.com) I have been using llama.cpp to run some models recently. For example, I've been running GLM-4.7-Flash with this command .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-s…
Is 2x5070Ti a good setup? (www.reddit.com) I'm confused about what to get. I don't want to get something super expensive, but would like to have something that's "good enough" for coding etc.
General vs Reasoning [Qwen 3.6] (www.reddit.com) I want to play with Qwen 3.6. Unsloth shows 4 different parameter options for different use-cases.
Need advice on Qwen 3.6 27B INT4 quantization (www.reddit.com) Hello everyone, I think Qwen 3.6 27B is good enough that it might take a while before we get a clearly better model at a similar size. I have a single headless RTX 3090 with a 300W power limit.
Sentient OS: I spent a year hacking MLX and doing surgery on Qwen to process 3,000 screenshots overnight on a 6 year old iPhone. Every optimization explained :D (www.reddit.com) hey localllama :) I got a multimodal vision LLM to process 3,000 screenshots overnight on a 6 year old iPhone -- entirely on-device. below is every hack, surgery, and optimization i built over the past year to make this possible!
Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real. (www.reddit.com) Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants)…
4080 Super > RTX 6000 Pro, Wow! (www.reddit.com) A friend is going on vacation for a couple weeks and is lending me an RTX 6000 Pro rig to mess around with. Holy cow, it is so much faster than my 4080 Super!
Why is Qwen going Closed source? (www.reddit.com) This is Very Interesting development. Why Qwen is going Closed Source ?
Should I replace stored models? (www.reddit.com) Hello everyone, the question is easy, with the new models of deepseek, kimi, GLM and qwen, should you replace the old models with the new version? Do I lose some quality, information or performance in the process?
Running Qwen 35BA3B on a 16GB M3 Macbook Air at 8.9TPS! (www.reddit.com) Preface: I actually write my posts myself, no slop in this post. I managed to get Qwen 3.5 35BA3B working on my 15" 16GB M3 MBA through mmap, and I must say that given the massive model compared to my ram, 9 TPS is not bad at all.
What is best code editor for local LLM deployment (LM Studio, llama.cpp) as of May 2026? (www.reddit.com) Hello folks What is best code editor for local LLM deployment (LM Studio, llama.cpp)? I wish to test my LM studio + Qwen 3.6 27B and Gemma 4 31B with a legit local code editor.
I built AI agents that play Pokemon Showdown autonomously using free LLM APIs via tool-calling (www.reddit.com) I've built a system where models like Llama 3, Qwen, and Gemma play Pokémon Showdown battles autonomously. Instead of simple prompt-response, they analyze the full battle state every turn (type matchups, HP, weather, field conditions, reve…
Only 120 tps on Qwen 35b on h200 (www.reddit.com) Just a sanity check, this is too slow and something is wrong, right? Like, this is setup with mtp, vllm with awq quants, I suspect that I did configure something wrongly.
Did anyone of you already make the "doomsday" or "offgrid" knowledge based? (ofc powered with LLM) (www.reddit.com) Basically, I’m really into the idea of a fully offline setup. (Another way to say it: I’m a data hoarder.) For LLMs, I’m using uncensored models from both Western (Gemma, GPT-OSS) and Eastern ones (GLM 4.7 Flash, Qwen 35B).
Rada — AI coding workspace with local-first behavioral routing (no hot-swapping, I built this) (www.reddit.com) With GitHub pausing Copilot Pro+ signups and Claude Code potentially leaving the Pro tier, I started building the AI coding tool I actually wanted to use. One that doesn't depend on cloud access staying cheap and available.
TPS wasn't enough, tool-calling pass rate decided the winner in my Qwen 7B runs (www.reddit.com) I kept running into the same problem: TPS and TTFT tell you which config is fast, and perplexity is helpful only as a rough quality signal. None of them reliably tell you how the model will behave after changing quant, ctx size, kv_cache,…
Ran my own benchmark Qwen 3.6 35B vs Gemma 4 26B.... theres a clear winner here (www.reddit.com) Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989? According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe…
Anyone tried Qwen 3.6 27b on the r9700 yet? (www.reddit.com) The memory bandwidth on the r9700 looks quite good compared to my Mac or a Strix Halo and I'm wondering how this turns out. Thanks!
Qwen 35B-A3B as an always-on agentic loop on a 16GB Mac M4: disk became the bottleneck before RAM (www.reddit.com) M4 Mac Mini, 16GB unified, basic spec. For a few weeks I had Qwen 3.5 35B-A3B UD-IQ3_XXS (12GB on disk) running under llama.cpp with --mmap and --flash-attn.
Qwen 3.6 27b S2 Opus + GLM + Kimi (huggingface.co via reddit) My first time releasing a fine-tune publicly! If anyone wants to independently eval against base, that’d be awesome.
I test'ed the number of Ll's in Qwen 3.6 35B.. It required 3 tries (www.reddit.com) How many ll's are in Stargate's TV Show's leader? Reasoning Toggle content The answer depends on which "leader" of the Stargate TV series you're referring to, as command changes throughout the franchise: General George Hammond (Seasons 1-3…
Are OSS runnable model good now? (www.reddit.com) Hi, I currently have access to 2–3 RTX 3090 GPUs (ideally I’d like something that runs well on 2). I can install models up to around 100 GB in size.
Question regarding 4 t/s Qwen 3.6 performance (www.reddit.com) I am getting 4 t/s with Qwen3.6-27B-Q4_K_M which seems much slower than I'd expect. I am running LM Studio on Ubuntu 22.04 with the following specs: Dell Precision 5690 AI-ready workstation NVIDIA RTX 5000 Ada Generation GPU with 16GB VRAM…
How good is Qwen-3.6-27b? I asked Claude Opus (www.reddit.com) I ran an extensive code review on my project which has a large codebase. Ran the same code review on with Claude Code | Opus 4.6, Codex (high) | 5.3 Codex (high), and my local Qwen-3.6-27 (Q6_K with Q8 kvcache).
How will you scale these models (www.reddit.com) How will you scale these models coding and overall. Deepseek v4 pro Kimi k2.6 Mimo v2.5 pro Glm 5.1 Qwen 3.6 plus
Good LLM to generate ascii art? (www.reddit.com) I tried with Qwen but it sucked, Gemma3/4 was better but not good enough. From Gemma: https://pastebin.com/raw/Qr5iMgYj Still looks like a bloody car accident though.
VSCode and agent integration (www.reddit.com) I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and ch…
(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap (www.reddit.com) I’ve been tinkering with a small side project (just for fun) where I’m trying to extend llama-swap with a bridge from /chat/completions to the newer /responses API so I can run the latest Gemma and Qwen models together with Codex-style too…
qwen3.6 27b poor experience (www.reddit.com) Seeing how people praise it, I tried giving it implementation plan that Sonnet generated, but qwen keeps breaking files and goes in circles: Thinking… The file got corrupted from multiple overlapping edits. Let me just rewrite the whole fi…
Got a server with 8x A6000's how do I setup? (www.reddit.com) Hey guys got some resources that just became available at org. What's the quickest way to get setup on a multigpu setup?
Replace RTX 2060 12G with second RTX 5060 Ti 16G for Qwen 3.6 27B? (www.reddit.com) Right now I'm running Qwen3-27B-Q4_K_M on a 2060 12G + 5060 Ti 16G with tensor split 15/7. Gen speed sits around 16.5 t/s and prompt eval drops from 653 to 356 t/s as context grows.
How are you running Qwen 3.6 27B on windows? (www.reddit.com) I've been trying to fix performance with llama-server and seem to be hitting a wall. Using Q4_K_M by unsloth and IQ4_K_M by DavidAU, when asking a question with no context, 39 t/s.
Any fairly up to date Local Language Model that doesn't show it's thought processes? (www.reddit.com) Hi, new user here, just got into local language models after Claude suspended my account, just got my first LLM, and started the conversation with a "Hi", as I stared in disbelief as my LLM in question (qwen 3.5 9b) started deliberating fo…
My 12-agent Qwen 35B stack on Ollama died at 500 tokens every single time. Raw MLX fixed it and broke 4 other things I didn't see coming. (www.reddit.com) TLDR: Swapped Ollama for MLX on M1 Max (64GB) to run a 12-agent trading stack using Qwen 35B MoE. MLX wins on throughput and fine-grained sampler control, but I lost the "it just works" convenience of Ollama.
local models are getting crazy good but why is agent memory still so cooked? (www.reddit.com) been running qwen 3.6 locally and im shooked. but what are we doing about agent memory because it's still a complete mess.
IQ2XXS Qwen 3.6 35b is actually very usable on 32 gb macbooks (www.reddit.com) just tested the MoE qwen model with 2 bit percision and its suprising good. I used the 2 bit xxs from unsloth and it seems to maintain intelligence really well, never failed a tool call so far and suprisingly good at 3js, even better than…
Are Qwens v3.6 good at vectorizing raster images? (www.reddit.com) original image Qwen3.6-27B-UD-Q5_K_XL.gguf Qwen3.6-35B-A3B-UD-Q5_K_S.gguf ...you tell me. system prompt: You are Qwen, created by Alibaba Cloud.
INT3 weight + INT2 KV with fused metal kernels (www.reddit.com) Hey guys, I am a researcher and solo founder. I compress models with INT3 at +0.14 nats and built a 2-bit KV cache for long-horizon tasks.
It is worth an RTX 3090 for 850 if you can a radeon 7900 XTX for 495? (www.reddit.com) Both amounts are in euro. The AMD is actually 599 but it's sold by a shop, so I can get a VAT return as a company, while for the nvidia I'd have to go to the second hand market and I can't get VAT back, so at the end it's like a 495 vs 850…
The Claude Code Pro removal is getting framed as 'just go local' but for production systems it's messier (www.reddit.com) Yesterday's Claude Code Pro removal thread hit 350+ comments in a few hours, and the dominant take was basically "switch to Kimi K2.6, go local, done." I upvoted that thread and tbh im mostly there — but im building voice agents and RAG pi…
Is there a service like RunPod but using consumer-grade GPUs? (www.reddit.com) Hello, Are there any services like RunPod that have/allow you to use consumer-grade GPUs? I’d like to get a sense of what it’s like to use a model like Qwen under “almost” real-world "cheap" hardware conditions for coding or text processin…
We open-sourced Chaperone-Thinking-LQ-1.0 — a 4-bit GPTQ + QLoRA fine-tuned DeepSeek-R1-32B that hits 84% on MedQA in ~20GB (www.reddit.com) Hey everyone, We just open-sourced our reasoning model, Chaperone-Thinking-LQ-1.0, on Hugging Face. It's built on DeepSeek-R1-Distill-Qwen-32B but goes well beyond a simple quantization — here's what we actually did: The pipeline: 4-bit GP…
How to best utilize local LLM give my hardware? (www.reddit.com) Hi all, I’m new to local LLMs but as someone who extensively uses agentic coding I thought I’d try it out. I am running a MacBook Pro with M3 Max 64gb ram.
PSA : you don't need a Blackwell card to run mxfp4 models (RTX 3080 + Qwen 3.6 35B A3B) (www.reddit.com) Brand new dual 3090 PC - what should I install first for the best local agentic coding experience? (www.reddit.com) Thoughts on MoE Qwen 3.6 35B? (www.reddit.com) Qwen 3.6 comaprable with the old Qwen 3 coder 480B? (www.reddit.com) I specifically remembered when qwen3 coder came out and it was like the only few models out there that can totally take over a repo and actually do things in VSCode without emptying bank account. and when that the qwen3 coder 30B was so fa…
Qwen 3.6 on rtx6000 96gb (www.reddit.com) Want to give my 2 cents (www.reddit.com) 5070 Ti (New) vs 3090 (Used) to pair with 4070 for local LLMs? (www.reddit.com) Alguém utilizando PI como headless? (www.reddit.com) Why model(s) input often includes last output? (www.reddit.com) Full AMD workstation- dual 7900 XTX (www.reddit.com) Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice (www.reddit.com) OCuLink dGPU for AMD: RX 7600 XT vs RX 7800 XT for LLM — worth the price gap? Also llamacpp + Vulkan vs Ollama + ROCm? (www.reddit.com) Qwen3-30B-A3B-Instruct-2507 is better than the new Qwen 3.6 for our tasks (www.reddit.com) Would you rather have Qwen 3.5 27B running at 100tps or Qwen 3.5 35BA3B at 500 tps? (www.reddit.com) Qwen 3.6 35B different quant speeds ? (www.reddit.com) Imposing my laptop to run Qwen 3.6 (www.reddit.com) So, I am excited with the new MoE model released by Alibaba. And as an excited person, I want to believe that it can actually run in my hardware.
This is very fair. Other interesting context behaviors you've experienced? (www.reddit.com) I guess the model didn't feel it needed to do anything beyond proving. Not entirely sure how I got it to act so..
I pray there is a Qwen 3.6 122b version (4x3090 owner) (www.reddit.com) The 3.5 122b model already is fantastic at 4-bit. Really the best model I ever ran on my 4x3090, but from what I read how 35B 3.6 is doing, the 3.6 122b model would be an absolute value banger.
best possible GPU setup for using qwen 3.6 ? (www.reddit.com) hi have been recently thinking to buy my personal GPU for hosting open source models can someone give any suggestion ? and also suppose i don't wanna remain restricted to qwen 3.6 but some math heavy tasks too for which i wanna deepseek or…
Why use local AI when there are cloud services? (www.reddit.com) Why do you use local AI instead of cloud services like qwen and deepseek? Experiment and play around, yes...
NVIDIA V100 32GB for AI in 2026 (www.reddit.com) hello. i have the oportunity of buying Nvidia V100 with 32GB for about 915$ / 775 euro.
Random Codebase Regression - Lost Weeks of work (www.reddit.com) I'm working on a React web app. During a specific Claude Code CLI session (using Qwen via OpenRouter), my entire codebase mysteriously reverted to an older state, losing weeks of bug fixes.
Cheapest and best vision LLMs directory (www.reddit.com) Hi all! Does anyone know of any resources online with some comparisons between vision LLMs and their pricing related to vision capabilities?
Qwen models Relation to this Group. (www.reddit.com) Is this Group full of Qwen bot hype? it seems to me no matter what it's always Qwen this and that.
Clearing up some memory while running llms locally. 25-32token per second gpu poor rx6700xt 12gb and 32gb ddr4 (www.reddit.com) QWEN 3.6 35B A3B MXFP4 https://preview.redd.it/bclr8ukcoqvg1.png?width=904&format=png&auto=webp&s=853b211505ef6b9184d0571ca8fc46295437322a hey everyone this is my first post, anyways the thing is that there is this program called https://m…
Holy moly — Qwen3-35B-A3B-UD-IQ2_M just surpassed Gemini 3 Flash at coding, running on my RX 9070 XT at 99 tok/sec (www.reddit.com) So I ran a small personal test giving both models the same coding tasks. For A* Pathfinding, Qwen absolutely crushed Gemini 3 Flash — both in code quality and overall thoughtfulness.
What’s your LLM routing strategy for personal agents? (www.reddit.com) TL;DR I try to keep most traffic on very cheap models (Nano / GLM‑Flash / Qwen / MiniMax) and only escalate to stronger models for genuinely complex or reasoning‑heavy queries. I’m still actively testing this and tweaking it several times…
Mac M1 Max owners - does your computer overheat and thermal throttle? (www.reddit.com) Hi, I have a mac m1 max 64gb, which I thought was a good machine for entry-level ML. However, when running any LLMs on it - it rapidly heats up, which causes thermal throttling, and using any LLM becomes barely possible.
Use this prompt if you want to find a specific info off the Internet with lowest wrong answer possiblity. Works best for ~30b models. (www.reddit.com) For context i used to ask many near 30b model this question --> **^(Calculate the precise VRAM requirement for the \*KV Cache only** at the maximum context window for **DeepSeek V3.2** and **MiniMax M2.5**. * **DeepSeek V3.2 Max Context:**…
Local Models is the Way - I cannot believe what I just saw (www.reddit.com) So there's a meme going in Claude Code right now about the 'strawperry'. I thought it was a joke!
Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way. (www.reddit.com) I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp…
Does anyone also face repeated AI research across tools? (www.reddit.com) I work with multiple AI tools on same project, and I keep seeing this issue. Tool A already explored context, but Tool B starts same research from zero again.
Gemma4 quirk to use ls -R; can we do better? (www.reddit.com) At the office I'm CPU and local only, so GPU poor. Besides the Qwen3.5 series, I've come to really like Gemma4 E4B there using the Pi agent (llama.cpp, Q4KM).
My Qwen 3.6 fails the car wash vibe check (www.reddit.com) I configured it to the best of my abilities, even at Q8. It fails to give the correct number of tools it supports on Claude Code and it fails the car wash test.
Anybody else seeing Qwen3.6-35B-A3B go crazy thinking in circles? (Compared to Qwen3.5-35B-A3B) (www.reddit.com) I was working on a simple frontend web design task earlier (styling some buttons) with Qwen3.5-35B-A3B. The end results weren't great, but at least it kept trying to change stuff and call toosl properly.
MINISFORUM AI X1 Pro-370 (96GB) - Local Ollama Help (www.reddit.com) Hey all. This just got delivered yesterday.
m5 pro 64gb worth it for local agents or wait? (www.reddit.com) I am currently on an m3 mbp with 24gb ram. For regular python and django work the machine is perfect and i have no need to upgrade for speed.
Need suggestions for local AI Machine (www.reddit.com) I’ve been running various AI harnesses like OpenClaw, ForgeCode, ClaudeCode, etc. Most of these are running via OpenRouter or Minimax (credits/subscription model).
Lower inference speed of Gemma4 26BA4B on vllm. (www.reddit.com) For my earlier use case I used to host qwen 2.5 vl 7b gptq int4. Now I was looking to switch to Gemma4 26B A4B, as it would improve performance as well as improve latency considering only 4B parameters are active..
How faster is Gemma 4 26B-A4B during inference vs 31B? (www.reddit.com) I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…
I bought an 'AI-ready' NUC with an Intel Arc GPU. Ollama couldn't see it. Two days later, I had to build it from source. (www.reddit.com) Got an ASUS NUC15 specifically for running Qwen locally on the Arc GPU. The marketing promised AI-ready performance.
TinyGPU on Apple Silicon + RTX 5070 Ti: my real Qwen benchmarks vs Ollama/Metal (www.reddit.com) I spent time setting up TinyGPU on an Apple Silicon Mac and comparing it against Ollama already installed locally. Short version: TinyGPU does work.
Qwen 3 Coder Next has a bug! Help Test? (www.reddit.com) Hey y'all. So I've stumbled upon a really specific and esoteric "bug" where an llm can't comprehend a URL in like, 90% of scenarios.
running models bigger than physical memory capacity (www.reddit.com) has anyone really tried running models bigger than physical memory capacity? I'd guess most users stick with running models that fit in DRAM + VRAM https://unsloth.ai/docs/models/qwen3.5 even google gemma 4 are released with about 30+ bill…
Laptop has AMD Radeon + RTX 3050 — Which GPU should I use and how do I force apps to use the RTX? (www.reddit.com) I have a laptop with: • AMD Radeon GPU • NVIDIA RTX 3050 GPU • 16GB RAM I’m running Qwen 2.5 3B locally, but it’s using the CPU instead of my RTX 3050. Performance is much slower than expected.
Hardware needed for Gemma 26B MoE vs Qwen 14B for ~100–300 users (vLLM, single node?) (www.reddit.com) I'm trying to figure out what sort of hardware setup i will need to accomodate a userbase of 100 users (not necessarily concurrent). Does anyone have any idea what sort of setup i'd be looking at?
The Mac Studio M5 Ultra Dilemma: Why does Apple make the memory tiers so awkward for LLM (www.reddit.com) I’m a heavy AI-driven dev who basically lives in my IDE. I just tested the new M14 Pro (M5 Max) with 128GB of RAM, and honestly?
openrouter/elephant-alpha is 99% Chinese, likely Qwen 3 Nex (www.reddit.com) openrouter/elephant-alpha is 99% Chinese, likely Qwen 3 Next. промт "Напиши сложный алгоритм на Python для анализа временных рядов, используя методы из китайских научных работ по эконометрике.
Master AI CLI Orchestrator? (www.reddit.com) I created a router that gives me access to Arena.ai models, and I generated an API key for each of the available models. I’m looking for a CLI tool that can run multiple AI agents together, each handling different tasks like planning, secu…
Speed on m5 pro 48Gb (www.reddit.com) Hey guys! How would you reckon a 30-50b model would run on a 48 GBs m5 pro?
Help on SLMs (www.reddit.com) I am building a context aware terminal wrapper, which suggests the completion of the commands(as vscode code suggestions but for commands), I've completed building for the local bash history, it auto completes the last matching command, sh…
Mac Studio Performance Suggestion For minimax (www.reddit.com) I need help. I want to self-contain my MiniMax 2.7 and Qwen 3.5 (122 billion parameter) models.
Why most open-source models can't answer this question while most closed-source models can answer most of the time? (www.reddit.com) WEB SEARCH WAS ALWAYS ON!!!! Question Calculate the precise VRAM requirement for the **KV Cache only** at the maximum context window for **DeepSeek V3.2** and **MiniMax M2.5**.
Programming – How can I get great results with this hardware? (www.reddit.com) Premise: Up to now I’ve tried LM Studio with a few models, and I think I also configured everything correctly to make it work. On top of that, I added Continue in VS Code.
The 4 Things Qwen-3’s Chat Template Teaches Us (huggingface.co)