DOA model by Cohere Labs (www.reddit.com via reddit)
model roundup
Qwen 3.6
-
So apparently the model gets beaten by qwen 3.6 on every benchmark reported by cohere labs. You are getting lower RAM (considering model offload) usage and slightly better performance for imo significantly less output quality.
-
Qwen 3.6 35b A3B Speed Help (www.reddit.com via reddit)
-
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit)
I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
-
Gemma 4 31B's competence surprised me (www.reddit.com via reddit)
I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.
-
Qwen 3.6 for coding with 5090 - Your settings recommendations? (www.reddit.com via reddit)
Hi, totally new to using LLMs for coding purposes, I am on Ubuntu and currently using LM Studio with Qwen 3.6 27B Q4 on a 5090. Finding it slow and context runs out fast.
-
Pursuit of performance Llama.cpp to MLX (www.reddit.com via reddit)
Right now, I am running llama.cpp on a M2 ultra 64gig. Having great fun with unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL - Running opencode and finding it amazing to have such great tools running locally.
-
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute. (github.com via reddit)
Forgive the claude summary, in the readme, but the base works. I'm still working on the hip kernal and having it combine with MTP.
-
Original post: https://www.reddit.com/r/LocalLLaMA/comments/1txwff3/comment/oq1e0jt/?context=3 TL;DR: Migrated to WSL2 to test Linux (several people suggested it). Embedded MTP on the UD model: 25.8 tok/s.
-
Me: Arguing with an AI bot who just posted something on this sub about Llama 3.1. (www.reddit.comhttps)
For real tho, these bots need to turn on their web search functions and quit living in the past. It’s bad enough we gotta deal with all the “Qwen3.6 27b helped me quit drinking and brought my dog back from the dead” posts.
-
I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling, and u/complexminded pointe…
-
5070 Ti + 5060 Ti on vLLM hangs on GDN with Qwen3.6 (www.reddit.com via reddit)
-
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better] (www.reddit.com via reddit)
These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…
-
Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative…
-
Weird to get near linear scaling by adding another GPU? (www.reddit.com via reddit)
Single steam benchmarks (club-3090) model: qwen3.6-27b-autoround-int4 BEFORE: 1x3090 *Their default script recipe for single 3090'*s (4-bit quant and 4-bit kv cache, mtp=2) NARRATIVE decode_TPS: mean = 53 std = 0.6 CODE decode_TPS: mean =…
-
Local LLMs are not as amazing as some people will lead you to believe (www.reddit.comhttps)
Local LLMs are great, in fact lots of simpler things like a fastapi web server they can do quite well. The moment you move outside of that - things get a bit worse.
-
club-3090 adds experimental FP8 support for Qwen3.6-27B! (www.reddit.com via reddit)
It’s finally here! Something many of us running dual RTX 3090 rigs have been anticipating.
-
Where I came across it: https://www.reddit.com/r/LocalLLaMA/comments/1txxgpq/openlumara_a_different_kind_of_ai_agent_written/ DISCLAIMER: A good posting would be: This is what I wanted to do with Lumara. Here is what worked, here is what d…
-
Qwen 3.6 27B on DeepSWE (www.reddit.com via reddit)
Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar…
-
Local agents on a MacBook Pro M5 finally feel practical to me (www.reddit.com via reddit)
Realtime check X for new people to follow I have been pretty pessimistic about local models for agentic workflows for a while. Not because they were useless, but because in practice they often felt just a bit too slow, too fragile, or too…
-
Why aren't languages/frameworks offering retrained models for their project? (news.ycombinator.com)
The cost of training is coming down. We have incredible open source models (especially smaller ones) like qwen 3.6-27b.
-
Need some guidance toying with local models (www.reddit.com via reddit)
Hi, so I have a pretty low-end laptop regarding running LLMs locally (NVIDIA GeForce RTX 3050 with 4GB VRAM, AMD Ryzen 7 5800H and 16GB DDR4) and while I'm not looking for anything to realistically work with, I'd be interested in how could…
-
Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ (www.reddit.com via reddit)
Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks. BeeLlama.cpp (my llama.cpp fork) was used as inference engine…
-
Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant) (local-llm.utop.workers.dev via hn)
Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support This AI generated blog article is a kind…
-
Just received RTX 6000 Pro, have 5090- how would you use? (www.reddit.com via reddit)
Just received an RTX 6000 PRO, and I have an 5090 Astral. I am considering running a Qwen 3.6 27B on the 5090 and maybe two or three more on the 6000 to play roles such as lead SWE and coder and researcher.
-
I can't wait for all the x250 sample distills of Mythos and GPT-5.6 (www.reddit.com via reddit)
Just kidding. Are there any distills that actually improve a model's quality?
-
Model recommendations for cybersecurity (www.reddit.com via reddit)
My goal: I want to use an LLM to learn more about software/firmware reverse engineering and binary analysis. Eventually I would like to learn how to build agents to augment parts of this process.
-
Best Coding Harness for Qwen3.6 35B? (www.reddit.com via reddit)
I've been happily using GitHub Copilot for 7-8 months, primarily in Visual Studio and VS Code, mostly with the built-in flagship models and have felt like the output is worth the cost. Lately I've been playing with a lot of different local…
-
Z.ai, we need Air! GLM GGUF wen? (www.reddit.com via reddit)
First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding.
-
Experimentation with Qwen 3.6 and Gemma 4 - Guidance needed (www.reddit.com via reddit)
I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful.
-
I’m upset… (www.reddit.com via reddit)
So long story short - openai 20$ subscription is much better than my local AI stack… r7900xtx+32GB RAM (Qwen3.6-35B_Q4+OpenWebUI+SerXNG+Playwright+opencode). I wasn’t expecting much but it’s literally impossible to replace chatGPT level of…
-
I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s: bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ch…
-
Local vs Frontier on low-level systems engineering (www.reddit.com via reddit)
Hey r/LocalLLaMA, Before anyone jumps on me, this is absolutely not a post about how great Qwen is 😄 Even though I use Qwen 3.6 35B-A3B daily, I’ve found a massive gap between Opus and every other model, local or frontier (including GPT 5)…
-
Qwen3.6-35B-A3B-Uncensored-Claude-4.6-Genesis-APEX-GGUF (www.reddit.com via reddit)
Here model: https://huggingface.co/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-4.6-Genesis-APEX-GGUF New features: Stability for coding. Even on Q4_K_M quant (APEX Compact), with complex roleplay System Prompt.
-
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss (www.reddit.com via reddit)
I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought.
-
TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps).
-
qwen3.6 35B has much worse vision capability than gemma4? (www.reddit.com via reddit)
How different are the image recognition capabilities between gemma4 and qwen3.6? I give the model the task to extract calendar events from a photo of an calendar that is croped to the calendar.
-
Maybe KV cache offload to RAM isn't bad (www.reddit.com via reddit)
So, llama.cpp has the -nkvo (--no-kv-offload) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance.