model roundup

Qwen 3.6

37 items · started 2026-06-05 · ongoing (last activity 2026-06-09)

  1. So apparently the model gets beaten by qwen 3.6 on every benchmark reported by cohere labs. You are getting lower RAM (considering model offload) usage and slightly better performance for imo significantly less output quality.

  2. I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.

  3. I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.

  4. Hi, totally new to using LLMs for coding purposes, I am on Ubuntu and currently using LM Studio with Qwen 3.6 27B Q4 on a 5090. Finding it slow and context runs out fast.

  5. Right now, I am running llama.cpp on a M2 ultra 64gig. Having great fun with unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL - Running opencode and finding it amazing to have such great tools running locally.

  6. Forgive the claude summary, in the readme, but the base works. I'm still working on the hip kernal and having it combine with MTP.

  7. Original post: https://www.reddit.com/r/LocalLLaMA/comments/1txwff3/comment/oq1e0jt/?context=3 TL;DR: Migrated to WSL2 to test Linux (several people suggested it). Embedded MTP on the UD model: 25.8 tok/s.

  8. For real tho, these bots need to turn on their web search functions and quit living in the past. It’s bad enough we gotta deal with all the “Qwen3.6 27b helped me quit drinking and brought my dog back from the dead” posts.

  9. I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about why models are not benchmarked on tool calling, and u/complexminded pointe…

  10. These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…

  11. Hardware: RTX 5090 | Model: Qwen3.6-27B | Framework: BeeLlama.cpp Full benchmark scripts, raw data, config, and generated artifacts are available on request — just DM or comment below. I spent the last week benchmarking DFlash speculative…

  12. Single steam benchmarks (club-3090) model: qwen3.6-27b-autoround-int4 BEFORE: 1x3090 *Their default script recipe for single 3090'*s (4-bit quant and 4-bit kv cache, mtp=2) NARRATIVE decode_TPS: mean = 53 std = 0.6 CODE decode_TPS: mean =…

  13. Local LLMs are great, in fact lots of simpler things like a fastapi web server they can do quite well. The moment you move outside of that - things get a bit worse.

  14. It’s finally here! Something many of us running dual RTX 3090 rigs have been anticipating.

  15. Where I came across it: https://www.reddit.com/r/LocalLLaMA/comments/1txxgpq/openlumara_a_different_kind_of_ai_agent_written/ DISCLAIMER: A good posting would be: This is what I wanted to do with Lumara. Here is what worked, here is what d…

  16. Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar…

  17. Realtime check X for new people to follow I have been pretty pessimistic about local models for agentic workflows for a while. Not because they were useless, but because in practice they often felt just a bit too slow, too fragile, or too…

  18. The cost of training is coming down. We have incredible open source models (especially smaller ones) like qwen 3.6-27b.

  19. Hi, so I have a pretty low-end laptop regarding running LLMs locally (NVIDIA GeForce RTX 3050 with 4GB VRAM, AMD Ryzen 7 5800H and 16GB DDR4) and while I'm not looking for anything to realistically work with, I'd be interested in how could…

  20. Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks. BeeLlama.cpp (my llama.cpp fork) was used as inference engine…

  21. Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support This AI generated blog article is a kind…

  22. Just received an RTX 6000 PRO, and I have an 5090 Astral. I am considering running a Qwen 3.6 27B on the 5090 and maybe two or three more on the 6000 to play roles such as lead SWE and coder and researcher.

  23. Just kidding. Are there any distills that actually improve a model's quality?

  24. My goal: I want to use an LLM to learn more about software/firmware reverse engineering and binary analysis. Eventually I would like to learn how to build agents to augment parts of this process.

  25. I've been happily using GitHub Copilot for 7-8 months, primarily in Visual Studio and VS Code, mostly with the built-in flagship models and have felt like the output is worth the cost. Lately I've been playing with a lot of different local…

  26. First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding.

  27. I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful.

  28. So long story short - openai 20$ subscription is much better than my local AI stack… r7900xtx+32GB RAM (Qwen3.6-35B_Q4+OpenWebUI+SerXNG+Playwright+opencode). I wasn’t expecting much but it’s literally impossible to replace chatGPT level of…

  29. I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s: bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ch…

  30. Hey r/LocalLLaMA, Before anyone jumps on me, this is absolutely not a post about how great Qwen is 😄 Even though I use Qwen 3.6 35B-A3B daily, I’ve found a massive gap between Opus and every other model, local or frontier (including GPT 5)…

  31. Here model: https://huggingface.co/LuffyTheFox/Qwen3.6-35B-A3B-Uncensored-Claude-4.6-Genesis-APEX-GGUF New features: Stability for coding. Even on Q4_K_M quant (APEX Compact), with complex roleplay System Prompt.

  32. I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought.

  33. TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps).

  34. How different are the image recognition capabilities between gemma4 and qwen3.6? I give the model the task to extract calendar events from a photo of an calendar that is croped to the calendar.

  35. So, llama.cpp has the -nkvo (--no-kv-offload) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance.

← all threads