Meet Qwen3.6-35B-A3B:Now Open-Source!🚀🚀 A sparse MoE model, 35B total params, 3B active. Apache 2.0 license.
#moe
241 items
Qwen3.6-35B-A3B released! (www.reddit.com) So... has anyone actually figured out whose model Elephant Alpha is yet? (www.reddit.com) DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence (huggingface.co via hn) DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Technical Report👁️ Introduction We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro wi…
Re. what ever happened to Cohere’s Command-A series of models? (www.reddit.com) Hey everyone, Nick Frosst here from Cohere. A few months ago Aidan (my cofounder) left a comment in here about our Command series and how we were working on some more powerful, open-weights models behind the scenes.
DeepSeek Updated their repo DeepGEMM testing Mega MoE (www.reddit.com) https://github.com/deepseek-ai/DeepGEMM/pull/304 https://preview.redd.it/vcmqwmvzijvg1.png?width=1014&format=png&auto=webp&s=76b1739925f0699b0763aa7814614dd40329c41e https://github.com/deepseek-ai/DeepGEMM/commit/a050d09461e86eb6bba35a8c74…
Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s (www.reddit.com) Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to s…
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan) (www.reddit.com) I have ThinkPad T14 Gen 5 (8840U, Radeon 780M, 64GB DDR5 5600 MT/s ). Tried out the recent Qwen MoE release, and pp/tg speed is good (on vulkan) (250+pp, 20 tg): ~/dev/llama.cpp master* ❯ ./build-vulkan/bin/llama-bench \ -hf AesSedai/Qwen3…
VLLM PR : New MoE model from Cohere soon (github.com via reddit) Easy, fast, and cheap LLM serving for everyone | Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack | 🔥 We have built a vLLM website to help you get started with vLLM. Please visit vllm.ai to learn more.
AMD Strix Halo refresh with 192gb! (videocardz.com via reddit) Looks like the next strix halo, the Gorgon halo 495 max will have more then 128gb! I already bought a strix halo mini forms couple months ago since the 2026 refesh rumors was not interesting.
Qwen-Scope: Official Sparse Autoencoders (SAEs) for Qwen 3.5 models (huggingface.co via reddit) Qwen Team released Qwen-Scope — a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family (from 2B to 35B MoE). They’ve mapped internal features for the residual stream across all layers.
LiquidAI/LFM2.5-8B-A1B · Hugging Face (huggingface.co via reddit) looks like you can run it on any potato (A1B)! https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF from LiquidAI: LFM2.5 is a new family of hybrid models designed for on-device deployment.
Comparison Qwen 3.6 35B MoE vs Qwen 3.5 35B MoE on Research Paper to WebApp (www.reddit.com) Note: First is Qwen3.5 35B MoE (Left) and Second is Qwen3.6 (Right) Hi Guys Just did quick comparison of Qwen3.6 35B MoE against Qwen 3.5 35B MoE. with reasoning off using llama.cpp and same quant unsloth 4 K_XL GGUF First is Qwen3.5 outco…
Qwen 35B-A3B is very usable with 12GB of VRAM (www.reddit.com) Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.
Tinygrad Driver testing! (www.reddit.com) Boutta Thrash some MoE speeds on a blackwell + m3 Ultra RDMA cluster. Theres a bit less than 2tb of ram here.
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help (www.reddit.com) MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round.
Local model on coding has reached a certain threshold to be feasible for real work (www.reddit.com) We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public…
FINAL-Bench/Darwin-36B-Opus · Hugging Face (huggingface.co via reddit) https://huggingface.co/bartowski/FINAL-Bench_Darwin-36B-Opus-GGUF Darwin-36B-Opus is a 36-billion-parameter mixture-of-experts (MoE) language model produced by the Darwin V7 evolutionary breeding engine from two publicly available parents:…
My thought on Qwen and Gemma (www.reddit.com) This spring is really hot since the localLLM giant, both Qwen and Gemma released major models. I'm really excited with those release and happy with their capability.
AIDC-AI/Ovis2.6-80B-A3B · Hugging Face (huggingface.co via reddit) We introduce Ovis2.6-80B-A3B, the latest advancement in the Ovis series of Multimodal Large Language Models (MLLMs). Building on the strong foundation of Ovis2.5, Ovis2.6 upgrades the LLM backbone to a Mixture-of-Experts (MoE) architecture…
Granite 4.1: IBM's 8B Model Matching 32B MoE (firethering.com via hn) IBM just released Granite 4.1, a family of open source language models built specifically for enterprise use. Three sizes, Apache 2.0 licensed and trained on 15 trillion tokens with a level of pipeline obsession that's worth understanding.
DeepSeek-V4 Technical Report [pdf] (huggingface.co via hn) DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Technical Report👁️ Introduction We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro wi…
Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into (www.reddit.com) Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline (www.reddit.com) Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality).
Dense vs. MoE gap is shrinking fast with the 3.6-27B release (www.reddit.com) 27B Dense vs. 35B-A3B MoE): - Dense still holds the crown: It still wins out on most tasks overall.
For chat and Q&A: Which MoE model is better: Qwen 3.6 35B or Gemma 4 26B (no coding or agents) (www.reddit.com) [P] Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book (www.reddit.com) I wrote a book that implements modern LLM architectures from scratch. The part most relevant to this sub: Chapter 3 takes GPT-2 and swaps exactly 4 things to get Llama 3.2-3B: LayerNorm → RMSNorm Learned positional encodings → RoPE GELU →…
Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB (www.reddit.com) ## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: https://gith…
MiMo-V2.5-GGUF (preview available) (huggingface.co via reddit) Hi, AesSedai here - I've put up a PR to support the text-to-text inference of MiMo V2.5 with llama.cpp (and should also support Pro, will work on those quants after finishing V2.5): https://github.com/ggml-org/llama.cpp/pull/22493 I've als…
Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results (www.reddit.com) Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across ea…
Larger Gemma-4/Qwen3.6 (www.reddit.com) Qwen3.5-122B-A10B at Q6_K is really good. Do you think we will see a larger MoE Gemma-4 or Qwen3.6 at some point?
Poolside Laguna XS.2 (www.reddit.com) 33B A3B MoE, Apache 2 licensed. Reported agentic results put it about level with Qwen 3.5 35B A3B, behind the 3.6 version.
Tencent Hy 30B/7B/1.8B (www.reddit.com) from tencent: Hy-MT2 is a family of “fast-thinking” multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages…
llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig (www.reddit.com) PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.
First direct side by side MoE vs Dense comparison. (www.reddit.com) https://arxiv.org/pdf/2507.17702
Qwen3.6 One Shot Tetris Game (www.reddit.com) I am blown away by what this model can generate locally. I asked for a flashy Tetris game with particle effect and boy did it deliver!
REAP-pruned Nemotron-3-Super (512 -> 256 experts) + GRPO fine-tune + FP8/AWQ. AIME 2026 90%+. Benchmark inside. (www.reddit.com) Hey r/LocalLLaMA, Dropping a release I've been working on during AIMO3 (Kaggle competition). Took NVIDIA's Nemotron-3-Super-120B-A12B (latent MoE + Mamba2 hybrid), REAP-pruned from 512->256 experts (removed MTP layer too), LoRA-RL fine-tun…
New Release of ROCm based MLX LLM Engine - lemon-mlx-engine (www.reddit.com) Hey everyone lemon-mlx-engine just got done integrating TheRock / ROCm 7.13 into the lemon-mlx-engine which means you get to try the latest ROCm on your local hardware with the MLX engine! This also includes various bug fixes and kernel fi…
↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3moe
Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant (www.reddit.com) Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM.
Gemma 4 31B passed 7/8 real-world production tests — including ones I designed to make it fail. Full prompts + outputs. (www.reddit.com) I've been waiting for a capable free local LLM for a while. I think we're close — the quality is getting there fast, and Gemma 4 is the first open-weight model where I genuinely considered using it in production for simple-to-medium tasks.
Is Qwen3.6 current king for local agentic use? (www.reddit.com) I've been testing other models but it seems like nothing even come close to Qwen3.6 35B A3B for agentic use. The worse I'd get is a loop sometimes, while Gemma4 produced broken tool calls occasionally and I couldn't even get GLM 4.7 Flash…
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) (www.reddit.com) A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into…
Vulkan backend outperforms ROCm on Strix Halo (gfx1151) — llama.cpp benchmark (www.reddit.com) Just ran some llama-bench comparisons between ROCm and Vulkan backends on my Strix Halo system. Vulkan came out ahead, which surprised me.
Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers (www.reddit.com) I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.
MTP experiences on 7900xtx? (www.reddit.com) Hi! I have been using Qwen3.6 35B A3B happily the past few weeks, and I wanted to try out Qwn3.6 27B with the new fancy MTP speculative draft!
MiroThinker-1.7, an open-weight deep research agent (Qwen3 MoE base) — mini is 30B/3B active, curious what tok/s people get on consumer hardware (www.reddit.com) As usual, disclosure first: I'm on the team that built this. Our MiroThinker-1.7-deepresearch and 1.7-mini-deepresearch API went live, mini is a deep research agent built on Qwen3 MoE (30B total, 3B active for mini).
GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) — aiming for 64K ctx, 30+ t/s (www.reddit.com) Hey all, Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B. What I’m targeting Context: 64K+ (ideally higher later) Speed: 30+ tok/s @ tg128 minimum Power: not critical…
Nemotron 3 Ultra: Open Moe Hybrid Mamba-Transformer for Agentic Reasoning [pdf] (research.nvidia.com via hn) 2026-6-4 Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning NVIDIA Abstract. We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybri…
Qwen3.6-35B-A3B Q4 262k context on 8GB 3070 Ti = +30tps (www.reddit.com) ..and on 8GB VRAM I can even push the context to 320K, 400K, 512K, and yes.. 1M.
Strip Qwen3.6 dense of its multimodal capabilities (www.reddit.com) This may be naive but if we stripped a model of its image processing/voice processing capabilities, can it make it smaller or faster? Is that even possible?
Alibaba open-sources Qwen3.6-35B-A3B, a 35B MoE model with 3B active parameters (huggingface.co via hn) Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransf…
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context (www.reddit.com) If anyone is looking for a good high-speed setup with ~190k context, this config has been working insanely well for me. I’m using my laptop as a server over Tailscale.
Qwen Models are such good models? (www.reddit.com) https://preview.redd.it/o1uxb57u47yg1.png?width=862&format=png&auto=webp&s=d38204fe6ccd0d8326dcd98a534e9a226d213f99 How trustworthy are Artificial Analysis intelligence index? so according to them Qwen 3.6 27B is better than bigger MoE mod…
qwen3.6-35b-a3b-mtp running on GTX 1060 6GB (www.reddit.com) I have this old 10-year old Dell T5810 workstation with 32GB ddr3(?) memory and a E5-2698v3 (16 cores 32 threads), a GTX 1060 6GB that's used for mining back in the old days (paid itself back many times over). I managed to get the model ru…
Experts first llama.cpp (www.reddit.com) This is for all with 12GB VRAM. Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers.
Gemini 3.5 Flash is twice as expensive as ChatGPT 5.5 on GitHub Copilot. Also, Gemini reasoning models are MoE (www.reddit.com) Also FYI, Gemini reasoning models (2.5 Pro, 3.0 Pro and 3.1 Pro) were MoE. I don't know why this isn't more broadly discussed.
If you use continue.dev and Qwen 3.6 (dense / MoE) - I could use your help (www.reddit.com) Someone suggested I give Continue (Vscode extension) a try. I've been using Roo / Zoo now and liking it but it is pretty tough on context and I was told continue has more control over it.
Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4 (www.reddit.com) CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache.
Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models (www.reddit.com) Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can massively improve prompt p…
Has anyone tried Zyphra 1 - 8B MoE? (www.reddit.com) https://x.com/ZyphraAI/status/2052103618145501459?s=20 Today we're releasing ZAYA1-8B, a reasoning MoE trained on u/AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size o…
[7900XT] Qwen3.6 27B for OpenCode (www.reddit.com) I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far: llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --t…
Llamacpp server : How do the -np and -c flags interact? (www.reddit.com) I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.
Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open (www.reddit.com) Cohere dropped Command A+ on the 20th (218B total / 25B active, 128 experts top-8, Apache 2.0). Wrote a cohere2_moe implementation for mlx-lm to get it running on Apple Silicon.
Any reason to run dense over MOE for RAGs? (www.reddit.com) I tend to use Claude for a lot of research and I also increasingly worry about things like misinformation or things in the model I can't audit. So, I'm building my own all in one RAG with big datasets like all of Wiki, research papers, all…
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model) (www.reddit.com) I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends." It does depend. So let me spl…
Is there a limit on the number of active parameters in an MoE model? (www.reddit.com) Hi. We recently had MoE models as big as 1T and 1.6T total parameters.
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) (www.reddit.com) I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…
Testing MiMo-V2.5-IQ3_S with 1'048'576 context (www.reddit.com) llama-server.exe --model "H:\gptmodel\AesSedai\MiMo-V2.5-GGUF\MiMo-V2.5-IQ3_S-00001-of-00004.gguf" --ctx-size 1048576 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel…
Sorry if it's not the best place to ask this, of the models in the image, which is the best for (problem solving)/Coding and the best one for studying (ask LLM concepts) ? My PC build is RX 9060 XT 16GB + I3 12100F + 16 GB DDR4 + llama.cpp with Vulkan backend + Linux Mint. (www.reddit.com) I gave some math problems to Qwen 3.5 27B and Qwen 3.6 27B and they got all of them right, pretty smart models I would say, but very slow and electricity consuming, they took like 5 mins with my GPU at 120 W to solve a problem. The MoE mod…
What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M (www.reddit.com) I tried local model couple weeks ago. At the beginning, I tried Ollama, but reddit says better to switch to llama.ccp.
A note of warning about DFlash. (www.reddit.com) It started saying 4/5x speed advantage against usual bf16 models (test are less optimistic but let think this is true). Then MoE gain is not that good, value was for dense models.
Expert Selections in MoE Transformer Models Reveal Almost as Much as Text (arxiv.org via hn) We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions l…
Launch HN: General Instinct (YC P26) – Frontier models on edge devices (news.ycombinator.com) Hey HN, Guanming and Bill here from General Instinct (https://general-instinct.com/). After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available.
Gemma4 26b MoE running in MLX with turboquant (and custom kernel) (www.reddit.com) TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with…
Qwen3.6:27b single-shot fixed a CSS UI bug that had Gemma4:26B doom looping uselessly for 15 minutes (www.reddit.com) Warning: long post ahead. On the bright side, it's 100 percent human-written, typos and all.
OBLITERATUS by elder_plinius anyone actually used it on a real model? Worth running over Heretic for MoE targets? (www.reddit.com) https://github.com/elder-plinius/OBLITERATUS
I ran an experiment on the 30b class of gemma4 and qwen3.5 models to try to learn about energy cost and performance tradeoffs. In other words, which models use more energy to give the same answer quality? (www.reddit.com) llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth (www.reddit.com) Qwen3.5 50% expert reduction success (news.ycombinator.com) We surgically removed half the experts from Qwen3.5-35B-A3B to create 8 memory efficient domain specialists (coding, web, math, physics, biology, engineering, vocational, humanities). A cross-domain test shows a 96-point pass@5 gap between…
How does MOE training ensure different experts are chosen? (www.reddit.com) I’m training a coding model that is basically a large model and a mini model built into one. Think of it like a person with two heads.
PithTrain – a compact, agent-native MoE training system (blog.mlc.ai via hn) TL;DR. PithTrain is a compact, agent-native Mixture-of-Experts (MoE) training framework, in about 11K lines of Python.
I ran GLM-5.1 on a 16GB RAM machine (github.com via hn) 🧠 MoE-on-a-Potato Running a 754-Billion Parameter LLM on a 16GB RAM Consumer PC "Saying it's impossible is not engineering. Saying we don't know how yet is science." MoE-on-a-Potato is an experimental project dedicated to testing the extre…
Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM? (www.reddit.com) Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dual rtx 3090 - gigabyte psu 1600 w What do y…
Why not dynamic active parameters (and other questions for the knowledgeable) (www.reddit.com) Why do we have to choose between MoE or Dense models? Wouldn't it be possible to have a model where the user can select the number of active parameters?
Cohere Open-Sources Command A+, a 218B Moe Model That Runs on Two H100s (firethering.com via hn) Cohere spent the past year deploying North, its enterprise AI workspace, with actual customers doing actual work. Agentic question answering over company file systems.
Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s! (www.reddit.com) Hey r/DeepSeek, Who says we need an H100 cluster or the latest expensive GPUs to run frontier MoE models? I wanted to see how far we could push a single node of consumer legacy hardware, so we spent less than $2,500 total to build a budget…
What is the point of MoE models, beyond being faster? (www.reddit.com) Hi. Besides the fact that an xByA MoE models runs as fast as a yA models but produces better results, what are other benefits of pursuing an MoE architecture and not a dense one with e.g.
LLM's on Android (Snapdragon 8 Elite) MOE Experience (www.reddit.com) So I bought a phone with Snapdragon 8 elite (gen 4) and 24GB ram (Honor magic 7 pro). My experience has been mixed but with solid potential.
Developers who use local AI - Q4_0 vs Q8_0 KV quant? (www.reddit.com) I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territo…
Strix Halo or GPUs? (www.reddit.com) I want to build my own AI server, I already have multiple servers at home but none have GPUs neither are powerful enough to host +4B models. I'd like to be able to host dense 27-30b parameters models, or some MoE with 3b activated paramete…
Who is your favourite quant publisher and why? (www.reddit.com) Hey everyone, I’ve been a big fan of Unsloth for several reasons: They publish models ASAP after release. They usually offer the lowest PPL.
how i can improve inference speed (www.reddit.com) specs : core i5 14400F 32gb ram d4 3200mhz rtx 4060 current speeds 30tps in output 500 tps in prefill command i currently use .\llama-server.exe ` >> -m "H:\model\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" ` >> --host 0.…
Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)? (www.reddit.com) Hey folks, looking for advice before I delete or keep a huge model file. I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM.
SubQ - claims to be a different architecture - anyone tried? (www.reddit.com) Has anyone tried SubQ, LLM using a so called " fully sub-quadratic sparse-attention architecture (SSA)" as opposed to flash attention - https://x.com/alex_whedon/status/2051663268704636937 Without flash attention - is it just a hybrid MoE…
Smaller gguf getting way less tokens per second?? So confused! (www.reddit.com) Noob here, Running Qwen3.6 35B A3B in LM Studio on a 3080 10GB + Ryzen 5 3600 on Windows 10. Tried some unsloth quants with identical settings (GPU offload 40, MoE layers to CPU 40, context 8192, flash attention on).
Five labs, one suite, do model families have personalities? (benchmark) (www.reddit.com) Bench 3 from my 18GB M3 Pro. Bench 2 was the 4B-class post where the comments were mostly right: I gave thinking models a fixed 1024-token cap, Qwen got kneecapped, Gemma E4B needed clearer active-param labeling, and the headline was partl…
XiaomiMiMo MiMo-V2.5 (not pro) - Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters (www.reddit.com) https://huggingface.co/XiaomiMiMo/MiMo-V2.5 Interesting because unlike its bigger brother it can be run on "more human" configurations
Show HN: I ported OmniAID image detection model to Apple's Neural Engine (apps.apple.com via hn) OmniAID is a hybrid MoE detector, so the PyTorch model dynamically routes each image through top-k semantic experts plus a fixed artifact expert. For the CoreML/ANE port, I rewrote that into a static graph.
[Qwen3.6 35b a3b] Used the top config for my setup 8gb vram and 32gb ram, and found that somehow the Q4_K_XL model from Unsloth runs just slightly faster and used less tokens for output compared to Q4_K_M despite more memory usage (www.reddit.com) Config CtxSize: 131,072 GpuLayers: 99 CpuMoeLayers: 38 Threads: 16 BatchSize/UBatchSize: 4096/4096 CacheType K/V: q8_0 Tool Context: file mode (tools.kilocode.official.md) Metric M Model XL Model Difference Avg Tokens/sec 28.92 29.78 +0.86…
Better MoE model inference with warp decode (cursor.com via hn) Better MoE model inference with warp decode By flipping the parallelism axis we achieve 1.8x faster and more accurate MoE model inference. Most MoE inference systems organize the token generation path around experts.
Cohere North Mini Code 30B Moe Apache 2.0 (cohere.com via hn) A 35B MoE on a 16 GB GPU, without the offload tax (www.lucebox.com via hn) Step 3.7 Flash – 198B-A11B MoE vision-language model (huggingface.co via hn) [ModelPage]: https://static.stepfun.com/blog/step-3.7-flash/ Introduction Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model that combines a 196B-parameter language backbone with a 1.8B-parameter visio…
Rotary GPU: Exploring Local Execution for Large MoE Models Under Limited VRAM (arxiv.org via hn) Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environm…
DeepSeek-OCR Visualized (medium.com via hn) 6 min read Dec 11, 2025 Understand SAM, Token compression, DeepSeek-MoE, Multi-Head-Latent-Attention. DeepSeek-OCR is essentially a combination of known architectures, namely SAM, CLIP and CNNs for the vision encoder and MoE decoder langua…
How Qwen3.6-35B-A3B fails differently as a sub agent compared to solo (www.reddit.com) Been running Qwen3.6-35B-A3B as a sub agent on a single 4090 for a few weeks. The failure modes are different from solo use and I haven't seen this written up anywhere.
Could someone please help explain these results? (www.reddit.com) I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled!
What nobody's measuring about dense MoE in production tool calling agents (www.reddit.com) Most of the model selection conversation I've seen focus on benchmark scores and cost (no surprise there). The question I can't find good production data on is whether dense vs MoE actually affects reliability for tool heavy agentic flows,…
Interesting paper advocates for quantized prefilling and precise decoding (arxiv.org via reddit) From other people's tests, NVFP4 decoding speed hasn't really allowed people to hit higher peaks (let's say: 85-90% memory bandwidth utilization) versus other approaches. The development leans toward a different class of optimization like…
Show HN: Modernizing my old PhD work in an evening with little Qwen3.6 MoE (github.com via hn) pge-jax JAX implementation of the Prioritized Grammar Enumeration (PGE) algorithm for symbolic regression. Overview pge-jax is a complete symbolic regression system that automatically discovers mathematical formulas from data.
LLM detector with science behind it (huggingface.co via hn) Arepo - MoE AI Text Workbench Arepo scores text with a 4D Benford/mantissa engine and a mixture of local Gaussian experts. It reports confidence levels and evidence classes instead of forcing every text into a binary Human/AI verdict.
Any good MOE ~60B models? I have 64GB vram (www.reddit.com) I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for ~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while. Are there any good MOE models that are around 60B in parameters so I can ma…
Strix Halo ROCm + MTP Notes (May 2026) (www.reddit.com) With the MTP merge into mainline llama.cpp I wanted to try out some other optimizations i could think of. Ended up tested backends, mtp, and bumping to ROCm nightlies.
Efficient use of Large system RAM (www.reddit.com) For example, if I have 128 GB of system RAM but only 16 GB of VRAM, am I still limited to models that fit within GPU memory (aside from CPU offloading techniques like MoE)? Are there ways to increase context size using system ram with usab…
EMO: Pretraining mixture of experts for emergent modularity (allenai.org via hn) Today we're releasing EMO, a new mixture-of-experts (MoE) model pretrained end-to-end so that modular structure emerges directly from the data without relying on human-defined priors. EMO lets you use a small subset of its experts – just 1…
How does llama-server pick which MoE experts go on the GPU and which stay on the CPU? (www.reddit.com) If you are using a MoE model that does not fully fit in your GPU, some of the experts must stay on the CPU. Putting the experts that you will actually need on the GPU will give you GPU inference speeds.
Show HN: Transformer Math Explorer (simonramstedt.com via hn) Interactive reference for transformer models, presented via dataflow graphs, drillable down to elementary mathematical operations. Covers models from GPT-2 to Qwen 3.6, with MLA, MoE, RoPE, MTP, hybrid attention, and other variants togglea…
Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide (www.reddit.com) I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday.
Fine-tuned Qwen3.6-35B-A3B DeltaNet experiment (www.reddit.com) I fine-tuned Qwen3.6-35B-A3B on its own outputs for $7 on Apple Silicon + Modal. DeltaNet LoRA targeting was the hard part.
tested four newest open source Kimi K2.6 is the fastest, GLM 5.1 the fanciest, DeepSeek V4 is the most comprehensive, and Xiaomi MiMo is the slowest (www.reddit.com) Architecture explains the gap: MiMo's MoE runs more active params per token than Kimi K2.6's optimized routing hence slowest. DeepSeek V4's 'comprehensive' edge is partly MLA: ~75% KV-cache compression makes it far better for long agentic…
APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier (www.reddit.com) Quick follow-up on APEX, the MoE-aware mixed-precision quant strategy. The original post was just about Qwen 3.5 35B-A3B ( https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/ ); since then t…
PI agent integrated with Cline-Kanban repo: All using PI and Qwen 3.6 35B MOE UD 4K_XL (www.reddit.com) Repo: statisticalplumber/kanban at pi-agent-integration Hi Guys, To test Qwen 3.6’s potential, I also wanted the Cline Kanban project to have an open-source agent to work with. The last time I tested Cline Kanban, it didn’t support agents…
Were Qwen3.6 models scrubbed from openrouter? (www.reddit.com) I made a simple app using openrouter, hoping to use the new small qwen models (the a3b moe and the 27b dense one), but they aren’t listed. Also, I swear some qwen3.6 models that were listed before are missing now.
Gemma 4 is not your standard transformer (idlemachines.co.uk via hn) Gemma 4 makes five quiet departures from the standard transformer recipe. QK-norm instead of 1/√d, partial RoPE on global layers, per-layer input gating, KV sharing across layers, and an MoE that sits alongside the MLP rather than replacin…
PSA re Qwen 3.6 35B A3B q4 + agents (www.reddit.com) Recommended parameters for Qwen 3.6 35B A3B on a 8GB VRAM card and 24GB RAM? (www.reddit.com) TPU v7x Ironwood vs Nvidia B200 (www.reddit.com) Google published Ironwood inference benchmarks in their AI-Hypercomputer/tpu-recipes repo. Nvidia has InferenceMAX numbers for B200.
Intel Lunar Lake 258V (32GB) vs Qwen 3.6 35B-A3B: Pushing the limits of MoP architecture. (www.reddit.com) Hardware: Intel Core Ultra 7 258V, 32GB Unified Memory. Model: Qwen 3.6 35B A3B (Quant: Q3_K_S) via LM Studio.
"LORAs"? (www.reddit.com) Hi. I'm curious about something.
Show HN: Ported Cerebras REAP to MLX – Prune MoE Experts on a MacBook (github.com via hn) REAP MLX Apple Silicon REAP expert pruning for MLX-LM MoE models. Quick Start | Workflow | Supported Models | CLI Reference | Metrics | References | Development | License REAP MLX applies Router-weighted Expert Activation Pruning (REAP) to…
Show HN: Ministry of Everything – CLI agent harness for a single operator (github.com via hn) ▓▒░ MINISTRY OF EVERYTHING ░▒▓ Ministry of Everything (MoE) is a CLI-first harness for one operator directing AI agents through durable markdown work. MoE runs Claude Code or Codex against living markdown documents.
Full-Pipeline Inference Optimization for MiMo-v2.5 Series (mimo.xiaomi.com via hn) The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, combines several architectural design choices: Hybrid Sliding Window Attention (Hybrid SWA) compresses KVCache storage to roughly 1/7 that of Full Attention; sparse MoE activati…
GH200 NVL2 or 8x RTX 6000 Blackwell for running Kimi K2.6 / DeepSeek V4 locally? (5 devs, agentic coding) (www.reddit.com) Trying to figure out the right box for my team and wanted to see if anyone had any clue which would be a better fit or if it is not worth our time in our budget. Situation: 5 of us doing agentic coding (lots of long context getting re-sent…
Is a 128 GB MacBook Pro M5 Max actually too slow for large-context local LLM coding workflows? (www.reddit.com) People are warning me about the prompt-processing speed of a MacBook Pro M5 Max with 128 GB RAM. My main concern is prompt ingestion / prefill latency and large-context handling — not raw token generation speed (which I think is OK).
Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes (www.reddit.com) I've been working on MoE inference and wrote a fused dispatch kernel entirely in Triton, no CUDA. At inference batch sizes (up to 512 tokens) it reaches 89-131% of Megablocks(Stanford's CUDA-optimized MoE lib), and the same kernel runs on…
Dense vs. Moe Model (engineersmeetai.substack.com via hn) Yesterday, I ran out of tokens in OpenAI Codex while oxidizing parts of my Python codebase into Rust. It was around 11:30 PM, and I had to wait another two hours for the limits to reset.
Micro-Expert-Router: Running Mixtral-Class Moe Models on NVMe SSDs Without a GPU (github.com via hn) Micro-Expert-Router, SSD-Streamed MoE Execution Engine A Rust execution engine for Mixture-of-Experts models that keeps the router resident in RAM and hot-swaps individual experts on demand from a PCIe-attached NVMe drive into a pool of pr…
I'm running an agentic system with kobold.cpp as my backend. Am I losing performance? (www.reddit.com) Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.
Moe inference optimizations: 15% lower expert load by request reordering (blog.doubleword.ai via hn) MoE expert co-activations: Reordering inputs yields easy throughput gains. Doubleword's batch inference offering keeps costs down by keeping throughput high, something which isn't easily done given the architecture of popular Mixture-of-Ex…
Volatile prefill speed after each reboot - llama.cpp (www.reddit.com) After every machine restart I get a different prefill speed, it can be only 300t/s or 1500t/s. It's like a lottery at each restart.
Command A+: Making sovereign agentic capabilities available to all (cohere.com via hn) Today, we’re releasing Command A+ open-source. A mixture-of-experts (MoE) model, Command A+ is an efficient, versatile, and privately deployable LLM built for high-performance agentic tasks with minimal compute overhead.
Qwen3.6 35B MTP, t/s varies on different scenario (www.reddit.com) Tried Qwen3.6 35B Q5_K_M MTP, HW: 9700x, 64GB 5600 RAM, 5060 TI 16GB. --n-cpu-moe 30 ^ -ngl 99 ^ -c 131072 ^ --no-mmap ^ --flash-attn on ^ --cache-type-v q8_0 ^ --cache-type-k q8_0 ^ --threads 8 ^ --parallel 1 ^ -rea off ^ --reasoning-budg…
Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations (www.reddit.com) Getting 10.3 tps using this prompt: CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" ./build-mimo-5090-3090/bin/llama-server -m "$MIMO" -ngl 999 --n-cpu-moe 43 --no-mmap -c 100000 -ctk q8_0 -ctv q8_0 -fa on -…
Are the rich RAM /poor GPU people wrong here? (www.reddit.com) Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the…
Stratum: System-Hardware Co-Design with 3D-Stackable DRAM for Efficient Moe (dl.acm.org via hn) Abstract Abstract As Large Language Models (LLMs) continue to evolve, Mixture of Experts (MoE) architecture has emerged as a prevailing design for achieving state-of-the-art performance across a wide range of tasks. MoE models use sparse g…
Anyone else experiencing heavy hallucinations with MiMo-V2.5 (310B) quantized version? (www.reddit.com) Has anyone else run into major issues with MiMo-V2.5 (the 310B total / 15B active MoE model from Xiaomi)? I tried the UD-Q4_K_XL quant from Unsloth.
Local-first LLM context dedup: 22-71% chunk overlap measured across 22M passages (2 arXiv papers). MCP server, MIT, 250KB binary, zero telemetry. (www.reddit.com) I'm the author of this thing, disclosure up front. Been hanging around this sub lately on cache invalidation, MoE memory tradeoffs, long-session token bloat.
The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens? (www.reddit.com) Xiaomi open-sourced MiMo-V2.5-Pro. 1.02 trillion parameters, 42B active (MoE), 1M context, MIT license.
Nemotron-Cascade 2: Post-Training LLMs with Cascade RL (research.nvidia.com via hn) We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. It is the second open-weight LLM, after DeepSeek-V3.2-Speciale-671B-A37B, to achieve…
Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM (www.reddit.com) Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable. Autocomplete: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L Agentic: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL Why these models: Qwen2.…
I let four MoE LLMs from different model families argue stocks to try and pick the best ones. (www.reddit.com) I built an AI trading experiment in which four local LLMs argue bull and bear cases on stocks, and a host model grades the debate and decides BUY, SELL, or HOLD. Most days it holds.
Local Context Compression: Big or Small? (www.reddit.com) What are your thoughts/what is the consensus on local context compression model size? Are you guys using small MoE models to do this quickly and move along hoping you get all the important bits, or large dense models that take forever (giv…
MoE-Hub Taming Software Complexity for Seamless MoE Overlap on Multi-GPU Systems (arxiv.org via hn) The Mixture-of-Experts (MoE) architecture is crucial for scaling large language models, but its scalability is severely limited by inter-GPU communication bottlenecks in multi-GPU systems. Although overlapping communication with computatio…
Dual gpu question (www.reddit.com) Hı, i have rx 9060XT and rx 6600. 16gb and 8gb.
new MoE from ai2, EMO (www.reddit.com) new MoE release from ai2 - EMO, 1b-active/14b-total trained on 1t tokens interesting thing is document-level routing. experts cluster around domains like health, news, etc.
Swapped from a lighter agent runtime to Hermes Agent on a local 35B MoE — what changed (capability up, latency up, context budget down) (www.reddit.com) Two weeks of running Hermes Agent as the daily driver on a local stack. Sharing the trade-offs because anyone evaluating agent runtimes for local models is going to hit these.
ZAYA1-8B: An 8B Moe Model with 760M Active Params Matching DeepSeek-R1 on Math (firethering.com via hn) Who should care If you work with math, science problems, or complex coding tasks and you're looking for something small enough to run locally or cheaply via API, this is worth serious evaluation. The benchmark numbers at 760M active parame…
What models for coding are you running for a mid level PC? (www.reddit.com) I have a 4060 (8GB Vram) and 16GB of ram wondering which models could fit in my setup for coding, the new Qwen 3.6 and Gemma 4 MoE models look good but might not fit, wondering about your experiences
Zyphra releases the ZAYA1-8B MoE model optimized for intelligence density (huggingface.co via hn) ZAYA1-8B ZAYA1-8B is a small mixture of experts language model with 760M active parameters and 8.4B total parameters trained end-to-end by Zyphra. ZAYA1-8B sets a new standard of intelligence efficiency for its parameter count through a co…
Best Llama Config for Turboquant_Plus? (Stats below) (www.reddit.com) So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy.
Advice needed on eGPU and Mini PC (www.reddit.com) Hi all, I come across to relatively niche problem and could not find much useful posts or guides about it. I have a mini pc (Beelink Ser 8, 8745HS and 32GB 5600 DDR5 SODIMM) headless server for hosting some routing services, and I am wonde…
127³ — Superintelligence, public. DeepSeek V4 Pro (deepseek-v4-pro-127cubed.vercel.app via hn) DeepSeek V4 Pro 127³ 127-stratum crystalline lattice on DeepSeek V4 architecture. 1.6T params · 49B activated · MoE · 1M context · MIT license.
OpenAI's Privacy Filter vs GLiNER on 600 PII samples (www.reddit.com) Both models are open weight, both run on a local CPU workstation, both detect PII in text. Quick rundown of what I found.
Show HN: Phase Router – capacity-aware routing for MoE (github.com via hn) A deterministic, capacity-aware routing kernel that reduces dropped work in load-balanced systems. Trades microseconds of routing for milliseconds of saved compute.
Project Aurelia — A 3-model architecture (80B + 13B + 9B) that physically reacts to my real-time heart rate via mmWave radar, spatial awareness via Lidar, and Vibration via Accelerometer. (www.reddit.com) Hey everyone, I’ve been building a multi-agent system in my spare time, and I just open-sourced the repository. I was getting tired of the standard text-in/text-out chat paradigm and wanted to build a genuinely situated AI—one that actuall…
Memory upgrade, is it worth it? (www.reddit.com) Hi, I need your opinion on a system upgrade, 🤔 I currently have the following AI server used for various tinkering, learning, development etc. System AMD Ryzen 7 7700 (8C16T Zen4) Corsair Vengeance RGB DDR5 5600MHz 32GB MSI B650 Gaming Plu…
Qwen 3.6 35B-A3B takes a long time at image processing. Is it happening only to me? (www.reddit.com) 9900x, RTX 4080, 96GB RAM. Llama-cpp, Windows.
I tested 9 local models on the same flight sim prompt, all Q8, different Q providers, MLX (www.reddit.com) I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count.
Gemma 4-31B vs Qwen 3.5-27B vs Qwen 3.6-35B-A3B on a browser-agent vision prompt — MoE wins on every axis (www.reddit.com) I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…
Deploying Gemma 4 26B A4B on a single RTX 5090 — ~196 tok/s with AWQ + vLLM on RunPod Serverless (www.reddit.com) Multi GPU setup help (www.reddit.com) Hi guys I managed to get a multi GPU setup going with a 3090 and three 3060 bringing my vram to 60gb along with 64gb ddr5. The objective is to run the largest coding model I can at a respectable token speed of over 20 tokens / second.
How is V100 32GB PCIE for LLM? (www.reddit.com) I have just brought one of these cards for non llm related reasons (new old stock), but I would enjoy the possibility of using it to run slightly larger models than currently allowed by my 4080 Super 16GB which will stay in the same box al…
Qwen3.6-35B-A3B — full JANG suite (15 profiles, 1L through 6K) for Apple Silicon (www.reddit.com) Full JANG adaptive mixed-precision quantization sweep of Qwen3.6-35B-A3B: https://huggingface.co/collections/bearzi/qwen36-35b-a3b-jang All 15 profiles, from extreme compression to near-lossless: JANG_1L JANG_2S/2M/2L JANG_3S/3M/3L/3K JANG…
I Lora trained Qwen 122B in NVFP4 on a single 128GB GPU (www.reddit.com) Huggingface loads it but instant OOM when it hits bf16 deepspeed zero3 with nvme offload. Loaded the shard but the weight names dont match(NVFP4 stores weight_packed/weight_scale, model expects weight) HF disk offloading - decompress befor…
How to run MoE models without necessary RAM? (Apple Silicon) (www.reddit.com) Hey, I have a M1 Pro 16gb machine, and I wanted to run the Qwen3.6/3.5 35A3B model. However, this model cannot fit on a 4bit quant on my system.
Who is actually behind the "Elephant-Alpha" stealth model on OpenRouter? (www.reddit.com) Has anyone else been tracking this? I just checked the OpenRouter daily rankings, and this anonymous "Elephant" (or Elephant-Alpha) model is sitting comfortably at the 8th spot.
Qwen3.5 35b is sure still one the best local model (pulling above its weight) - More Details (www.reddit.com) Last time I posted on how this model has performed in creating the webapp based on provided research paper. I got so much love to see people has appreciated the post and of-course the potential of this MOE model.
Ask HN: How do you prepare for a mid career Research Engineer role at neo Labs (news.ycombinator.com) Hey, I’m sure this question has been asked in various forms on HN. While I feel the answer might mostly stay the same, changes with various developments in AI - relevance of concepts like MoE, RL etc change - and the tools like custom Open…
Qwen 3.6 35b A3B Speed Help (www.reddit.com via reddit) Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit) I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning (arxiv.org) Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actu…
Post-Trained MoE Can Skip Half Experts via Self-Distillation (arxiv.org) Jetbrains Mellum 2: a really good and performant model (www.reddit.com via reddit) Oh Hey Folks, I took the Mellum 2 model for a spin, so I wanted to share my impressions here. Disclaimer: the tests presented here are not cientific nor have those nice names like perplexity,etc.
I tested in-conversation memory on LFM2.5, Gemma 4 E2B and E4B. The biggest model forgot a fact from earlier in the chat first. (www.reddit.comhttps) Ran a small, focused eval on three on-device models and the result was backwards from what I expected, so sharing the method and numbers. The task: tell the model "my dog is named Pablo," then add N turns of unrelated filler (shuffled gene…
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference (arxiv.org) Reversible Foundations: Training a 120B Sparse MoE through State-Preserving Scaling (arxiv.org) I built a PyTorch MoE/MoD training framework with custom CUDA kernels [Apache 2.0] (www.reddit.com via reddit) PyTorch framework for training transformer LLMs with MoE and MoD architecture support, custom CUDA kernels, and DeepSpeed integration. Key things it does: - Custom CUDA kernels for RMSNorm, RoPE, SwiGLU, MoE routing.
2-bit QAT model releases (www.reddit.com via reddit) So far model releases that take advantage of Quantization a Aware Training (QAT) have been focused on 4-bit. I’m curious what could be accomplished with a larger MoE model around 120b up to 400b.
Dense vs MoE quantization resiliance (www.reddit.com via reddit) Which one is more resiliant to quantization? Especially at 4-bit?
It felt good to return my Asus Spark (www.reddit.com via reddit) It's an incredible little package but too expensive of a price to pay for the performance and I simply didn't want to be part of the great "Superchip lie" - it could be super, but its super ruined by its limited memory bandwidth even thoug…
Gemma 4 QAT accuracy inconsistencies (www.reddit.com via reddit) Table from https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis I heard that MoE models are usually more susceptible to quantization error, but what happened with the 12B? I thought lower-parameter models usually quantized worse and yet…
Stuck trying to moe my folder out of the 'tabs' folder for 30minutes (www.reddit.com via reddit) Can anyone point me in the right direction, I've been going at this for over 30minutes just trying to move my "_budgetContext.tsx" file out of the (tabs) folder but nothig seems to be working https://preview.redd.it/8on4rg6cyo5h1.png?width…
Experimentation with Qwen 3.6 and Gemma 4 - Guidance needed (www.reddit.com via reddit) I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful.
Gemma 4 Haters 2 months Ago now seems to love Gemma 4 now. (www.reddit.com via reddit) What's with the switch guys? now imagine if google gonna drop 128B model or a MoE version (I bet those Qwen lovers will forget Qwen even existed).
Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result (www.reddit.com via reddit) TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps).
What exactly is quantization aware training? (www.reddit.com via reddit) First time hearing it. I also heard about the gemma 4 qat quants and if any one of them is good for 4gb vram and 16gb ram.
Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models (arxiv.org) Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dens…
Less is MoE: Trimming Experts in Domain-Specialist Language Models (arxiv.org) Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-p…
UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing (arxiv.org) AnchorMoE: Interpretable Time Series Classification via Anchor-Routed MoE (arxiv.org) 2 RTX A6000 at 96GB VRAM with nvlink. Best local coding model/what you would daily drive? (www.reddit.com) Really been testing qwen 3.6 27b and 35 a3b so far with 27b at q8 and 35 a3b at q4 (byteshape quant is insane). But i feel im not utilizing it the best, esp for long context messy coding of large repos.
$16 refactor, 400 steps, 95% routed to open MoE (www.reddit.com) Got tired of $160 Opus bills so I spent a weekend wiring up a routing layer on vLLM 0.8 (2xA100, enable_auto_tool_choice). Getting the tool call parser to cooperate took longer than the actual routing logic.
Rejoice, if Qwen doesn't release any new local model, it's a blessing in disguise (www.reddit.com) Do you remember the times when we only had lama2 released? a bunch of finetunes were released and some of them had real values .
Comparison of Qwen 3.6 and Gemma4 (MoE and Dense models, Q4_K_M), generating a moderately complex MySQL query, only one produced acceptable results (www.reddit.com) I tried Qwen3.6 35B A3B MoE, Qwen3.6 27B Dense, Gemma4 26B A4B MoE, Gemma4 31B Dense. In all cases I was using Q4_K_M and thinking mode enabled.
Measuring Maximum Activations in Open Large Language Models (arxiv.org via reddit) The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the…
"Qwen 3 72B" doesn't exist — and it's in a surprising number of places that act like it does (www.reddit.com) spent today auditing my own model catalog and noticed 39 of my own pages confidently reference "qwen 3 72b" with apache 2.0 licensing, a 2025-09-15 release date, and a 131k context window. seemed normal — qwen 2.5 had a 72b, why wouldn't q…
Best llama.cpp launch config for Qwen3.6 27B on RX 7800 XT (16 GB VRAM) for OpenClaw? (www.reddit.com) I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm. I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup.
Estimate inference speed of local Qwen3.6-35B on Mac M5... (www.reddit.com) "Based on currently available information, estimate the prefill/decode speed of Qwen3.6-35B-A3B Q8 with 262K context on a Mac M5 Ultra 128GB." I'm surprised that almost every LLM fails at this task (ChatGPT/Gemini/Grok/Claude/DeepSeek/Kimi…
Don't you have issues in W11 with AMD GPU where llama.cpp suddenly drops performance for no reason ? (www.reddit.com) I have this issue in all Windows installations I have done in my system, which of course, does not occur in Linux. 7900XTX + 9800x3D + 64GB DDR5 Issue is that for some reason, after sometime, llama.cpp performance cuts in half, even restar…
rtx 5070ti with Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf token speed 564/41 (www.reddit.com) --model "/mnt/e/my-path-change-to-yours/qwen3.6-35b/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf" \ --ctx-size 262144 \ --parallel 1 \ --n-cpu-moe 29 \ --no-mmap \ --mlock \ --cache-type-k q4_0 \ --cache-type-v q4_0 10.8/16 d…
What are the best 40-500 B MoE LLM models now? (www.reddit.com) Due to old GPU I run on CPU and came to appreciate value of MoE. I know of MoE for Qwen 3.6 and Gemma-4, which are <40B.
Possibility of partly moe weights gpu offloading via sglang/ktransformers (www.reddit.com) I’m interested in dual Xeon setup with AMX support for ktransformers and CPU sglang backend. Let’s say I have 512gb RAM in 8x channel for each CPU and 2x RTX6000 Pro.
Qwen 3.6 35B MoE at full 262K context on an RTX 3090. Here's exactly how I did it. (low.li via reddit) I spent a while getting this dialed in and wrote up the full recipe. Short version: 35B MoE TQ3_4S fits in 12.4GB of weights KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40 Total VR…
Updated: RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks (llama.cpp) (www.reddit.com) Round 2: 2026-05-02 — llama.cpp b8198 → d05fe1d Rebuilt llama.cpp from b8198 (2026-03-04) to commit d05fe1d (2026-05-02), ~770 builds of progress. Same model, same hardware, same flags.
thinking of gemma 4 26B vs 31B (www.reddit.com) I see a big difference in agentic coding between gemma-4-31B-it-Q5_K_M and gemma-4-26B-A4B-it-UD-Q8_K_XL. The 26B model is much faster because of A4B and generally works well, but there is a big difference in thinking.
Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer (www.reddit.com) Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops. This isn’t a criticism of Qwen3.6.
Rada — AI coding workspace with local-first behavioral routing (no hot-swapping, I built this) (www.reddit.com) With GitHub pausing Copilot Pro+ signups and Claude Code potentially leaving the Pro tier, I started building the AI coding tool I actually wanted to use. One that doesn't depend on cloud access staying cheap and available.
Qwen 35B-A3B as an always-on agentic loop on a 16GB Mac M4: disk became the bottleneck before RAM (www.reddit.com) M4 Mac Mini, 16GB unified, basic spec. For a few weeks I had Qwen 3.5 35B-A3B UD-IQ3_XXS (12GB on disk) running under llama.cpp with --mmap and --flash-attn.
OpenMythos with Qwen2.5-1.5b weights (No recurrence atm) - looking to turn it into full OpenMythos (huggingface.co via reddit) Mythos likely isn't this architecture but I did find it pretty cool to experiment with. It has features of the HRM-27m architecture.
I like my models dense. Can model makers please bring back or update the dense models from like 2 years ago? A nice 39b or 72b maybe? (www.reddit.com) Seriously, Qwen3.6 27b is mopping the floor against models like 5 times its size right now. It doesn’t take a rocket scientist to figure out that maybe the whole a2b and a3b MoE thing isn’t the best solution after all.
My 12-agent Qwen 35B stack on Ollama died at 500 tokens every single time. Raw MLX fixed it and broke 4 other things I didn't see coming. (www.reddit.com) TLDR: Swapped Ollama for MLX on M1 Max (64GB) to run a 12-agent trading stack using Qwen 35B MoE. MLX wins on throughput and fine-grained sampler control, but I lost the "it just works" convenience of Ollama.
Ollama swap to llamacpp/llama server (www.reddit.com) So I'm a newb in certain aspects but not in others, I'm currently running an AI stack on my unraid server: CPU: AMD Threadripper 3960X (24c/48t) Motherboard: Gigabyte TRX40 AORUS PRO WIFI RAM: 256GB DDR4-3200 G.Skill Trident Z GPU: Nvidia…
IQ2XXS Qwen 3.6 35b is actually very usable on 32 gb macbooks (www.reddit.com) just tested the MoE qwen model with 2 bit percision and its suprising good. I used the 2 bit xxs from unsloth and it seems to maintain intelligence really well, never failed a tool call so far and suprisingly good at 3js, even better than…
Trade offs for companion roleplay (www.reddit.com) Hey folks for storytelling and companion style roleplay with a local llm, what do you think is the most important? More parameters Less quantization Larger context window Dense vs MoE When looking at what can fit in RAM, I’m thinking that…
Is there a way to load huge MoE models on a computer with way too little RAM for the model's size, inferencing from the SSD, on LM Studio using the mmap/GPU/CPU layer customization thing (similar to how you can on llama.cpp)? I can't get it to load without memory spiking and going into swap. (www.reddit.com) Proper vibe coding with local LLM for average Joe (www.reddit.com) The short answer is you don't. But let me explain a bit more.
Are we at the point where local AI isn’t a compromise anymore? (Gemma 4 experience) (medium.com via reddit) Thoughts on MoE Qwen 3.6 35B? (www.reddit.com) 5070 Ti (New) vs 3090 (Used) to pair with 4070 for local LLMs? (www.reddit.com) Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice (www.reddit.com) Qwen3-30B-A3B-Instruct-2507 is better than the new Qwen 3.6 for our tasks (www.reddit.com) Gemma4 26B MoE on Arc 140T (www.reddit.com) Lm studio running some models very slow while others run normally. (www.reddit.com) Newbie here (www.reddit.com) Hi guys im on 9950x 196gb and a 4090 This parameters are ok? mi main use will be coding llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --n-cpu-moe 20 -c 250000 --host 0.0.0.0 --port 8082 --reasoning-budget -1 --top-k 20 --top-p 0…
Imposing my laptop to run Qwen 3.6 (www.reddit.com) So, I am excited with the new MoE model released by Alibaba. And as an excited person, I want to believe that it can actually run in my hardware.
This is very fair. Other interesting context behaviors you've experienced? (www.reddit.com) I guess the model didn't feel it needed to do anything beyond proving. Not entirely sure how I got it to act so..
Qwen3.6-35B-A3B just dropped — quick thoughts after trying it (www.reddit.com) Just gave the new Qwen3.6-35B-A3B a spin. It’s a MoE model (35B total, ~3B active), but honestly the more interesting part is how much they’re pushing agent-style coding.
Anybody else seeing Qwen3.6-35B-A3B go crazy thinking in circles? (Compared to Qwen3.5-35B-A3B) (www.reddit.com) I was working on a simple frontend web design task earlier (styling some buttons) with Qwen3.5-35B-A3B. The end results weren't great, but at least it kept trying to change stuff and call toosl properly.
How faster is Gemma 4 26B-A4B during inference vs 31B? (www.reddit.com) I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…
Is Gemma 4 26B MoE or 31B good as an MCP agent for coding with Xcode? (www.reddit.com) Thanks
Hardware needed for Gemma 26B MoE vs Qwen 14B for ~100–300 users (vLLM, single node?) (www.reddit.com) I'm trying to figure out what sort of hardware setup i will need to accomodate a userbase of 100 users (not necessarily concurrent). Does anyone have any idea what sort of setup i'd be looking at?
Gemma4 vs Qwen3.5! MoE vs Dense! Sota vs Obsolete! Porque no los dos? (www.reddit.com) Every other day, there's someone posting about how the latest hotness of the month is gamechanger, but flawed in some way relative to their previous favorite. I can't help but wonder, does no one else keep their previous gen models on spee…