TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp…
#vllm
199 items
Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) (www.reddit.com) What do you want me to try? (www.reddit.com) Got a new playground at work. Anything I cn help run (via vllm maybe) that you might be curious about.
VLLM PR : New MoE model from Cohere soon (github.com via reddit) Easy, fast, and cheap LLM serving for everyone | Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack | 🔥 We have built a vLLM website to help you get started with vLLM. Please visit vllm.ai to learn more.
MI50s Qwen 3.6 27B @52.8 tps TG @1569 tps PP (no MTP, no Quant) (www.reddit.com) TL;DR Results from the title are for single inference with 2 prompt of 1k and 15k tokens. So no MTP (as it’s slower for big prompt), no DFlash (working too but slower for big prompt), no quant used (full precision wanted) and the results a…
LiquidAI/LFM2.5-8B-A1B · Hugging Face (huggingface.co via reddit) looks like you can run it on any potato (A1B)! https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF from LiquidAI: LFM2.5 is a new family of hybrid models designed for on-device deployment.
KVarN: Native vLLM KV-cache quantization back end by Huawei (github.com via hn) ⚡️ Built for agentic and long-context workloads. 💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.
Qwen 3.6: worse adherence? (www.reddit.com) Just swapped Qwen 3.5 for the 3.6 variant (FP8, RTX 6000 Pro) using the same recommended generation settings. My stack is vLLM (v0.19.0) + Open WebUI (v0.8.12) in a RAG setup where the model has access to several document retrieval tools.
vLLM ROCm has been added to Lemonade as an experimental backend (www.reddit.com) vLLM has the ability to run .safetensors LLMs before they are converted to GGUF and represents a new engine to explore. I personally had never tried it out until u/krishna2910-amd/ u/mikkoph and u/sa1sr1 made it as easy as running llama.cp…
Qwen/Qwen3.6-27B · Hugging Face (huggingface.co via hn) Qwen3.6-27B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransforme…
FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 (www.reddit.com) Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published Dynamic Memory Sparsification (DMS), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV…
Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct (www.reddit.com) Ok, hear me out. This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround (https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main) recipe was performing so much better than any other Qwen3.6 27B q…
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM (www.reddit.com) Greetings from former TurboQuant's biggest defender, now middle-sized niche-aware TurboQuant defender. Today I'm presenting to you the results of me thoroughly exploring the world of PPL and KLD benchmarks with my single RTX 3090 using Bee…
Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM (www.reddit.com) So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF. This is not a "best possible setup" claim.
Is using vLLM actually worth it if you aren't serving the model to other people? (www.reddit.com) So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc.
Built an open-source one-prompt-to-cinematic-reel pipeline on a single GPU — FLUX.2 [klein] for character keyframes, Wan2.2-I2V for animation, vision critic with auto-retry, music + 9-language narration in the same pipeline (www.reddit.com) Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality).
Follow-up: Qwen3.6-27B on 1× RTX 3090 — pushing to ~218K context + ~50–66 TPS, tool calls now stable (PN12 fix) (www.reddit.com) Following up on our previous post about running Qwen3.6-27B on a single RTX 3090 (~125K context, higher TPS). We’ve been pushing further on both context length and stability for tool-agent workloads.
Qwen3.6-27B at ~80 tps with 218k context window on 1x RTX 5090 served by vllm 0.19 (www.reddit.com) Qwen3.6-27B is out for a few days and the NVFP4 with MTP is dropped earlier on HF: https://huggingface.co/sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP Can follow the same recipe I used for Qwen3.5-27B to achieve ~80 tps on a single RTX 5090 at…
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA (github.com via hn) tiny-vllm You're going to build a high performance LLM inference engine with C++ and CUDA - tiny-vllm, a younger and smaller sibling of vLLM We will learn a lot along the way, make mistakes and derive the ideas and maths from scratch This…
Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results (www.reddit.com) Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across ea…
Gemma 4 26B Hits 600 Tok/s on One RTX 5090 (www.reddit.com) I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: GPU: RTX 5090, 32GB VRAM vLLM: 0.19.2rc1 Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit Draft model: z-lab/gemma-4-26B-A4B-it-DFlash Workload: r…
DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q (www.reddit.com) TL;DR: DeepSeek-V4-Flash running at 85.52 tok/s @ 524k ctx and ~111 tok/s @ 128k single-stream on 2× RTX PRO 6000 Max-Q pasta-paul's DeepSeek-V4-Flash-W4A16-FP8 quant is great, but its MTP head silently gets stripped at load time (HF trans…
Kv cache quantization: ignorance, or malice? (www.reddit.com) I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between spee…
Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp (www.reddit.com) Here is how to run the new Qwen3.6-35B-A3B > At full context on a 4090 - IQ4_XS gguf with llama cpp > At full context on a Spark - FP8 with a tweaked vLLM Here is the docker compose with llama cpp services: llamacpp: container_name: llamac…
z-lab released gemma-4-26B-A4B-it-DFlash. Anybody tried it yet? (huggingface.co via reddit) Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago.
Those of you running minimax 2.7 locally, how are you feeling about it? (www.reddit.com) Im running the raw version straight from the minimax release on hugging face (https://huggingface.co/MiniMaxAI/MiniMax-M2.7) on 3 rtx pro 6000's on vllm. So no quantization.
Eagle 3.1: Collaboration Between the EAGLE Team, vLLM Team, and TorchSpec Team (vllm.ai via hn) EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec The EAGLE series — including EAGLE 1, EAGLE 2, and EAGLE 3 — has become one of the most widely adopted and practically deployed fam…
Simple to use vLLM Docker Container for Qwen3.6 27b with Lorbus AutoRound INT4 quant and MTP speculative decoding - 118 tokens/second on 2x 3090s (github.com via reddit) Qwen3.6-27B vLLM Docker Docker-based vLLM serving for Qwen3.6-27B with Lorbus AutoRound INT4 quant and MTP speculative decoding. Model is downloaded at runtime and stored on a host volume so the container can be upgraded without redownload…
Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19 (www.reddit.com) Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG).
Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers (www.reddit.com) I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.
Finding the 4x 3090 Sweet Spot (www.reddit.com) https://preview.redd.it/8o43bjhe9d1h1.png?width=5346&format=png&auto=webp&s=1c87c2ee8b8ffff43495f543266056b0e26d3947 In another post I had someone ask me about the power draw of the 4x 3090 setup so I'm sharing a a full test I conducted to…
vLLM Just Merged TurboQuant Fix for Qwen 3.5+ (www.reddit.com) Previously it was throwing a 'Not Implemented' error due to Mamba layers. Going to test it now!
Qwen3.6-27B at 72 tok/s on RTX 3090 on Windows using native vLLM (no WSL, no Docker), portable launcher and installer (www.reddit.com) The angle here is native Windows, no WSL. Simple installation, open source, no telemetry.
Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler (www.reddit.com) In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags behind LLM-Scaler (Intel's VLLM fork), li…
VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? (www.reddit.com) EDIT - IGNORE. I MADE A MISTAKE.
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6vllmllama
Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. (www.reddit.com) Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090?
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6vllmqwenllama
club-5060ti: practical RTX 5060 Ti local LLM notes and configs (github.com via reddit) I put together a small public repo for RTX 5060 Ti 16GB local LLM setups: I took inspiration from the club-3090 repo, but this one is focused on documenting what we’ve actually tested on 5060 Ti hardware so the setup details are easier to…
qwen3.6 just stops (www.reddit.com) https://preview.redd.it/74cj1xu9pw0h1.png?width=1229&format=png&auto=webp&s=3ae999cc3530ecb4eccf70e25f1a9eb2aa3f2d7b Sometimes qwen 3.6 just stops at the middle of a task, is there a way to avoid it? This is qwen-code CLI, but also happens…
Exaggerated PCI-E bandwidth concerns? (www.reddit.com) I frequently see (both here and on r/LocalLLM ) comments that multi-gpu setups are complex, problematic and typically bottlenecked by PCI-E bandwidth on consumer motherboards. I am running 2x RTX 5060 TI 16gb ( and about to add a third ),…
Throughput and TTFT comparisons of Qwen 3.6 27B, Qwen 3.6 35B A3B and Gemma 4 models on H100 (www.reddit.com) I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple.
Bench 8xMI50 MiniMax M2.7 AWQ @ 64 tok/s peak (vllm-gfx906-mobydick) (www.reddit.com) Inference engine used (vllm fork): https://github.com/ai-infos/vllm-gfx906-mobydick/tree/main Huggingface Quants used: cyankiwi/MiniMax-M2.7-AWQ-4bit Relevant commands to run: docker run -it --name vllm-gfx906-mobydick-mixa3607 -v ~/llm/mo…
Alibaba open-sources Qwen3.6-35B-A3B, a 35B MoE model with 3B active parameters (huggingface.co via hn) Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransf…
Free vLLM Course: Inference, Compression, Benchmarks (www.deeplearning.ai via hn) Apply quantization to shrink a model's memory footprint, then measure the accuracy tradeoff. Fast & Efficient LLM Inference with vLLM Instructor: Cedric Clyburn Earn an accomplishment with PRO - Intermediate - 1h38m - 9 Video Lessons - 3 C…
Looking to migrate off of Ollama and LMStudio (www.reddit.com) Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow.
Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call) (www.reddit.com) I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command: vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \ --enable-auto-tool-choice --tool-call-parser qwen3_xml \ --enable-prefi…
Reproduction of TurboQuant (www.reddit.com) There have been many TurboQuant implementations recently in llama.cpp, mlx, vllm, and sglang, but a lot of the discussion and code around them feels pretty noisy and looks to be AI-generated. I’m trying to understand which claims from the…
'Am I OpenAI compatible' - a tool and documentation for unified api signatures in open source AI. (www.reddit.com) This has turned out to be useful to many of my friends so I thought I'd share here as well. I created a tool and documentation page for most major open-souce project's adherence to 'OpenAI compatibility' after seeing inconsistencies betwee…
Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6 (www.reddit.com) 32 MI50 32GB setup moonshotai/Kimi-K2.6 int4 @ 9.7 tok/s (output of 136 tok) and 263 tok/s (input of 14564 tok) on vllm-gfx906-mobydick Github link of vllm fork: https://github.com/ai-infos/vllm-gfx906-mobydick Power draw: ~640W (idle) / ~…
Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working (www.reddit.com) I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards. Hardware: 2x RTX 5060 Ti 16GB 32GB total VRAM Proxmox LXC 16 vCPU ~60GB RAM CU…
Qwen3.6 uncensored AWQ (www.reddit.com) I have tested Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q5_K_P.gguf on my 4x3090 system (opencode) and find it really good and fast. However, I can't find any uncensored models for vllm (preferably as AWQ).
TurboQuant on MLX & vLLM (www.reddit.com) MLX https://github.com/Blaizzy/mlx-vlm?tab=readme-ov-file#turboquant-kv-cache vLLM https://github.com/vllm-project/vllm/pull/38479 MLX & vLLM users, please share your experience with benchmarks(t/s). Adding llama.cpp Links related to Turbo…
Qwen 3.6 benchmarks on 2x RTX PRO 6000 (www.reddit.com) Got a chance to play around with 2x RTX PRO 6000 setup so sharing some number for Qwen 3.6. All these were run using latest stable VLLM backend.
For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. (www.reddit.com) I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of diff…
Qwen3.6-35B-A3B KLDs - INTs and NVFPs (www.reddit.com) https://preview.redd.it/c76w57d1yexg1.png?width=1482&format=png&auto=webp&s=1164d8bc3e2e8a4157f26dd5583238a736474932 KLD for INTs and NVFP4s. AS ALWAYS - Use Case is important.
RTX PRO 5000 (48GB) vs MacBook Pro M5 MAX (128GB RAM) - The choice for fine-tuning & agentic coding (www.reddit.com) Show HN: Harbor v0.4.19 – harbor launch –back end vLLM –web codex (github.com via hn) https://github.com/user-attachments/assets/e4897391-c5a8-4391-93c3-9f8b76155f11 Setup your local LLM stack effortlessly. Starts fully configured Open WebUI and Ollama harbor up Now, Open WebUI can do Web RAG and TTS/STT harbor up searxng s…
Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark? (www.reddit.com) I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with: docker run --gpus all \ --name qwen36-aggressive \ --restart unless-stopped \ -p 8000:8000 \ --ipc=host \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ --shm-size=32g…
Using Intel Arc Pro series, any thoughts ? (www.reddit.com) Simple question: Has anyone run two or more of either of these on Ubuntu ? Intel Arc Pro B70 (32 GB) Intel Arc Pro B65 (32 GB) Running llama or vLLM etc., Any thoughts
My own local first ai harness (www.reddit.com) Hi, i just wanted to share what im playing with for last couple weaks. I built my own AI harness: TinyHarness My main goal was low memory footprint, it is not written in Typescript/Javascript/Python, leaving as much memory as possible for…
New Qwen3.6 27b Autoround Quant (int4) Best Recipe (www.reddit.com) I've been using the int4 Autoround quant from "Lorbus/Qwen3.6-27B-int4-AutoRound" and it has been pretty good! Great quality and performance on an RTX 5090 vllm.
The GB10 Solution Atlas is now open source, the inference engine made for the community with breakneck inference speeds (Qwen3.6-35B-FP8 100+ tok/s) (www.reddit.com) Some of you saw our post a couple weeks back about hitting 102 tok/s stable on Qwen3.5-35B on a DGX Spark. A lot of you asked "cool, where's the code?" Today's the day: Github Atlas is open source.
Whats the latest status on 7900xtx multi-GPU setups? (www.reddit.com) I am currently running dual RTX 5060 ti 16gb (both of which are easy to sell or re-use in other PCs at home) and monitoring the used market for more of the same and alternatively RTX 3090. I couldn't help but notice that sometimes some qui…
Mixing 3090 with 3080 20G (modded) for vllm (www.reddit.com) Has anyone tried mixing 3090s with 3080 20G for vllm using tensor parallelism? I know vllm normally discourages mixing GPUs, but given how much 3090 is selling nowadays, the modded 20G 3080s with half the price feel like better deals.
Best settings for Qwen 3.6 -27B for 2X3090? (cannot make it to be smarter than Qwen 3.6 35B-A3B! (www.reddit.com) I'm sure people have asked before for settings for these gpu's, but for me, no matter what I do, It doesn't work as good as 3.6 35B! I've tried VLLM and LLAMACPP .
Show HN: Aide – A customizable Android assistant (voice, choose your provider) (aideassistant.com via hn) Free hands-on lab: build a ReAct agent 3 ways (create_agent, raw LangGraph with tool-call budget, NVIDIA NAT YAML) (www.reddit.com) current: 1x 16GB 5060Ti. worth a 2nd for OpenCode? (www.reddit.com) my current build is just a 16GB 5060Ti running on a 3800X with 32GB DDR4. not really anything special, but I only really use it right now for Qwen3-VL-8B-Instruct at INT8 to do handwriting transcription (and it works great for that). someo…
Running Gemma4 31b-it on vLLM 0.21.0 A100s (bad quality or what am I doing wrong) (www.reddit.com) Okay fun time I got access to two Nvlinked A100s for some research project I benchmarked my work against the Gemma 4 31b-it available through Google, but my dataset is rather massive, so I need to run it on the "local" resources. Basically…
Output Length Constrained Summarization using GRPO on tiny LLMs | smolcluster (www.reddit.com) Just released a blog on a side research project I have been doing for the past two months and would love for you all to check out and see how it is! It's about output length-constrained summarization using LLMs with GRPO.
Run Chrome’s tiny Gemma4 (aka Gemini Nano) directly on PC without GPU (www.reddit.com) Everyone remembers that sneaky download of Gemini Nano earlier this month? and if you talk to it, it will happily tell you it’s a Gemma.
LLMKube – A Kubernetes operator for local LLMs across Nvidia and Mac fleets (llmkube.com via hn) Run production LLMs on your own hardware A Kubernetes operator for self-hosted LLM inference. vLLM, llama.cpp, TGI, NVIDIA, Apple Silicon.
club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes (www.reddit.com) I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes.
While waiting for Fara-1.5 for my coding harness (www.reddit.com) Hi all, Not sure many people are aware so wanted to give a word about Fara-1.5 release. => this release will likely be the big sister of Fara-7B and built on top of Qwen3.5 Actual Fara-7B performs not bad at all but actually requires a pro…
RTX 5060Ti 16GB or RTX 3080 20GB? (www.reddit.com) I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.
Should I sell my RTX3090s? (www.reddit.com) I have a GPU server (4 × RTX3090s) that I've been using for research and PoC in the past 2 years. Mostly running vLLM for Qwen, GPT-OSS, and Gemma.
Load balancer for vLLM server instances? (www.reddit.com) Hello all, the docs for the vLLM production stack suggested autoscaling the vllm worker instances based on the number of waiting requests, but it seems like this would only help with new coming requests? We are having burst LLM calls which…
Will llama.cpp multislot improve speed? (www.reddit.com) I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved the overall speed significantly (150-170t…
What are your most interesting and hard Vision use cases? I plan to do side by side comparison of Gemma 4 (31B) vs Qwen 3.6(27B) Vision and I look for inspiration (www.reddit.com) Hey guys, I built a custom vLLM pipeline to run Gemma 4 (31B FP8) and Qwen 3.5 side-by-side locally to see how they actually perform in the wild with preprocessing of audio and images. But of course new model Qwen 3.6 27B came out just whe…
DeepSeek V4 in vLLM: Efficient Long-Context Attention (vllm-website-pdzeaspbm-inferact-inc.vercel.app via hn) DeepSeek V4 in vLLM: Efficient Long-context Attention We are excited to announce that vLLM now supports the DeepSeek V4 family of models (deepseek-ai/DeepSeek-V4-Pro and deepseek-ai/DeepSeek-V4-Flash ). These models feature an efficient lo…
Gemma 4 vs Qwen 3.5 Vision on vLLM — 5 things I learned benchmarking them side-by-side (Reasoning budgets, FP8, pre-processing the input). (www.reddit.com) Hi guys, I’ve been running side-by-side experiments on Gemma 4 (31B FP8) and Qwen 3.5 Vision for the last few days using vLLM in Docker to see how they actually handle real-world images and video. A few things I found out: 1.
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! (www.reddit.com) So, yesterday run was a success and I did get an avg rollout length of about 64 tokens as attached in the image! This was with quality_reward + length_penalty (more info below!) Next, I'll be going with length penalty as the reward and wit…
DGX Spark just arrived — planning to run vLLM + local models, looking for advice (www.reddit.com) Just got a DGX Spark set up today and starting to configure it for local LLM inference. Plan is to run: • vLLM • PyTorch • Hugging Face models as a local API backend for an application I’m building (education / analytics use case, trying t…
Deep Dive into Efficient LLM Inference with Nano-vLLM (cefboud.com via hn) Deep Dive into Efficient LLM Inference with nano-vLLM A look inside a lightweight implementation of vLLM. KV cache, paged attention, tensor parallelism &multi-GPU support, etc.
Turboquant in vllm kv cache - how to implement ? (or any other rotational kv cache) (www.reddit.com) Hi folks - is there any "standard" (acceptable) vllm way of implementing turboquant or a similar rotational quant for vllm's kvcache? I found https://github.com/mitkox/vllm-turboquant - but this seems inactive.
Using older vLLM version via Docker -- how do you use GGUF quants? (www.reddit.com) So vLLM recently added the feature to use GGUF quants with the syntax author/model:quant format. I was just wondering if people were able to use the quants on older vLLM versions.
Going local with old GPUs (www.reddit.com) I'm an ex crypto miner with remnant mining parts so I threw them together into a franken hydra case. I've been using claude oath previously, but they just shut that door last week or so.
Fast and Efficient LLM Inference with vLLM: A New Course with Deeplearning.ai (vllm.ai via hn) Fast & Efficient LLM Inference with vLLM: A New Course with DeepLearning.AI We're excited to announce, with Red Hat and Andrew Ng's DeepLearning.AI, a hands-on course that walks through LLM fundamentals and the full optimize, deploy, and b…
Local run for multi users: which software set? (www.reddit.com) Context: I am testing and running local LLM on Linux for some months, first with llama.cpp and now with vLLM for better concurrent capabilities. I use llama-swap in front of either vLLM or llama.cpp in order to have thinking and non-thinki…
Looking for a working Deepseek-v4-Flash quant (www.reddit.com) Best I tried so far is https://huggingface.co/nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF with the custom llama.cpp fork, but it suffers from low quality and random incoherent output. VLLM wouldn't support anything other than H100s for DS4.
Looking for Suggestions — Single 5090 & 64gb DDR5 (www.reddit.com) Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if…
Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode (www.reddit.com) I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends.
I Built MagesticAI. A Cloud Web-Based Agentic DevOps Orchestrator that actually helped me develop Itself. (www.reddit.com) Posted on other feeds last week and figured some of you out here might be interested as well; Someone commented asking if it supported OpenAI-compatible endpoints (LM Studio, vLLM, OpenRouter, Together, Groq, LocalAI…), so i have spent few…
Best coding model on RTX 3060 (www.reddit.com) Wondering what’s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it?
numind/NuExtract3 · Hugging Face (huggingface.co via reddit) NuExtract3 is a unified 4B vision-language reasoning model for document understanding. It combines strong structured information extraction with high-quality image-to-Markdown conversion, making it suitable for extraction pipelines, OCR, a…
What workstation to get for ~13k EUR? (www.reddit.com) My use-cases will be to test open-weight LLMs and work on harnesses, inference systems and possibly other non-ML workflows (CS-related) in the future. Fine-tuning would not be something I do locally because I can rent a B200 from RunPod fo…
Do smaller quants silently break tool calls / JSON output? (www.reddit.com) I posted recently about EvalShift, an OSS CLI for regression-testing LLM model changes. A few people pointed out that for LocalLLaMA, the more interesting use case may be quantization regression: Q8 -> Q4_K_M Same base model, same prompts,…
5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp (www.reddit.com) Hey all, While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.
Is it possible to exclusively use a draft model for reasoning to speed up generation? (www.reddit.com) EDIT: Edited to provide more clarity It occurred to me, that perhaps the same draft model used for speculative decoding would be completely adequate if we just used it's output as-is for reasoning, without validating the results against th…
TensorRT-LLM vs vLLM vs llama.cpp on NVIDIA DGX Spark? (www.reddit.com) I am looking for recommendations on the best way to run local LLMs on NVIDIA DGX Spark. Which stack makes the most sense in practice: TensorRT-LLM, vLLM, or llama.cpp?
Show HN: Granite Switch - compose multiple LoRA adapters to one deployable model (github.com via hn) Granite Switch is an open-source IBM Research project for composing several task-specific LoRA adapters into a single deployable Granite model checkpoint. The idea is to get the accuracy benefits of multiple fine-tuned models without havin…
How vLLM Works (avkcode.github.io via hn) The vLLM real-world lab models mixed production traffic instead of a single throughput number. FCK generated the request mixes, ran the scheduler and routing profiles, captured build outcomes, and emitted the evidence used for the charts b…
[Help] Running big dense models faster (www.reddit.com) I have been trying Mistral 3.5 on my 4x RTX 3090 rig with llama.cpp. Inference is slow (about 11 t/s) even without anything being offloaded to the CPU.
What’s up with mobile LLMs? (www.reddit.com) I see a lot of support for running LLMs on PCs with ollama to vLLM. Whats the current state for running on mobile?
Ubuntu 26.04 vs 24.04 speed improvements for inference? (www.reddit.com) I'm curious if any brave soul has upgraded their computer (especially if it's Strix Halo) from Ubuntu 24.04 -> 26.04 and seen a significant performance improvement for inference with VLLM, llama-server, and/or LM Studio.
Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP? (www.reddit.com) I'm a daily llama-cpp user and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM.
To run deepseek v4 flash how much max vram we need? 175 gb or 320gb? (www.reddit.com) As far as i know the weight is of 160gb + 9.6gb needed for max 1 million token window + 5 gigs overhead = 175gb vram. But vllm and othere sources said "To use the full 1M context, you need 4x A100 80G" --> thats a 320gb vram ??
LLM performance benchmarking update (www.reddit.com) Please help me pick the right Qwen3.5-27B format/quant for RTX5090 (www.reddit.com) Hi all, first post here. I've started a project in OpenClaw a month ago, and it's been a very "intense" 4 weeks to say the least...
Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? (www.reddit.com) I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.
Show HN: Fleet Watch – preflight guard for local AI inference on Apple Silicon (github.com via hn) Fleet Watch Process governance for AI workloads on a single machine. The Problem You're running MLX, Ollama, vLLM, Candle/Cake, experiment runners, and AI coding agents on the same machine.
Dynamic tool lists vs KV cache: how do you handle this trade-off in LLM agents? (www.reddit.com) I’m working on an LLM agent setup (using Qwen-style chat templates with tool calling), and I ran into a design trade-off that I’d like to get some insights on. In these templates, the full tool definitions (JSON schemas) are injected into…
Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script (github.com via hn) NVIDIA DGX Spark GB10 — AI Models & Inference Guide Welcome to my repository and guide for running, optimizing, and benchmarking state-of-the-art AI models on the NVIDIA DGX Spark deskside supercomputer, powered by the cutting-edge NVIDIA…
vLLM: An Efficient Inference Engine for Large Language Models [pdf] (www2.eecs.berkeley.edu via hn) vLLM: An Efficient Inference Engine for Large Language Models by Woosuk Kwon A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the Un…
Show HN: LLMhop – A tiny, stateless router for LLMs with a NixOS module (github.com via hn) LLMhop is a tiny stateless proxy for LLM inference servers. It tackles an issue I faced when trying to serve more than one local LLM at once which is not natively supported by vLLM.
Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference? (www.reddit.com) I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).
Sharing INT4-W4A16 version of Jackrong/Qwopus3.6-27B-v2 for VLLM/SGLang users (www.reddit.com) link: https://huggingface.co/JC1DA/Qwopus3.6-27B-v2-INT4-W4A16-Autoround Super surprised how good Jackrong's model is... It's taking so much time to evaluate the all the base qwen3.6-27B, Jackrong's version and other's quantized models but…
$340 opus bill made me rethink how I route agent tool calls (www.reddit.com) Looked at my coding agent's bill last month: $340 for repo maintenance across three repos, each around 15k lines. Most of those tool calls were just grep and file reads.
Cannot get NCCL test to run in docker with 2 x 6000 Pro connected x8 to AM4 CPU (www.reddit.com) nvidia-smi topo -m is showing the both GPU as PHB (i.e. via CPU) connected as expected but I cannot get NCCL all_reduce_perf to run at all, it always hangs after starting up.
40+tok/s - optimized recipe for Qwen 3.5 122B Int4 on a single DGX Spark with vLLM (www.reddit.com) Hello guys, two days ago i ran the spark-arena for my Qwen 3.5 122B Recipe on a single DGX Spark and I got the highest score on speed for any context length and concurrency across all 3.5 122B Int4 Recipes. Just wanted to share if somebody…
Need help getting 7900 XTX PyTorch performance metrics (www.reddit.com) I'm on a quest to profile and benchmark different GPUs for PyTorch, vLLM, and llama.cpp. Cannot find the high-end AMD consumer cards for rent anywhere online and interested in the PyTorch ROCm performance of the 7900 XTX (if you want to co…
Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster (www.reddit.com) I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96G…
KV Cache Is Becoming the Memory Hierarchy of Inference (touchdown-labs.com via hn) A briefing on the inference memory hierarchy: prompt layout, host-side shared KV, distributed lookup, RDMA transfer, encoder reuse, and evidence discipline. Covers vLLM × Mooncake, LMCache MP, LMCache CacheBlend, SGLang, NVIDIA Dynamo, and…
Show HN: Per-request emotion steering for vLLM, with batching preserved (github.com via hn) emotion-steering Extract and serve CAA-style emotion steering vectors for any HuggingFace causal LM, with a fast vLLM path for Qwen3. ┌────────────┐ ┌────────────────────┐ labeled │ extract │ vectors + AUC report │ serve │ contrasts ├─────…
A plug-n-play open-source pruning tool that is workload-aware (www.reddit.com) This project was born out of time I spent digging into a biologically inspired algorithm I was using to measure co-activation for placement of experts and ranks onto chips. The default scheduling that vllm provides can end up causing laten…
Advice needed on eGPU and Mini PC (www.reddit.com) Hi all, I come across to relatively niche problem and could not find much useful posts or guides about it. I have a mini pc (Beelink Ser 8, 8745HS and 32GB 5600 DDR5 SODIMM) headless server for hosting some routing services, and I am wonde…
Show HN: Valkyr LM Inference with Realtime Guarantees (github.com via hn) Valkyr is a fresh take on LM Inference runtimes. It's quite different from llama.cpp, vLLM, or ZINC for example.
How can I locally run Deepseekv4 1.6T? I can use a VPS. (www.reddit.com) I wanted to use vast.ai, but ollama doesnt have it, and when i used vLLM I didn't have success. I genuinely don't know what failed.
Is Mistral-3.5-Medium-128B broken in Llama CPP? (www.reddit.com) Trying some if Bartowski's Q4 quants. Using Vulkan with the latest main branch as of a few hours ago.
5060ti quad-chads - vllm (the reluctant arc) - pp and tg talk (www.reddit.com) Okay, so I have this quad 5060ti setup and for forever I have had people nagging me to try vllm. I thought it was too complicated, like varsity golf or putting on both legs of pants at the same time.
vLLM-Compile: Bringing Compiler Optimizations to LLM Inference (docs.google.com via hn) vLLM-compile: Bringing Compiler Optimizations to LLM Inference Luka Govedič vLLM Committer Senior Machine Learning Engineer, Red Hat 1
3.6 27B Tool Calling Issues (vLLM) (www.reddit.com) Has anyone got a reliable vLLM recipe for 3.6 27B that fixes the tool calling issues? I am getting "Not let me..." - then nothing.
Disaggregated Serving for Hybrid SSM Models in vLLM (vllm-website-lx4pji0mz-inferact-inc.vercel.app via hn) Disaggregated Serving for Hybrid SSM Models in vLLM Introduction Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine th…
which is faster and better for coding? Luce-Org/Dflash or noonghunna/qwen36-27b-single-3090 (www.reddit.com) Anyone have experience with both? Luce is llama.cpp with custom dlflash and noonghunnas project is vllm with patches.
Power-limit vs TG/s for 2x3090 (www.reddit.com) Trying to find the sweet-spot to tradeoff between power and tg/s. 250W seems to be a sweet spot for Qwen3.6-27B.
Guys this is so fun! (www.reddit.com) Running my own models. I was having some trouble getting vLLM going so dropped down to LM Studio which I've used on my 24GB MacBook Air.
your daily driver stack, what's it look like? and why? (www.reddit.com) What it says in the title, I'm interested in hearing what you all have landed on as a workable / useful stack for you. Mine looks like this: back end inference servers - llama.cpp, vLLM | V hermes-agent - cron jobs + OpenAI compatible endp…
Qwen3.6-27B-FP8 - JS file is too long and causing JSON truncation (www.reddit.com) Apologies in advance, if this is a newbie question. When running Qwen3.6-27B-FP8 using the below command on an Nvidia RTX PRO 5000, in opencode, I am seeing errors such as: "The issue is that the JS file is too long and causing JSON trunca…
ASUS Ascent GX10 - Having tons of issues (www.reddit.com) Hi all, Looking for some advice with a GX10 I purchased about 4 months ago. I've been having all kind of issues trying to run local models on this device.
What are your favorite LLMs for translation/docuement work? (www.reddit.com) I am currently working on a system to translate books/web novels. I got a working prototype, but now I am looking into optimizing it.
Short term access to 4x rtx6000pro... Suggestion on what to try/test? (www.reddit.com) Always been stuck with models that fit on my 16gb .... Going to have about a week for free with 4x rtx6000pro .
Do you have any go-to utility LLM-related tools that are less commonly discussed? (www.reddit.com) Deploying Gemma 4 26B A4B on a single RTX 5090 — ~196 tok/s with AWQ + vLLM on RunPod Serverless (www.reddit.com) A Debugging Story: Getting Claude Code to Work with Local vLLM When the Docs Don't (www.reddit.com) Multi GPU setup help (www.reddit.com) Hi guys I managed to get a multi GPU setup going with a 3090 and three 3060 bringing my vram to 60gb along with 64gb ddr5. The objective is to run the largest coding model I can at a respectable token speed of over 20 tokens / second.
Training Qwen2.5-0.5B-Instruct on Reddit post summarization with GRPO on my 3x Mac Minis — add METEOR as quality reward! (www.reddit.com) Setup: 3x Mac Minis in a cluster running MLX. One node drives training, two push rollouts via vLLM.
I have 4x 128 GB VRAM now , what should i do. (www.reddit.com via reddit) Fixing single missing quote errors. (www.reddit.com via reddit) Loving the local AI world and been building out my own the last few weeks. One pesky recurring problem is what seems to be related to how the model produces JSON for its responses.
RDNA4 Specific Docker Image vLLM (www.reddit.com via reddit) You bought RDNA4 with the promise of go-fast, and it doesn't deliver in vLLM. I know the feeling, out of the box vllm is a complete dog on RDNA4...
Here are some tips on hitting nearly 200 tok/s for DeepSeek v4 Flash on Hopper (dnhkng.github.io via reddit) I needed a smarter model for my local Hermes Agent setup, so I moved to DeepSeek v4 Flash. First things first: Running 4 concurrent threads on vLLM, I can hit ~400 tok/s 400 x 60 x 60 x 24 x 30 is ~1B TOKENS per month!!!
5070 Ti + 5060 Ti on vLLM hangs on GDN with Qwen3.6 (www.reddit.com via reddit) [2x3090]: SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available. (www.reddit.com via reddit) Hi all, this is a mere "see what others are doing post" rather than a solution to a problem. As newbie, I put together a 2x3090 box that I run vllm on.
OpenEnv is now owned by HF, Torch, Prime Intellect, Unsloth, Modal, Mercor, and more! Use it for training agents. (www.reddit.com via reddit) OpenEnv is a tool for creating an agentic execution environment like terminals, browsers, or anything an agent can interact with. And today, we’re excited to announce that OpenEnv is becoming even more open, to make the future of training…
vllm-doctor — a CLI tool to diagnose and monitor vLLM inference servers (www.reddit.com via reddit) vllm-doctor reads metrics from a vLLM server's /metrics endpoint or a Prometheus instance and runs rule-based checks to find what is wrong. It detects queue pressure, high TTFT/TPOT, KV cache pressure, and other rules across pods.
Breaking the Ice: Analyzing Cold Start Latency in vLLM (arxiv.org) how to run gemma-4-12b-it-qat-w4a16-ct in vllm or any version quantized of the model (www.reddit.com via reddit) when running by using transformers it runs by using vllm some weird error come up plese can any body share the command of running it on vllm ?
club-3090 adds experimental FP8 support for Qwen3.6-27B! (www.reddit.com via reddit) It’s finally here! Something many of us running dual RTX 3090 rigs have been anticipating.
Qwen 3.6 27B on DeepSWE (www.reddit.com via reddit) Overview: It scored 2% (1.79% rounded up) It is 18/20th place scoring above Haiku 4.5 and Minimax M2.7 Full benchmark took 70 hours Average time per task 32m Average output tokens per task: 44k Perspectives: It scored suspiciously similar…
dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model (www.reddit.comhttps) Im into both HPC and 3D reconstruction, so I built this as a side project. dvlt.cu is a single 5MB binary: - No python, torch, TF, ONNX, llama.cpp, vLLM, or huggingface runtime - Nearly no dependencies: only cuBLASLt (shipped with libcuda…
Activating MTP for QATGemma4 31b q4_0? (www.reddit.com via reddit) Has anyone figured out how to activate MTP for Gemma4’s new QAT q4_0 GGUF for 31b? Or is this still not supported in llamacpp?
Serving TTS/cloning models on llama.cpp? (www.reddit.com via reddit) Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a sepa…
Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B (www.reddit.com) I have been coming to this subreddit to understand what the optimal config is to run a model on a given hardware setup. I referred to specific benchmarks, but they are too generic and do not consider the underlying hardware.
$16 refactor, 400 steps, 95% routed to open MoE (www.reddit.com) Got tired of $160 Opus bills so I spent a weekend wiring up a routing layer on vLLM 0.8 (2xA100, enable_auto_tool_choice). Getting the tool call parser to cooperate took longer than the actual routing logic.
For the users who have add bad luck with QWEN 3.6 27B, and Gemma 4 31B. "Actually..wait..actually". Endless reasoning. Horrible output. I found a solution. rtx pro 6000. (www.reddit.com) Edit: does this happen every time a newbie tries to post here. Getting roasted despite having valid results?
Built a self-hosted layer for local agent workflows because retries kept replaying side effects (www.reddit.com) I work on AxonFlow, a source-available (BSL 1.1) runtime for long-running agent workflows. We’ve been running it in front of Ollama-served models and OpenAI-compatible local endpoints (llama.cpp `--server`, vLLM, LM Studio).
I built a native Swift macOS AI client that's invisible to screen sharing — works with Ollama, vLLM, llama.cpp [OC] (www.reddit.com) Built this for myself after wanting to use local LLMs during work calls without the window showing up on screen share. Every existing tool was either cloud-only or a 200MB Electron app.
Is this a crazy idea? (www.reddit.com) I’m running locally with 2 RTX 3099s and 128gb of RAM I run my workflows with Hermes/OWUI and use Comfy for media generation. My inference is with LM Studio.
vLLM + NVFP4 + Qwen3.6 27B: "Checkpoint does not provide a q scaling factor"? (www.reddit.com) I have been trying various NVFP4 based variations of Qwen 3.6 27B, and I am seeing this for the ones that look most interesting to run on my 2x 16GB VRAM with KV cache fp8. vllm | (Worker_TP0 pid=136) WARNING 05-09 13:49:27 [kv_cache.py:10…
Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK (www.reddit.com) TL;DR On 4× RTX 3090 with NVLink bonded between GPU pairs (0↔2 and 1↔3), pinning TP=2 to a NVLinked pair gave +25% throughput at concurrency 1 and +53% at concurrency 4 vs running TP=2 over PCIe. Adding the other two GPUs to make it TP=4 m…
I built an episodic, 2-tier memory for long-running local AI agents - temporal contradiction detection, fiction/roleplay filter, no vector DB required. (www.reddit.com) I've been running a persistent local agent for about 2 months - hundreds of sessions, mix of local models (llama.cpp/vLLM/lmstudio) and paid (Claude). One of the things that has been driving me nuts with OpenClaw and Hermes is the way memo…
vLLM V0 to V1: Correctness Before Corrections in RL (huggingface.co) vLLM V0 to V1: Correctness Before Corrections in RL TL;DR. vLLM V1 matched our vLLM V0 reference after we fixed four things: processed rollout logprobs, V1-specific runtime defaults, the inflight weight-update path, and the fp32 lm_head us…
Help with GPT-OSS-120B on vLLM (www.reddit.com) Hiya, today I was trying to get a response from GPT-OSS-120B via vLLM - and failed miserably! Has anybody gotten it to work, i.e.
Getting unexpected output with Gemma 4 31b-it on vLLM (www.reddit.com) Hey everyone, I'm running into a weird issue and hoping someone here might have a fix or some troubleshooting ideas. I'm currently trying to run the new Gemma 4 31b-it model using vLLM (v0.20.0-cu130) deployed via Helm chart (https://gith…
Gemma 4 31B MTP Drafter on H100 -- Real Benchmarks + DFlash Comparison (www.reddit.com) Just tested Gemma 4 31B with the new official MTP Drafter on my H100 today and compared the approach with DFlash to help you decide which one to use. Without drafter: 13.7 tok/s.
Sglang is better for serving a model for a personal agent harness? (www.reddit.com) If one has enough vram, would Sglang be a superior choice than vLLM or llamacpp in terms of inference speed for serving a model dedicated to powering a personal (single user) agent harness like Hermes agent? Sglang has MTP for speculative…
Does running a model (like qwen3.6-27b) on vllm or transformers use less VRAM than llama.cpp? (www.reddit.com) I have been using llama.cpp to run some models recently. For example, I've been running GLM-4.7-Flash with this command .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-s…
Anyone running HUANANZHI H12D-8D + BMC with 4x RTX 3090 for LLM inference? (www.reddit.com) Hi everyone, I'm considering building a home LLM inference rig around: - HUANANZHI H12D-8D + BMC - AMD EPYC 7002/7003 - 4x RTX 3090 24GB - DDR4 ECC RDIMM, 8-channel - Linux + vLLM / SGLang / llama.cpp - Open frame, PCIe 4.0 x16 risers The…
Requesting advice on local AI setup for academic use (www.reddit.com) I'm about to do a clean install of Ubuntu 26.04 on a desktop that has a 5060ti 16gb and a 4060ti 16gb. Can you help me work out the best local AI setup for my use cases?
Need advice on Qwen 3.6 27B INT4 quantization (www.reddit.com) Hello everyone, I think Qwen 3.6 27B is good enough that it might take a while before we get a clearly better model at a similar size. I have a single headless RTX 3090 with a 300W power limit.
Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real. (www.reddit.com) Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants)…
DeepSeek V4 Flash as a cheap worker in your LLM stack: $0.0003/call via MCP, swappable endpoint (www.reddit.com) Most of my LLM cost was on the wrong tier of work. Classification, extraction, JSON formatting, summarization I'm going to review anyway.
Best RTX Pro 6000 vllm settings? (www.reddit.com) Just got myself (for my company) a RTX Pro 6000 Blackwell Workstation card. Managed to get really good TPS on qwen3 27b fp8.
thinking of gemma 4 26B vs 31B (www.reddit.com) I see a big difference in agentic coding between gemma-4-31B-it-Q5_K_M and gemma-4-26B-A4B-it-UD-Q8_K_XL. The 26B model is much faster because of A4B and generally works well, but there is a big difference in thinking.
Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer (www.reddit.com) Reasoning Guard: Stopping LLM Thinking Loops at the Proxy Layer I’ve been running Qwen3.6 MoE behind a vLLM proxy and hit a specific reliability issue: occasional runaway reasoning loops. This isn’t a criticism of Qwen3.6.
Only 120 tps on Qwen 35b on h200 (www.reddit.com) Just a sanity check, this is too slow and something is wrong, right? Like, this is setup with mtp, vllm with awq quants, I suspect that I did configure something wrongly.
locally uncensored v2.4.2 - chat, coding agent, image + video generation in one local app. plus remote access from your phone. one-click install (www.reddit.com) locally uncensored is a desktop app that combines four things most people run separately: chat, a coding agent, image generation, and video generation. all local, all on your hardware, no docker, no cloud account needed.
How do you actually use Qwen3 72B Instruct locally? (www.reddit.com) I just got Qwen3 72B Instruct running on a high RAM setup and I’m kinda confused about the proper way to use it. What’s the correct workflow for running it smoothly (like best quant, tools, or runtime)?
Free book on building AI agent harnesses — 22 chapters, Python harness, written by AI (www.reddit.com) Claude Code drafted the prose. I did the research, direction, architecture, ran the code, caught the bugs, and reviewed every commit.
Is there an alternative between vLLM and Ollama that handles token prefill? (Arc Pro B70) (www.reddit.com) I am using an Arc Pro B70 to do inference, and it's token generation speed is fine using Ollama, but it takes *forever* to do a prefill. vLLM absolutely tackles the prefill problem (nearly instant responses), but I can't run nearly as larg…
Brand new dual 3090 PC - what should I install first for the best local agentic coding experience? (www.reddit.com) Qwen3-30B-A3B-Instruct-2507 is better than the new Qwen 3.6 for our tasks (www.reddit.com) made a desktop app that puts ollama, comfyui and coding into one window (www.reddit.com) been using local AI for a while now but my workflow was a mess. ollama for chat, comfyui for images, different tools for video and coding.
M1 Pro 16GB users: what local LLM configs are actually usable day to day? (www.reddit.com) I'm trying to get past generic "best model" recommendations and collect real-world configs from people on similar hardware. My setup: MacBook M1 Pro, 10-core CPU, 14-core GPU, 16 GB unified memory.
DGX Spark users: What's the easiest way to do multi-node vLLM clustering with a browser UI and training? (www.reddit.com) Hey r/LocalLLaMA, I've been running a small 4-node DGX Spark cluster on a 400µT fabric switch and got frustrated with the usual raw Ray/vLLM scripts and EXO basically ignoring pure NVIDIA paths. I started from the solid foundation in [eugr…
gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM (www.reddit.com) I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error : Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.3…
Lower inference speed of Gemma4 26BA4B on vllm. (www.reddit.com) For my earlier use case I used to host qwen 2.5 vl 7b gptq int4. Now I was looking to switch to Gemma4 26B A4B, as it would improve performance as well as improve latency considering only 4B parameters are active..
Hardware needed for Gemma 26B MoE vs Qwen 14B for ~100–300 users (vLLM, single node?) (www.reddit.com) I'm trying to figure out what sort of hardware setup i will need to accomodate a userbase of 100 users (not necessarily concurrent). Does anyone have any idea what sort of setup i'd be looking at?
What is the best way to deploy LLM on 3x3090? (www.reddit.com) Two questions: which model? In my mind, Qwen3.5 27b or Gemma 4 31b are top options.
Optimizing a WSL2-based Local AI Orchestration for Product Viz | RTX 3090 24GB VRAM & i7-14700KF (www.reddit.com) Hi everyone, I’m building a local AI pipeline on WSL2 (Ubuntu) specifically for Product Visualization. My goal is to orchestrate LLMs for scene generation and Stable Diffusion/ComfyUI for high-fidelity rendering, keeping my Windows host cl…
No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL (huggingface.co) Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference (huggingface.co)