#vllm

21 items

Alibaba open-sources Qwen3.6-35B-A3B, a 35B MoE model with 3B active parameters (huggingface.co via hn) 8 pts· 8h

Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransf…

↯ Qwen 3.6 moe vllm
Reproduction of TurboQuant (www.reddit.com via reddit) 7 pts·5 replies· 9h

There have been many TurboQuant implementations recently in llama.cpp, mlx, vllm, and sglang, but a lot of the discussion and code around them feels pretty noisy and looks to be AI-generated. I’m trying to understand which claims from the…

vllm llama
current: 1x 16GB 5060Ti. worth a 2nd for OpenCode? (www.reddit.com via reddit) 4 pts·9 replies· 3d

my current build is just a 16GB 5060Ti running on a 3800X with 32GB DDR4. not really anything special, but I only really use it right now for Qwen3-VL-8B-Instruct at INT8 to do handwriting transcription (and it works great for that). someo…

↯ Qwen 3.5 vllm llama
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! (www.reddit.com via reddit) 3 pts·2 replies· 1d

vllm
DGX Spark just arrived — planning to run vLLM + local models, looking for advice (www.reddit.com via reddit) 3 pts·5 replies· 1d

Just got a DGX Spark set up today and starting to configure it for local LLM inference. Plan is to run: • vLLM • PyTorch • Hugging Face models as a local API backend for an application I’m building (education / analytics use case, trying t…

vllm
Deep Dive into Efficient LLM Inference with Nano-vLLM (cefboud.com via hn) 3 pts· 1d

Deep Dive into Efficient LLM Inference with nano-vLLM A look inside a lightweight implementation of vLLM. KV cache, paged attention, tensor parallelism &multi-GPU support, etc.

vllm
Turboquant in vllm kv cache - how to implement ? (or any other rotational kv cache) (www.reddit.com via reddit) 3 pts·7 replies· 2d

vllm
Using older vLLM version via Docker -- how do you use GGUF quants? (www.reddit.com via reddit) 3 pts·1 replies· 2d

So vLLM recently added the feature to use GGUF quants with the syntax author/model:quant format. I was just wondering if people were able to use the quants on older vLLM versions.

vllm
Going local with old GPUs (www.reddit.com via reddit) 3 pts·11 replies· 3d

vllm ollama claude
Please help me pick the right Qwen3.5-27B format/quant for RTX5090 (www.reddit.com via reddit) 2 pts·1 replies· 1d

Hi all, first post here. I've started a project in OpenClaw a month ago, and it's been a very "intense" 4 weeks to say the least...

↯ Qwen 3.5 vllm sonnet openclaw+2
Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? (www.reddit.com via reddit) 2 pts·15 replies· 1d

I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.

↯ Qwen 3.5 vllm llama
Dynamic tool lists vs KV cache: how do you handle this trade-off in LLM agents? (www.reddit.com via reddit) 2 pts·8 replies· 2d

vllm qwen mcp
DGX spark (www.reddit.com via reddit) 1 pts·5 replies· 2d

↯ Qwen 3.5 vllm qwen llama+1
Macbook Vs Strix Halo (www.reddit.com via reddit) 1 pts·21 replies· 2d

↯ Gemma 4 moe vllm agentic
gemma-4-31B-it thinking? (www.reddit.com via reddit) 2 replies· 8h

I can't get my model to think. According to the documentation, thinking should be triggered by starting the system prompt with a '<|think|>' string.

↯ Gemma 4 vllm gemma
DGX Spark users: What's the easiest way to do multi-node vLLM clustering with a browser UI and training? (www.reddit.com via reddit) 3 replies· 10h

Hey r/LocalLLaMA, I've been running a small 4-node DGX Spark cluster on a 400µT fabric switch and got frustrated with the usual raw Ray/vLLM scripts and EXO basically ignoring pure NVIDIA paths. I started from the solid foundation in [eugr…

fine-tuning vllm openai
gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM (www.reddit.com via reddit) 3 replies· 12h

I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error : Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.3…

↯ Gemma 4 vllm gemma
Lower inference speed of Gemma4 26BA4B on vllm. (www.reddit.com via reddit) 8 replies· 13h

For my earlier use case I used to host qwen 2.5 vl 7b gptq int4. Now I was looking to switch to Gemma4 26B A4B, as it would improve performance as well as improve latency considering only 4B parameters are active..

↯ Qwen 2.5 vllm qwen
Hardware needed for Gemma 26B MoE vs Qwen 14B for ~100–300 users (vLLM, single node?) (www.reddit.com via reddit) 16 replies· 2d

↯ Qwen 2.5 moe vllm qwen+1
What is the best way to deploy LLM on 3x3090? (www.reddit.com via reddit) 13 replies· 2d

↯ Qwen 3.5 vllm llama gemma
Optimizing a WSL2-based Local AI Orchestration for Product Viz | RTX 3090 24GB VRAM & i7-14700KF (www.reddit.com via reddit) 7 replies· 3d

vllm ollama

← all tags