Qwen3.6-35B-A3B [!Note] This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format. These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransf…
#vllm
21 items
Alibaba open-sources Qwen3.6-35B-A3B, a 35B MoE model with 3B active parameters (huggingface.co via hn) Reproduction of TurboQuant (www.reddit.com via reddit) There have been many TurboQuant implementations recently in llama.cpp, mlx, vllm, and sglang, but a lot of the discussion and code around them feels pretty noisy and looks to be AI-generated. I’m trying to understand which claims from the…
current: 1x 16GB 5060Ti. worth a 2nd for OpenCode? (www.reddit.com via reddit) my current build is just a 16GB 5060Ti running on a 3800X with 32GB DDR4. not really anything special, but I only really use it right now for Qwen3-VL-8B-Instruct at INT8 to do handwriting transcription (and it works great for that). someo…
Trained a Qwen2.5-0.5B-Instruct bf16 model on Reddit post summarization task with GRPO written from scratch in PyTorch - updates! (www.reddit.com via reddit) DGX Spark just arrived — planning to run vLLM + local models, looking for advice (www.reddit.com via reddit) Just got a DGX Spark set up today and starting to configure it for local LLM inference. Plan is to run: • vLLM • PyTorch • Hugging Face models as a local API backend for an application I’m building (education / analytics use case, trying t…
Deep Dive into Efficient LLM Inference with Nano-vLLM (cefboud.com via hn) Deep Dive into Efficient LLM Inference with nano-vLLM A look inside a lightweight implementation of vLLM. KV cache, paged attention, tensor parallelism &multi-GPU support, etc.
Turboquant in vllm kv cache - how to implement ? (or any other rotational kv cache) (www.reddit.com via reddit) Using older vLLM version via Docker -- how do you use GGUF quants? (www.reddit.com via reddit) So vLLM recently added the feature to use GGUF quants with the syntax author/model:quant format. I was just wondering if people were able to use the quants on older vLLM versions.
Going local with old GPUs (www.reddit.com via reddit) Please help me pick the right Qwen3.5-27B format/quant for RTX5090 (www.reddit.com via reddit) Hi all, first post here. I've started a project in OpenClaw a month ago, and it's been a very "intense" 4 weeks to say the least...
Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? (www.reddit.com via reddit) I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.
Dynamic tool lists vs KV cache: how do you handle this trade-off in LLM agents? (www.reddit.com via reddit) DGX spark (www.reddit.com via reddit) Macbook Vs Strix Halo (www.reddit.com via reddit) gemma-4-31B-it thinking? (www.reddit.com via reddit) I can't get my model to think. According to the documentation, thinking should be triggered by starting the system prompt with a '<|think|>' string.
DGX Spark users: What's the easiest way to do multi-node vLLM clustering with a browser UI and training? (www.reddit.com via reddit) Hey r/LocalLLaMA, I've been running a small 4-node DGX Spark cluster on a 400µT fabric switch and got frustrated with the usual raw Ray/vLLM scripts and EXO basically ignoring pure NVIDIA paths. I started from the solid foundation in [eugr…
gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM (www.reddit.com via reddit) I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error : Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.3…
Lower inference speed of Gemma4 26BA4B on vllm. (www.reddit.com via reddit) For my earlier use case I used to host qwen 2.5 vl 7b gptq int4. Now I was looking to switch to Gemma4 26B A4B, as it would improve performance as well as improve latency considering only 4B parameters are active..
Hardware needed for Gemma 26B MoE vs Qwen 14B for ~100–300 users (vLLM, single node?) (www.reddit.com via reddit) What is the best way to deploy LLM on 3x3090? (www.reddit.com via reddit) Optimizing a WSL2-based Local AI Orchestration for Product Viz | RTX 3090 24GB VRAM & i7-14700KF (www.reddit.com via reddit)