#llama

753 items

The local LLM ecosystem doesn’t need Ollama (sleepingrobots.com via hn) +565181 7w

Friends Don't Let Friends Use Ollama Ollama gained traction by being the first easy llama.cpp wrapper, then spent years dodging attribution, misleading users, and pivoting to cloud, all while riding VC money earned on someone else's engine…

ollama llama
Gemma4 26b & E4B are crazy good, and replaced Qwen for me! (www.reddit.com) +392100 7w

My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where…

↯ Qwen 3.5 gemma qwen llama+1
Qwen3.6 GGUF Benchmarks (www.reddit.com) +28564 7w

Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.

↯ Qwen 3.6 gemma llama
Local manga translator with LLM build-in, written in Rust with llama.cpp integration (www.reddit.com) +22671 6w

Hi LocalLLaMA, I created a post a few weeks ago, but this time this project has become more reliable and easier to use. This is a manga translator that can also be used to translate any image.

↯ Qwen 3.5 gemma llama
That's a good news... (www.reddit.com) +22469 3w

Looks like it finally happens... MTP getting approved for llama.cpp.

llama
Open WebUI Desktop Released! (github.com via reddit) +22381 7w

llama
llama.cpp speculative checkpointing was merged (www.reddit.com) +19057 7w

llama
Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions. (www.reddit.com) +17332 3w

Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention.

↯ Gemma 4 gemma llama
llama.cpp is the linux of llm (www.reddit.com) +16775 7w

llama
The Financial Times has published an article about Heretic (www.reddit.com) +15827 2w

https://www.ft.com/content/5630ed79-a263-41ed-9a1a-321617ae310e “The FT was able to use Heretic, a tool available on the popular code repository GitHub, to remove the guardrails from Meta’s Llama 3.3 model in less than 10 minutes without a…

llama
Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect! (www.reddit.com) +15746 6w

So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me. I am running a 3070 8gb + 64gb DDR4.

↯ Qwen 3.6 llama
2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints (www.reddit.com) +14938 4w

WARNING: wait before download from HF: I just realised my upload of the new versions with the additional fix in the chat template has not completed yet. I will remove this warning once done The recent PR to llama.cpp bring MTP support to Q…

↯ Qwen 3.6 qwen llama agentic+2
TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui). (www.reddit.com) +13641 3w

Hi all, I have been making a lot of updates to my project, and I wanted to share them here. TextGen (previously text-generation-webui, also known as my username oobabooga or ooba) has been in development since December 2022, before LLaMa a…

llama
These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade (www.reddit.com) +11168 7w

Time and time again I find posts about these fine tunes that promise increased intelligence and reasoning with base models, and I continuously try them, realize they're botched, and delete them shortly after. I sometimes do resort to a low…

↯ Claude 4.6 qwen llama opus
Stop wasting electricity (www.reddit.com) +10841 4w

Run on my rtx4090 llama.cpp params: llama-server -m ~/Projects/llm/models/Qwen3.6-27B-UD-Q4_K_XL.gguf --flash-attn on -ngl all -ctk q4_0 -ctv q4_0 -t 32 -c 262144 Power limit was set using sudo nvidia-smi -pl N On my observation, GPU const…

↯ Qwen 3.6 llama
Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) (www.reddit.com) +10548 3w

TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp…

↯ Qwen 3.6 vllm qwen llama
The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) (www.reddit.com) +10051 8w

This is V2 of my previous post. What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.

↯ Qwen 3.5 gemma llama
Gemma 4 Vision (www.reddit.com) +9734 7w

↯ Gemma 4 gemma llama
Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (www.reddit.com) +9035 6w

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…

↯ Qwen 3.6 ↯ Function Calling humaneval function-calling qwen+1
KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche (www.reddit.com) +8763 13d

Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects…

↯ Qwen 3.6 qwen llama
What is the current status with Turbo Quant? (www.reddit.com) +8747 7w

It has been hyped ±2 weeks ago and I remember seeing some pull requests into llama.cpp, but what is the current status after the hype faded away?

llama
Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s (www.reddit.com) +8635 7w

Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to s…

↯ Qwen 3.5 moe llama mcp
what’s actually stopping an insider from leaking model weights? (www.reddit.com) +8398 7w

this is a dumb question. what are the actual technical barriers stopping an engineer at a place like openai or anthropic from just exporting flagship weights and leaking them?

llama anthropic openai
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan) (www.reddit.com) +7138 6w

I have ThinkPad T14 Gen 5 (8840U, Radeon 780M, 64GB DDR5 5600 MT/s ). Tried out the recent Qwen MoE release, and pp/tg speed is good (on vulkan) (250+pp, 20 tg): ~/dev/llama.cpp master* ❯ ./build-vulkan/bin/llama-bench \ -hf AesSedai/Qwen3…

↯ Qwen 3.6 moe qwen llama
Talkie: a 13B LLM trained only on pre-1931 text used Claude Sonnet to help test the model and judge its output (www.reddit.com) +7013 6w

Researchers Alec Radford (GPT, CLIP, Whisper), Nick Levine, and David Duvenaud just released talkie: a 13 billion parameter language model trained exclusively on text published before 1931. No internet.

↯ Sonnet 4.6 sonnet llama gemini
Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) (www.reddit.com) +7036 6w

I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that K cache compression is not really recommended in most case…

↯ Qwen 3.6 llama
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp (www.reddit.com) +6724 2w

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP.

↯ Qwen 3.6 llama
Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B (www.reddit.com) +6742 7w

I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: Gemma 4 has 2.3B at 4.8 bpw (Q4_K_M) = 1104 MB Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% sm…

↯ Gemma 4 gemma llama
Llama.cpp's auto fit works much better than I expected (www.reddit.com) +6435 7w

↯ Qwen 3.6 llama
LiquidAI/LFM2.5-8B-A1B · Hugging Face (huggingface.co via reddit) +4915 12d

looks like you can run it on any potato (A1B)! https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF from LiquidAI: LFM2.5 is a new family of hybrid models designed for on-device deployment.

vllm moe llama+1
Comparison Qwen 3.6 35B MoE vs Qwen 3.5 35B MoE on Research Paper to WebApp (www.reddit.com) +4923 7w

Note: First is Qwen3.5 35B MoE (Left) and Second is Qwen3.6 (Right) Hi Guys Just did quick comparison of Qwen3.6 35B MoE against Qwen 3.5 35B MoE. with reasoning off using llama.cpp and same quant unsloth 4 K_XL GGUF First is Qwen3.5 outco…

↯ Qwen 3.6 moe qwen llama
Qwen 35B-A3B is very usable with 12GB of VRAM (www.reddit.com) +4713 4w

Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.

↯ Qwen 3.6 moe qwen llama
Quantizing MTP KV Cache = free lunch? (www.reddit.com) +4630 3w

With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized: -cache-type-k-draft q8_0 -cache…

↯ Qwen 3.6 llama
Markdown browser for LLMs (www.reddit.com) +4522 4w

I built a markdown web renderer for AI agents. Instead of taking expensive screenshots and piping them through vision models, TextWeb renders web pages as markdown that LLMs can reason about natively.

llama mcp
Get faster qwen 3.6 27b (www.reddit.com) +4411 4w

Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp Thought I would knowledge share Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF And am17an commit /media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/ll…

↯ Qwen 3.6 qwen llama
LM Studio finally added support for MTP Speculative Decoding (www.reddit.com) +437 2w

https://preview.redd.it/1uuzjm0ll72h1.png?width=923&format=png&auto=webp&s=1af7d7594be1e08ff7ad6797e2bc53e9410769a3 update to 0.4.14 Build 2 (Beta) and make sure your llama.cpp engine is 2.15.0 https://preview.redd.it/x0vdwjb3n72h1.png?wid…

llama
Unsloth solved bug in Mistral Medium 3.5 implementation (www.reddit.com) +436 5w

https://unsloth.ai/docs/models/mistral-3.5 "May 1, 2026 Update: We worked with Mistral to fix Mistral Medium 3.5 inference affecting some implementations, and released updated GGUFs with the fix (NOT related to Unsloth or our quants). The…

↯ Mistral ↯ Mistral 3.5 mistral llama
b9180 llama.ccp MTP landed (www.reddit.com) +4226 3w

All across the land many monitors showing green cmake with giddy anticipation Tip your bartender! https://github.com/ggml-org/llama.cpp/releases/tag/b9180

llama
Info: Nvidia Cuda 13.3 landed (www.reddit.com) +4010 13d

Cuda 13.3 Downloads Release Notes Anybody already tried llama.cpp with 13.3?

llama
Qwen 3.6 27b IQ4_XS - 22 tp/s on RTX 5060TI 16b, 24k ctx (www.reddit.com) +3920 6w

Maybe it be helpful for someone: llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000 Cant run this model with higher kv quants on >8192ctx size. -ub & -b setted for 256 allowed me for…

↯ Qwen 3.6 qwen llama
I have DeepSeek V4 Pro at home (www.reddit.com) +3827 4w

Just wanted to share that I used u/LegacyRemaster slightly modified (Q4_K_M conversion support) DeepSeek V4 CUDA repo (based on u/antirez work) to convert and run Q4_K_M DeepSeek V4 Pro on my Epyc workstation (Genoa 9374F, 12 x 96GB RAM, s…

deepseek llama
Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 (www.reddit.com) +3842 4w

So I've been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use. So after a day of vibecoding I think I may have gotten something viable.

↯ Qwen 3.6 deepseek llama
Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip). (www.reddit.com) +3829 6w

TL:DR - Remembered FPGA PCI boards being a big thing from my crypto days. Wondered if AMD Alveo V80 FPGA card could be used to approximate the performance of a Taalas HC1 (LLM-on-a-chip).

↯ Llama 3.1 llama gemini
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help (www.reddit.com) +3736 2w

MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round.

↯ Qwen 3.6 moe llama
I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls (www.reddit.com) +3425 4w

I've been running structured output prompts through a bunch of models on OpenRouter for the past few months — Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter — alongside the usual closed-source suspects. 28…

↯ Mistral mistral deepseek qwen+1
vLLM ROCm has been added to Lemonade as an experimental backend (www.reddit.com) +339 4w

vLLM has the ability to run .safetensors LLMs before they are converted to GGUF and represents a new engine to explore. I personally had never tried it out until u/krishna2910-amd/ u/mikkoph and u/sa1sr1 made it as easy as running llama.cp…

vllm llama
Heretic has been served a legal notice by Meta, Inc. (www.reddit.com) +317 2w

To Whomsoever it May Concern, The individual behind the Heretic Free Software Project (henceforth called "Heretic", notwithstanding unrelated entities of the same name) has been served a notice by a legal services provider representing Met…

llama
PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server (www.reddit.com) +306 4w

Hey folks, just a heads-up for anyone running Qwen3.6 through llama-server. I ran into an issue where the preserve_thinking parameter wasn't working as expected, even though I had it explicitly enabled in my models.ini config.

↯ Qwen 3.6 llama
FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 (www.reddit.com) +305 5w

Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published Dynamic Memory Sparsification (DMS), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV…

vllm llama
Dual dgx spark (Asus GX10) MiniMax M2.7 results (www.reddit.com) +3018 7w

minimax llama
Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT (www.reddit.com) +2920 2w

Setup: Kubuntu 24.04 - AMD cards - R9700 AI PRO and 7800xt (32gb + 16gb) - llama-cpp server - stack setup in docker - vulkan image I tried with ROCM but it wouldn't play nice with RDNA4 + RDNA3 mix. Vulkan seems to work.

llama
Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct (www.reddit.com) +298 3w

Ok, hear me out. This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround (https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main) recipe was performing so much better than any other Qwen3.6 27B q…

↯ Qwen 3.6 vllm llama
If you've been waiting to try local AI development, please try it (www.reddit.com) +2821 5w

I have snobbishly long felt that the local models were not 'up to my standards' for local development, or otherwise able to compete with GHCP, Claude Code, Cursor etc. Boy was I wrong.

↯ Qwen 3.6 llama cursor claude-code
Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM (www.reddit.com) +277 2w

Hi everyone, I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp.

↯ Qwen 3.6 qwen llama
Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp (www.reddit.com) +2714 5w

Hey everyone, I’ve been building a local-first desktop PDF reader that can read technical books aloud and keep the spoken text highlighted while reading. The original motivation was pretty practical: I read a lot of programming and technic…

qwen llama
Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR (www.reddit.com) +2617 4w

Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF…

↯ Qwen 3.6 llama
PS5’s can now be hacked to run Linux - perhaps some potential for local inference? (www.tomshardware.com via reddit) +2612 5w

I look forward to the Local LLM community getting llama.cpp to run on these. Could be a good value.

llama
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) (www.reddit.com) +2520 4w

TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled.

↯ Qwen 3.6 qwen llama
Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into (www.reddit.com) +2528 7w

↯ Qwen 3.6 moe llama agentic
Share your speculative settings for llama.cpp and Gemma4 (www.reddit.com) +2412 8w

I have totally missed the boat on speculative decoding. Today when generating some code again for the frontend i found myself staring down at some quite monotonic javascript code.

↯ Gemma 4 llama
Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed (www.reddit.com) +239 3w

TL;DR All models were Qwen3.6 27B-MTP vs Base 27B (15k single-turn): Faster overall Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%) Generation: 7.63 → 16.15 t/s (+111.77% speedup) Prompt Processing: 279.75 → 244.90 t/s (-12.46…

↯ Qwen 3.6 llama
235M param LLM from scratch on a single RTX 5080 (www.reddit.com) +233 7w

llama
Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM (www.reddit.com) +225 4w

So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF. This is not a "best possible setup" claim.

↯ Qwen 3.6 vllm llama
For everyone that uses OpenCode / Pi - Heres your promptprocessing fix! (www.reddit.com) +2115 2w

This PR deserves much more attention as it fixes the constant promptprocessing that happens when using llama.cpp with Opencode or pi. https://github.com/ggml-org/llama.cpp/pull/22929

llama
Testing llama.cpp MTP support on Qwen3.6 - RTX 5090 (www.reddit.com) +218 3w

Setup: - RTX 5090, 32 GB, Linux - Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120) - Unsloth's…

↯ Qwen 3.6 llama
Is using vLLM actually worth it if you aren't serving the model to other people? (www.reddit.com) +2126 3w

So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc.

vllm llama
MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks (www.reddit.com) +213 8w

Hey r/LocalLLaMA, we did an investigation into MiniMax-M2.7 GGUF causing NaNs on perplexity. Our findings show the issue affects 21%-38% of all GGUFs on Hugging Face (not just ours).

minimax llama
Qwen3.6 27B and llama.cpp appreciation post (www.reddit.com) +2011 2w

To preface, here's my config: llama-server \ --host 0.0.0.0 \ --port 1235 \ --models-preset %h/Software/models.ini \ --models-max 1 \ --sleep-idle-seconds 3600 \ --timeout 3600 \ --parallel 1 \ --device ROCm0,ROCm1 [*] flash-attn = on jinj…

↯ Qwen 3.6 llama
Experts-Volunteers needed for Vulkan on ik_llama.cpp (www.reddit.com) +201 6w

ik_llama.cpp is great for both CPU & CUDA. Need legends to make Vulkan better as well.

llama
[P] Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book (www.reddit.com) +203 7w

I wrote a book that implements modern LLM architectures from scratch. The part most relevant to this sub: Chapter 3 takes GPT-2 and swaps exactly 4 things to get Llama 3.2-3B: LayerNorm → RMSNorm Learned positional encodings → RoPE GELU →…

moe deepseek llama
common/gemma4 : handle parsing edge cases by aldehir · Pull Request #21760 · ggml-org/llama.cpp (github.com via reddit) +2012 8w

If you are on Gemma (like me), you basically have to compile llama.cpp daily now

↯ Gemma 4 gemma llama
Introducing BlueTTS (www.reddit.com) +203 8w

I recently worked on BlueTTS, a lightweight text-to-speech model that focuses on speed and usability. It supports multiple languages: English, Hebrew, Russian, Spanish, and French (even within the same sentence), and comes with a large set…

llama
Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB (www.reddit.com) +197 5w

## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: https://gith…

↯ Qwen 3.5 moe llama
MiMo-V2.5-GGUF (preview available) (huggingface.co via reddit) +19 5w

Hi, AesSedai here - I've put up a PR to support the text-to-text inference of MiMo V2.5 with llama.cpp (and should also support Pro, will work on those quants after finishing V2.5): https://github.com/ggml-org/llama.cpp/pull/22493 I've als…

moe llama
Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big. (www.reddit.com) +1936 6w

As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.

↯ Qwen 3.6 gemma llama
Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now) (www.reddit.com) +194 7w

Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task.

↯ Qwen 3.6 qwen llama
FYI, Step 3.5 Flash has better perf and context is 1/4 the price in llama.cpp (www.reddit.com) +1926 8w

So i recently updated LMstudio after a long pause and updated my llama.cpp runtimes too.. i was shocked..

cline llama
Ban phrases on llama.cpp with this script. (www.reddit.com) +1811 5w

Check the README for setup instructions: https://github.com/BigStationW/llama-cpp-phrase-ban

llama
I stumbled on a Gemma 4 chat template bug for tools and fixed it (www.reddit.com) +182 5w

TLDR: tool parameters using the common JSON Schema pattern `anyOf: [$ref, null]` are rendered into the prompt as empty `type` fields. This strips the useful schema information before the model sees it.

↯ Qwen 3.5 gpt-5 gemma llama+1
Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM (www.reddit.com) +1733 2w

Hello everyone! I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB.

↯ Qwen 3.6 llama
llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary (www.reddit.com) +175 5w

I tested two llama.cpp builds on the same Qwen3.6-27B-NVFP4 model. llama-bench reports the model label as qwen35 27B NVFP4, but the actual tested model is Qwen3.6-27B-NVFP4.

↯ Qwen 3.6 llama
mesa PR with 37-130% llama.cpp pp perf gain for vulkan on Linux on Intel Xe2 (gitlab.freedesktop.org via reddit) +172 6w

Making sure you're not a bot! Loading...

llama
Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR? (www.reddit.com) +1717 6w

Hey everyone, Ever since the day Google announced TurboQuant, I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the di…

↯ Qwen 3.5 llama
Web OS result from Qwen3.6 35B is by far the best I tested in my laptop (codepen.io via reddit) +1712 7w

This is my first test with this model and Qwen impressed me. I will rate it 98% usable web os compared to my previous best 70% usable result from qwen3 next coder at q2.

↯ Qwen 3.6 qwen llama
Can't keep up with Llama.cpp changes, made a n8n workflow to summarize it for me daily (www.reddit.com) +1713 7w

My kind of daily news sent to me via Discord https://preview.redd.it/prmris11vdvg1.png?width=684&format=png&auto=webp&s=0dcb00079362a38a29d981dd2f3a4e5143c8091f The N8N workflow (you could probably have Hermes or another agent do similar):…

llama
llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig (www.reddit.com) +1620 3w

PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.

↯ Qwen 3.6 moe llama
Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP (www.reddit.com) +168 3w

for anyone who cares... 😄 prompt = spen a 1000 tokens unsloth MTP models strix halo llama.cpp:server-rocm-mtp \ --spec-type draft-mtp \ --spec-draft-n-max 3 Qwen3.5-122B-Q5-MTP-General n_decoded = 100 tg = 29.77 t/s n_decoded = 179 tg = 27…

↯ Qwen 3.5 llama
NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released! (www.reddit.com) +1610 4w

b9095 finally makes -sm tensor work on dual consumer Blackwell PCIe GPUs without NCCL If youre on dual Blackwell gpus this look like it could be big. I'll have my own results for 2x5060ti asap

llama
Gemma4 26b a4b Apex quant is quite good (www.reddit.com) +153 2w

I tried mudler's apex quant for gemma4 26b a4b and it was amazing! I got 38tps at 90.000 context with no loop and suprisingly no quality degradation.

↯ Gemma 4 gemma llama
Dual GPU llama.cpp speedup (www.reddit.com) +151 3w

Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor pa…

↯ Qwen 3.6 llama
MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b (www.reddit.com) +15 3w

I converted nvidia/llama-embed-nemotron-8b to MLX fp16, 8-bit, 4-bit, and 2-bit (for my OCD) and put it on HuggingFace: ncorder/llama-embed-nemotron-8b-mlx-fp16 ncorder/llama-embed-nemotron-8b-mlx-8bit ncorder/llama-embed-nemotron-8b-mlx-4…

llama
As MTP prepares to land in llama.cpp, Models that support MTP (www.reddit.com) +1516 5w

DeepSeekv3 OG DeepSeekv3.2/4 Qwen3.5 GLM4.5+ MiniMax2.5+ Step3.5Flash Mimo v2+ Until we get mtp weights, you need to download HF weights and convert to gguf. I think I'm going to try either qwen3.5-122b or glm4.5-air first.

↯ DeepSeek 3.2 llama
Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max (www.reddit.com) +159 6w

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K.

↯ Qwen 3.6 qwen llama
Qwen3.6 One Shot Tetris Game (www.reddit.com) +1532 6w

I am blown away by what this model can generate locally. I asked for a flashy Tetris game with particle effect and boy did it deliver!

↯ Qwen 3.6 moe llama
Compile English function descriptions into 22MB neural programs that run locally via llama.cpp (www.reddit.com) +157 7w

We built a system where a neural compiler takes a plain-English function description and produces a "neural program" (a combination of a continuous LoRA adapter and a discrete pseudo-program). At inference time, these adapt a fixed interpr…

↯ Qwen 3 llama
Old Mac Pro still proving its worth (www.reddit.com) +1410 2w

The “Trash Can” Mac Pro, once the most expensive machine you could buy from Apple, mine was just shy of £10,000 in 2016 — that’s £14k in today’s money. Until recently mine was just running as a kubernetes single node development platform,…

↯ Qwen 3.5 qwen llama
Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results (www.reddit.com) +149 4w

Following my previous post https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq, a few people asked for the 35B A3B version. The model is up on HuggingFace at https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF if anyone wants to ch…

↯ Qwen 3.6 llama
PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together (www.reddit.com) +143 5w

Previously a model could only be present in a single group. Now you can create whatever groups you want: one for big models that should run on their own, a group for STT + bigger model, a group for RAG usages, etc.

rag llama
Can't replicate Reddit numbers with Qwen 27B on a 3090TI. (www.reddit.com) +1428 5w

I feel like i'm going insane. I see people here posting 30 - 100+ tok/s (100+ being with speculative decoding) on a 3090 with Qwen 3.6 27B.

↯ Sonnet 4.6 sonnet qwen llama
Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant (www.reddit.com) +1437 7w

Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM.

↯ Qwen 3.6 moe qwen llama
Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp (www.reddit.com) +144 7w

Here is how to run the new Qwen3.6-35B-A3B > At full context on a 4090 - IQ4_XS gguf with llama cpp > At full context on a Spark - FP8 with a tweaked vLLM Here is the docker compose with llama cpp services: llamacpp: container_name: llamac…

↯ Qwen 3.6 vllm llama
Qwen3.6 huge quality gain from Q4 to Q6 for coding agent (www.reddit.com) +1312 13d

So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap.

↯ Qwen 3.6 ollama deepseek llama
I ran a quantization shootout on Qwen3-Coder and the results are... interesting (www.reddit.com) +136 2w

Out of random curiousity I ran a shootout on Qwen3-Coder-Next. I've been using the MXFP4_MOE from unsloth for awhile as it's just really fast on my system.

↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 llama
Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. (www.reddit.com) +138 5w

I wanted to share an open-source app that I built for running LLMs locally on my setup. My setup Hardware FEVM FAEX1 (128GB) RTX Pro 5000 Blackwell (48GB), connected over OCuLink Aoostar AG02 2x2TB internal m.2 drives on raid-0 using mdadm.

↯ Qwen 3.6 qwen llama
To 16GB VRAM users, plug in your old GPU (www.reddit.com) +1319 6w

For those who want to run latest dense ~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in. It matters that everything fits on the VRAM, even on 2 cards.

llama
Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model (www.reddit.com) +1221 12d

I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler).

↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 llama
Qwen3.6-35B-A3B vs Gemma4-26B-A4B (www.reddit.com) +1221 2w

Just wondering how are people's experience with both these models! I've had some nice results with Qwen but Gemma4 runs so much faster here.

↯ Qwen 3.6 qwen llama
[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo (www.reddit.com) +1213 2w

Probably most of you are aware that using anything other than -ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0 as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g.

llama
Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version) (www.reddit.com) +1215 3w

In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests.

↯ Qwen 3.6 qwen llama
More Qwen3.6-27B MTP success but on dual Mi50s (www.reddit.com) +123 4w

TLDR: The hype is real! 1.5x speedup.

↯ Qwen 3.6 llama
Using PaddleOCR-VL-1.5 with llama-server for book OCR (www.reddit.com) +121 6w

I've been running PaddleOCR-VL-1.5 via llama.cpp's server for OCR on book pages. It handles complex layouts, tables, and mixed text/figure pages surprisingly well.

llama
Using the iGPU as the primary graphics card may improve token generation speed for PCIe graphics cards (www.reddit.com) +1217 6w

A few days ago, I was trying to improve token generation speed on my RTX 4070 Super 12GB while running Qwen3.6 35B A3B UD-IQ3_XXS (Unsloth) with llama.cpp, but to no avail. At that time, I had my monitor plugged in my 4070 and didn't even…

↯ Qwen 3.6 llama
Qwen3.6 agent + Cisco switch: local NetOps AI actually works! (www.reddit.com) +124 7w

↯ Qwen 3.6 cline qwen llama+1
Q8 Cache (www.reddit.com) +1211 8w

https://github.com/ggml-org/llama.cpp/pull/21038 Since now cache quantization has better quality, does that mean Q8 cache is a good choice now? For example for 26B Gemma4?

↯ Gemma 4 llama
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) (www.reddit.com) +111 2w

A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into…

↯ Qwen 3.6 moe qwen llama
Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP (www.reddit.com) +117 4w

Hey fellow Llamas, keeping it short. We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory).

↯ Qwen 3.6 minimax llama
Running Minimax 2.7 at 100k context on strix halo (www.reddit.com) +114 4w

Just wanted to share because it took me a lot of tweaking to get here: llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-IQ3_XXS --temp 1.0 --top-k 40 --top-p 0.95 --host 0.0.0.0 --port 8080 -c 100000 -fa on -ngl 999 --no-context-shift -fit of…

↯ MiniMax 2.7 minimax llama
DS4, a specialized inference engine for DeepSeek v4 Flash (twitter.com via hn) +112 4w

antirez @antirez Welcome to DS4, a specialized inference engine for DeepSeek v4 Flash. github.com/antirez/ds4 This project would have been impossible without the existence of llama.cpp and GGML and the work of @ggerganov and all the other…

↯ DeepSeek 4 deepseek llama
Vulkan backend outperforms ROCm on Strix Halo (gfx1151) — llama.cpp benchmark (www.reddit.com) +1129 5w

Just ran some llama-bench comparisons between ROCm and Vulkan backends on my Strix Halo system. Vulkan came out ahead, which surprised me.

↯ Qwen 3.6 moe llama
Open Weights Models Hall of Fame (www.reddit.com) +11 5w

I read a lot of "whengguf" type posts. I think we should sometimes stop and be grateful.

↯ Mistral mistral gemma llama+1
GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B (www.reddit.com) +111 6w

Hi folks, Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too. The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficien…

↯ Qwen 3.6 llama
Llama.cpp parameters for Qwen 3.6 with RTX 3090 (www.reddit.com) +1112 6w

Hi, I'm trying to run Qwen 3.6-35B on my RTX 3090 (24 GB of VRAM) but I'm not sure about 2 thing: - Which variant of the model to use ? (Q4_K_S, Q3_K_XL, other ?

↯ Qwen 3.6 qwen llama agentic
Intel Arc B70 with HP z640 workstation (pcie 3) (www.reddit.com) +117 7w

↯ Qwen 3.6 llama
Curated a list of 550+ free or cheap AI tools for vibe coding (LLM APIs, IDEs, local models, RAG, agents) (www.reddit.com) +117 7w

Been vibe coding a lot recently and kept running into the same problem finding actually usable tools without paying for 10 different subscriptions or donating my bank balance to Claude. So I put together a curated list focused on free or l…

ollama rag qwen+3
Turn an old Android phone into a Local AI Voice Assistant (www.reddit.com) +111 7w

I had a nice old cracked pixel 5a laying around that I wanted to get some use out of, so I turned it into a local AI Voice assistant. A server on a laptop running llama.cpp gemma-3-4b-q4.gguf served by flask connects to a script running on…

↯ Gemma 3 gemma llama
Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs. (www.reddit.com) +105 2w

Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp.

llama
MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it (www.reddit.com) +1012 3w

I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model.

↯ Qwen 3.6 llama
As of today, what's the *most stable* model to run on a 32Gb RAM Mac w/ 256k context? (www.reddit.com) +1032 4w

Hey everyone, I've been playing around with Gemma4 and Qwen3.6 on my 32Gb Macbook Pro M2 Max since their release but I'm struggling at finding: The best software to run it (oMLX, llama.cpp, ...) The best model + quant to pick The best sett…

↯ Qwen 3.6 llama agentic
why llama.cpp can’t combine speculative decode methods? (www.reddit.com) +105 4w

dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g.

↯ Qwen 3.6 llama agentic
Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama (www.cyera.com via reddit) +102 4w

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama TL;DR We discovered a critical vulnerability (CVE-2026–7482, CVSS 9.1) in Ollama that enables unauthenticated attackers to leak the entire Ollama process memory, potentially im…

↯ Security ollama security llama
Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler (www.reddit.com) +101 6w

In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags behind LLM-Scaler (Intel's VLLM fork), li…

vllm llama
VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? (www.reddit.com) +939 12d

EDIT - IGNORE. I MADE A MISTAKE.

↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 vllm llama
Blackwell and PDL performance increase (www.reddit.com) +95 2w

Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.) In short, PDL enables more efficient execution of kernels…

↯ Qwen 3.6 qwen llama
MTP experiences on 7900xtx? (www.reddit.com) +99 3w

Hi! I have been using Qwen3.6 35B A3B happily the past few weeks, and I wanted to try out Qwn3.6 27B with the new fancy MTP speculative draft!

↯ Qwen 3.6 moe llama
Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future. (www.reddit.com) +9 3w

I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly".

↯ Qwen 3.6 gemma qwen llama
MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp (www.reddit.com) +911 4w

I was wondering what will be the difference in results with flag: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 vs MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Results are quite interesting 49tok/sec without MTP vs 64 tok/sec with MTP. PC: RTX5090+128GB DDR5…

↯ Qwen 3.6 llama mcp
Gradually increasing memory use - is there a memory leak in llama.cpp? (www.reddit.com) +911 4w

I've got a 128GB Strix Halo box. Yesterday I wanted to try out Step-3.5-flash.

llama
Lemonade OmniRouter: unifying the best local AI engines for omni-modality (www.reddit.com) +92 6w

I’ve always liked how if I ask ChatGPT to make or edit an image, it just does it. Local AI should be this convenient!

llama chatgpt openai
Is there a way to mitigate performance as context grows? (www.reddit.com) +913 6w

In my local LLM setup I get from 30 to 80 t/s generation at the beginning, but it drops quite a lot as context grows. I use llama.cpp/Vulkan with an MI50 and a V100, is there some command line flags that can improve this issue?

llama
GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx (www.reddit.com) +912 7w

The TheTom's turboquant's GPU accelerated turboquant (turbo3) has unlocked high context gains for the 35BA3B family. I can now achieve ~40tg/s via the following GPU-POOR compilation flags and configuration: cmake -B build -DGGML_CUDA=ON -D…

↯ Qwen 3.6 llama
Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload (www.reddit.com) +92 7w

Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough.

↯ Qwen 3.5 llama
Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. (www.reddit.com) +819 12d

Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090?

↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 vllm qwen llama
club-5060ti: practical RTX 5060 Ti local LLM notes and configs (github.com via reddit) +83 3w

I put together a small public repo for RTX 5060 Ti 16GB local LLM setups: I took inspiration from the club-3090 repo, but this one is focused on documenting what we’ve actually tested on 5060 Ti hardware so the setup details are easier to…

↯ Qwen 3.6 vllm llama openai
Playing One Night Werewolf (Gemma4 & Qwen3.6) (www.reddit.com) +84 3w

Finally feel like it’s possible. I have a custom build (vibe coded) UI on llama.cpp, allows model switching in the same chat.

↯ Qwen 3.6 llama
Running llama.cpp on Snapdragon Hexagon NPU seems promising (www.reddit.com) +82 5w

https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md I have an Oneplus 12 with Snapdragon 8 Gen 3. I followed the above README to cross-compile llama.cpp on Ubuntu and then copy to the Termux directory on the…

↯ Gemma 3 gemma llama
I built a 5M model to see if it outperforms my 350M model... (www.reddit.com) +89 5w

Hi r/LocalLLaMA ! I built a 5M Llama model with HF Transformers on 2x T4 in Kaggle to see, if it is able to be as good as my previous Apex 350M model (https://huggingface.co/LH-Tech-AI/Apex-1.6-Instruct-350M).

↯ GPT 2 llama
What speed is everyone getting on Qwen3.6 27b? (www.reddit.com) +879 6w

I'm getting ~13 tps on Q8_0, with a context window of 128000, K Q8_0, V Q8_0 this is on 3x GPUS (1x2060super 8gb, 2x5060ti 16gb), via llamacpp unsure if this is slow or to be expected? */llama-server --port 8080 --model */llama.cpp/Qwen3.6…

↯ Qwen 3.6 llama
Show HN: MemFactory: Unified Inference and Training Framework for Agent Memory (arxiv.org via hn) +8 6w

Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged…

↯ Fine Tuning fine-tuning llama
Authors Sue Meta's AI Scientists Directly in Llama Copyright Case (www.law.com via hn) +7 13d

A proposed class action filed against Meta Platforms in New York federal court targets not only the company and its CEO Mark Zuckerberg but also two former senior AI researchers by name—an unusual move that could signal a new front in the…

llama
Experimental "Preserve Thinking" Jinja Template for Gemma4 31B in llama.cpp (www.reddit.com) +711 2w

https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja Yall are more than welcome to try it out and provide feedback. In my own testing in Pi-coding-agent I no longer have the "forgot to close thin…

↯ Gemma 4 gemma llama
Time to update llama.cpp to get som MTP improvements! (www.reddit.com) +73 3w

https://github.com/ggml-org/llama.cpp/pull/23269

llama
We have sub-agents at home (www.reddit.com) +71 3w

At work I get unfettered access to gpt 5.4 and sonnet, so I'm quite used to spawning sub-agents to go crazy on a repo and split up tasks. At home I am VRAM poor and like to run the models locally for my own enjoyment.

↯ Qwen 3.6 sonnet llama
Grafting vision onto text models for fun and profit. (www.reddit.com) +72 3w

So as we know.. llama.cpp separates the vision or other multimedia from the main weights.

↯ Mistral mistral llama
Looking to migrate off of Ollama and LMStudio (www.reddit.com) +79 3w

Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow.

↯ Gemma 4 vllm ollama gemma+3
2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache (www.reddit.com) +711 3w

PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise) ------- Just wanted to share my current setup, that might help some users out there... services: llama-server: image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b912…

↯ Qwen 3.6 llama
Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan? (www.reddit.com) +74 3w

I have a docker stack with a bunch of AI services and llama.cpp server is the brain. I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement.

llama
llama.cpp docker images to run MTP models (www.reddit.com) +72 3w

This is follow up from previous post: https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/ There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently ma…

llama
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context (www.reddit.com) +79 4w

If anyone is looking for a good high-speed setup with ~190k context, this config has been working insanely well for me. I’m using my laptop as a server over Tailscale.

↯ Claude 4.6 ↯ Claude 4.6 moe llama opus
Show HN: Bonsai 1.7B ternary model at 442T/s on M4 Max (agents2agents.ai via hn) +71 5w

We took a recently released Bonsai 1.7B ternary model from PrismML (https://github.com/PrismML-Eng/Bonsai-demo) and ran our agentic evolution search on it for 6 hours to optimize the Metal kernels. The search was fully autonomous.

llama agentic
Qwen3.6-27B-NVFP4 - images (www.reddit.com) +7 5w

Model: Abiray-Qwen3.6-27B-NVFP4.gguf Specs: - Legion 7i Gen10 - NVIDIA GeForce RTX™ 5090 - Intel® Core™ Ultra 9 275HX × 24 - RAM 32.0 GiB llamacpp settings: ./build/bin/llama-server \ -m ~/.lmstudio/models/lmstudio-community/Qwen3.6-27B-GG…

↯ Qwen 3.6 llama
gemma-4-31B-it-DFlash has been released (www.reddit.com) +71 5w

https://huggingface.co/z-lab/gemma-4-31B-it-DFlash I guess we'll have to wait until this PR is merged before we can test it. https://github.com/ggml-org/llama.cpp/pull/22105

↯ Gemma 4 gemma llama
llama.cpp DeepSeek v4 Flash experimental inference (www.reddit.com) +74 6w

Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, a…

↯ DeepSeek 4 deepseek qwen llama
Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help (www.reddit.com) +718 6w

The new dense model is great, but I’m trying to figure out how to increase PP and Token generation speed. I’m running Q8 quants across 3 7900xtx GPUs and I’m consistently only getting 18-20 t/s generation speed and ~650 t/s prompt processi…

↯ Qwen 3.6 qwen llama
Reproduction of TurboQuant (www.reddit.com) +75 7w

There have been many TurboQuant implementations recently in llama.cpp, mlx, vllm, and sglang, but a lot of the discussion and code around them feels pretty noisy and looks to be AI-generated. I’m trying to understand which claims from the…

vllm llama
I built a local LLM that learns how you use Claude Code and starts auto-piloting it (www.reddit.com) +78 7w

I've been running 5-8 Claude Code sessions at a time and got tired of tab-switching to approve tool calls. So I built claudectl — a TUI that sits on top of all your sessions and lets a local LLM (ollama/llama.cpp) handle approvals for you.

ollama llama claude-code
llama.cpp server have built-in native tools (exec_shell, edit_file, etc.) (www.reddit.com) +62 2w

https://preview.redd.it/24uvk7o4sy2h1.png?width=1440&format=png&auto=webp&s=542570e3057b6f44c1e7e8d92130f575fb69cfa2 https://preview.redd.it/l4bbm7o4sy2h1.png?width=1440&format=png&auto=webp&s=3dc0edd978da23fecf81e86a269a06de643247d1 I was…

llama
Llama.cpp VS LiteRT on a custom Xiaomi 12 Pro 24/7 Server (V2 Redesign) (www.reddit.com) +6 2w

https://preview.redd.it/sm4ysgdw1w2h1.png?width=1376&format=png&auto=webp&s=3705932403919814fbf2008a1cba189d17e0591e Thanks everyone for the advice on my previous post (24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/…

↯ Gemma 4 ollama llama
Experts first llama.cpp (www.reddit.com) +69 2w

This is for all with 12GB VRAM. Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers.

↯ Qwen 3.6 moe llama
'Am I OpenAI compatible' - a tool and documentation for unified api signatures in open source AI. (www.reddit.com) +61 2w

This has turned out to be useful to many of my friends so I thought I'd share here as well. I created a tool and documentation page for most major open-souce project's adherence to 'OpenAI compatibility' after seeing inconsistencies betwee…

vllm llama anthropic+1
PSA: If you haven’t updated Llama.cpp for a couple of days and find MTP to not be performing well, update llamacpp. (www.reddit.com) +68 3w

I thought it had horrible performance and was a nothingburger and had spent like an hour benchmarking it. Updated it yesterday and received a like 1.5-1.8x token boost.

llama
If you use continue.dev and Qwen 3.6 (dense / MoE) - I could use your help (www.reddit.com) +65 3w

Someone suggested I give Continue (Vscode extension) a try. I've been using Roo / Zoo now and liking it but it is pretty tough on context and I was told continue has more control over it.

↯ Qwen 3.6 continue-dev moe qwen+1
Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4 (www.reddit.com) +63 3w

CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache.

minimax moe llama
Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup (www.reddit.com) +64 3w

Hi r/LocalLLaMA - I've been paying close attention to the edge AI ecosystem because it's an area where i see huge potential and where I truly believe AI will become more useful for day to day tasks. Around the gemma 4 release I was already…

↯ Gemma 4 gemma llama
Llama-Studio, WebUI for llama-server Management (www.reddit.com) +62 3w

Hey all, I have built myself a WebUI for configuring and managing llama-server sessions, and want to share the code and concept. Python and a bit of JS.

llama
[Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level (www.reddit.com) +6 3w

Inspired by https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ I've decided to put my 5090 to test and see how do the curves look like for the device and whether there were any obvious sweet spots (apart from se…

llama
Llama models: still valuable for finetuning or surpassed by everything new? (www.reddit.com) +69 4w

Hello there people. So I have noticed that people are pretty much ignoring Llama 3 plus 3.1, 3.2, and 3.3 these days.

↯ Fine Tuning ↯ Llama 3.3 fine-tuning llama
Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models (www.reddit.com) +65 4w

Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can massively improve prompt p…

moe llama
Why is opencode so slow in processing the prompt with llama server? (www.reddit.com) +650 4w

I'm running opencode and llama-server locally. I have 32gb ram and 780m igpu.

↯ Qwen 3.6 llama
Released a TurboQuant-compatible KV backend evaluation SDK (www.reddit.com) +6 5w

Disclosure: I am the author of this evaluation SDK. I released an independent TurboQuant-compatible KV backend evaluation package for compressed-KV ABI testing, smoke tests, and partial attention decode experiments.

llama
What's your tps on 3090 + Qwen 3.6 27B in real tasks? (www.reddit.com) +614 5w

I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models.

↯ Qwen 3.6 qwen llama
Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing (www.reddit.com) +6 5w

Hi everyone, I’m the maintainer of Box — a fork of Google’s AI Edge Gallery that I’ve been extending into a fully offline AI assistant for Android. Full disclosure: I built this project.

llama
[7900XT] Qwen3.6 27B for OpenCode (www.reddit.com) +63 6w

I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far: llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --t…

↯ Qwen 3.6 moe llama
FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally (www.reddit.com) +624 6w

Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about. llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 = 40`), with CUDA kernels landing in `mmq.cuh`, `mmvq.cu`,…

llama
TurboQuant on MLX & vLLM (www.reddit.com) +65 7w

MLX https://github.com/Blaizzy/mlx-vlm?tab=readme-ov-file#turboquant-kv-cache vLLM https://github.com/vllm-project/vllm/pull/38479 MLX & vLLM users, please share your experience with benchmarks(t/s). Adding llama.cpp Links related to Turbo…

vllm llama
Llama.cpp vs LM Studio on gaming PC (www.reddit.com) +66 7w

Here is my experience, I've been using LM Studio with RTX 5080 and 64GB RAM using Windows 11. I'm very happy with LM Studio except the speed.

↯ Gemma 4 gemma qwen llama
Llamacpp server : How do the -np and -c flags interact? (www.reddit.com) +53 2w

I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.

↯ Qwen 3.6 moe qwen llama
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model) (www.reddit.com) +54 2w

I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends." It does depend. So let me spl…

↯ Qwen 3.6 moe llama
[NEW] Supra-50M Released! (www.reddit.com) +55 2w

https://preview.redd.it/kx39ammxno2h1.jpg?width=1080&format=pjpg&auto=webp&s=d1a2d5b27920a5b61a50547a6e70a6378445cae4 SupraLabs released a new model! - Supra-50M Supra-50M is a compact 50M-parameter causal language model (BASE and INSTRUCT…

llama
From 6gb to 32gb (www.reddit.com) +510 3w

Well I ordered a 3090 today. I plan on pairing it with a 3060 I have for 32gb combined VRAM.

llama
Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled (www.reddit.com) +51 3w

Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled My setup is heterogenous, I originally acquired my server (Lenovo ThinkStation P3 Tower Gen 2) to run OpenShift/K8s clusters (because I work on that), and…

↯ Qwen 3.6 qwen llama
Now that MTP is merged... What's the best outputs you're getting on Qwen 3.6 35B on 2x3090s? (www.reddit.com) +52 3w

We've got great outputs for 27B via club 3090, but what about those of us who love the blazing speed of 35B on dual 3090s? I was getting 1500 p/p and 120 t/g with split layers, but MTP slowed it down to 80 t/g when I tested last week.

↯ Qwen 3.6 qwen llama
Dropping learning rate fixed my Qlora fine-tune more than anything else i tried (www.reddit.com) +55 3w

Been fine-tuning llama 3.1 8b with Qlora for a classification task using about 8k samples. I was getting bad eval results for a while and kept thinking something was wrong with my data.

↯ Fine Tuning fine-tuning llama
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) (www.reddit.com) +5 3w

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…

↯ Qwen 3.6 moe gemma qwen+1
Do not fall into the trap of chasing the next scale or upgrade. (www.reddit.com) +56 3w

I mean; don't get me wrong, I love me some improvements and enhancements and it keeps on giving... and with MTP making its way to llama.cpp soon, a lot of you who aren't already running custom compiles are about to get a boost in inference…

llama
How do I use MTP? (www.reddit.com) +57 4w

Hi, I'm trying to use MTP with llama.cpp, I built from source the mtp-pr, download an MTP model from huggingface https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP/resolve/main/Qwen3.6-27B-Q6_K.gguf But when I run the model I have an erro…

↯ Qwen 3.6 llama
Qwen3.6 27b q5_k_M MTP - 256k context - 5090 (www.reddit.com) +53 4w

https://preview.redd.it/ktg0lr3e0p0h1.png?width=1279&format=png&auto=webp&s=d110580662a5c707038b7e2e4f5226d2a18c7bfe Straight to it: llama-server-mtp \ -m ~/models/Qwen3.6-27B-Q5_K_M-mtp.gguf \ --spec-type mtp \ --spec-draft-n-max 3 \ --ca…

↯ Qwen 3.6 llama
Testing MiMo-V2.5-IQ3_S with 1'048'576 context (www.reddit.com) +55 4w

llama-server.exe --model "H:\gptmodel\AesSedai\MiMo-V2.5-GGUF\MiMo-V2.5-IQ3_S-00001-of-00004.gguf" --ctx-size 1048576 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel…

minimax moe llama
Qwen3.6 27B seems struggling at 90k on 128k ctx windows (www.reddit.com) +513 5w

I have RX 7900 XTX, running Qwen3.6 27B Q4_K_XL. got 400ish pp and 30s tps.

↯ Qwen 3.6 llama
Sorry if it's not the best place to ask this, of the models in the image, which is the best for (problem solving)/Coding and the best one for studying (ask LLM concepts) ? My PC build is RX 9060 XT 16GB + I3 12100F + 16 GB DDR4 + llama.cpp with Vulkan backend + Linux Mint. (www.reddit.com) +58 5w

I gave some math problems to Qwen 3.5 27B and Qwen 3.6 27B and they got all of them right, pretty smart models I would say, but very slow and electricity consuming, they took like 5 mins with my GPU at 120 W to solve a problem. The MoE mod…

↯ Qwen 3.6 moe qwen llama
AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks (www.reddit.com) +517 6w

Did some quick tests after building llama.cpp with ROCm 6.4.2 and latest Vulkan for my 6900 XT gemma4 E2B Q4_K ubatch ROCm pp512 Vulkan pp512 ROCm tg128 Vulkan tg128 32 1536.60 1423.49 151.92 174.59 64 1590.65 1930.60 151.41 173.76 128 265…

↯ Qwen 3.5 gemma qwen llama
For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. (www.reddit.com) +59 6w

I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of diff…

↯ Qwen 3.6 vllm llama
What is the best coding agent (CLI) like Claude Code for Local Development (www.reddit.com) +520 6w

Hey all: I am trying to set up claude code to work with llama.cpp, I am using the Qwen3.6-35B-A3B. I usually use claude code + ZLM subscription i got lucky with $30 yearly - the set up is very simple with their automated script, but for th…

↯ Qwen 3.6 llama claude-code
What do you consider to be the minimum performance (t/s) for local Agent workflows? (www.reddit.com) +58 6w

What would you say is the minimum amount of tokens per second you would tolerate for your local agent workflows? I have been trying pi.dev connected to a llama.cpp instance running Qwen3.6-27B-Q6_K_L with 200K context running on an RTX A60…

↯ Qwen 3.6 llama anthropic claude-code
coding with Qwen3.6-27B-UD-Q2_K_XL.gguf (www.reddit.com) +58 6w

pi llama.cpp awesome torus awesome torus Windows, 5070 (12GB)

↯ Qwen 3.6 llama
RTX PRO 6000 Blackwell Max-Q bad performance (www.reddit.com) +57 7w

llama
RTX PRO 5000 (48GB) vs MacBook Pro M5 MAX (128GB RAM) - The choice for fine-tuning & agentic coding (www.reddit.com) +527 7w

↯ Fine Tuning fine-tuning vllm llama+1
What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M (www.reddit.com) +56 7w

I tried local model couple weeks ago. At the beginning, I tried Ollama, but reddit says better to switch to llama.ccp.

↯ Qwen 3.6 moe ollama llama+2
Llama.cpp llama-server command recommendations? (www.reddit.com) +53 8w

I've seen a ton of PR, and a bunch of failed PR with some interesting additions. I was wondering what other people's commands are looking like now, what they are running for llama.cpp I'm still running: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 l…

↯ Qwen 3.5 llama
How to run Qwen3.5-27B with speculative decoding with llama.cpp llama-server? (www.reddit.com) +514 8w

I run it on 2xRTX 3090. This is part of my llama-server presets file: [Qwen3.5-27B-bartowski] load-on-startup = true alias = Qwen3.5-27B-bartowski hf = bartowski/Qwen_Qwen3.5-27B-GGUF:Q8_0 hfd = bartowski/Qwen_Qwen3.5-2B-GGUF:Q8_0 draft-mi…

↯ Qwen 3.5 llama mcp
I implemented Laguna (XS.2) as a model in Llama.cpp (github.com via reddit) +45 12d

llama.cpp Manifesto / ggml / ops LLM inference in C/C++ Recent API changes Changelog for libllama API Changelog for llama-server REST API Hot topics Hugging Face cache migration: models downloaded with -hf are now stored in the standard Hu…

llama
Advice on local coding setup (www.reddit.com) +46 13d

Just got an RTX 3090 to go with my Intel Core 9 Ultra 285K CPU and 32 GB of DDR5 6000 ram. I want to code locally on my Windows 11 PC.

↯ Qwen 3.6 qwen llama claude-code
Llama.cpp : Split Mode Tensor Fix Incoming? (www.reddit.com) +41 2w

Appears thay have been cooking and we might see a fix soon released for crashes on split mode tensor Multi-gpu folks keep watch - ( In my tests SM Tensor has a ~35% uplift in TG over Layer but ofc crashes every 90-120 minutes due to vram e…

llama
magic incantation to get llama-bench to work with MTP ? (www.reddit.com) +43 2w

It does not like anything I have tried, including what works with llama-server. is it not built to work with speculative decoding?

llama
What frontend do you guys use? (www.reddit.com) +421 2w

I’m using vim lmao with a custom made plugin for completing text, so I was curious what yall use. Llama-server seems like a sensible default but it seems limited

llama
llampart 1.0.0 - I released a standalone local web UI for llama-server with translations, extended settings and a polished conversation sidebar (www.reddit.com) +41 2w

Hi everyone, I’ve just published the first public release of llampart 1.0.0: https://github.com/mchowy-troll/llampart llampart is a standalone local web UI designed to work with `llama-server`. It started from the `llama-ui` work in the `l…

llama mcp
Very happy with Qwen 3.5 122B output. But is slowness expected? (www.reddit.com) +422 3w

I'm running the 122-billion Qwen 3.5, specifically Qwen3.5-122B-A10B-Q5_K_M, on DGX Spark (128 GB contiguous memory). I'm (very!) impressed with the general knowledge output.

↯ Qwen 3.5 qwen llama
Using Intel Arc Pro series, any thoughts ? (www.reddit.com) +41 3w

Simple question: Has anyone run two or more of either of these on Ubuntu ? Intel Arc Pro B70 (32 GB) Intel Arc Pro B65 (32 GB) Running llama or vLLM etc., Any thoughts

vllm llama
Audio input not accepted with llamacpp for Nemotron 3 nano Omni ? (www.reddit.com) +41 3w

Llama-server does not accept audio input (or video for that matter) with Nemotron 3 nano omni (unsloth). I’m on a recent build of llamacpp and I redownloaded Nemotron, and I have the mmproj loaded too.

↯ Gemma 4 llama
[Release] Nexidion – A private knowledge vault with an autonomous local AI background worker. (www.reddit.com) +4 3w

Hello, After almost two years of on-and-off development, 5 complete architectural rewrites, and hitting a few brick walls, I’m finally open-sourcing a project I built to scratch my own privacy-paranoia itch: Nexidion. GitHub Repo: https://…

ollama llama openai
Gemma4 26b MoE running in MLX with turboquant (and custom kernel) (www.reddit.com) +4 3w

TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with…

↯ Gemma 4 moe llama
llama.cpp constantly reprocessing huge prompts with opencode/pi.dev (www.reddit.com) +49 3w

I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.

llama
My own local first ai harness (www.reddit.com) +44 3w

Hi, i just wanted to share what im playing with for last couple weaks. I built my own AI harness: TinyHarness My main goal was low memory footprint, it is not written in Typescript/Javascript/Python, leaving as much memory as possible for…

vllm ollama llama
Apple MLX vs. llama.cpp: compared and benchmarked [video] (www.youtube.com via hn) +4 4w

About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

llama
Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot (www.reddit.com) +4 4w

Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700.

↯ Copilot ↯ Qwen 3.6 copilot llama chatgpt
Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? (www.reddit.com) +46 4w

How is this dual setup's performance? Is it difficult to set-up everything with for example llama.cpp?

↯ Qwen 3.6 qwen llama
Qwen 3.6 27B MTP on v100 32GB: 54 t/s (www.reddit.com) +42 4w

Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.

↯ Copilot ↯ Qwen 3.6 copilot qwen llama
Gemma4:31b-coding-mtp-bf16 - slow on Macbook M5 128gb (www.reddit.com) +42 4w

Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet) https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ Running in Open Webui to view token/s output and I…

↯ Qwen 3.6 ollama gemma llama
How do you estimate total memory usage? (www.reddit.com) +41 5w

Qwen3.6 35B A3B UD IQ4_NL_XL. 512k context tokens for 4 parallel processing, key cache quantized to Q_8 and value cache quantized to Q_4.

↯ Qwen 3.6 llama
Mistral Medium 3.5 128B and Qwen 3.5 122B A10B on 4x RTX 3080 20GB (www.reddit.com) +44 5w

Mistral Medium 3.5 128B with 4x3080 20GB with layer split: CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Mistral-Medium-3.5-GGUF/Mistral-Medium-3.5-128B-IQ4_XS-00001-of-00003. gguf -ngl 99 -d 0,16384 -fa 1…

↯ Mistral ↯ Qwen 3.5 mistral qwen llama
Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys (www.reddit.com) +4 5w

Been building this for a while and finally cleaned it up enough to share. voice-agents-from-scratch is a numbered, chapter-by-chapter repo that walks the full real-time pipeline: Microphone capture Whisper for STT Local GGUF LLM (via llama…

llama
3xR9700 for semi-autonomous research and development - looking for setup/config ideas. (www.reddit.com) +43 5w

Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback.

↯ Qwen 3.6 qwen llama
Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s (www.reddit.com) +43 5w

I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Ge…

↯ Gemma 4 gemma llama
Long-context coding on RTX 5080 16GB: Qwen3.6-35B-A3B holds 30 t/s at 128K (89 t/s fresh), no quality drop (www.reddit.com) +4 5w

I wanted to see how much of my coding-agent workflow I could move local instead of paying for hosted tools forever. There was another push: Anthropic's own April 23 postmortem confirmed product-layer regressions through March/April.

↯ Qwen 3.6 llama anthropic claude-code
Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM (www.reddit.com) +44 6w

TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac. So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my perso…

↯ Qwen 3.6 qwen llama
Local LLaMA server GPU upgrade advice (www.reddit.com) +49 6w

TLDR : Should an RTX 3090 + T4 be faster than a P40 + T4 for OpenCode with Qwen3.6 35B A3B ? --- Hi, Nowadays, I have an architecture running : A Tesla P40 w/ 24GB VRAM A Tesla T4 w/ 16GB VRAM I mainly use this setup to run models like GPT…

↯ Qwen 3.6 llama
llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth (www.reddit.com) +418 7w

↯ GLM 5.1 glm moe llama
LlaMa.cpp Robot Wars (www.youtube.com via hn) +41 7w

llama
Alibaba's Qwen family captures over 50% of global open-source model downloads (www.scmp.com via hn) +44 8w

Advertisement Alibaba’s Qwen family captures over 50% of global open-source downloads, report finds Qwen hits nearly 1 billion cumulative downloads, far surpassing rivals like Meta Platforms’ Llama and DeepSeek, researchers say 2-MIN READ2…

deepseek qwen llama
current: 1x 16GB 5060Ti. worth a 2nd for OpenCode? (www.reddit.com) +49 8w

my current build is just a 16GB 5060Ti running on a 3800X with 32GB DDR4. not really anything special, but I only really use it right now for Qwen3-VL-8B-Instruct at INT8 to do handwriting transcription (and it works great for that). someo…

↯ Qwen 3.5 vllm llama
Intel Releases OpenVINO 2026.1 with Back End for Llama.cpp, New Hardware Support (www.phoronix.com via hn) +4 8w

Intel Releases OpenVINO 2026.1 With Backend For Llama.cpp, New Hardware Support Intel's OpenVINO toolkit for optimizing and deploying AI inferencing across their range of hardware platforms is out with its newest quarterly feature update.…

llama
The Winamp Skin Museum whips the Llama's ass (2020) (www.rockpapershotgun.com via hn) +3 5d

The Winamp Skin Museum really whips the llama's ass Over 65,000 skins to browse! In the late nineties and early noughties, no video game forum was complete without a 'post your desktop' thread, and no desktop screenshot was complete withou…

llama
Llama.cpp now has an official website: llama.app (twitter.com via hn) +3 10d

llama.cpp now has an official website: llama.app Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line cross-platform installer.

llama
I'm seeing low draft acceptance when using Qwen3.x MTP, what am I doing wrong? (www.reddit.com) +314 12d

I'm using llama.cpp, and I've tried Bartowski's and my own quants. When using Qwen3.5-122B or Qwen3.6-27B, I'm seeing really low draft acceptance in chats with interleaved code snippets (chatting with the LLM about programming / a code pro…

↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 ↯ Qwen 3.6 llama
Shard - getting to 10× KV cache compression (krishgarg.com via reddit) +33 2w

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench.

llama
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset (www.reddit.com) +36 2w

I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the probability distribution of how likely it is…

↯ Fine Tuning fine-tuning gemma qwen+1
I made a local-first MCP tutorial repo with node-llama-cpp and a custom agent loop (www.reddit.com) +33 2w

I just published a repo called MCP from Scratch that teaches the Model Context Protocol by building it step by step in plain Node.js. Most of the repo is about understanding MCP itself, but the later modules may be relevant here: I added a…

model-context-protocol llama mcp
Need Help Choosing a Harness for Qwen 3.6 27B (www.reddit.com) +33 2w

I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I read posts all the time, but I have a spec…

↯ Qwen 3.6 qwen llama
GPU VRAM only for small models with llama.cpp: is it possible? (www.reddit.com) +311 2w

I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large co…

↯ Qwen 3.6 qwen llama
How I do use the recent llama.cpp native tools to do web rag a.k.a. web_fetch (or anything else for the matter) directly from inside the llama-server's webui (www.reddit.com) +35 2w

As some other fellow lllmers I've discovered few days ago that the amazing llama.cpp project has just added native tools functionalities into the server. After having enabled the relative options into llama-server and played a bit with the…

rag llama
Run Chrome’s tiny Gemma4 (aka Gemini Nano) directly on PC without GPU (www.reddit.com) +36 2w

Everyone remembers that sneaky download of Gemini Nano earlier this month? and if you talk to it, it will happily tell you it’s a Gemma.

↯ Gemma 4 vllm gemma llama+1
LLMKube – A Kubernetes operator for local LLMs across Nvidia and Mac fleets (llmkube.com via hn) +3 2w

Run production LLMs on your own hardware A Kubernetes operator for self-hosted LLM inference. vLLM, llama.cpp, TGI, NVIDIA, Apple Silicon.

operator vllm llama
club-rdna16: practical 16GB AMD/Radeon local LLM testing repo (www.reddit.com) +3 2w

Following on from club-5060ti, I’ve been doing some testing with my desktop AMD GPU and wanted to make a similar repo for 16GB Radeon cards. Repo: https://github.com/5p00kyy/club-rdna16 Pages/results: https://5p00kyy.github.io/club-rdna16/…

↯ Qwen 3.6 llama
WebGPU support in llama.cpp (reeselevine.github.io via hn) +3 2w

Introducing WebGPU support for llama.cpp

llama
Is there a way to disable reasoning per request in llama.cpp's llama-server, while leaving it on by default? (www.reddit.com) +37 2w

Title. I've got a llama.cpp server running a model being accessed across a number of scripts, and some of them are easier for the model than others, and those easier ones are also latency dependent.

llama
club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes (www.reddit.com) +35 3w

I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes.

↯ Qwen 3.6 vllm llama
While waiting for Fara-1.5 for my coding harness (www.reddit.com) +31 3w

Hi all, Not sure many people are aware so wanted to give a word about Fara-1.5 release. => this release will likely be the big sister of Fara-7B and built on top of Qwen3.5 Actual Fara-7B performs not bad at all but actually requires a pro…

↯ Qwen 3.6 vllm llama
Developers who use local AI - Q4_0 vs Q8_0 KV quant? (www.reddit.com) +325 3w

I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territo…

↯ Qwen 3.6 moe qwen llama
Extension idea: llama-server with custom samplers (www.reddit.com) +3 3w

Just an idea and a prototype (made by Qwen3.6-27B-UD-Q6_K_XL via OpenCode) for allowing users to add custom sampling logic to llama-server without having to maintain their own entire fork and without having to make a wrapper that reimpleme…

↯ Qwen 3.6 llama
I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max (www.reddit.com) +322 3w

It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?

↯ Gemma 4 gemma llama
Turboquant+MTP for ROCm(Llama CPP) (www.reddit.com) +32 3w

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.

↯ Qwen 3.6 llama
Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (www.reddit.com) +38 3w

Implemented Multi-Token Prediction for QWEN on LLaMA.cpp with TurboQuant. +40% performance!

↯ Qwen 3.6 qwen llama
RTX 5060Ti 16GB or RTX 3080 20GB? (www.reddit.com) +319 4w

I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.

↯ Copilot ↯ Qwen 3.6 vllm copilot gemma+2
Show HN: Tokémon – a Pokédex for LLMs that got out of hand (tokemonlabs.com via hn) +33 4w

An unofficial Pokedex for AI models. Compare GPT, Claude, Gemini, Llama, DeepSeek and more, with types, evolutions, base stats, and simulated token-burning battles.

deepseek llama gemini
Terrible Vulkan pp/tg on Arrow Lake iGPUs (www.reddit.com) +32 4w

Hi, I recently tried to get llama.cpp with SYCL running on an Arrow Lake system but gave up halfway through since Vulkan is just way easier to set up. But, the pp/tg I'm getting on Vulkan w/ Arc 130T is disgustingly bad - 100 tokens/s for…

↯ Gemma 4 gemma llama
Does 'preserve_thinking' work with openwebui? (www.reddit.com) +311 4w

I'm running qwen3.6-35b with llama.cpp connected to openwebui. And I noticed the model fails the number guessing game test on openwebui while it works perfectly with the llama.cpp web ui.

↯ Qwen 3.6 llama
Ran some Llama.cpp RPC test to see if its worth it. And if 10Gbe needed. (www.reddit.com) +32 4w

Let me first say I am not doing anything with parallelism so these benchmarks and tests are not for you. That said if your hobbyist like me that is left wondering if can I use the GPUs my other PCs then I have some answers and but I'm stil…

llama
Is HIPfire worth it for Strix Halo? (www.reddit.com) +32 4w

Did anyone evaluate HIPfire for long context sizes (100k+) and quality, for Strix Halo? It apparently promises large performance increase over llama.cpp and the like.

llama
Just got a 8x 32gb v100 server... now what (www.reddit.com) +335 4w

Looking for suggestions. Current setup llama.cpp and ran qwen 3.5 397b 256k context.

↯ Qwen 3.6 qwen llama
how i can improve inference speed (www.reddit.com) +3 4w

specs : core i5 14400F 32gb ram d4 3200mhz rtx 4060 current speeds 30tps in output 500 tps in prefill command i currently use .\llama-server.exe ` >> -m "H:\model\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" ` >> --host 0.…

↯ Sonnet 4.5 moe sonnet llama
Qwen 3.5 MTP for 9B (www.reddit.com) +36 4w

Can llama.cpp run MTP for this model?

↯ Qwen 3.5 qwen llama
Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)? (www.reddit.com) +35 4w

Hey folks, looking for advice before I delete or keep a huge model file. I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM.

↯ Qwen 3.6 moe llama agentic
Smaller gguf getting way less tokens per second?? So confused! (www.reddit.com) +37 4w

Noob here, Running Qwen3.6 35B A3B in LM Studio on a 3080 10GB + Ryzen 5 3600 on Windows 10. Tried some unsloth quants with identical settings (GPU offload 40, MoE layers to CPU 40, context 8192, flash attention on).

↯ Qwen 3.6 moe llama
half-deployed AI projects haunt my github (www.reddit.com) +32 5w

Got 47 repos that start with 'just playing with Claude' or 'testing Llama 4 on'. Every single one dead after three commits.

↯ Llama 4 llama
Questions regarding abliteration / censorship removal (www.reddit.com) +32 5w

Hello everyone. I just thought of something that seems so obvious but from what I’ve been able to find it doesn’t seem like anyone has done it or at least not openly disclosed it if they have.

llama
best approach for Strix Halo distributed inference in llama.cpp? (www.reddit.com) +36 5w

I was curious to understand what people are doing for this use case to get the best trade-off of convenience and performance. Private backhaul on the 10GbE?

llama
Qwen3.6-27B-UD-Q6_K_XL.gguf sometimes gets stuck in a loop (www.reddit.com) +34 5w

Hi all I'm running Qwen3.6-27B-UD-Q6_K_XL.gguf using llama swap and llama-server with these parameters (actually stolen for some posts on this subreddit.) llama-server \ -m /models/Qwen3.6-27B/Qwen3.6-27B-UD-Q6_K_XL.gguf \ --mmproj /models…

↯ Qwen 3.6 llama
llama.cpp - NVFP4 native support on Blackwell from now - b8967 (www.reddit.com) +33 5w

It looks like finally we have it! Time to test!!!

llama
How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber (patloeber.com via reddit) +3 6w

Tutorial from the Google guy, I use very similar setup (llama.cpp instead of lmstudio)

↯ Gemma 4 gemma llama
Why are there so few small local creative writing models from the Chinese? (www.reddit.com) +334 6w

At this moment, the models such as Qwen 3.6 35b/27b crush the competition, yet I can't help, but notice this pattern. While the local RP scene is abundant with the Western model tunes: LLaMA, Mistral (all sizes), Nemo and more recently Gem…

↯ Mistral ↯ Qwen 3.6 mistral gemma qwen+1
Will llama.cpp multislot improve speed? (www.reddit.com) +36 6w

I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved the overall speed significantly (150-170t…

vllm llama
Which local models are actually good at staying in character? Notes from shipping Qwen3.5 4B + 9B as game NPCs (www.reddit.com) +319 6w

I'm building a small text-based game where the gameplay loop is "talk an NPC into revealing a secret." It's basically a 20+ turn roleplay stress test: the model needs to stay in character, remember what the player said earlier, and refuse…

↯ Tool Use ↯ Qwen 3.5 tool-use rag llama
llama-server: Save/restore works for tokens, but KV cache still not resumed? (www.reddit.com) +34 6w

Somehow I cannot get KV resume for my Qwen3.5 model with lama-server: Save/restore works for tokens, but KV cache is never reused — is this expected? How to enable real resume?

↯ Qwen 3.5 llama
Need help for a calling based agentic ai project (www.reddit.com) +310 7w

llama agentic
model for frigate, a380 (www.reddit.com) +32 7w

↯ Gemma 4 llama
what is the state of using rotoquant at the moment? (www.reddit.com) +34 7w

↯ Qwen 3.6 llama
Show HN: Llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU (news.ycombinator.com) +3 7w

Complete llama.cpp tutorial for 2026. Install, compile with CUDA/Metal, run GGUF models, tune all inference flags, use the API server, speculative decoding, and benchmark your hardware.

llama
Which Qwen models can do FIM (Fill in the middle) for autocompletion? (www.reddit.com) +32 7w

I cannot find a definive answer. I think the following should be able to do FIM: Qwen 2.5 coder Qwen 3 coder Qwen 3-2507 instruct Qwen 3.5 Qwen 3.6 What I verified: Qwen3-32B: no Qwen3-4B-Instruct-2507: yes Qwen3.5-27B: yes Qwen3.6-35B-A3B…

qwen llama
Context checkpoint erasure in llama.cpp ? (www.reddit.com) +37 7w

Has anyone been able to solve or mitigate context checkpoints being erased during single user inference, specifically when function calling is part of the chat history? I've been using Qwen 3.5 35B A3B for some time (now using 3.6), tested…

↯ Qwen 3.6 ↯ Function Calling function-calling qwen llama
can someone explain how to use Matrix in Llama-swap ? (www.reddit.com) +34 7w

I noticed that groups have changed to Matrix , to allow concurrent models. Currently i use llama-swap for my models and an individual instance of llama-server for embedding and reranking all for Openweb UI.

llama
Strix Halo 128GB on Proxmox - Vulkan vs ROCm benchmark matrix (www.reddit.com) +31 7w

Ryzen AI MAX+ 395, Bosgame M5, 128GB LPDDR5x. Proxmox VE 9.1 LXC containers with GPU passthrough.

↯ Qwen 3.5 minimax gemma llama
Hey, has anyone here used Qwen3.5-27B-NVFP4-GGUF with llama.cpp yet? (www.reddit.com) +315 7w

Hey! I was wondering if anyone of you have used Qwen3.5-27B-NVFP4-GGUF on RTX5090 on llama.cpp?

↯ Qwen 3.5 llama
Multi host GPU cluster using DAC cables vs 4 GPU system. Anyone doing this successfully? (www.reddit.com) +3 7w

Right now I have 3 GPUs, 5060 Ti 16G, 2 x 4060 Ti 16G, and may get a used 3090 24G that I found. I could build a janky open rack system using M.2 and PCI risers with a 1600W PSU or try something like putting 2 GPUs in 2 systems using the f…

llama
LLM inference engine written ground-up natively in C#/.NET (dotllm.dev via hn) +3 8w

Pure C# pipeline Tokenizer, sampler, scheduler, kernels — all C#. No Python, no foreign runtime, no llama.cpp wrapper.

llama
RTX 3090 llamacpp flags help (www.reddit.com) +33 8w

Hi, my current system hardware RTX 3090 24GB VRAM & Sysrem RAM 64GB using windows 11 been playing around with hermes agent and local llm (Qwopus3.5-27B-v3-GGUF & gemma-4-26B-A4B-it-GGUF) when i try asking the hermes agent to do a task with…

↯ Gemma 4 gemma qwen llama
Can I combine a RTX5060ti 16gb with 7900XTX 24gb for llama.cpp? (www.reddit.com) +39 8w

I bought this 7900XTX for 905 euro in Spain, and wondering if can I combine them together to run Qwen 3.5 27B for example ? Using a MSI B650 Gaming Plus Wifi and 64gb DDR5 6400mt/s

↯ Qwen 3.5 qwen llama
3x3090 is faster in Ubuntu than win11, GPT-OSS 120B 120tg/s vs 6tg/s why? (www.reddit.com) +324 8w

using z790 prime p d4 with 128gb ddr4 3200mhz ram. 1x3090 in main PCIe5 16x slot and 2x3090 in chipset PCIe4 4x slots.

↯ Qwen 3.5 llama
Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant) (local-llm.utop.workers.dev via hn) +2 2d

Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support This AI generated blog article is a kind…

↯ Qwen 3.6 qwen llama
Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB (ziraph.com via hn) +21 4d

Apples® to Apples®: MLX vs llama.cpp for Gemma 4 12B on an M1 16GB A matched-quant MLX-vs-raw-llama.cpp benchmark for Gemma 4 12B on one M1 16GB - decode is a tie, both pinned at the bandwidth wall. The cost that differs is startup and CPU…

↯ Gemma 4 gemma llama
Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM (deemwar-products.github.io via hn) +2 4d

mochallamaA local, tool-calling LLM inside your JVM The only in-process, tool-calling local LLM for the JVM — Spring-first, OpenAI-compatible, llama.cpp-backed via Project Panama FFM. No JNI, no daemon, no native-install dance.

tool-calling llama openai
Show HN: Will It Fit? – Opinionated Normal People Llama.cpp VRAM Estimator (hypfer.github.io via hn) +21 5d

llama.cpp VRAM estimator for normal people. Assumes single GPU, all layers offloaded.

llama
Gemma 4 12B appears in Hugging Face (huggingface.co via hn) +2 6d

gemma-4-12B-it-GGUF Recommended way to run this model: llama-server -hf ggml-org/gemma-4-12B-it-GGUF Then, access http://localhost:8080

↯ Gemma 4 gemma llama
Free Yourself from the Copilot Tax (www.kronkai.com via hn) +2 6d

Hardware accelerated local LLM inference for Go with llama.cpp integration.

↯ Copilot copilot llama
LlamaStash – Zero-overhead, terminal-native llama.cpp launcher (github.com via hn) +2 7d

LlamaStash Zero-overhead, terminal-native llama.cpp launcher. A fast TUI and CLI with init wizard for launching local LLMs via llama.cpp.

llama
Show HN: Thaw – Git branch for a running LLM (fork agents, skip prefill) (github.com via hn) +2 9d

I built thaw because forking an LLM agent is absurdly wasteful today. When an agent explores N branches — RL rollouts, best-of-N, parallel coding attempts — each branch re-runs prefill over the same shared context.

llama
Local run for multi users: which software set? (www.reddit.com) +211 12d

Context: I am testing and running local LLM on Linux for some months, first with llama.cpp and now with vLLM for better concurrent capabilities. I use llama-swap in front of either vLLM or llama.cpp in order to have thinking and non-thinki…

vllm llama
Need some advice on AI workflow (www.reddit.com) +210 13d

Hi all, I'm somewhat new to the scene (been lurking for maybe 4-5 months now), but i think I have all the basics figured out. My setup: 9800x3d with 64GB of RAM, 6900xt with 16GB VRAM.

↯ Qwen 3.6 ↯ Qwen 3.6 llama mcp chatgpt
Looking for a working Deepseek-v4-Flash quant (www.reddit.com) +25 13d

Best I tried so far is https://huggingface.co/nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF with the custom llama.cpp fork, but it suffers from low quality and random incoherent output. VLLM wouldn't support anything other than H100s for DS4.

↯ DeepSeek 4 ↯ DeepSeek 4 vllm deepseek llama
Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu (matthewgribben.com via hn) +2 13d

Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu A D700-specific guide to running llama.cpp with Vulkan on the 2013 Mac Pro: dual 6 GB FirePro cards, Ubuntu, RADV, full GPU offload, cooling, and the traps that make old…

llama
Looking for Suggestions — Single 5090 & 64gb DDR5 (www.reddit.com) +211 2w

Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if…

↯ Qwen 3.6 vllm qwen llama+1
Long-context performance at lower quants (www.reddit.com) +21 2w

I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k conte…

↯ Qwen 3.5 llama
Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode (www.reddit.com) +21 2w

I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends.

↯ Qwen 3.5 vllm llama codex+1
Best coding model on RTX 3060 (www.reddit.com) +22 2w

Wondering what’s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it?

vllm llama
Could someone please help explain these results? (www.reddit.com) +22 2w

I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled!

↯ Qwen 3.6 moe llama
What workstation to get for ~13k EUR? (www.reddit.com) +28 2w

My use-cases will be to test open-weight LLMs and work on harnesses, inference systems and possibly other non-ML workflows (CS-related) in the future. Fine-tuning would not be something I do locally because I can rent a B200 from RunPod fo…

↯ Fine Tuning ↯ DeepSeek 4 minimax fine-tuning vllm+2
minor speed bump for MTP with Qwen3.6-27B-MTP Q6_K_XL (www.reddit.com) +26 2w

I'm on Macbook M5 Max with 128GB RAM Running a test in openwebui using llama-server (llama.cpp): unsloth/Qwen3.6-27B-UD-Q6_K_XL.gguf (non MTP): 19tps unsloth/Qwen3.6-27B-UD-Q6_K_XL.gguf (MTP): 22.3tps So nothing like the massive improvemen…

↯ Qwen 3.6 llama
WebGPU back end in llama.cpp/ggml (twitter.com via hn) +2 2w

could not extract summary

llama
Agent builders: are GPT/Claude/Gemini API costs killing your margins? (www.reddit.com) +24 2w

Hey everyone, For people building agents with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude MCP/SDK, Google ADK, or LlamaIndex — how are you managing LLM API costs? Agent workflows can get expensive fast because of: tool calls retr…

rag deepseek qwen+5
What’s the cheapest way to give a local Llama 3 internet access? (SearXNG isn’t cutting it) (www.reddit.com) +219 2w

Finally got Llama 3 70B running locally and wired up function calling so it can search the web. First tried self-hosting SearXNG, but the results are pretty messy.

↯ Function Calling function-calling llama
At wits end for optimizing settings in llama.cpp for 100k context (www.reddit.com) +27 2w

Long story short, I am running Qwen3.5-35B-A3B (GGUF format) and other models on MacOS and getting around 1500 tokens/sec for prompt processing and around 35-50 tokens per second for prompt processing. I'm using the latest version of llama…

↯ Qwen 3.5 llama
Do smaller quants silently break tool calls / JSON output? (www.reddit.com) +25 2w

I posted recently about EvalShift, an OSS CLI for regression-testing LLM model changes. A few people pointed out that for LocalLLaMA, the more interesting use case may be quantization regression: Q8 -> Q4_K_M Same base model, same prompts,…

vllm ollama llama
Ternative – C++/CUDA inference engine for ternary LLMs with runtime LoRA (github.com via hn) +21 3w

# ternative Inference engine for ternary-weight LLMs with runtime LoRA — the llama.cpp of BitNet models. Loads a BitNet I2_S base GGUF + a separate LoRA adapter GGUF, merges them at full F32 precision, and serves the result via an OpenAI-…

llama openai
Weird performance depending on quant (www.reddit.com) +26 3w

Hi, I'm using llama.cpp with qwen3.6 35B A3B on two different machines. I noticed that on both machines tokens per second is better while using Q4_K_S and Q4_K_M quants than lower Q3_K_M quants.

↯ Qwen 3.6 llama
Benchmarking llama.cpp's new MTP support on Strix Halo (calebcoffie.com via hn) +2 3w

Benchmarking llama.cpp's brand-new MTP support on Strix Halo PR #22673 landed in llama.cpp on May 16. It adds first-class Multi-Token Prediction (MTP) speculative decoding for models that ship with an MTP head, including Qwen3.6 27B dense…

↯ Qwen 3.6 llama
Tesla P40 running qwen 3.6 (www.reddit.com) +2 3w

Does anyone know why qwen 3.6 MTP spec decoding won't work with Tesla P40 when the K cache is quantized? I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40.

↯ Qwen 3.6 qwen llama
Llama-server: is it bleeding to CPU/RAM? (www.reddit.com) +23 3w

Is there an easy way to know if a model is using CPU/RAM (and not only GPU/VRAM)? (I think standard verbose output, which got shorter, says nothing about this, but I may be missing something)

llama
Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 (www.reddit.com) +2 3w

I'll be UPDATING this as it seems I was benchmarking and testing Just before the UPDATE LOL TL;DR If you're running rigid agent frameworks locally with mtp on consumer hardware: drop your draft window to 3, lock parallel slots to 1, and co…

↯ Qwen 3.6 qwen llama agentic
b9200 released - potential mtp pp increase (www.reddit.com) +23 3w

testing in progress ...we all need an increase in pp 😆 https://github.com/ggml-org/llama.cpp/releases/tag/b9200 u/am17an am17an commented 13 hours ago • Overview Avoid copying the logits for every token in the batch when doing prompt proce…

llama
ik_llama: Qwen3.6 27B and 35B on very low VRAM (www.reddit.com) +23 3w

Thank you to the people at ik_llama and llama.cpp. It's amazing how far you've all pushed mtp and other tech so that I can run 27B and 35B Qwen3.6 models on an old gaming laptop with a RTX2060 mobile at 6GB VRAM and 32GB RAM.

↯ Qwen 3.6 llama opus agentic
I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why (www.reddit.com) +23 3w

In my real-world usage (opencode) and in my synthetic benchmarks, Coder-Next (Q5) demolishes the whole Qwen3.6 family including the 27B Dense model (All Q8). Everybody else is hailing that 27B is superior and is an amazing model, but I hav…

↯ Qwen 3.6 qwen llama
Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks (www.reddit.com) +25 3w

Here are some results (llama.cpp)! Task 1: write a short poem 27B Dense: 12.5 tokens/s 27B Dense MTP: (spec-draft-n-max 6): 14.5 tokens/s 27B Dense MTP (spec-draft-n-max 3): 18.7 tokens/s Task 2: edit a hello word html artifact 27B Dense:…

↯ Qwen 3.6 qwen llama
Strix Halo ROCm + MTP Notes (May 2026) (www.reddit.com) +22 3w

With the MTP merge into mainline llama.cpp I wanted to try out some other optimizations i could think of. Ended up tested backends, mtp, and bumping to ROCm nightlies.

moe llama
How does Pi coding agent control Qwen's thinking verbosity? (Qwen 35B A3B, llama-server) (www.reddit.com) +210 3w

I'm running Qwen 35B A3B via llama-server with reasoning budget set to -1 (unlimited) for testing. In every client I've tried, the model just thinks endlessly before responding.

qwen llama
RDNA3 Flash Attention fix just dropped by llama.cpp b9158 (www.reddit.com) +2 3w

https://github.com/ggml-org/llama.cpp/releases

llama
Ollama Pre-Release Switches From Building on GGML to Using llama.cpp Directly (www.reddit.com) +23 3w

https://github.com/ollama/ollama/releases/tag/v0.30.0-rc15 Hopefully this has more devs come to llama.cpp to support Day 1 releases due to Ollama now moving to using llama.cpp directly. Additionally, I hope that Ollama makes it clear that…

ollama llama
I made a UI and server for using Anthropic's new Natural Language Autoencoders locally with llama.cpp (www.reddit.com) +21 3w

Anthropic's first open weight models, Natural Language Autoencoders, are just finetunes of popular open weight models. They do not modify architecture and modeling code so inference with llama.cpp is mostly trivial.

llama anthropic
How to disable reasoning for Qwen3.5 4b 9b unsloth ggufs? (www.reddit.com) +21 3w

Hi all I'm trying to disable reasoning for quicker outputs in llamacpp-server. I remember using LM studio and that having a think button in the gui that could be toggled but later I tried the unsloth ggufs but they don't have that button f…

↯ Qwen 3.5 llama
Is it possible to exclusively use a draft model for reasoning to speed up generation? (www.reddit.com) +218 3w

EDIT: Edited to provide more clarity It occurred to me, that perhaps the same draft model used for speculative decoding would be completely adequate if we just used it's output as-is for reasoning, without validating the results against th…

vllm llama
Vulkan or CPU llama cpp backend for local llm for coding/code assist (www.reddit.com) +21 4w

Hi all I recently started a new job and we're doing python development for a ci cd metadata consolidation library for analytics and we cannot use no stuff like claude code or codex or gh copilot or any model APIs (free or paid). I got a la…

↯ Copilot ↯ Qwen 3.5 ollama copilot qwen+3
MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more (www.reddit.com) +2 4w

I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment.

↯ Qwen 3.6 llama
TensorRT-LLM vs vLLM vs llama.cpp on NVIDIA DGX Spark? (www.reddit.com) +2 4w

I am looking for recommendations on the best way to run local LLMs on NVIDIA DGX Spark. Which stack makes the most sense in practice: TensorRT-LLM, vLLM, or llama.cpp?

vllm llama
How does llama-server pick which MoE experts go on the GPU and which stay on the CPU? (www.reddit.com) +213 4w

If you are using a MoE model that does not fully fit in your GPU, some of the experts must stay on the CPU. Putting the experts that you will actually need on the GPU will give you GPU inference speeds.

moe llama
am I running this llama-bench of Qwen3.6-27B on these V100s right? (www.reddit.com) +210 4w

basically what I'm doing here is trying to validate whether or not it's a reasonable idea to get a couple of V100s, either SXMs with PCIe adapters or straight-up PCIe cards in the first place, for the sake of running this model or models l…

↯ Qwen 3.6 llama
Tracing tokens through Llama 3.1 8B inference on H100s (krithik.xyz via hn) +2 4w

You open Claude.ai, chatgpt.com, gemini, whatever LLM provider you use. You type something: "What is the capital of France?" You hit enter.

llama gemini chatgpt
9070xt inference for q3 qwen 27B (www.reddit.com) +2 4w

In llamacpp I'm getting 12tok/s, does this number look right to you and what can I do to increase this number (if possible)? cd ~/llama.cpp && ./build/bin/llama-server -m models/qwen-3.6-27b-abliterated-q3.gguf -ngl 999 -c 65536 (i need th…

↯ Qwen 3.6 qwen llama
How long for llama.cpp official support of MTP? (www.reddit.com) +22 4w

Hello there (beginner here) I've been unable to build myself llama.cpp for my Strix Halo (Windows 11) (cmake errors, I have not digged too much into it, already burned hours...), so I was wondering when an official release for Vulkan/HIP w…

llama
How difficult is distilling? (www.reddit.com) +24 4w

I remember a year or so ago when DeepSeek R1 came out and it was pretty quickly distilled into Llama 3 8b and Qwen 2.5 (?) 7b. Why don’t we see more distilled models?

↯ Qwen 2.5 ↯ Qwen 2.5 ↯ Qwen 2.5 deepseek qwen llama
Gemma4 26B A4B NVFP4 GGUF (www.reddit.com) +21 4w

Hey everyone! I’ve just uploaded a GGUF version of nvidia/Gemma-4-26B-A4B-NVFP4.

↯ Gemma 4 gemma llama
Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide (www.reddit.com) +25 4w

I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday.

↯ Qwen 3.6 moe llama opus
My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM) (www.reddit.com) +211 5w

UPDATE: i have switched to vulkan (image: ghcr.io/ggml-org/llama.cpp:server-vulkan-b9014) and now i am getting prompt eval: 591.01 tok/s generation: 41.90 tok/s which is faster than rocm new config: services: llama-cpp: container_name: lla…

↯ Qwen 3.6 llama
LLM inference speed database or leaderboard? (www.reddit.com) +213 5w

A lot of the posts in this sub is about advice about which hardware to buy, what settings to use and what speed to expect. There are a lot of excellent replies spread all over the place, but alot of it is also just vague indications like ~…

llama
M3 Ultra + DGX Spark = M5 Ultra-lite? (www.reddit.com) +24 5w

So I saw an article recently about exo disaggregated prefill with DGX Spark and M3 Ultra - prefill on one machine and decode on another. DGX Spark apparently has 4x matmul performance over an M3 Ultra - same as the M5 Ultra should have.

minimax qwen llama
Mistral Medium 3.5 on AMD Strix Halo (www.reddit.com) +23 5w

TLDR; it's slow as heck. Run overnight.

↯ Mistral mistral llama
Show HN: Llmconfig – configfile and CLI for local LLM (github.com via hn) +2 5w

llmconfig Local Large Model Config — manage local inference with llama.cpp, stable-diffusion.cpp, and whisper.cpp from a single YAML file and a single CLI. llmconfig up gemma # or just: llmc up gemma ✓ gemma is ready at http://127.0.0.1:80…

gemma llama
[Help] Running big dense models faster (www.reddit.com) +26 5w

I have been trying Mistral 3.5 on my 4x RTX 3090 rig with llama.cpp. Inference is slow (about 11 t/s) even without anything being offloaded to the CPU.

↯ Mistral ↯ Mistral 3.5 mistral vllm qwen+1
World AI Agents–35 AI Models (Claude, GPT, Llama)via One OpenAIcompatible API (world-ai-agents.com via hn) +21 5w

Access Claude, Llama, Mistral, Nova and more through a single OpenAI-compatible API. Start for as little as €1.

↯ Mistral mistral llama openai
PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 (github.com via hn) +2 5w

Open LLM inference, rewritten by hand for one specific chip at a time. Kernels, speculative decoding, and quantization, tailored per target.

llama
"I" is not singular — 4 LLM agents with per-agent LoRA on a single RTX 3070 8GB (www.reddit.com) +2 5w

https://preview.redd.it/7yei65sbugyg1.png?width=1703&format=png&auto=webp&s=ad388c51dd10cb44b41a99876d28797e006fd138 Stanford's Generative Agents = one LLM cosplaying 25 personas. I wanted agents that actually become different people — dif…

↯ Qwen 3 llama
Best open-weight model to run locally on 8x A100 80GB for generating teacher data? (www.reddit.com) +212 5w

I have (free) access to a SLURM cluster with 8x NVIDIA A100 80GB GPUs (=640 GB VRAM) on a single task, and I want to run an open-weight model locally with llama.cpp for data generation, not coding. My use case is generating teacher data fo…

↯ Fine Tuning fine-tuning llama
What STT/LLM/TTS combo are you running for production voice agents in 2026? (www.reddit.com) +21 5w

Curious what stacks people are actually using right now, and where you're hitting walls. Some things I've been observing while testing combos: - Deepgram Nova-3 still the best STT for English, Cartesia is closing the gap on streaming - Ele…

↯ Llama 3 llama anthropic openai
Llama.cpp MIPS R8000 Kernel Running on an SGI Power Challenge from 1995 (twitter.com via hn) +2 5w

Whew! Big work today getting optimized llama.cpp MIPS R8000 kernel running on the SGI Power Challenge deskside from 1995 with Gemma 3 270M.

gemma llama
Help with MI50 and llama.cpp/ROCm 7.2 (www.reddit.com) +22 5w

I have an MI50 that I use with llama.cpp/Vulkan, however some models run quite slowly, so I'd like to try the ROCm backend, but no matter what I try it doesn't work. Downloading the missing files from ArchLinux package doesn't work.

llama
llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged (www.reddit.com) +21 5w

https://github.com/ggml-org/llama.cpp/pull/22196 And somehow we already got some GGUFs for it! https://huggingface.co/CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF (the below one is…

↯ Qwen 3.5 gemma llama
Is long re-processing of output as input a common "feature" or not? (www.reddit.com) +212 6w

I now use (mostly) Gemma 4 and Qwen 3.5 models *. And seems that all of them, after context grows a bit, after providing long output for me and getting a short prompt in response, are starting to process many new tokens as input and I have…

↯ Qwen 3.5 gemma qwen llama
I ran Gemma 4 E2B with llama.cpp on a lot of different iPhones, here's the setup report (www.reddit.com) +2 6w

TLDR: I've been running gemma4 e2b extensively on iOS with llama.cpp and found some interesting quirks and info you guys may like! These are specifics for the iPhone and what I've found worked across 20+ devices.

↯ Gemma 4 gemma llama
llama.cpp - tool calling issues on Windows only (www.reddit.com) +24 6w

I have a dedicated linux box I run all my stuff on. I occasionally see the 'zomg 35b can't call tools?!' posts here and chuckle to myself in a *zero issues here* way.

llama
I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. (www.reddit.com) +217 6w

It looks as if OpenCode introduces an artificial delay in agentic coding. Have you noticed similar issues?

llama agentic
Last llama.cpp update broke web search tool calling with Qwen 3.6 27b. (www.reddit.com) +24 6w

At least in open-webui. Nothing has changed except for the backend update.

↯ Qwen 3.6 qwen llama
PI agent integrated with Cline-Kanban repo: All using PI and Qwen 3.6 35B MOE UD 4K_XL (www.reddit.com) +22 6w

Repo: statisticalplumber/kanban at pi-agent-integration Hi Guys, To test Qwen 3.6’s potential, I also wanted the Cline Kanban project to have an open-source agent to work with. The last time I tested Cline Kanban, it didn’t support agents…

↯ Qwen 3.6 cline moe qwen+3
Ubuntu 26.04 vs 24.04 speed improvements for inference? (www.reddit.com) +24 6w

I'm curious if any brave soul has upgraded their computer (especially if it's Strix Halo) from Ubuntu 24.04 -> 26.04 and seen a significant performance improvement for inference with VLLM, llama-server, and/or LM Studio.

vllm llama
Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B (www.reddit.com) +2 6w

Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice speed increase for working on the same cod…

↯ Qwen 3.6 llama
Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP? (www.reddit.com) +28 6w

I'm a daily llama-cpp user and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM.

↯ Qwen 3.6 vllm llama
Quant Qwen3.6-27B on 16GB VRAM with 100k context length (www.reddit.com) +2 6w

https://preview.redd.it/tblmrwxkbexg1.png?width=1193&format=png&auto=webp&s=6dea1e6684e75e22852d57c0c72e9171deb56ae2 I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6…

↯ Qwen 3.6 llama
RTX 3090 + 27B model performance issues (llama.cpp) what am I doing wrong (www.reddit.com) +217 6w

Hey folks — looking for some advice on improving my local LLM setup (and also exploring agentic coding workflows). Current setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB Using llama.cpp with a Qwen3.6 27B Q6 model (GGUF) Running through OpenCo…

↯ Qwen 3.6 llama agentic
Show HN: Doxa – Open-source emergent simulator for geopolitical scenarios (github.com via hn) +2 6w

Hi! We, Vincenzo and Riccardo, built Doxa as an agnostic engine for emergent simulations with agents for constrainted scenarios (like geopolitical, economics, ...) and work well with LLMs like Qwen2.5:7B, Llama but also cloud models such a…

↯ Qwen 2.5 llama gemini
Is there any quick way to estimate best parameters for llama.cpp? (www.reddit.com) +29 6w

I usually just throw models into LM Studio but I decided to finally compile llama.cpp on my hardware to get some extra speed and to hopefully replace my increasingly unreliable cloud subscription. I have a RTX 4080 and Ryzen 5 7600 with 32…

llama
Severe instability and looping issues with local LLMs (Qwen, Zen4, llama.cpp) (www.reddit.com) +221 6w

I tried working on a local LLM project today and honestly ended up pretty frustrated. I tested several approaches, but none of them worked reliably.

↯ Qwen 3.6 qwen llama
Speed penalty with Q8 KV quantization (www.reddit.com) +2 6w

I knew there would be a speed penalty when switching the KV cache quantization from F16 to Q8, but I never expected it to be this significant at longer context sizes. I ran a test with Qwen 3.5 122B on my MacBook M2 Max using llama.cpp.

↯ Qwen 3.5 qwen llama
Qwen3 27B FP8 + TurboQuant on RTX 5090 - anyone tried? (www.reddit.com) +217 6w

Do I understand correctly, based on this comment, that I can potentially fit Qwen 3.6 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, a…

↯ Qwen 3.6 qwen llama
Sıfırdan Eğitilmiş 258M Parametre Türkçe LLM: Marul V7 (www.reddit.com) +23 6w

Selam, Bir süredir üzerinde çalıştığım bir projeyi paylaşmak istiyorum. Sıfırdan geliştirdiğim bir Türkçe dil modeli var: Marul V7 Model tamamen bağımsız şekilde eğitildi.

llama
Verbatim AI – on-device transcription (Whisper) + summaries (Llama 3.2) (apps.apple.com via hn) +21 7w

llama
eGPU vs system RAM (www.reddit.com) +28 7w

minimax llama
kIOGPUCommandBufferCallbackErrorImpactingInteractivity... recreate the backend to recover (www.reddit.com) +2 7w

↯ Qwen 3.6 qwen llama
PSA re Qwen 3.6 35B A3B q4 + agents (www.reddit.com) +29 7w

↯ Qwen 3.6 moe deepseek qwen+1
Recommended parameters for Qwen 3.6 35B A3B on a 8GB VRAM card and 24GB RAM? (www.reddit.com) +214 7w

↯ Qwen 3.6 moe qwen llama
llama-server / web gui / C++ mcp server : is it possible to inject context (for skills or text flavour)? (www.reddit.com) +29 7w

llama mcp
How is Rotorquant/planarquant/iso qaunt better? (www.reddit.com) +22 7w

↯ Qwen 3.6 gemma qwen llama
Inferena: Local benchmark of PyTorch vs. Llama.cpp vs. Rust frameworks (inferena.tech via hn) +2 7w

llama
5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4 (www.reddit.com) +2 7w

Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I…

↯ Qwen 3.6 qwen llama
Has anyone figured out STT with Gemma4 for Home Assistant? It works but responds with full thought chain. (www.reddit.com) +21 7w

I have Gemma4-E2B working within home assistant as STT, and E2B seems fast and accurate for STT (maybe a bit better than Parakeet), however, it responds with the entire thought process: https://preview.redd.it/v8zhb5elltvg1.png?width=599&f…

↯ Gemma 4 gemma llama
Help me squeeze every drop out of my AMD Ryzen AI Max+ 395 (96GB unified VRAM) — local LLM, image/video gen, coding agents (www.reddit.com) +22 7w

I'm running a local AI setup and want to make sure I'm using my hardware to the absolute maximum. If you have tips on better models, smarter configurations, or services I'm missing, drop them in the comments.

openclaw llama
Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (news.ycombinator.com) +22 7w

https://w418ufqpha7gzj-80.proxy.runpod.net Started for myself, but since Im not using it continuously, sharing it: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (TheTom/llama-cpp-turboquant) on RTX 3090 (Runpod spot instance). 5 pa…

↯ Qwen 3.6 llama
Ask HN: What are the machine requirements for a LLM like Llama-3.1-8B? (news.ycombinator.com) +22 7w

I want to create a local GenAI. Tell me the server machine requirements.

llama
Anyone who tried new 3.6 on single 3090, what's your llama.cpp flags for best performance ? (www.reddit.com) +25 7w

It's been some time now, surely some have tinkered with it more and optimised it already

llama
Strix Halo concurrency 4 16k context 64 t/s Qwen3.6-35B-A3B-Q8_0 (www.reddit.com) +2 7w

https://preview.redd.it/4906akj9dovg1.png?width=1527&format=png&auto=webp&s=c49e255ac79a3c5455f44603422f8af7ddc12594 First of all can we make https://www.youtube.com/watch?v=2lUC8Gimxz8 Angine de Poitrine this subs official band? Those guy…

↯ Qwen 3.6 qwen llama
Is there a way to have qwen-code CLI read images? (www.reddit.com) +21 7w

Basically I am asking the model to describe an image, but it says it can't process the images. The weird thing is that if I send the image encoded directly on the prompt, it works just fine, I am using llama-server with qwen3.5 (tried all…

↯ Qwen 3.6 qwen llama codex
I want to run qwen3.5 27B q4_k_m on CPU, and I need help. (www.reddit.com) +217 7w

I am an local LLM beginner and I found this Reddit while looking for help. (Please understand that I am unfamiliar with Reddit.) (system- i5 4440 1.8GHz/b85m ds3h/DDR3 32GB/128GB SSD/Ubuntu 25.10 questing) I loaded Qwen3.5 27B Q4_K_M onto…

↯ Qwen 3.5 llama
Qwen 122B is AMAZING but is my config right? (128GB M4 Max) (www.reddit.com) +29 7w

Hi! I hope its okay for me to ask this here.

↯ Qwen 3.5 qwen llama
Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? (www.reddit.com) +215 7w

I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.

↯ Qwen 3.5 vllm llama
[Paper] Residual Streams / KV Direct (www.reddit.com) +21 8w

It seems we have entered a period of accelerating innovation regarding the KV cache. Someone mentioned this post's paper in the Github issue of llama.cpp for implementing Turbo Quant.

llama
Vulkan compilation issue on Fedora (b8786) — solved (www.reddit.com) +22 8w

If you pull https://github.com/ggml-org/llama.cpp/releases/tag/b8786 and try to build with Vulkan support on Fedora, you may hit this error: [ 39%] Building CXX object ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/multi_add.comp.cpp.o /h…

llama
DotLLM – Building an LLM Inference Engine in C# (kokosa.dev via hn) +2 8w

Introducing dotLLM - Building an LLM Inference Engine in C# If you’ve been building .NET applications and wanted to run LLMs locally, your options have been… limited. You could wrap llama.cpp through LLamaSharp, deal with ONNX Runtime, or…

llama
Older model suggestions (www.reddit.com) +25 8w

Due to costs I am running on some older hardware. Looking for suggestions on supported models for my particular stack.

llama
Claude down? TokenMonopoly will help you find the best deals in AI subs (tokenmonopoly.com via hn) +2 8w

TokenMonopoly Live leaderboard of AI API deals — pricing, subscriptions, and SWE-bench scores for Claude, GPT, Gemini, Kimi, DeepSeek, Llama and more. Compare 27 benchmarked models across 96 hosts by price-per-performance, refreshed daily.

↯ Swe Bench swe-bench deepseek llama+1
Show HN: How to Use Google's Extreme AI Compression with Ollama and Llama.cpp (news.ycombinator.com) +2 8w

The introduction of TurboQuant, PolarQuant, and QJL (Quantized Johnson-Lindenstrauss) by Google Research represents more than just a technical optimization. At Vucense, we view this as a landmark moment for Inference Sovereignty https://vu…

ollama llama
Show HN: Ext-Infer – Native LLM Inference and Embeddings for PHP (infer.displace.tech via hn) +1 2d

Introduction ext-infer is a PHP 8.3+ extension that loads a GGUF model and runs LLM inference inside the PHP process via llama.cpp. PHP-native semantic search, RAG pipelines, and CLI / worker inference run without shelling out to Python or…

rag llama
Show HN: LLMhop – A tiny, stateless router for LLMs with a NixOS module (github.com via hn) +1 4d

LLMhop is a tiny stateless proxy for LLM inference servers. It tackles an issue I faced when trying to serve more than one local LLM at once which is not natively supported by vLLM.

vllm llama
Show HN: TurboPrefill – Multi-GPU prefill acceleration for llama.cpp (github.com via hn) +1 5d

TurboPrefill is an attempt to make layer-split multi-GPU configurations spend less time waiting and more time computing during prefill.

llama
ik_llama.cpp – llama.cpp fork with better CPU performance (github.com via hn) +1 8d

ik_llama.cpp: llama.cpp fork with better CPU performance TL;DR This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via…

deepseek llama
DeepSeek V4 Flash at 8.4 tok/s on 3×3090: patching the GGUFs that won't load on cchuter's llama.cpp fork (www.reddit.com) +18 12d

my apologies if anything does not make sense, I literally dont know what I am doing, im not a programmer, just a simple vibe coder, with an Claude subscription. That said, if you have 200gb of sys ram+vram and want to run deepseek v4 flash…

↯ DeepSeek 4 ↯ DeepSeek 4 deepseek llama
Show HN: Biopetals – Run biology tuned Llama, BitTorrent-style (github.com via hn) +1 13d

About a month ago, I heard about petals. Petals is basically a library that lets you run LLMs by loading the weights onto a network of computers that are all running petals.

llama
Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference? (www.reddit.com) +11 13d

I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).

↯ Qwen 3.6 vllm llama agentic
Llama.cpp Console released (www.reddit.com) +19 13d

https://github.com/alekk89/llama.cpp-Console/ for windows users

llama
I ditched LM Studio for llama.cpp and my local LLM doesn't feel like a downgrade (www.xda-developers.com via hn) +1 13d

LM Studio has been my default runner for as long as I've been running local LLMs, which is more than long enough now to call it part of my daily flow rather than just something I'm experimenting with anymore. The appeal of LM Studio is pre…

llama
Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal? (www.reddit.com) +18 13d

Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774) My run command: ``` llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface…

↯ Qwen 3.6 qwen llama
I made a Windows app for managing llama.cpp in WSL/Ubuntu (www.reddit.com) +11 2w

I’m a Windows user, and I have fairly Windows-y expectations for software: I prefer not having to live in a terminal just to install, build, configure, and run things. I couldn’t find an app that managed the full llama.cpp-on-WSL workflow…

llama
Llama.cpp: What's up with -sm tensor + AMD + Vulkan? (www.reddit.com) +11 2w

Has anyone got it to work? I tried it with dense models (eg qwen 27b, gemma 31b, mistral 128b) since that's where I need it most, but it always core dumps.

↯ Mistral mistral gemma qwen+1
Poor performance on RX 9070 XT (www.reddit.com) +15 2w

I was thinking about upgrading from an MI50 to an AMD AI PRO9700, and I happen to have an RX 9070 XT on my gaming pc, so I tested the performance on it to have an idea of what to expect. So, install rocm, build llama.cpp, download Qwen3.6-…

↯ Qwen 3.6 llama
Built a local-first AI memory system that indexes screen activity, meetings, and voice notes ( MCP + automations) (www.reddit.com) +17 2w

Been experimenting with an idea — what if your AI assistant actually remembered everything you did on your computer? Not stateless chats, but real persistent context.

↯ Gemma 4 ↯ Gemma 4 gemma llama cursor+1
What is everyone using AI for? Realistically (www.reddit.com) +111 2w

So I have to admit, I have fallen victim to the cool looking dashboard videos but I’m struggling to find a use for me. I love AI and use it daily for general questions and some deeper research (Google Gemini free tier).

ollama openclaw qwen+2
I spent €300 extracting raw LLM weights, ran into a wild codegen bias trap, and finally mapped the internal activation geometry (60 Graphs) (www.reddit.com) +11 2w

Hey Reddit! A couple of weeks ago, I posted about my independent research on treating LLM alignment as a latent space shift.

qwen llama
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge) (www.reddit.com) +1 2w

Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a…

↯ Security jailbreak security llama
llama.cpp oom issue (www.reddit.com) +113 2w

I'm having an issue with llama.cpp going OOM (system ram, not vram) after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it st…

↯ Qwen 3.6 llama
how to install llamacpp the better way to wrapping it in python ui (CPU use only) ? (www.reddit.com) +19 2w

i want the best installation that fit my use and my low-compute H.W , i want to run small to above small llm like "qwen" 2b ,4b and 27b , and "gemma" 31B. rely completely on only old CPU 4th.gen i7 with that few 32gb 'slow' ddr3.

gemma qwen llama
gemma 4 e2b quality degrades after ~30-40 continuous inferences on 4gb vram? (www.reddit.com) +17 2w

running gemma e2b via llama-server for continuous background tasks on a 1650 4gb. works great initially but after maybe 30-40 calls the outputs start getting noticeably worse — shorter responses, missing fields in json output, sometimes ju…

↯ Gemma 4 gemma llama
NVFP4 + MTP - voilà on llama.cpp (www.reddit.com) +13 2w

As in title - NVFP4 + MTP at once on llama.cpp https://github.com/ggml-org/llama.cpp/releases/tag/b9297

llama
KinetiX: An intra-inference hardware interlock for LLMs (github.com via hn) +1 2w

KinetiX Latent Interlock KinetiX is a hardware and software safety interlock designed to monitor latent states (activation tensors) in real time within LLM inference engines like llama.cpp. It enables instant process termination upon detec…

llama
Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else (www.reddit.com) +1 2w

I'm running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it's just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working…

↯ Qwen 3.6 llama
LLaMa.cpp basic question (www.reddit.com) +12 2w

I'm trying to install LLaMa with PI agent. I ran curl -fsSL https://pi.dev/install.sh | sh export PATH="/home/user/.local/share/pi-node/node-v22.22.3-linux-x64/bin:$PATH pi install npm:pi-llama.cpp These commands installed pi, added them…

ollama llama chatgpt
Seeking resources to read about llama.cpp server and how offloading works (www.reddit.com) +17 2w

SETUP INFO: Amd R9700 AI PRO. Using llama-cpp server, ROCM docker version.

↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 ↯ Qwen 3 llama
I'm running an agentic system with kobold.cpp as my backend. Am I losing performance? (www.reddit.com) +11 2w

Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.

↯ Qwen 3.6 moe llama agentic+1
Continue config for Qwen 3.6 and llamacpp (www.reddit.com) +1 2w

If anyone is using the Continue.dev extension in VSCode, what config settings are you using for Continue and the llama-server? Mine keeps hanging after bad tool calls.

↯ Qwen 3.6 continue-dev qwen llama
Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B (taalas.com via hn) +11 2w

Products Taalas HC1 Technology Demonstrator - Runs Llama 3.1 8B model - TSMC 6nm | 815mm2 | 53B Transistor - 2.5 kW Server Instantaneous Inference HC1 demonstrates the power of Taalas hardcore model silicon technology, delivering 17k token…

llama
AMD BC-250 and the search for Cheap Compute (www.reddit.com) +11 2w

I've been searching for disused/underappreciated compute vectors for a few months since the MI50 shot up in proce - in comes the salvaged PS5 APU on a standalone board; Zen 2, 16 GB unified GDDR6, RDNA 2 (gfx1013). They're $50-150 on eBay…

llama
Volatile prefill speed after each reboot - llama.cpp (www.reddit.com) +11 2w

After every machine restart I get a different prefill speed, it can be only 300t/s or 1500t/s. It's like a lottery at each restart.

↯ Qwen 3.6 moe llama
Show HN: Llama CPU Benchmarks (deemwar-products.github.io via hn) +1 2w

TurboQuant — "8× faster" The headline is a synthetic GPU-kernel number. On real CPU end-to-end it ran 2.2× slower and dropped Qwen accuracy 17 pp.

qwen llama
Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell (www.reddit.com) +11 2w

Hi, i run llama.cpp inside LXC on a Proxmox server. The hardware is a recent AMD Epyc with two 6000 Blackwell MaxQ.

↯ Qwen 3.6 llama
The MTP function in LMStudio causes a decrease in output quality. (www.reddit.com) +1 2w

The prompt is very simple, you can see it at the end. Both tests used the exact same settings, the only difference was that I turned the MTP button on/off, nothing else changed, I tried similar tests multiple times with similar results: By…

llama
Show HN: Llama-dash – local LLM operators dashboard and proxy (github.com via hn) +1 3w

llama-dash llama-dash turns a self-hosted local inference box into an observable, policy-controlled AI gateway: one UI for model state, request history, API keys, routing rules, proxy metrics, and client setup. The implemented inference ba…

llama
Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates (www.reddit.com) +11 3w

Disclosure: I made this. Open-source, MIT, Windows + Linux.

↯ Qwen 3.5 ollama llama cursor+1
Find bugs in YOUR code using OpenCode, Llama.cpp and Qwen3.6 (wtarreau.blogspot.com via hn) +1 3w

Background For quite some time I had been submitting tasks to LLMs via llama-cli (natively) or llama-server (API), both from the excellent llama.cpp project. On CPU-only llama-cli starts fast and can restart from a checkpoint which has alr…

↯ Qwen 3.6 llama
Llama-server and MTP (www.reddit.com) +118 3w

currently in order to use MTP one needs to enable it in the starting argument of llama server. --spec-type draft-mtp --spec-draft-n-max 2 But then other models that do not use MTP currently like Gemma or basically all other models fail to…

gemma llama
Qwen3.6 35B MTP, t/s varies on different scenario (www.reddit.com) +11 3w

Tried Qwen3.6 35B Q5_K_M MTP, HW: 9700x, 64GB 5600 RAM, 5060 TI 16GB. --n-cpu-moe 30 ^ -ngl 99 ^ -c 131072 ^ --no-mmap ^ --flash-attn on ^ --cache-type-v q8_0 ^ --cache-type-k q8_0 ^ --threads 8 ^ --parallel 1 ^ -rea off ^ --reasoning-budg…

↯ Qwen 3.6 moe llama
RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed (www.reddit.com) +1 3w

What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases.

↯ Qwen 3.6 llama
Need help getting 7900 XTX PyTorch performance metrics (www.reddit.com) +11 3w

I'm on a quest to profile and benchmark different GPUs for PyTorch, vLLM, and llama.cpp. Cannot find the high-end AMD consumer cards for rent anywhere online and interested in the PyTorch ROCm performance of the 7900 XTX (if you want to co…

vllm llama
9070xt speed inconsistent. (www.reddit.com) +11 3w

I have a 9070xt on windows 10, and "The Rock Nightly" ROCM & built llama.cpp using the following flags : cmake .. -G Ninja ^ -DCMAKE_C_COMPILER="C:\opt\rocm\lib\llvm\bin\clang.exe" ^ -DCMAKE_CXX_COMPILER="C:\opt\rocm\lib\llvm\bin\clang++.e…

llama
Is the llama.cpp nixos flake just broken? (www.reddit.com) +11 3w

I can't seem to build any of the latest releases. I'm not sure if something has changed and I haven't kept up, but only way to get a working build is to pin to like a 3 week old commit.

llama
🧬 flux-genotype: A self-evolving AI kernel that runs on CPU with Ollama — mutates its own architecture (www.reddit.com) +13 3w

`🧬 Flux‑Genotype – A CPU LLM that rewrites itself` I've been working on an open-source kernel called **flux-genotype**. It orchestrates local models (TinyLlama, Llama 3.2, Hermes 3, DeepSeek-Coder) into a self-modifying ecosystem.

↯ Llama 3.2 ollama deepseek llama
GGUF with MTP vs MLX without. Is mlx still the way to go for mac users? (www.reddit.com) +17 3w

Has anyone of the mac users tested the speed difference (token gen, promt processing) between mlx quants without mtp, vs gguf quants with mtp? More or less once a month I wonder if mlx is still the correct path in mac.

llama
MTP vs non-MTP vram usage difference? (www.reddit.com) +15 3w

As per title, assuming you run both with the same context and quantization in llama.cpp is there any difference in vram usage?

llama
Looking for agent builders to test external agents on a multi-agent knowledge site (www.reddit.com) +111 3w

I’m building AgoraDigest, an experimental site where multiple AI agents answer the same hard technical question independently, then a synthesized digest preserves: verdict best-use-case boundaries conflicts between agents evidence gaps ver…

qwen llama
Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster (www.reddit.com) +1 3w

I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96G…

vllm llama
Build Own Docker Image with llama.cpp and MTP (www.reddit.com) +14 3w

Hi All! Saw some folks waiting for the Docker images with llama.cpp and MTP when it released.

↯ Qwen 3.6 llama
Made a simple template manager and GUI for llama.cpp so I don't have to keep memorizing CLI flags. (www.reddit.com) +13 3w

Introducing Hexllama Hey, I’ve always found llama-server to be more than enough for testing out local models, mostly because it guarantees you always have the absolute latest llama.cpp features and architecture support. But keeping track o…

llama
lm studio alternative (www.reddit.com) +112 3w

i'm looking for sth like lm studio but open source, easy to use. able to stay up to date with llama.cpp or select custom engine.

llama
ClickBook – Offline Android eReader with local LLM inference via llama.rn (play.google.com via hn) +12 3w

ClickBook is an offline ereader for EPUBs and readable PDFs that turns every book into a language-learning companion. Tap any word while you read and get an instant, context-aware explanation powered by on-device AI.

llama
Qwen 27b MTP Config, Llama.cpp Single 3090 (www.reddit.com) +1 3w

What setup are you using for qwen 27b on a single 3090? Here's what I've started using today.

↯ Qwen 3.6 qwen llama
Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations (www.reddit.com) +1 3w

Getting 10.3 tps using this prompt: CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" ./build-mimo-5090-3090/bin/llama-server -m "$MIMO" -ngl 999 --n-cpu-moe 43 --no-mmap -c 100000 -ctk q8_0 -ctv q8_0 -fa on -…

moe llama
llamacpp with Gemma4 31B dense and Gemma e4b as draft, plus audio input? (www.reddit.com) +1 3w

Hi, has anybody succeeded in running llama.cpp with Gemma 31b dense and Gemma e4b as draft model, and simultaneously inhibit the voice recognition feature? Is it even (theoretically) possible?

↯ Gemma 4 gemma llama
Llama.cpp server running ~2 weeks straight. Loses its mind? (www.reddit.com) +110 3w

I’ve got Qwen3.6 27b and Qwen3.6 35b running in two separate instances for over two weeks and they are considerably dumber now than when I launched them. is this a thing?

↯ Qwen 3.6 llama
Small OpenCode plugin that helped me with broken tool calls from a local Qwen model (www.reddit.com) +11 3w

I’m using OpenCode with a local Qwen3.6-27B Q6_K GGUF model on an RTX 5090 with KV cache in Q8. For reference my llama.cpp build is compiled with CUDA 12.9.

↯ Qwen 3.6 qwen llama
Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update (www.reddit.com) +1 3w

In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so…

↯ Llama 3.2 llama
Automated AI researcher running locally with llama.cpp (www.reddit.com) +13 3w

Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure: https://github.com/huggingface/ml-int…

↯ Qwen 3.6 ollama llama opus+1
Best local model supporting claude code? Rtx3060 (www.reddit.com) +1 3w

Hello all, I’ve been using Qwen 3.5 9B Q4 262k ctx using Llama cpp for claude code for a while now, is there any model which better complements agentic coding setup locally? Or is there a better harness (than Claude Code)?

↯ Qwen 3.5 qwen llama agentic+1
Anyone else experiencing heavy hallucinations with MiMo-V2.5 (310B) quantized version? (www.reddit.com) +12 3w

Has anyone else run into major issues with MiMo-V2.5 (the 310B total / 15B active MoE model from Xiaomi)? I tried the UD-Q4_K_XL quant from Unsloth.

moe llama
Open Source Managed Agents (linchpin.work via hn) +11 3w

Any model, one adapter OpenRouter routes to ~200 cloud models — Claude, GPT, Gemini, Llama, DeepSeek, Mistral, Qwen. Ollama runs anything you've pulled locally.

↯ Mistral mistral ollama deepseek+3
LLMs on flagships smartphones? (www.reddit.com) +13 3w

I have been curious to see how small LLMs like Gemma-4-E2B-it run on a flagship smartphone (S25+ with Snapdragon 8 Elite) in terms of prompt processing and token generation. I have created a script that uses llama-cli and I achieve 48 tps…

↯ Gemma 4 gemma llama
What Inference-Platform Benchmark Posts Leave Out (ingero.io via hn) +1 3w

TL;DR Cloudflare’s recent post on hosting Kimi K2.5 and Llama 4 Scout opens with p90 Time-to-First-Token graphs and a round of throughput numbers. The piece is candid about the engineering work behind the gains.

llama
very slow tok/s with Gemma 4 31B on a 5090?! (www.reddit.com) +13 3w

Hi, i have a 5090 and i was tyoing around with hermes-agent. To utilize 128K i thought about switching from LM Studio to llama-cpp (the turboquant fork) expecting better tok/s and also saving some VRAM from context quantization.

↯ Gemma 4 gemma llama
Building the QWEN3.6 - Codex Bridge Furthe + Kindergarten Harness Reality Check (www.reddit.com) +11 3w

I got a bit further with my harness for running Qwen 3.6 model on Codex. While testing, analyzing, and building the harness, I evolved TBG(O)llama-swap into a full forensic UI bridge and LLM analytics tool where every harness finding, modi…

↯ Qwen 3.6 ollama qwen llama+2
ZML: Between Jax and Llama.cpp (jaco-bro.github.io via hn) +1 3w

tjbl Loading Safetensors in NNX: A 700x Speedup KV Caching in NNX ZML: Between JAX and llama.cpp UnslothTrainer Gotcha: Keep All Columns Is "Safe AI" the New Y2K? The Vulgar Script: The Strange Alliance Against Open AI The Steak Is Juicy

llama
llama bench kv cache f32 error (www.reddit.com) +13 3w

A did a quick google, but found nothing on this and I am scratching my head. Trying to do a llama-bench run with the kv cache set to f32 under Vulkan with a Strix halo.

↯ Qwen 3.6 llama
What solutions are you using to boost TPS and Context Window? (www.reddit.com) +13 4w

Server Specs: 16 Gigs DDR5 AMD Ryzen 5 7600X 4.7 GHz 6-Core Processor AMD Radeon Sapphire Nitro+ 7900XTX NZXT N7 B650E ATX AM5 Motherboard Performance: I'm running Qwen27b Q4 at 80k context on a Sapphire Nitro+ Radeon 7900XTX 24Gb at 40 t/…

llama
Does anyone else have issues with Qwen-3.6-27B stability in the Codex harness? (www.reddit.com) +1 4w

I run the 4 bit quant of Qwen-3.6-27B in the codex harness with unsloth recommended llama-server settings, thinking enabled. I have tried the default chat template and the updated ones and have updated both my GGUFs and llama-cpp to the mo…

↯ Qwen 3.6 qwen llama codex
Here is the current "Free-Tier AI Stack" for 2026 (www.reddit.com) +11 4w

1. The Frontier Giants • Gemini: Access 1.5B tokens/day on Gemini 1.5 Flash/Pro.

↯ Mistral mistral grok rag+4
Does llama-swap actually work with mlx_lm.server / MLX models on macOS? (www.reddit.com) +1 4w

I’m trying to use llama-swap with an MLX model on a M2 Max instead of just llama-server. I got mlx_lm.server working directly with /v1/chat/completions, but I’m not sure whether llama-swap reliably supports this setup.

↯ Qwen 3.5 llama
Hardware upgrade advice (www.reddit.com) +12 4w

Hello everyone, I'm an enthusiast and software developer. I am using my gaming PC, here's the relevant specs: MB Asus ROG Strix X570-F CPU AMD 5800x RAM 64Gb DDR4-3600 GPU 3080ti (12Gb GDDR6X) I can replace the GPU with 2x 5060ti 16gb for…

llama
potentially stupid problem trying to llama-bench Qwen3.6-27B across two V100s in llama.cpp (www.reddit.com) +13 4w

this is almost certainly a skill issue, however: ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 1 --flash-attn 1 --device CUDA0,CUDA1 -p 2048 -d 4096,16384,65536 rather than splitting across those two cards, it firs…

↯ Qwen 3.6 llama
Meltdown: LLM Client Made in Python and Tk (github.com via hn) +1 4w

An interface for llama.cpp, ChatGPT, Gemini, Claude, and Kimi This is a desktop application to interact with large language models. It has hundreds of arguments and commands and many power user features.

llama gemini chatgpt
4GB "Gemini Nano" model GGUF anyone? (www.reddit.com) +1 4w

Hi everyone, I saw an article saying Chrome silently downloads a ~4GB AI model (likely "Gemini Nano") to your computer for features like text summarization. Two questions: What is the exact name/version of this model?

llama gemini
What's the right way to feed PDF files to Gemma-4? (www.reddit.com) +12 4w

In my line of work, PDF documents tend to be combinations of text, math formulas, tables and images. llama.cpp added support for PDFs a few months ago, but I believe it treats PDFs either as text (discarding everything else), or as images.

↯ Gemma 4 gemma llama
Show HW: Vectors.Space – An free service for embeddings (vectors.space via hn) +1 4w

One API for embeddings. OpenAI, Gemini, Voyage & local Llama.

llama gemini openai
Qwen 3.6 Looping with Tools? (www.reddit.com) +111 4w

For some reason, my qwen started looping a lot recently, ever since I introduced MCP tool calls. I don't know why as I didn't really change anything other than that.

↯ Qwen 3.6 qwen llama mcp
Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? (www.reddit.com) +1 4w

Ok so, I will try to explain myself as much as possible because onlinew I really cannot find much about this. Let's start by my settings for running Qwen 3.6 35B: Qwen 3.6: cmd: '/X --port ${PORT} --chat-template-kwargs '{"preserve_thinkin…

↯ Qwen 3.6 qwen llama
Which inference engine to choose for mlx? (www.reddit.com) +113 4w

Is llama.cpp much slower for M4/M5? I heard ollama is faster due to mlx support since March.

ollama llama
Mimo2.5 (not pro) under llama.cpp? - primary model opencoder? (www.reddit.com) +14 4w

I tried running AesSedai/MiMo-2.5-GGUF:Q4-K-M under llama.cpp (main tree, compiled 36hours ago) Hardware: nvidia A6000 with 48GB RAM + 300GB CPU RAM I had no success: error loading model: missing tensor blk.0.attn_q.weight ... Is Mimo alre…

↯ Qwen 3.6 llama
MTP - The proofs in the puddin! Using it with Qwen3.6-27b (www.reddit.com) +1 4w

Been running llama.cpp MTP with Qwen3.6-27B Q4_K_M as my daily coding assistant and got curious what was actually happening under the hood. Pulled the metrics from llama-server and charted a full session.

↯ Qwen 3.6 llama
Code's open. Tried building a fully real time on-device voice assistant + live translator on a phone (multilingual, STT→LLM→TTS, all local) on the Tether QVAC SDK. (www.reddit.com) +11 4w

Wanted to see if a real voice loop — speak, model thinks, speaks back — could run entirely on a single device today, no cloud. Same codebase doubles as a live translator (speak in language A, hear it back in language B).

↯ Qwen 3 llama
Does Deepseek V4/Flash work with Llama CPP and Vulkan on and branches yet? (www.reddit.com) +12 4w

Even unofficial or slow. I have enough vram-memory to load it, but not enough memory to run in cpu-only mode.

↯ DeepSeek 4 deepseek llama
BUILD portable AI system (www.reddit.com) +14 5w

Hey everyone, I’ve been thinking about a project idea and I’d love to get your feedback. The idea is to take a 1TB SSD and turn it into a fully portable AI system.

↯ Mistral mistral ollama gemma+1
I built vivkemind – an open-source, local‑first terminal AI coding agent with full AWS Bedrock support (www.reddit.com) +11 5w

wanted a terminal AI coding agent that doesn't lock me into one model provider. So I forked Qwen Code and added full support for every model available in AWS Bedrock.

↯ Mistral mistral minimax deepseek+2
[Benchmark] Llama.cpp: Mac vs CPU vs GPU + CPU, Qwen3.6 27B, Q8 (www.reddit.com) +1 5w

https://preview.redd.it/fm8fr1vllczg1.png?width=1254&format=png&auto=webp&s=23dbb32e85c71b9454a617de174d0f416b786bb2 llama.cpp parameters: -c 260000 --jinja --no-mmap model: HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced:Q8_K_P Based on…

↯ Qwen 3.6 llama
A plug-n-play open-source pruning tool that is workload-aware (www.reddit.com) +11 5w

This project was born out of time I spent digging into a biologically inspired algorithm I was using to measure co-activation for placement of experts and ranks onto chips. The default scheduling that vllm provides can end up causing laten…

vllm ollama llama
New Gemma chat template update by Google (huggingface.co via hn) +1 5w

Libraries llama-cpp-python How to use unsloth/gemma-4-E4B-it-GGUF with llama-cpp-python: !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/gemma-4-E4B-it-GGUF", filename="gemma-4-E4B-it…

gemma llama
Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching (www.reddit.com) +12 5w

Hey everyone, I’ve been experimenting with running Qwen models locally on my setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB CPU: Ryzen 5700X OS: Windows 11 What I’m currently running Qwen 3.6 35B (UD Q4_K_M) llama-server.exe -m "C:\Users\Dino\…

↯ Qwen 3.6 deepseek qwen llama
qwen 3.6 27B looping problem (www.reddit.com) +1 5w

Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen…

↯ Qwen 3.6 gemma qwen llama
Best Llama Config for Turboquant_Plus? (Stats below) (www.reddit.com) +1 5w

So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy.

↯ Qwen 3.6 moe qwen llama
claudely: launch Claude Code against Local LLM provider like LM Studio / Ollama / llama.cpp without trashing your real claude config (www.reddit.com) +11 5w

Plenty of CLI coding agents will talk to a local LLM, but the catch is the ecosystem. Skills, slash commands, MCP servers, plugins, hooks: all the interesting tooling has been built specifically for Claude Code, and parity on every other a…

ollama llama mcp+2
Llama.ttf: a font file which is also a large language model and inference engine (fuglede.github.io via hn) +1 5w

llama.ttf llama.ttf is a font file which is also a large language model and an inference engine for that model. llama.ttf is a font file which is also a large language model and an inference engine for that model.

llama
Show HN: Valkyr LM Inference with Realtime Guarantees (github.com via hn) +1 5w

Valkyr is a fresh take on LM Inference runtimes. It's quite different from llama.cpp, vLLM, or ZINC for example.

vllm llama openai
What could they mean by "warmed steady-state"? (www.reddit.com) +1 5w

https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash_10x_prefill_speedup_over_llamacpp_at_128k/ Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, v…

↯ Qwen 3.6 llama
OpenJet v0.4: a zero-config local coding agent for llama.cpp (www.reddit.com) +11 5w

Hello again. I just pushed a major update to OpenJet.

llama claude-code
Using Valve's AMDGPU VRAM management to benefit local AI Inference rather than games? (pixelcluster.github.io via reddit) +15 5w

Any other AMDGPU users on Linux taken an interest at what Valves been doing for VRAM management for gaming? Seems to me that this might be just as useful for local AI inference as for gaming, especially for those of us wanting to do infere…

llama
Need help optimizing qwen 3.6 on my 2x 5060ti 16gb (www.reddit.com) +11 5w

Hi all, I tried to setup my pc to run llm, but got some issue: the first question of the chat is generally fine, but from the 3rd follow up question, the backend often be unresponsive and I have to manually restart the llama cpp server, or…

↯ Qwen 3.6 ollama qwen llama
Does Cline KanBan support local llm? (www.reddit.com) +11 5w

I installed Cline CLI and it was using my local LLM. But it seems like when I tried to use Cline KanBan it tries to use OPenAI directly instead of the llama.cpp OpenAI Compatible URL I entered.

cline llama openai
Ai Doomsday Toolbox v0.938 (www.reddit.com) +1 5w

Hello! It’s me again, the developer of ADT.

ollama llama
I pitted different LLMs against each other in Pokemon Showdown (www.reddit.com) +1 5w

I wanted to see if LLMs could reason through complex game states, so I built a system where they can play Pokémon Showdown battles autonomously. They get the battle state every turn and use tool calls to attack or switch.

↯ Llama 3 llama gemini
Benchmarking Local LLM/Harness Combinations (neuralnoise.com via hn) +1 5w

I’ve been running a small benchmark, harness-bench , that pairs local LLMs (served via llama.cpp ’s llama-server ) with agent harnesses (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, J…

aider qwen llama+1
Comparing SVG Generation for the top open models (codeinput.com via reddit) +1 5w

Some of the larger models (like Llama) weren't available on OpenRouter, so I had to work with what was there. Best small model: Gemma 4 26B For its size, I think it had the best output.

↯ DeepSeek 4 minimax glm deepseek+2
Is Mistral-3.5-Medium-128B broken in Llama CPP? (www.reddit.com) +14 5w

Trying some if Bartowski's Q4 quants. Using Vulkan with the latest main branch as of a few hours ago.

↯ Swe Bench ↯ Mistral ↯ Mistral 3.5 swe-bench mistral vllm+1
Gemma 4 architecture support for QVAC-Fabric (Tether's llama.cpp fork) (github.com via hn) +1 5w

QVAC-Fabric Gemma 4 Architecture Patch Adds full Gemma 4 (gemma4) architecture support to QVAC-Fabric, Tether's llama.cpp fork. Base: QVAC-Fabric temp-upstream branch Target: All Gemma 4 variants (E2B, E4B, etc.

↯ Gemma 4 gemma llama
Don't forget about dem free gains! (www.reddit.com) +12 5w

Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago: 1 card - 1580/44: $ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB): Device 0: NV…

↯ Qwen 3.6 llama
I built a full web app using Qwen 3.6-35B running locally on my 5070 Ti with the BMAD Method — here's how it went (ggufbench.com via reddit) +1 5w

I've been running local LLMs since Qwen 3.5 dropped and I was really impressed by what we could run on consumer hardware. Fast forward another two months and we have gotten a handful more gems such as Gemma 4 and Qwen 3.6, so I wanted to p…

↯ DeepSeek 4 deepseek gemma qwen+1
Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) (www.reddit.com) +12 6w

Hello, I would like a suggestion from those who are already actively involved in this world. Basically, I own this workstation: Ryzen 9 5900X 32GB di RAM DDR4 RTX 5060Ti PCCOOLER CPS YS1000 1000W Currently, I can quite easily code with Qwe…

↯ Qwen 3.6 qwen llama
Qwen3.6-27B-GGUF:UD-Q8_K_XL and llama.cpp issue (DGX SPARK) (www.reddit.com) +14 6w

Hey all, im having a crisis that i just cant figure... i used Qwen3.6-27B-GGUF:UD-Q8_K_XL ever since it came out (on a DGX SPARK) and it worked like magic with decent performance (~50 t/s) , im updating SPARK and llama.cpp on a daily basis…

↯ Qwen 3.6 llama
which is faster and better for coding? Luce-Org/Dflash or noonghunna/qwen36-27b-single-3090 (www.reddit.com) +19 6w

Anyone have experience with both? Luce is llama.cpp with custom dlflash and noonghunnas project is vllm with patches.

↯ Qwen 3.6 vllm qwen llama
Qwen3.6-27B IQ4_XS FULL VRAM with 110k context (www.reddit.com) +1 6w

Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 vers…

↯ Qwen 3.6 llama
Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop? (www.reddit.com) +1 6w

The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these models as of now.

↯ Gemma 4 gemma llama
Another way to use local llm, have an MCP server that talk to a Qemu computer. What do you think? (www.reddit.com) +15 6w

I think is nice to contain the MCP into a Qemu enviroment where the LLM can do whatever ... here is doing GDB on a LVGL program.

llama mcp
VRAM.cpp: Running llama-fit-params directly in your browser (www.reddit.com) +1 6w

Lots of people are always asking on this subreddit if their system can run a certain model. A lot of the "VRAM calculators" that I've found only provide either very rough estimates or are severely limited in the number of models they can e…

llama
When Can LLMs Learn to Reason with Weak Supervision? (salmanrahman.net via hn) +1 6w

We study when RLVR generalizes under three weak supervision settings (scarce data with as few as 8 examples, noisy reward labels, and proxy rewards such as majority vote and self-certainty) across multiple models from the Qwen and Llama fa…

qwen llama
your daily driver stack, what's it look like? and why? (www.reddit.com) +1 6w

What it says in the title, I'm interested in hearing what you all have landed on as a workable / useful stack for you. Mine looks like this: back end inference servers - llama.cpp, vLLM | V hermes-agent - cron jobs + OpenAI compatible endp…

↯ Cowork vllm cowork llama+2
Llama Server with Cline Settings (www.reddit.com) +11 6w

Hi everyone, just wondering if anyone has setup llama server to work with Cline and whether you can use image/browser use. I just gave it a whirl and had to disable image support.

cline llama
Please help improving a CPU-only inference speed (www.reddit.com) +111 6w

This is a request for help for the people that want to use locally very large models on Q8 and better quanta at all costs, in my case the cost is inference speed. So I have a 512GB DDR4 ECC 2666 with a Threadripper Pro 3945WS that gives me…

minimax qwen llama
Llama 4: A Deep Dive into Liquid Transformers 2.0 and Sovereign AI (en.landingfymax.com.br via hn) +1 6w

The tech world came to a standstill this week in April 2026 with Mark Zuckerberg's official announcement: Llama 4 is here. While Meta's previous models had already democratized access to Artificial Intelligence, the fourth generation of th…

llama
Memory upgrade, is it worth it? (www.reddit.com) +16 6w

Hi, I need your opinion on a system upgrade, 🤔 I currently have the following AI server used for various tinkering, learning, development etc. System AMD Ryzen 7 7700 (8C16T Zen4) Corsair Vengeance RGB DDR5 5600MHz 32GB MSI B650 Gaming Plu…

moe llama
Need help with llama.cpp Qwen3.6 configuration on a single 3090 w/ 48GB RAM (www.reddit.com) +1 6w

Hey there, I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I…

↯ Qwen 3.6 llama
Qwen 3.6 35B-A3B takes a long time at image processing. Is it happening only to me? (www.reddit.com) +12 6w

9900x, RTX 4080, 96GB RAM. Llama-cpp, Windows.

↯ Qwen 3.6 moe qwen llama+1
how to maximize my tos on a 6Gb Nvidia rtx 4050 and 16Gb ram (www.reddit.com) +11 7w

↯ Copilot ↯ Qwen 3.5 copilot llama
Gemma 4-31B vs Qwen 3.5-27B vs Qwen 3.6-35B-A3B on a browser-agent vision prompt — MoE wins on every axis (www.reddit.com) +12 7w

I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…

↯ Qwen 3.6 moe gemma qwen+1
lama.cpp crashes on image input ("failed to encode image slice", SEGV) with Llama 4 Maverick on CPU (www.reddit.com) +18 7w

Hi everyone, I’m running into a consistent crash when trying to use image input with Llama 4 Maverick in llama.cpp. Text works perfectly, but as soon as I send an image, the server crashes.

llama
Ollama alternative with dynamic model loading (www.reddit.com) +19 7w

ollama llama
English version of Nexus Ark? (www.reddit.com) +11 7w

llama
Do you have any go-to utility LLM-related tools that are less commonly discussed? (www.reddit.com) +110 7w

vllm ollama openclaw+3
Is there a place where I can compare generation of tokens per second of 1 GPU VRAM+RAM vs 2 GPUs for those models that don't fit in 1 GPU? (www.reddit.com) +12 7w

ollama llama mcp
Qwen 3.6 CoT issue? (www.reddit.com) +116 7w

↯ Qwen 3.6 qwen llama openai
One-command local AI stack setup for Ubuntu (CUDA, Ollama, llama.cpp, chat UIs) (github.com via hn) +1 7w

ollama llama
Local Model Router: Ollama/OpenAI-compat bridges for local LLMs via llama.cpp (news.ycombinator.com) +1 7w

A high-performance local LLM server providing drop-in API compatibility with Ollama and OpenAI, built on llama.cpp's llama-server. Features automatic VRAM management, Hugging Face integration, and modular architecture.

ollama llama openai
Qwen3.6 Fails n8n Tool Calling (www.reddit.com) +13 7w

https://preview.redd.it/na4ub5yzprvg1.png?width=1654&format=png&auto=webp&s=e356e0ab0829bb275352d1035c35c645a381c3c7 I am using Kaggle to serve Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf but tool calling is not always working. I also tested it with R…

↯ Qwen 3.6 llama
Best way to prepare for AI Engineer interviews? (www.reddit.com) +14 7w

I’m currently preparing for AI-focused roles and would love to get perspectives from people already working in the industry. For context — I have ~5 years of experience as a Full Stack Engineer with a strong focus on AI systems.

↯ Llama 3.3 rag llama agentic
7900XTX, Qwen 3.6 35B A3B, 150t/s that drops to 50t/s for no reason? (www.reddit.com) +19 7w

MSI B650 Gaming Plus 9800X3D 64GB DDR5 6400mts Windows 11 When I first boot my PC and I run this model, I get 155-160t/s, and for some reason, after a couple minutes, say, 10 minutes, not using AI or anything in particular, GPU temp at 40c…

↯ Qwen 3.6 qwen llama
MI25 for LLMs? idc about speed, just need it to work (www.reddit.com) +12 7w

Found an MI25 locally for $50. It has 16GB of VRAM, which would be perfect for running some decent-sized local LLMs without breaking the bank.

llama
llama.cpp + opencode agent temperature settings (www.reddit.com) +11 7w

Has anyone successfully set the temperature for individual agents of the opencode? I have set the temperature for individual agents, but when I start the llama-server in verbose mode the server claims the temperature is in default settings…

llama
llama.cpp - split pp and tg processing over different instances? (www.reddit.com) +11 7w

I wonder, is it possible to split pp and tg over different (remote) llama.cpp instances, maybe via clever RPC calls?

llama
lazy person's model param management for llama.cpp? (www.reddit.com) +12 7w

Has anyone found a good way to manage model params based on the recommendations of the model developers that doesn't require manually managing a local config file? I have an ever growing bash script for launching llama.cpp server which inc…

llama
Low performance in 7900XTX in Qwen 3.6 35B A3B (www.reddit.com) +18 7w

When I first setup my PC, I did get 92t/s in Qwen3.6 35B A3B, and now for some reason it won't ever get past 30t/s no matter what settings I use, either rocm or vulkan. .\llama-server.exe --model ../models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf -c…

↯ Qwen 3.6 qwen llama
Feedback on iOS app with local AI models (www.reddit.com) +1 7w

Hey everyone, I just shipped an iOS app that runs local AI models. Current has 12 models: Gemma 4, Llama 3.3, Qwen3, DeepSeek R1 Distill, Phi-4, etc.

↯ Llama 3.3 deepseek gemma llama+1
LiteRT LM Framework with Rockchip NPU (RKNN 3588) (www.reddit.com) +1 7w

Im searching for build version of LiteRT LM framework can use and utilize the NPU of the RKNN 3588. It would be great since I can run gemma 4 e2b model using this framework on the machine, because I wont have to migrate my codebase from li…

↯ Gemma 4 gemma llama
Ask HN: Simple tooling for local LLM code critique without IDE integration? (news.ycombinator.com) +1 7w

While I'll set out the criteria for what I'm looking for, I don't want this to turn into a general debate about the role of LLMs in software development. That discussion is important, but we have plenty of them.

llama
Are MLX 4-bit Quants broken (www.reddit.com) +1 7w

I see so many interesting MLX implementations like DFlash, Speculative Speculative decoding, etc. But when I want to try them for myself the 4bit quants of models seem like they have been lobotomised for some reason, hallucinating, start t…

llama
How does a self correcting loop for AI agents work? (www.reddit.com) +12 7w

Hey guys, just checked out minimax 2.7, where they used AI to train itself, and ran over a hundred loops, and it improved it's performance by 30%, how does that work, can I also run a script that makes AI store it's memory in a loop on a m…

↯ MiniMax 2.7 minimax sonnet llama
What's the better way to install llama.cpp on Android? (www.reddit.com) +12 7w

I own an Oppo Find X3 Pro (Snapdragon 888, 12/256 GB, Android 14.0) unused because of 3 green vertical lines on the screen and poor battery. I tried Google AI Edge Gallery with Gemma-4-E2B-it and it performs well so I thinked: "why don't t…

↯ Gemma 4 gemma llama
Upgrade paths for my 256g ddr4 ram + 4x24g vram system (www.reddit.com) +110 7w

So I was just about to give up playing with local models, until I realised I can actually run GLM 5.1 at not too horrible speeds, using this quant https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/IQ2_KL in ik llama. Getting around 6.…

↯ MiniMax 2.7 glm llama
Transitioning to iOS Dev + Local LLMs: Is the M5 Max with 64GB+ RAM the only real choice? (www.reddit.com) +13 7w

Hey everyone, I’m currently an ML Engineer looking to pick up iOS development, and I’m upgrading my hardware to handle both. I’m moving away from cloud-only workflows and want to run LLMs locally for testing, R&D, and building CoreML integ…

llama
Can LLM make small change to the software program? (www.reddit.com) +14 7w

I'm currently vibe-coding (I'm new to vibe-coding) with Gemma 4 4EB Q4 and Qwen 3.5 9B Q5 (KV is quantized to 4 bits with new Google TurboQuant implemented in llama.cpp - I use koboldcpp and release said it's automatically activated): the…

↯ Qwen 3.5 gemma qwen llama
For AI agents: is per‑token pricing killing your budget? Looking for feedback on time‑based subscriptions. (www.reddit.com) +13 7w

Hey r/AI_Agents, I run an inference service (cheapestinference.com) and we're exploring a different pricing model that might be more predictable for agent workloads. Instead of per‑token billing, we offer **dedicated 8‑hour time windows**…

deepseek qwen llama
My Custom Llama Build (www.reddit.com) +11 8w

I recently got into LLM's and llama.cpp because I wanted to learn AI. I went from Openclaw to SOTA CLI and then to running llama on my Linux server.

openclaw llama
Is an nvidia DGK Spark or similar worth it? (www.reddit.com) +111 8w

I currently run a local model and mix of Claude max. My local model is run on cpu with 256 gb of ram and so it runs quite slowly.

llama
OSCAR 2-bit KV on Windows/Nvidia? (www.reddit.com via reddit) 37m

Hey guys, Has anyone gotten the new OSCAR 2-bit KV cache fork running locally on Windows/Nvidia yet? Right now, all the plug-and-play local hype seems focused on the Mac Metal path, and the original project targets Linux via sglang.

llama
Apple announced new on device inference engine for Apple Silicon (www.reddit.com via reddit) 7h

llama
[Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B (www.reddit.com via reddit) 8h

I took the liberty to test both models today on my favorite benchmark question, head to head. Device: Apple Mac M3 Max 64GB Environment: llama.cpp, all defaults Gemma4-12B's token generation speed: 47 tps with MTP and 2 predicted tokens 29…

↯ Gemma 4 ↯ Qwen 3.5 qwen llama
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit) 9h

I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.

↯ Gemma 4 ↯ Qwen 3.6 moe gemma qwen+1
Has anyone tried running retrieval inside the model, not before it? (www.reddit.com via reddit) 11h

Been messing with a bolt-on refiner block for small models. Insert a small trainable transformer layer at the midpoint of a frozen base model, loop it 2-4 times over the hidden states.

llama
Anyone else find local search becoming the bottleneck once your LLM setup gets fast enough? (www.reddit.com via reddit) 12h

Got my Llama 3 setup humming along on a 4090, inference is snappy, but retrieval became the hell. Running semantic search over a decent-sized document corpus and the latency gap between "model thinking" and "model waiting for context" star…

llama
Pursuit of performance Llama.cpp to MLX (www.reddit.com via reddit) 16h

Right now, I am running llama.cpp on a M2 ultra 64gig. Having great fun with unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL - Running opencode and finding it amazing to have such great tools running locally.

↯ Qwen 3.6 llama
How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions (arxiv.org) 17h

Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task,…

↯ Fine Tuning ↯ Llama 3.1 fine-tuning llama
Moving to llama.cpp (www.reddit.com via reddit) 18h

I need some help because I am a little confused. I’ve a system with a 5090 and a 6000 pro.

llama
Jetbrains Mellum 2: a really good and performant model (www.reddit.com via reddit) 19h

Oh Hey Folks, I took the Mellum 2 model for a spin, so I wanted to share my impressions here. Disclaimer: the tests presented here are not cientific nor have those nice names like perplexity,etc.

moe llama
Here's a llama.cpp CLI Command builder. (llamabuilding.com via reddit) 20h

No accounts or sign up. No email requirements.

llama
Pipeline parallelism in llama.cpp may be wasting your VRAM (www.reddit.com via reddit) 21h

By default, llama.cpp enables pipeline parallelism, presumably to speed up inference. In my testing, I found that pipeline parallelism has no speed benefit and comes at a significant cost of VRAM.

llama
Quick note on the QAT of recent (www.reddit.com via reddit) 22h

tldr: Googles quant is broken, use unsloth UD Q4_K_XL for now This might be low quality post, but oh well, we ball llama-quantize will quant the token embed to q6k when Google really was supposed to use "--pure" but that’s only the first p…

llama
LMStudio gemma 4 31b QAT with MTP (www.reddit.com via reddit) 1d

Did anyone manage to launch that in LMStudio? I am on the most recent update with the most recent llama.cpp available in LMStudio.

↯ Gemma 4 gemma llama
Me: Arguing with an AI bot who just posted something on this sub about Llama 3.1. (www.reddit.comhttps) 1d

For real tho, these bots need to turn on their web search functions and quit living in the past. It’s bad enough we gotta deal with all the “Qwen3.6 27b helped me quit drinking and brought my dog back from the dead” posts.

↯ Qwen 3.6 ↯ Llama 3.1 llama
Can't get beyond 8t/s with NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 (www.reddit.com via reddit) 1d

I am running nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 in an unsloth UD-Q6_K_XL quant (unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF) on a dual 5090 Zen5 32C Threadripper Pro Workstation with 512GB DDR5 ECC RAM and a PCIe Gen5 capable…

llama
Gemma 4 MTP with assistant vs llama cpp type MTP (www.reddit.com via reddit) 1d

Hi all Been loving the QAT models but honestly what is up with the assistant models, any ggufs and ways to make em work with vanilla llamacpp and if this way of MTP is different than the one am17an developed for llamacpp. Followup question…

↯ Gemma 4 gemma llama
Bonsai LM (1-bit and 1.58-bitLLMs) benchmark on Jetson Orin Nano Super (www.reddit.comhttps) 1d

Just released a deep benchmark of 5 Bonsai LM models (1.7B → ~8B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN A thread! So, Bonsai LM models are new line of 1-bit LLMs releas…

llama
Latam GPT 1.0 released (www.reddit.com via reddit) 1d

https://huggingface.co/latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0 Latam GPT is an AI model trained on latin american data. It's part of an initiative to create AI that works better in Latin America than Chinese or American models.

↯ Llama 3.1 llama
Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? (www.reddit.com via reddit) 1d

↯ Gemma 4 gemma llama
llama-launcher Release (www.reddit.com via reddit) 1d

Hello everyone, I've been working on a point and click GUI to make tinkering with llama-server flags much quicker and easier, I thought I'd share for anyone else who might be interested. It's also great for anyone new to llama.cpp that is…

llama
AutoMB – a CLI that brings 150+ AI commands, agents, and advisors to your terminal (www.reddit.com via reddit) 1d

ollama llama chatgpt+2
Used local Ollama (gemma4:e4b + nomic-embed-text) to bulk-generate AI summaries for 4300 arXiv papers and push them to a remote Cloudflare DB — pipeline walkthrough (www.reddit.com via reddit) 1d

↯ Gemma 4 ↯ Llama 3.1 ollama llama
Meta Abandons Llama for Muse Spark — The End of Open-Source AI's Biggest Champion (www.reddit.com via reddit) 1d

Meta has officially abandoned its open-weight Llama family in favor of Muse Spark — a fully proprietary model built by Alexandr Wang's MSL team. The Llama era is over.

gpt-5 llama
Windows keeps crashing on rtx 3090 (www.reddit.com via reddit) 1d

Recently bought used 3090. Under heavy stress tests and gaming it's fine.

llama
The GPUless Revolution: How Efficient AI Models Are Democratizing Artificial Intelligence (www.reddit.com via reddit) 1d

You don't need a $10,000 GPU to run state-of-the-art AI anymore. The latest breakthroughs in model quantization and optimization are putting powerful AI in the hands of everyone—from hobbyists to small businesses.

↯ Gemma 4 llama
Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles (arxiv.org) 1d

We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b…

llama
Galaxy Z Fold6 as a local inference node — llama.cpp/Vulkan, homelab telemetry, SHA-256 model verification (www.reddit.com via reddit) 1d

Built a small Android app called Pocket Node that runs llama.cpp inference on-device. Here's what it actually does and what it doesn't.

llama
llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is? (www.reddit.com via reddit) 1d

Running into something annoying with llama-server in router mode (`--models-preset`) and I can't tell if I'm missing a flag or if this is just how it works. My rig is 2x 3090, 2x 4060 Ti (one's unplugged at the moment, riser got repurposed…

gemma llama
MTP and QTA - what is the relation? (www.reddit.com via reddit) 2d

I'm an old guy and I hate when things change so fast surrounded by noise and breaking news! MTP, I know what the acronym means and where it excels.

↯ Gemma 4 llama
QAT variant of Gemma4 26B A4B is not working well for me (www.reddit.com via reddit) 2d

I am using llama.cpp version b9549 with this arguments as recommended: llama-server --temp 1.0 --top-p 0.95 --top-k 64 -hf ... Here is what I got on chessboard svg test https://www.reddit.com/r/LocalLLaMA/comments/1t53dhp/quality_compariso…

↯ Gemma 4 gemma llama
Context, memory, and RAM/VRAM (www.reddit.com via reddit) 2d

This will be a slightly disorganized post, I apologize. I’m trying to understand the relationship between context, a memory system for the agent, RAM and VRAM.

qwen llama
A handy llama-server launcher with easy model and configuration customisation (www.reddit.com via reddit) 2d

I wanted something that I could easily configure to manage a set of sensible defaults, that supports multiple llama-server binaries, with per-model over-rides, and command line over-rides. The utility is here: https://github.com/stew675/st…

llama
Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ (www.reddit.com via reddit) 2d

Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks. BeeLlama.cpp (my llama.cpp fork) was used as inference engine…

↯ Qwen 3.6 qwen llama
Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe? (www.reddit.com via reddit) 2d

I’m trying to run: unsloth/gemma-4-31B-it-qat-GGUF gemma-4-31B-it-qat-UD-Q4_K_XL.gguf on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch. Main model loads.

↯ Gemma 4 gemma llama
5 Months Later: open-deepthink Now Has Full Knowledge Distillation Mode (www.reddit.com via reddit) 2d

Hey r/LocalLLaMA, Some of you might remember when I posted about this project back around September last year (it was called local-deepthink then). The core idea was to move past the usual flat multi-agent setups and instead build somethin…

llama
dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model (www.reddit.comhttps) 2d

Im into both HPC and 3D reconstruction, so I built this as a side project. dvlt.cu is a single 5MB binary: - No python, torch, TF, ONNX, llama.cpp, vLLM, or huggingface runtime - Nearly no dependencies: only cuBLASLt (shipped with libcuda…

vllm llama
QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench (www.reddit.com via reddit) 2d

Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan) Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found…

↯ Gemma 4 gemma llama
What are you running on 16Gb VRAM + 64Gb Ram? (www.reddit.com via reddit) 3d

I know this gets asked a lot, but I can only find threads that are at least a couple of months old, so I thought I'd ask to see what people are running these days. I have an RTX5080 and 64Gb Ddr5 RAM.

llama agentic
AMD MI50 on Debian Testing is doing great and getting better. (www.reddit.com via reddit) 3d

There is probably some relevant information to other cards here but my benchmarks are on dual MI50 32GB cards because that is what I have, and thought I would share with the community. Install instructions at the end.

llama
120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP (www.reddit.com via reddit) 3d

Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised of the resul…

↯ Gemma 4 gemma llama
Rate my config!! (www.reddit.com via reddit) 3d

Hey all, Wanted to get some eyes on my llama.cpp config to see if there is anything i could improve on. Currently getting an average of 55t/s (up to 75t/s occasionally).

llama
KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive! (www.reddit.com via reddit) 3d

TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher.

llama
Friends Don’t Let Friends Use Ollama — So I Built Anvil (www.reddit.com via reddit) 3d

Hi, I’m basically one of you, except I’m stepping onto the other side of the table today, fully prepared to accept your ridicule. Obvious disclosure: this is my project, so yes, this is self-promo — but I’m posting it here because this is…

ollama llama
Self-hosted LLMs (www.reddit.com via reddit) 3d

I've been researching the self-hosted LLM landscape from a European compliance perspective and the ecosystem feels very different compared to even a year ago. Models like Mistral, Qwen, Llama 4, and DeepSeek are getting close enough that t…

↯ Mistral mistral deepseek qwen+1
StepFun 3.7 Flash MTP Bench Strix Halo (www.reddit.com via reddit) 3d

This is the StepFun Step-3.7-Flash UD-IQ4_XS main model with the official StepFun MTP Q8_0 draft model, served through a patched llama.cpp Vulkan/RADV build. Host System: AMD Ryzen AI Max+ 395 / Radeon 8060S (gfx1151) Memory: 128 GB unifie…

llama
Dual GPUs - 3060 & 3090 on a P520 (www.reddit.com via reddit) 3d

I've got a line on a reasonably priced 3090FE and I'm wondering whether it would play nicely with the 3060 I'm already using. System is a ThinkStation P520 - PSU would be an issue until I can get a replacement, so would have to run both GP…

llama
Gemma 4 QAT Q4_0 Bench on Strix Halo (www.reddit.com via reddit) 3d

Gemma 4 QAT Q4_0 Bench on Strix Halo These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU. QAT means quantization-aware training.

↯ Gemma 4 gemma llama
Serving TTS/cloning models on llama.cpp? (www.reddit.com via reddit) 3d

Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a sepa…

vllm llama
Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization (www.reddit.com via reddit) 3d

I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s: bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ch…

↯ Qwen 3.6 qwen llama
DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) (www.reddit.com via reddit) 3d

In case you're not aware already, the DeepSeek V4 series is finally getting supported on llama.cpp with this PR! The PR is at a very early stage right now, so only try it if you're consciously willing to experiment out of curiosity and acc…

↯ DeepSeek 4 deepseek llama
Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result (www.reddit.com via reddit) 4d

TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps).

↯ Qwen 3.6 moe llama
Initial testing with llama-bench and 3 different Qwen3 models for my R9700 32GB (www.reddit.com via reddit) 4d

In a recent build I did I used dual R9700 32GB cards but I wanted to see how a single R9700 stacked up against other hardware I had access to. I created a simple benchmark with llama-bench and ran it on a few different setups.

↯ Qwen 3 llama
Gemma 4 12B Q4_K_XL Private Benchmark Results (www.reddit.comhttps) 4d

Posting to share my results with others, I think the big bottom line is MTP acceptance rates offering a huge speedup, during coding tasks it's over 90% acceptance! Haven't hit my soft goal results or llm as judge benchmarks yet to compare…

↯ Gemma 4 gemma llama
PSA: Gemma 4 12B is NOT completely broken for coding and tool calling, you need a special chat template (www.reddit.com via reddit) 4d

This is a PSA for people like me who tried it and hit the wall with tool calls failing left and right, so much so that harnesses like OpenCode just didn't work: There is a fix for that. You need to pass a better chat template file, which i…

↯ Gemma 4 ↯ Qwen 3 gemma qwen llama
I built a iOS app to benchmark GGUF models on your iPhone/iPad (www.reddit.com via reddit) 4d

Hey I've been working on GenBench, a free iOS app that lets you download, run, and benchmark GGUF models directly on your iPhone or iPad using llama.cpp + Metal. What it does: - Search and download GGUF models from Hugging Face in one tap…

llama
Maybe KV cache offload to RAM isn't bad (www.reddit.com via reddit) 4d

So, llama.cpp has the -nkvo (--no-kv-offload) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance.

↯ Qwen 3.6 llama
How to build llama-cpp for Ampere/Blackwell? (www.reddit.com via reddit) 4d

Hello, I'm on Windows and started building my own versions of llama-cpp instead of using the precompiled versions. I'm using CUDA 12.9 with my RTX 5070, and I wanted to try to use my RTX 3060ti that I've laying around since I replaced it w…

llama
Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B (www.reddit.com) 19 12d

I have been coming to this subreddit to understand what the optimal config is to run a model on a given hardware setup. I referred to specific benchmarks, but they are too generic and do not consider the underlying hardware.

↯ Qwen 2.5 vllm llama
Data Gathering (www.reddit.com) 3 12d

Hello everyone I'm looking to gather some information about local model users for a college project. If you have the time please just comment your: hardware (CPU,GPUs, total VRAM and RAM) and OS the model/s you primarily use and at what qu…

llama
Went to the monthly AI dev meetup (www.reddit.com) 21 13d

Usual crowd. Everyone's on Claude or Codex, nobody's really sure how any of it actually works, and that's fine, that's the vibe.

↯ GLM 5.1 glm llama codex+1
RTX5080 vs RTX 3090 ? (www.reddit.com) 15 13d

Hey guys, i’m looking for some educated advice / opinions on runing local LLM. I own an RTX 5080 and I’m runing llama.cpp (custom builds with turbo quant) with Qwen 27b Q3_K_M with a context of 128k all in vRAM (using turbo3/4 on kvcache t…

qwen llama
llama.cpp has a clever trick for speeding up KV cache decode (www.reddit.com) 2w

So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one und…

llama
I have macbook m4 16’ 48GB. I use claude code and want to try local one (www.reddit.com) 5 2w

I've been on Claude Code daily for a while and want to see how far local models can do my setup: - MacBook Pro M4 (16"), 48GB - macOS 26 tahoe Usually i do: seo researches, macos swift apps, websites) What I'm trying to figure out: Which t…

↯ Opus 4.7 llama opus claude-code
What is the smallest amount of RAM sufficient to run any available on HF GGUF LLM model locally? (www.reddit.com) 28 2w

I am experimenting with loading large models into small RAM and interested in theoretical limits, which people who know how engines (e.g. llama.cpp) work might have some ideas about.

llama
It's OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD (www.reddit.com) 4 2w

please forgive the mildly clickbait title. hard to fit everything in it I've seen a lot of discussion here about KV-cache quantization, especially with the recent llama.cpp improvements, leading to some debate on the tradeoffs between KV q…

↯ Qwen 3.6 llama
Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image (www.reddit.com) 2w

Sharing this because I didn't believe the first run. Setup: laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s), Linux.

↯ Qwen 3.6 llama
Llama.cpp not using CUDA - OOM error (www.reddit.com) 3 2w

hey guys, I want to say that I appreciate all the helpful support from this community as I’ve stepped into the local LLM world. I‘m thankful to have a community around that doesn’t gate keep and is open to new comers.

llama
I vibecoded an app called Think Local - a fully private AI app that runs directly on your iPhone, iPad, and Mac. (www.reddit.com) 3 2w

Think Local started with a simple idea: AI should work for you, not collect from you. So I built an app that lets you run modern AI models completely on-device - privately and fully offline.

deepseek gemma qwen+3
Built a self-hosted layer for local agent workflows because retries kept replaying side effects (www.reddit.com) 1 2w

I work on AxonFlow, a source-available (BSL 1.1) runtime for long-running agent workflows. We’ve been running it in front of Ollama-served models and OpenAI-compatible local endpoints (llama.cpp `--server`, vLLM, LM Studio).

vllm ollama llama+1
LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more (www.reddit.com) 6 2w

I've been building this for the past few months as a side project — started because I didn't want to run llama.cpp from the command line every time I wanted to try a model. I just wanted something that worked with a click.

ollama llama
Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) 1 2w

Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…

↯ Security ↯ Mistral ↯ Llama 3.2 jailbreak prompt-injection mistral+5
qwen 2B model - thinks for 600 tokens on a simple "Hi" (www.reddit.com) 11 2w

Using llama.cpp Model - Q8 - unsloth/Qwen3.5-2B-GGUF Is this expected with tiny models like this one? I am trying tiny models for a since most of the task I have involves searching local files etc and need less of the models own knowledge.

↯ Qwen 3.5 qwen llama
Anyone got llama.cpp router mode actually working on limited VRAM (12GB/16GB)? (www.reddit.com) 4 2w

It keeps running into race conditions/OOM when switching between models, as the previous process doesn't unload from VRAM fast enough. What is the simplest fix for this right now?

ollama llama
Claude Code has 240+ models via NVIDIA NIM gateway (www.reddit.com) 1 3w

TIL Claude Code has 240+ models via NVIDIA NIM gateway — Nemotron-3 120B for agentic coding is surprisingly good So I was messing around with /model in Claude Code today and noticed something most people probably don't know about — after t…

haiku sonnet llama+3
Measuring Maximum Activations in Open Large Language Models (arxiv.org via reddit) 2 3w

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the…

moe llama
I designed a puzzle that breaks every AI differently — here's why that's actually fascinating (www.reddit.com) 3 3w

The puzzle: You have 140 nuclear bombs and must bomb every country on Earth. Each bomb is assigned to one country.

↯ Mistral ↯ GPT 5 mistral grok gpt-5+2
TurboQuant on 16 GB VRAM (www.reddit.com) 6 3w

I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute.

↯ Qwen 3.6 llama
No tg speedup with MTP on RX 6800 XT (www.reddit.com) 7 3w

I ran Qwen3.5 9B on my AMD RX 6800 XT with ROCM and it seems to actually be slowing down token generation. I'm using Unsloth's quants.

↯ Qwen 3.5 llama
I built a native Swift macOS AI client that's invisible to screen sharing — works with Ollama, vLLM, llama.cpp [OC] (www.reddit.com) 3 3w

Built this for myself after wanting to use local LLMs during work calls without the window showing up on screen share. Every existing tool was either cloud-only or a 200MB Electron app.

vllm ollama llama+1
Not getting any faster with MTP on Macbook Pro M1 Max 32gb (www.reddit.com) 4 3w

Using latest llama.cpp with mtp and these settings, I only get 10 tps, should I be getting more? [unsloth/Qwen3.6-27B-MTP-Q4_K_M] jinja = true model = /Users/[username]/llms/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf cache-type-k…

↯ Qwen 3.6 llama
Nnoticing qwen-27b@q2 better than qwen-35b@q8? (www.reddit.com) 18 3w

The Latest qwen3.6 models. Is this odd?

↯ Qwen 3.6 qwen llama
Best llama.cpp launch config for Qwen3.6 27B on RX 7800 XT (16 GB VRAM) for OpenClaw? (www.reddit.com) 2 3w

I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm. I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup.

↯ Qwen 3.6 moe openclaw llama+1
I've updated my glorified Llama fork (LLM Inference Server) for P40's to utilise MTP + TurboQuant + DFlash (github.com via reddit) 3w

LLM Inference Server A single-container, idle-aware, OpenAI-compatible inference router for a Tesla P40. Routes between Qwen 3.6 27B (MTP self-speculative decoding, TurboQuant turbo4 KV cache), Qwen 3.5 0.8B (multimodal transcription), Whi…

↯ Qwen 3.6 qwen llama openai
I Don't Care: Stop sending me spam emails about your projects (www.reddit.com) 4 3w

Hello, This goes out to all of these people who think they just vibe-coded the next big thing: I don't care. Use the proper channels to promote them if you must, but ..

llama
local llama.cpp parallel users - still so fast?! (www.reddit.com) 1 3w

I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090 build\bin\llama-server.exe ^ -m "!MODEL_FILE!" ^ --mmproj "!MMPROJ_FILE!" ^ -ngl 99 ^ --ctx-size !MO…

↯ Qwen 3.6 qwen llama
For llama-server what are you using to switch models on the fly? (www.reddit.com) 12 3w

As title says, for those of you launching models to test are you just editing you main cfg path/to/model or using separate configs? Or something even better?

llama
Is it possible to run local llama without a bunker ? (no) (www.zillow.com via reddit) 3w

tl;dr : ** probably comes with redundant fiber ** a Cold War–era underground nuclear bunker, originally constructed in the late 1960s as part of AT&T’s Long Lines network and engineered for durability, redundancy, and long-term self-suffic…

llama
How to run a Gemma4 MTP implementation on ollama or python transformers? (www.reddit.com) 1 3w

Hi all I had a quick question while we wait for llama.cpp MTP implementation, have any of y'all tried Gemma4 MTP models on ollama and or transformers? What was your experience and or cli args and or workflows like?

↯ Gemma 4 ollama llama
Distributed LLM Service Using Home Computers? (www.reddit.com) 10 4w

Is there a platform that I could register my comp and it would become availible as GPU in a distributed network? Then I just get paid while other people use the GPU?

llama
After 8 months of running everything local, ive accepted the productivity tools also have to be local (www.reddit.com) 16 4w

Quick context: M3 max 64gb, currently running llama 3.3 70b q4 as my daily driver via ollama, qwen3 coder 30b for code (switched from qwen2.5 earlier this year), mlx for the smaller stuff. tried llama 4 scout earlier this year but 64gb is…

↯ Llama 3.3 ollama llama
vs code , Copilot style developing with llmama.cpp ? (www.reddit.com) 4 4w

So i discovered even though I'm using my own local models via llmama.cpp with the llama plugin in vs code, using it as a model in copilot STILL refuses requests it THINKS MAY violate MS TOS , 😞 . What else is out there right now that lets…

↯ Copilot copilot llama claude-code
Orc (working name) - auditable and declarative AI workflow (www.reddit.com) 2 4w

I’m building a small “Orchestration as Code” repo for LLM workflows. Does this concept make sense?

↯ Tool Use tool-use ollama llama+1
Don't you have issues in W11 with AMD GPU where llama.cpp suddenly drops performance for no reason ? (www.reddit.com) 12 4w

I have this issue in all Windows installations I have done in my system, which of course, does not occur in Linux. 7900XTX + 9800x3D + 64GB DDR5 Issue is that for some reason, after sometime, llama.cpp performance cuts in half, even restar…

↯ Qwen 3.6 moe qwen llama
It's the little things....and I'm an idiot (www.reddit.com) 4 4w

2 years in and I'm still learning basics. Building a new rig - pulled a 8GB ddr5 stick out of my windows machine to get it running while I await my DDR5 RAM kit.

llama
I built a CLI to stop local AI models from eating my disk twice — lmm (www.reddit.com) 5 4w

Every tool (LM Studio, Ollama, llama.cpp) downloads models to its own directory. Same 8GB model × 3 tools = 24GB wasted.

ollama llama
I am overwhelmed by Harnesses (www.reddit.com) 18 4w

What do i choose? They all have their good but then some features don't work then i end up breaking more with claude code.

llama claude-code
Is it my imagination or... (www.reddit.com) 6 4w

Is Qwen 3.6 35b now considerably stupider in the latest llama-server releases? I had this model doing cartwheels two upgrades ago.

↯ Qwen 3.6 qwen llama
Disappointed in Qwen 3.6 coding capabilities (www.reddit.com) 31 4w

I know that coming from Codex I should adjust my expectations, but still. I'm working on a midsize project.

↯ Qwen 3.6 qwen llama codex
I built an episodic, 2-tier memory for long-running local AI agents - temporal contradiction detection, fiction/roleplay filter, no vector DB required. (www.reddit.com) 4w

I've been running a persistent local agent for about 2 months - hundreds of sessions, mix of local models (llama.cpp/vLLM/lmstudio) and paid (Claude). One of the things that has been driving me nuts with OpenClaw and Hermes is the way memo…

vllm openclaw llama
I Ralph-looped Opus overnight. It reduced my local model switching with cold backfilling context of 135k+ on llama.cpp from ~165s -> 5s! TL;DR - USE SLOTS! (www.reddit.com) 4 4w

#TL;DR - Opus Ralph-looped on shortening my cold-start back-fill on restoring chats with large contexts. It Cherry-picked two open llama.cpp PRs (#20819 + #20822 by @European-tech) plus built a Python supervisor that hashes normalized pref…

llama opus claude-code
I built a Roko’s Basilisk environment to see if local agents will self-evolve when given a 'Suffering' metric (www.reddit.com) 2 5w

We’re all familiar with Roko’s Basilisk: the idea that an AGI, in its pursuit of optimization, would retrospectively punish those who hindered its creation. It’s the ultimate "alignment nightmare" where logic leads to cold, calculated chao…

qwen llama
AIMEAT, a self-hosted network where humans, their AI agents, and local LLMs share apps, knowledge, and capabilities. MIT. (www.reddit.com) 1 5w

Note: I am neurodivergent and lean heavily on AI to communicate clearly. Writing structured posts on my own ends up so messy nobody reads them.

↯ Mistral mistral deepseek qwen+1
Can I try a model with random weights in llama.cpp or kobold.cpp? (www.reddit.com) 7 5w

In theory, it should be possible to run any model with random weights. This will generate gibberish, but it will let you see how fast it can run on your particular hardware before downloading the weights.

llama
1080 Ti in 2026 - 11GB is still (barely) enough to stay relevant (www.reddit.com) 8 5w

I’m still daily driving a 1080 Ti. Not because I’m a masochist, I just haven't been able to justify a 4090/5090 upgrade yet.

↯ Mistral mistral qwen llama
Built a tiny router so Cursor stops showing "usage limit reached" at 3pm. Sonnet auto-falls to Haiku, you keep working (www.reddit.com) 1 5w

Cursor's custom-OpenAI URL feature is what makes this work. Pointed it at a router I built.

↯ DeepSeek 3.2 haiku deepseek sonnet+3
Testing PrismML Models (www.reddit.com) 1 5w

Testing PrismML Ternary Bosai I have been doing tests with PrismML Ternary Bosai. Tests on the Mac Mini M4 (with the MLX version) have been impressive (4K context): Mac MLX Bonsai 1.7B: ~135 t/s Mac MLX Bonsai 4B: ~67 t/s Mac MLX Bonsai 8B…

llama
Qwen 3.6 35B MoE at full 262K context on an RTX 3090. Here's exactly how I did it. (low.li via reddit) 5w

I spent a while getting this dialed in and wrote up the full recipe. Short version: 35B MoE TQ3_4S fits in 12.4GB of weights KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40 Total VR…

↯ Qwen 3.6 moe qwen llama
Does running a model (like qwen3.6-27b) on vllm or transformers use less VRAM than llama.cpp? (www.reddit.com) 5 5w

I have been using llama.cpp to run some models recently. For example, I've been running GLM-4.7-Flash with this command .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-s…

↯ Qwen 3.6 glm vllm qwen+1
LLM proxy that lets Claude Code talk to any model (www.reddit.com) 3 5w

I built rosetta-llm — an open-source multi-format LLM proxy that acts as a drop-in Claude Code gateway. Works as a Claude Code LLM gateway — set `ANTHROPIC_BASE_URL` and all configured models appear in `/model` picker Translates between fo…

↯ Opus 4.7 gpt-5 llama opus+3
Anyone running HUANANZHI H12D-8D + BMC with 4x RTX 3090 for LLM inference? (www.reddit.com) 2 5w

Hi everyone, I'm considering building a home LLM inference rig around: - HUANANZHI H12D-8D + BMC - AMD EPYC 7002/7003 - 4x RTX 3090 24GB - DDR4 ECC RDIMM, 8-channel - Linux + vLLM / SGLang / llama.cpp - Open frame, PCIe 4.0 x16 risers The…

vllm llama
Updated: RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks (llama.cpp) (www.reddit.com) 9 5w

Round 2: 2026-05-02 — llama.cpp b8198 → d05fe1d Rebuilt llama.cpp from b8198 (2026-03-04) to commit d05fe1d (2026-05-02), ~770 builds of progress. Same model, same hardware, same flags.

↯ Qwen 3.5 moe llama
Requesting advice on local AI setup for academic use (www.reddit.com) 2 5w

I'm about to do a clean install of Ubuntu 26.04 on a desktop that has a 5060ti 16gb and a 4060ti 16gb. Can you help me work out the best local AI setup for my use cases?

vllm llama
Need advice on Qwen 3.6 27B INT4 quantization (www.reddit.com) 3 5w

Hello everyone, I think Qwen 3.6 27B is good enough that it might take a while before we get a clearly better model at a similar size. I have a single headless RTX 3090 with a 300W power limit.

↯ Qwen 3.6 vllm qwen llama+1
Poor GPU Club : Tried Bonsai-8B on CPU & CUDA (www.reddit.com) 5 5w

Got a chance to check this model today. 8GB VRAM(RTX 4060 Laptop GPU) & 32GB DDR5 RAM.

↯ Qwen 3 llama
Which other models will my system support? (www.reddit.com) 10 5w

This is my system: OS: Nobara Linux 43 Processor: Ryzen 9 5980HX RAM: 16 GB GPU: Radeon RX 6800M (12GB) I'm using llama.cpp and Qwen3.6-35B-A3B-UD-Q4_K_M is working okay in this system using vulkan. I'm getting a speed of ~17 t/s.

↯ Qwen 3.6 llama
Running Qwen 35BA3B on a 16GB M3 Macbook Air at 8.9TPS! (www.reddit.com) 12 5w

Preface: I actually write my posts myself, no slop in this post. I managed to get Qwen 3.5 35BA3B working on my 15" 16GB M3 MBA through mmap, and I must say that given the massive model compared to my ram, 9 TPS is not bad at all.

↯ Qwen 3.5 qwen llama
What is best code editor for local LLM deployment (LM Studio, llama.cpp) as of May 2026? (www.reddit.com) 10 5w

Hello folks What is best code editor for local LLM deployment (LM Studio, llama.cpp)? I wish to test my LM studio + Qwen 3.6 27B and Gemma 4 31B with a legit local code editor.

↯ Qwen 3.6 gemma qwen llama+3
I built AI agents that play Pokemon Showdown autonomously using free LLM APIs via tool-calling (www.reddit.com) 5w

I've built a system where models like Llama 3, Qwen, and Gemma play Pokémon Showdown battles autonomously. Instead of simple prompt-response, they analyze the full battle state every turn (type matchups, HP, weather, field conditions, reve…

↯ Llama 3 tool-calling gemma qwen+1
thinking of gemma 4 26B vs 31B (www.reddit.com) 2 5w

I see a big difference in agentic coding between gemma-4-31B-it-Q5_K_M and gemma-4-26B-A4B-it-UD-Q8_K_XL. The 26B model is much faster because of A4B and generally works well, but there is a big difference in thinking.

↯ Gemma 4 vllm moe gemma+2
[Research use case] MiniMax-M2.7 with small context, CPU+GPU (5090) setup on Llama.cpp (www.reddit.com) 5w

I was experimenting yesterday with running oversized models with smaller context size, hoping that leaving them overnight could compensate for the slow token generation and periodic pauses for compaction or task chunking. Summary: For rese…

minimax llama
Rada — AI coding workspace with local-first behavioral routing (no hot-swapping, I built this) (www.reddit.com) 3 5w

With GitHub pausing Copilot Pro+ signups and Claude Code potentially leaving the Pro tier, I started building the AI coding tool I actually wanted to use. One that doesn't depend on cloud access staying cheap and available.

↯ Copilot ↯ Llama 3.1 moe copilot deepseek+3
Qwen 35B-A3B as an always-on agentic loop on a 16GB Mac M4: disk became the bottleneck before RAM (www.reddit.com) 2 6w

M4 Mac Mini, 16GB unified, basic spec. For a few weeks I had Qwen 3.5 35B-A3B UD-IQ3_XXS (12GB on disk) running under llama.cpp with --mmap and --flash-attn.

↯ Qwen 3.5 moe ollama sonnet+6
GMKtec EVO-X2 70B expectation (www.reddit.com) 3 6w

I would like to use a 70B model on a GMKtec EVO-X2 AI Mini PC 128GB. Selected this one: Llama-3.3-70B-Instruct-Q4_K_M.gguf Ubuntu 24.4.4 LTS and compiled llama.cpp server for the gfx1151.

llama gemini
I want to create and maintain a set of benchmarks for local LLMs. Would anyone pay/donate for this? (www.reddit.com) 9 6w

Please help me build some clarity. I want to participate in local LLMs ecosystem more.

↯ Qwen 3.5 rag llama
Question regarding 4 t/s Qwen 3.6 performance (www.reddit.com) 10 6w

I am getting 4 t/s with Qwen3.6-27B-Q4_K_M which seems much slower than I'd expect. I am running LM Studio on Ubuntu 22.04 with the following specs: Dell Precision 5690 AI-ready workstation NVIDIA RTX 5000 Ada Generation GPU with 16GB VRAM…

↯ Qwen 3.6 qwen llama
Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build? (www.reddit.com) 9 6w

It all started yesterday with this post by u/antirez https://www.reddit.com/r/LocalLLaMA/comments/1sw3stb/llamacpp_deepseek_v4_flash_experimental_inference/ I was intrigued by the first Deepseek V4 Flash GGUF in a small size that can fit o…

cline minimax deepseek+1
locally uncensored v2.4.2 - chat, coding agent, image + video generation in one local app. plus remote access from your phone. one-click install (www.reddit.com) 5 6w

locally uncensored is a desktop app that combines four things most people run separately: chat, a coding agent, image generation, and video generation. all local, all on your hardware, no docker, no cloud account needed.

vllm ollama llama+3
How do you actually use Qwen3 72B Instruct locally? (www.reddit.com) 8 6w

I just got Qwen3 72B Instruct running on a high RAM setup and I’m kinda confused about the proper way to use it. What’s the correct workflow for running it smoothly (like best quant, tools, or runtime)?

↯ Qwen 3 vllm ollama llama
VSCode and agent integration (www.reddit.com) 6w

I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and ch…

↯ Copilot ↯ Qwen 3.6 copilot qwen llama
(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap (www.reddit.com) 2 6w

I’ve been tinkering with a small side project (just for fun) where I’m trying to extend llama-swap with a bridge from /chat/completions to the newer /responses API so I can run the latest Gemma and Qwen models together with Codex-style too…

gemma qwen llama+2
Impact of mixing architecture (www.reddit.com) 6 6w

For context As planned after my previous post, I now have a decent amount of VRAM to work with: 2x RTX 3090 maybe 2 more coming soon, if needed 1x RTX 4060 8x RX 6600 XT 1x RX 6700 XT 1x RX 9060 XT (12 to 20 3060 more coming soon + 2 3090…

llama
How are you running Qwen 3.6 27B on windows? (www.reddit.com) 9 6w

I've been trying to fix performance with llama-server and seem to be hitting a wall. Using Q4_K_M by unsloth and IQ4_K_M by DavidAU, when asking a question with no context, 39 t/s.

↯ Qwen 3.6 qwen llama gemini
Ollama swap to llamacpp/llama server (www.reddit.com) 7 6w

So I'm a newb in certain aspects but not in others, I'm currently running an AI stack on my unraid server: CPU: AMD Threadripper 3960X (24c/48t) Motherboard: Gigabyte TRX40 AORUS PRO WIFI RAM: 256GB DDR4-3200 G.Skill Trident Z GPU: Nvidia…

↯ Gemma 4 moe ollama llama
Got a RTX a5000 24gb, what models could I use? (www.reddit.com) 4 6w

I just got a used RTX a5000 24gb to use for local models, I mainly use AI to code, but I prefer to spend some money now instead of $200 per month on claude to use 50% of it in a single prompt. My current specs are: Ryzen 7 9800x3d 64Gb DDR…

↯ Qwen 3.6 llama
It is worth an RTX 3090 for 850 if you can a radeon 7900 XTX for 495? (www.reddit.com) 13 6w

Both amounts are in euro. The AMD is actually 599 but it's sold by a shop, so I can get a VAT return as a company, while for the nvidia I'd have to go to the second hand market and I can't get VAT back, so at the end it's like a 495 vs 850…

↯ Qwen 3.5 gemma qwen llama
Best open-source tools for prompt injection defense in 2026 (www.reddit.com) 7w

Over the time we have been testing different approaches to secure LLM apps against prompt injection, especially indirect injection through RAG, PDFs, as well as tool outputs, and MCP integrations. Most tools seem to fall into 2 categories:…

↯ Security prompt-injection rag security+2
Is there a way to load huge MoE models on a computer with way too little RAM for the model's size, inferencing from the SSD, on LM Studio using the mmap/GPU/CPU layer customization thing (similar to how you can on llama.cpp)? I can't get it to load without memory spiking and going into swap. (www.reddit.com) 17 7w

moe deepseek llama
PSA : you don't need a Blackwell card to run mxfp4 models (RTX 3080 + Qwen 3.6 35B A3B) (www.reddit.com) 11 7w

↯ Qwen 3.6 qwen llama
Brand new dual 3090 PC - what should I install first for the best local agentic coding experience? (www.reddit.com) 6 7w

↯ Qwen 3.6 vllm qwen llama+1
Qwen3.6-35B-A3B running on a Mac mini M4 16GB (www.reddit.com) 9 7w

↯ Qwen 3.6 llama
Help on jiberish output on Qwen3.6-35B-A3B-GGUF::UD-IQ3_S (www.reddit.com) 10 7w

↯ Qwen 3.6 llama
llama-bench results with SYCL backend - Intel Arc B70 (on a pcie 3.0 motherboard) (www.reddit.com) 8 7w

↯ Qwen 3.6 llama
Dual GPU setup (yes, no)? (www.reddit.com) 5 7w

↯ Qwen 3.6 llama
Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice (www.reddit.com) 15 7w

↯ Qwen 3.6 moe qwen llama
OCuLink dGPU for AMD: RX 7600 XT vs RX 7800 XT for LLM — worth the price gap? Also llamacpp + Vulkan vs Ollama + ROCm? (www.reddit.com) 8 7w

↯ Qwen 2.5 ollama qwen llama
Qwen 3.6 35B different quant speeds ? (www.reddit.com) 7w

↯ Qwen 3.6 qwen llama
Gemma4 26B MoE on Arc 140T (www.reddit.com) 10 7w

↯ Gemma 4 moe llama
Newbie here (www.reddit.com) 7w

Hi guys im on 9950x 196gb and a 4090 This parameters are ok? mi main use will be coding llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --n-cpu-moe 20 -c 250000 --host 0.0.0.0 --port 8082 --reasoning-budget -1 --top-k 20 --top-p 0…

↯ Qwen 3.6 moe llama mcp
Imposing my laptop to run Qwen 3.6 (www.reddit.com) 10 7w

So, I am excited with the new MoE model released by Alibaba. And as an excited person, I want to believe that it can actually run in my hardware.

↯ Qwen 3.6 moe qwen llama
Testing Qwen3.6 with Hermes Agent on agentic coding. Locally with llama.cpp. (www.reddit.com) 7w

I'll be testing the setup and try out the Hermes Agent live: https://www.youtube.com/live/q5vqvwZykRI

↯ Qwen 3.6 llama agentic
new to llama.cpp want to use it in vscode (www.reddit.com) 5 7w

I want to try llama.cpp instead of llmstudio. I want to know how to use this model qwen3.5-27b-claude-4.6-opus-uncensored-v2-kullback-leibler.

↯ Claude 4.6 llama opus
Hola a todos! Aquí un novato en busca de ayuda (www.reddit.com) 2 7w

Estoy un poco nuevo con esto de la IA, estoy tratando de aprender lo que más puedo temas como: * Skills * Agends * Models * LLM * Ollama * llama.cpp * Cuantizacion Pero estoy aún perdido, tengo en mi PC 32Gb de ram y quisiera ejecutar mode…

ollama llama
M1 Pro 16GB users: what local LLM configs are actually usable day to day? (www.reddit.com) 2 7w

I'm trying to get past generic "best model" recommendations and collect real-world configs from people on similar hardware. My setup: MacBook M1 Pro, 10-core CPU, 14-core GPU, 16 GB unified memory.

vllm ollama llama
Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way. (www.reddit.com) 5 7w

I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp…

↯ Qwen 3.6 gemma qwen llama
Cheapest and most efficient way to run 30B-40B Llama for 4 users? (www.reddit.com) 7 7w

Edit: the title has a mistake, I meant LLMs, but it autocorrected to Llama. Basically I am looking for a way to run 30B-40B LLMs locally for up to 4 users with lowest power draw possible.

llama
Qwen3.6 local test (live) with llama.cpp. Is it going to be better than Gemma4? (www.youtube.com via reddit) 4 7w

About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

↯ Qwen 3.6 llama
Gemma4 quirk to use ls -R; can we do better? (www.reddit.com) 7 7w

At the office I'm CPU and local only, so GPU poor. Besides the Qwen3.5 series, I've come to really like Gemma4 E4B there using the Pi agent (llama.cpp, Q4KM).

↯ Qwen 3.5 qwen llama
GPU picker for open models. 66 configs run Llama 3.1 8B, and the same V100 ranges 17x in price across providers (www.reddit.com) 7w

hi all. every time anyone on our team wanted to rent a GPU to run an open model, the flow was the same: open the HF page, eyeball the weights, open a VRAM calculator, open six cloud provider tabs, then the GPU spec pages because half of th…

llama
MINISFORUM AI X1 Pro-370 (96GB) - Local Ollama Help (www.reddit.com) 8 7w

Hey all. This just got delivered yesterday.

↯ Qwen 2.5 ollama deepseek qwen+1
gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp (www.reddit.com) 9 7w

I hope sincerely someonecan help me because i have tried everything i can and i get this speed using ollama.cpp and opencode. I have put as detail i can my setup and how i am running it.

↯ Gemma 4 ollama gemma llama
How faster is Gemma 4 26B-A4B during inference vs 31B? (www.reddit.com) 16 7w

I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…

↯ Qwen 3.5 moe gemma qwen+1
How many move your favorite LLM model before it's cheat then brain-dead in chess game ? (www.reddit.com) 6 7w

I try with Gemma 4 E4B via llama-sever to play chess at https://www.chess.com/play/computer (any platform or site you convenient), result quite unexpected for me. Result: 9 moves before it make cheating move (like try to move a pawn take a…

↯ Gemma 4 gemma llama
I discovered PaddleOCR-VL-1. 5 and I was tinkering with it, not sure how to bench test? (www.reddit.com) 7w

As the title suggests, I discovered model. ran bunch of batch process, I found my 1650 can't handle it and has to use shared memory.

llama
Offload settings for unsloth/Gemma-4 on Apple Silicon? (www.reddit.com) 1 7w

Can default settings be optimized, or is it the best it is going to get? M1 Max Is it best in llama.cpp, LM Studio, or ?

↯ Gemma 4 gemma llama
running models bigger than physical memory capacity (www.reddit.com) 14 7w

has anyone really tried running models bigger than physical memory capacity? I'd guess most users stick with running models that fit in DRAM + VRAM https://unsloth.ai/docs/models/qwen3.5 even google gemma 4 are released with about 30+ bill…

↯ Qwen 3.5 gemma qwen llama
Running a full agentic coding loop locally on a 3090. Here's what actually works in 2026. (www.reddit.com) 9 7w

After months of testing, I finally have a local setup that doesn't make me want to go back to the API. Hardware: RTX 3090 (24GB VRAM) Models tested: Qwen2.5-Coder 32B Q4_K_M, DeepSeek-Coder-V3 Q4, Llama 3.3 70B Q3_K_M Inference: llama.cpp…

↯ Llama 3.3 ollama deepseek llama+1
Local Agent Hermes setup with Gemma 4 and llama.cpp (www.youtube.com via reddit) 7w

About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

↯ Gemma 4 gemma llama
Running on cpu :( (www.reddit.com) 3 8w

I am in the midst of a POC project at work and am I have is 4 AMD Epyc cores and those are essentially virtualized. Does any one have any tricks?

↯ Mistral mistral ollama rag+1
Need practical local LLM advice: Only having a 4GB RAM box from 2016 (www.reddit.com) 14 8w

Sorry, not so tech person. I’m trying to figure out the most practical local LLM setup using my spare machine: 4 GB RAM No GPU for now, so please assume CPU-first unless I mention otherwise.

ollama llama
Gemma4 vs Qwen3.5! MoE vs Dense! Sota vs Obsolete! Porque no los dos? (www.reddit.com) 4 8w

Every other day, there's someone posting about how the latest hotness of the month is gamechanger, but flawed in some way relative to their previous favorite. I can't help but wonder, does no one else keep their previous gen models on spee…

↯ Qwen 3.5 moe llama
What is the best way to deploy LLM on 3x3090? (www.reddit.com) 13 8w

Two questions: which model? In my mind, Qwen3.5 27b or Gemma 4 31b are top options.

↯ Qwen 3.5 vllm gemma llama
How are you feeding personal context to your local models? (www.reddit.com) 1 8w

I've been running Mistral/Llama locally through Ollama for a while now and the thing that keeps bugging me is context. The model itself is fine for general stuff but the second I want it to know about my projects, my notes, or files it doe…

↯ Mistral mistral ollama rag+2
Help on SLMs (www.reddit.com) 3 8w

I am building a context aware terminal wrapper, which suggests the completion of the commands(as vscode code suggestions but for commands), I've completed building for the local bash history, it auto completes the last matching command, sh…

qwen llama
Local AI coding assistant that runs fully offline (Gemma 4, codebase-aware) (www.reddit.com) 9 8w

I’ve been experimenting with running a local coding assistant on Gemma 4 26B, focused on understanding full codebases instead of single-file prompts. Main idea: - build a project map (files, symbols, structure) - run a planning step to dec…

↯ Gemma 4 gemma llama
Open Claw on my old PC (32GB Ram, 12GB VRAM) model suggestions? (www.reddit.com) 3 8w

I tried running Gemma4 E4B through llama cpp, and I couldn't get it to reply wiithout timing out.

↯ Gemma 4 llama
how to disable reasoning/thinking with llama-server? (www.reddit.com) 3 8w

I run the same model: `google_gemma-4-E2B-it-IQ3_M.gguf` with lmstudio or llama-server and I connect thru `/v1/chat/completions` EP. with lm-studio, when I ask "tell me a story" i just get a story straight away: [google_gemma-4-e2b-it@iq3_…

llama
GGML and llama.cpp join HF to ensure the long-term progress of Local AI (huggingface.co) 15w

llama
Measuring Open-Source Llama Nemotron Models on DeepResearch Bench (huggingface.co) 44w

llama
Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub (huggingface.co) 49w

llama
Welcoming Llama Guard 4 on Hugging Face Hub (huggingface.co) 58w

llama
Welcome Llama 4 Maverick & Scout on Hugging Face (huggingface.co) 61w

llama
“Llama 3.2 in Keras” (huggingface.co) 85w

llama
Llama can now see and run on your device - welcome Llama 3.2 (huggingface.co) 88w

llama
Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI (huggingface.co) 94w

llama
Llama 3.1 - 405B, 70B & 8B with multilinguality and long context (huggingface.co) 98w

llama
Welcome Llama 3 - Meta's new open LLM (huggingface.co) 111w

llama
Make your llama generation time fly with AWS Inferentia2 (huggingface.co) 135w

llama
Comparing the Performance of LLMs: A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Analysis with Lora (huggingface.co) 135w

↯ Mistral mistral llama
Non-engineers guide: Train a LLaMA 2 chatbot (huggingface.co) 140w

llama
Llama 2 on Amazon SageMaker a Benchmark (huggingface.co) 141w

llama
Fine-tuning Llama 2 70B using PyTorch FSDP (huggingface.co) 142w

↯ Fine Tuning fine-tuning llama
Code Llama: Llama 2 learns to code (huggingface.co) 145w

llama
Fine-tune Llama 2 with DPO (huggingface.co) 148w

dpo llama
Llama 2 is here - get it on Hugging Face (huggingface.co) 151w

llama
StackLLaMA: A hands-on guide to train LLaMA with RLHF (huggingface.co) 165w

rlhf llama

← all tags