Friends Don't Let Friends Use Ollama Ollama gained traction by being the first easy llama.cpp wrapper, then spent years dodging attribution, misleading users, and pivoting to cloud, all while riding VC money earned on someone else's engine…
#llama
753 items
The local LLM ecosystem doesn’t need Ollama (sleepingrobots.com via hn) Gemma4 26b & E4B are crazy good, and replaced Qwen for me! (www.reddit.com) My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where…
Qwen3.6 GGUF Benchmarks (www.reddit.com) Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.
Local manga translator with LLM build-in, written in Rust with llama.cpp integration (www.reddit.com) Hi LocalLLaMA, I created a post a few weeks ago, but this time this project has become more reliable and easier to use. This is a manga translator that can also be used to translate any image.
That's a good news... (www.reddit.com) Looks like it finally happens... MTP getting approved for llama.cpp.
Open WebUI Desktop Released! (github.com via reddit) llama.cpp speculative checkpointing was merged (www.reddit.com) Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions. (www.reddit.com) Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention.
llama.cpp is the linux of llm (www.reddit.com) The Financial Times has published an article about Heretic (www.reddit.com) https://www.ft.com/content/5630ed79-a263-41ed-9a1a-321617ae310e “The FT was able to use Heretic, a tool available on the popular code repository GitHub, to remove the guardrails from Meta’s Llama 3.3 model in less than 10 minutes without a…
Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect! (www.reddit.com) So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me. I am running a 3070 8gb + 64gb DDR4.
2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints (www.reddit.com) WARNING: wait before download from HF: I just realised my upload of the new versions with the additional fix in the chat template has not completed yet. I will remove this warning once done The recent PR to llama.cpp bring MTP support to Q…
TextGen is now a native desktop app. Open-source alternative to LM Studio (formerly text-generation-webui). (www.reddit.com) Hi all, I have been making a lot of updates to my project, and I wanted to share them here. TextGen (previously text-generation-webui, also known as my username oobabooga or ooba) has been in development since December 2022, before LLaMa a…
These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade (www.reddit.com) Time and time again I find posts about these fine tunes that promise increased intelligence and reasoning with base models, and I continuously try them, realize they're botched, and delete them shortly after. I sometimes do resort to a low…
Stop wasting electricity (www.reddit.com) Run on my rtx4090 llama.cpp params: llama-server -m ~/Projects/llm/models/Qwen3.6-27B-UD-Q4_K_XL.gguf --flash-attn on -ngl all -ctk q4_0 -ctv q4_0 -t 32 -c 262144 Power limit was set using sudo nvidia-smi -pl N On my observation, GPU const…
Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm) (www.reddit.com) TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp…
The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) (www.reddit.com) This is V2 of my previous post. What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.
Gemma 4 Vision (www.reddit.com) Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (www.reddit.com) Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…
KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche (www.reddit.com) Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects…
What is the current status with Turbo Quant? (www.reddit.com) It has been hyped ±2 weeks ago and I remember seeing some pull requests into llama.cpp, but what is the current status after the hype faded away?
Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s (www.reddit.com) Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to s…
what’s actually stopping an insider from leaking model weights? (www.reddit.com) this is a dumb question. what are the actual technical barriers stopping an engineer at a place like openai or anthropic from just exporting flagship weights and leaking them?
Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan) (www.reddit.com) I have ThinkPad T14 Gen 5 (8840U, Radeon 780M, 64GB DDR5 5600 MT/s ). Tried out the recent Qwen MoE release, and pp/tg speed is good (on vulkan) (250+pp, 20 tg): ~/dev/llama.cpp master* ❯ ./build-vulkan/bin/llama-bench \ -hf AesSedai/Qwen3…
Talkie: a 13B LLM trained only on pre-1931 text used Claude Sonnet to help test the model and judge its output (www.reddit.com) Researchers Alec Radford (GPT, CLIP, Whisper), Nick Levine, and David Duvenaud just released talkie: a 13 billion parameter language model trained exclusively on text published before 1931. No internet.
Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) (www.reddit.com) I've been using Qwen3.6-27B-Q5_K_M with turbo3 KV cache since it's been released, and I haven't had any issues at all (no loops, no memory loss, etc.). However, I'm also aware that K cache compression is not really recommended in most case…
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp (www.reddit.com) Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP.
Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B (www.reddit.com) I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: Gemma 4 has 2.3B at 4.8 bpw (Q4_K_M) = 1104 MB Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% sm…
Llama.cpp's auto fit works much better than I expected (www.reddit.com) LiquidAI/LFM2.5-8B-A1B · Hugging Face (huggingface.co via reddit) looks like you can run it on any potato (A1B)! https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF from LiquidAI: LFM2.5 is a new family of hybrid models designed for on-device deployment.
Comparison Qwen 3.6 35B MoE vs Qwen 3.5 35B MoE on Research Paper to WebApp (www.reddit.com) Note: First is Qwen3.5 35B MoE (Left) and Second is Qwen3.6 (Right) Hi Guys Just did quick comparison of Qwen3.6 35B MoE against Qwen 3.5 35B MoE. with reasoning off using llama.cpp and same quant unsloth 4 K_XL GGUF First is Qwen3.5 outco…
Qwen 35B-A3B is very usable with 12GB of VRAM (www.reddit.com) Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means more MoE blocks stay on GPU.
Quantizing MTP KV Cache = free lunch? (www.reddit.com) With the MTP llama.cpp implementation in the Qwen3.6/3.5 models more VRAM is required for the MTP layer. However, many people don't realize this layer comes with its own KV cache which can also be quantized: -cache-type-k-draft q8_0 -cache…
Markdown browser for LLMs (www.reddit.com) I built a markdown web renderer for AI agents. Instead of taking expensive screenshots and piping them through vision models, TextWeb renders web pages as markdown that LLMs can reason about natively.
Get faster qwen 3.6 27b (www.reddit.com) Using 100k context with 3090 with MTP GGUF and getting 50 t/s on llama.cpp Thought I would knowledge share Use https://huggingface.co/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF And am17an commit /media/adam/D_DRIVE/LLM/llama-cpp-am17an/build/bin/ll…
LM Studio finally added support for MTP Speculative Decoding (www.reddit.com) https://preview.redd.it/1uuzjm0ll72h1.png?width=923&format=png&auto=webp&s=1af7d7594be1e08ff7ad6797e2bc53e9410769a3 update to 0.4.14 Build 2 (Beta) and make sure your llama.cpp engine is 2.15.0 https://preview.redd.it/x0vdwjb3n72h1.png?wid…
Unsloth solved bug in Mistral Medium 3.5 implementation (www.reddit.com) https://unsloth.ai/docs/models/mistral-3.5 "May 1, 2026 Update: We worked with Mistral to fix Mistral Medium 3.5 inference affecting some implementations, and released updated GGUFs with the fix (NOT related to Unsloth or our quants). The…
b9180 llama.ccp MTP landed (www.reddit.com) All across the land many monitors showing green cmake with giddy anticipation Tip your bartender! https://github.com/ggml-org/llama.cpp/releases/tag/b9180
Info: Nvidia Cuda 13.3 landed (www.reddit.com) Cuda 13.3 Downloads Release Notes Anybody already tried llama.cpp with 13.3?
Qwen 3.6 27b IQ4_XS - 22 tp/s on RTX 5060TI 16b, 24k ctx (www.reddit.com) Maybe it be helpful for someone: llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000 Cant run this model with higher kv quants on >8192ctx size. -ub & -b setted for 256 allowed me for…
I have DeepSeek V4 Pro at home (www.reddit.com) Just wanted to share that I used u/LegacyRemaster slightly modified (Q4_K_M conversion support) DeepSeek V4 CUDA repo (based on u/antirez work) to convert and run Q4_K_M DeepSeek V4 Pro on my Epyc workstation (Genoa 9374F, 12 x 96GB RAM, s…
Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090 (www.reddit.com) So I've been messing around trying to get MTP working alongside TBQ4_0 (TurboQuant's lossless 4.25 bpv KV cache) on Qwen3.6-27B for my own use. So after a day of vibecoding I think I may have gotten something viable.
Thoughts on using an AMD Alveo V80 FPGA PCI card as a poor man’s Taalas HC1 (LLM-burned-onto-a-chip). (www.reddit.com) TL:DR - Remembered FPGA PCI boards being a big thing from my crypto days. Wondered if AMD Alveo V80 FPGA card could be used to approximate the performance of a Taalas HC1 (LLM-on-a-chip).
RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help (www.reddit.com) MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round.
I catalogued every way local models break JSON output and built a repair library, here's what I found across 288 model calls (www.reddit.com) I've been running structured output prompts through a bunch of models on OpenRouter for the past few months — Llama 3, Mistral, Command R, DeepSeek, Qwen, and every other model on OpenRouter — alongside the usual closed-source suspects. 28…
vLLM ROCm has been added to Lemonade as an experimental backend (www.reddit.com) vLLM has the ability to run .safetensors LLMs before they are converted to GGUF and represents a new engine to explore. I personally had never tried it out until u/krishna2910-amd/ u/mikkoph and u/sa1sr1 made it as easy as running llama.cp…
Heretic has been served a legal notice by Meta, Inc. (www.reddit.com) To Whomsoever it May Concern, The individual behind the Heretic Free Software Project (henceforth called "Heretic", notwithstanding unrelated entities of the same name) has been served a notice by a legal services provider representing Met…
PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server (www.reddit.com) Hey folks, just a heads-up for anyone running Qwen3.6 through llama-server. I ran into an issue where the preserve_thinking parameter wasn't working as expected, even though I had it explicitly enabled in my models.ini config.
FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8 (www.reddit.com) Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published Dynamic Memory Sparsification (DMS), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV…
Dual dgx spark (Asus GX10) MiniMax M2.7 results (www.reddit.com) Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT (www.reddit.com) Setup: Kubuntu 24.04 - AMD cards - R9700 AI PRO and 7800xt (32gb + 16gb) - llama-cpp server - stack setup in docker - vulkan image I tried with ROCM but it wouldn't play nice with RDNA4 + RDNA3 mix. Vulkan seems to work.
Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct (www.reddit.com) Ok, hear me out. This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround (https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main) recipe was performing so much better than any other Qwen3.6 27B q…
If you've been waiting to try local AI development, please try it (www.reddit.com) I have snobbishly long felt that the local models were not 'up to my standards' for local development, or otherwise able to compete with GHCP, Claude Code, Cursor etc. Boy was I wrong.
Qwen-27B-IQ4_KS for ik_llama.cpp, especially for NVIDIA with 16GB VRAM (www.reddit.com) Hi everyone, I'm presenting a new quantization of the Qwen-27B model, created specifically with 16GB VRAM NVIDIA GPUs in mind. I used quants that, unfortunately, are not yet available in the main upstream llama.cpp.
Building a fully local PDF-to-audiobook workflow with Kokoro 82M, Qwen and llama.cpp (www.reddit.com) Hey everyone, I’ve been building a local-first desktop PDF reader that can read technical books aloud and keep the spoken text highlighted while reading. The original motivation was pretty practical: I read a lot of programming and technic…
Qwen3.6-27B with MTP grafted on Unsloth UD XL: 2.5x throughput via unmerged llama.cpp PR (www.reddit.com) Hey everyone, I've been working on getting Multi-Token Prediction (MTP) working with quantized GGUFs for Qwen3-27B and the results are pretty impressive. Here's what I put together: https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF…
PS5’s can now be hacked to run Linux - perhaps some potential for local inference? (www.tomshardware.com via reddit) I look forward to the Local LLM community getting llama.cpp to run on these. Could be a good value.
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) (www.reddit.com) TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled.
Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into (www.reddit.com) Share your speculative settings for llama.cpp and Gemma4 (www.reddit.com) I have totally missed the boat on speculative decoding. Today when generating some code again for the frontend i found myself staring down at some quite monotonic javascript code.
Strix Halo Llama.cpp MTP Benchmarks: 27B Gets Much Faster, 35B Is Mixed (www.reddit.com) TL;DR All models were Qwen3.6 27B-MTP vs Base 27B (15k single-turn): Faster overall Total Time (wall): 87.44s → 77.39s (10.05s faster / -11.50%) Generation: 7.63 → 16.15 t/s (+111.77% speedup) Prompt Processing: 279.75 → 244.90 t/s (-12.46…
235M param LLM from scratch on a single RTX 5080 (www.reddit.com) Qwen3.6 27B NVFP4 + MTP on a single RTX 5090: 200k context working in vLLM (www.reddit.com) So I spent some time testing Qwen3.6 27B NVFP4 on my RTX 5090 and wanted to share the numbers, since most of the recent good posts are either around 48GB cards, FP8, or llama.cpp/GGUF. This is not a "best possible setup" claim.
For everyone that uses OpenCode / Pi - Heres your promptprocessing fix! (www.reddit.com) This PR deserves much more attention as it fixes the constant promptprocessing that happens when using llama.cpp with Opencode or pi. https://github.com/ggml-org/llama.cpp/pull/22929
Testing llama.cpp MTP support on Qwen3.6 - RTX 5090 (www.reddit.com) Setup: - RTX 5090, 32 GB, Linux - Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120) - Unsloth's…
Is using vLLM actually worth it if you aren't serving the model to other people? (www.reddit.com) So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc.
MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks (www.reddit.com) Hey r/LocalLLaMA, we did an investigation into MiniMax-M2.7 GGUF causing NaNs on perplexity. Our findings show the issue affects 21%-38% of all GGUFs on Hugging Face (not just ours).
Qwen3.6 27B and llama.cpp appreciation post (www.reddit.com) To preface, here's my config: llama-server \ --host 0.0.0.0 \ --port 1235 \ --models-preset %h/Software/models.ini \ --models-max 1 \ --sleep-idle-seconds 3600 \ --timeout 3600 \ --parallel 1 \ --device ROCm0,ROCm1 [*] flash-attn = on jinj…
Experts-Volunteers needed for Vulkan on ik_llama.cpp (www.reddit.com) ik_llama.cpp is great for both CPU & CUDA. Need legends to make Vulkan better as well.
[P] Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book (www.reddit.com) I wrote a book that implements modern LLM architectures from scratch. The part most relevant to this sub: Chapter 3 takes GPT-2 and swaps exactly 4 things to get Llama 3.2-3B: LayerNorm → RMSNorm Learned positional encodings → RoPE GELU →…
common/gemma4 : handle parsing edge cases by aldehir · Pull Request #21760 · ggml-org/llama.cpp (github.com via reddit) If you are on Gemma (like me), you basically have to compile llama.cpp daily now
Introducing BlueTTS (www.reddit.com) I recently worked on BlueTTS, a lightweight text-to-speech model that focuses on speed and usability. It supports multiple languages: English, Hebrew, Russian, Spanish, and French (even within the same sentence), and comes with a large set…
Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB (www.reddit.com) ## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: https://gith…
MiMo-V2.5-GGUF (preview available) (huggingface.co via reddit) Hi, AesSedai here - I've put up a PR to support the text-to-text inference of MiMo V2.5 with llama.cpp (and should also support Pro, will work on those quants after finishing V2.5): https://github.com/ggml-org/llama.cpp/pull/22493 I've als…
Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big. (www.reddit.com) As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.
Qwen 3.6 35 UD 2 K_XL is pulling beyond its weight and quantization (No one is GPU Poor now) (www.reddit.com) Hi guys, Back again. I have tested the Qwen 3.6 UD 2 K_XL Unsloth model on the same paper to web app task.
FYI, Step 3.5 Flash has better perf and context is 1/4 the price in llama.cpp (www.reddit.com) So i recently updated LMstudio after a long pause and updated my llama.cpp runtimes too.. i was shocked..
Ban phrases on llama.cpp with this script. (www.reddit.com) Check the README for setup instructions: https://github.com/BigStationW/llama-cpp-phrase-ban
I stumbled on a Gemma 4 chat template bug for tools and fixed it (www.reddit.com) TLDR: tool parameters using the common JSON Schema pattern `anyOf: [$ref, null]` are rendered into the prompt as empty `type` fields. This strips the useful schema information before the model sees it.
Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM (www.reddit.com) Hello everyone! I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB.
llama.cpp benchmark native vs. non native NVFP4 on Blackwell - summary (www.reddit.com) I tested two llama.cpp builds on the same Qwen3.6-27B-NVFP4 model. llama-bench reports the model label as qwen35 27B NVFP4, but the actual tested model is Qwen3.6-27B-NVFP4.
mesa PR with 37-130% llama.cpp pp perf gain for vulkan on Linux on Intel Xe2 (gitlab.freedesktop.org via reddit) Making sure you're not a bot! Loading...
Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR? (www.reddit.com) Hey everyone, Ever since the day Google announced TurboQuant, I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the di…
Web OS result from Qwen3.6 35B is by far the best I tested in my laptop (codepen.io via reddit) This is my first test with this model and Qwen impressed me. I will rate it 98% usable web os compared to my previous best 70% usable result from qwen3 next coder at q2.
Can't keep up with Llama.cpp changes, made a n8n workflow to summarize it for me daily (www.reddit.com) My kind of daily news sent to me via Discord https://preview.redd.it/prmris11vdvg1.png?width=684&format=png&auto=webp&s=0dcb00079362a38a29d981dd2f3a4e5143c8091f The N8N workflow (you could probably have Hermes or another agent do similar):…
llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig (www.reddit.com) PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.
Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP (www.reddit.com) for anyone who cares... 😄 prompt = spen a 1000 tokens unsloth MTP models strix halo llama.cpp:server-rocm-mtp \ --spec-type draft-mtp \ --spec-draft-n-max 3 Qwen3.5-122B-Q5-MTP-General n_decoded = 100 tg = 29.77 t/s n_decoded = 179 tg = 27…
NCCL-Free Tensor Parallelism on Dual Blackwell PCIe llama.cpp b9095 released! (www.reddit.com) b9095 finally makes -sm tensor work on dual consumer Blackwell PCIe GPUs without NCCL If youre on dual Blackwell gpus this look like it could be big. I'll have my own results for 2x5060ti asap
Gemma4 26b a4b Apex quant is quite good (www.reddit.com) I tried mudler's apex quant for gemma4 26b a4b and it was amazing! I got 38tps at 90.000 context with no loop and suprisingly no quality degradation.
Dual GPU llama.cpp speedup (www.reddit.com) Llama.cpp has had a long standing issue with "--split-mode tensor", you'll get great results but it only supports non-quantized KV caches, for this very reason a lot of people decide to go with a healthy sized KV cache and ignore tensor pa…
MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b (www.reddit.com) I converted nvidia/llama-embed-nemotron-8b to MLX fp16, 8-bit, 4-bit, and 2-bit (for my OCD) and put it on HuggingFace: ncorder/llama-embed-nemotron-8b-mlx-fp16 ncorder/llama-embed-nemotron-8b-mlx-8bit ncorder/llama-embed-nemotron-8b-mlx-4…
As MTP prepares to land in llama.cpp, Models that support MTP (www.reddit.com) DeepSeekv3 OG DeepSeekv3.2/4 Qwen3.5 GLM4.5+ MiniMax2.5+ Step3.5Flash Mimo v2+ Until we get mtp weights, you need to download HF weights and convert to gguf. I think I'm going to try either qwen3.5-122b or glm4.5-air first.
Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max (www.reddit.com) Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K.
Qwen3.6 One Shot Tetris Game (www.reddit.com) I am blown away by what this model can generate locally. I asked for a flashy Tetris game with particle effect and boy did it deliver!
Compile English function descriptions into 22MB neural programs that run locally via llama.cpp (www.reddit.com) We built a system where a neural compiler takes a plain-English function description and produces a "neural program" (a combination of a continuous LoRA adapter and a discrete pseudo-program). At inference time, these adapt a fixed interpr…
Old Mac Pro still proving its worth (www.reddit.com) The “Trash Can” Mac Pro, once the most expensive machine you could buy from Apple, mine was just shy of £10,000 in 2016 — that’s £14k in today’s money. Until recently mine was just running as a kubernetes single node development platform,…
Uploaded Unsloth Qwen3.6-35B-A3B UD XL models with MTP grafted, here are the results (www.reddit.com) Following my previous post https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq, a few people asked for the 35B A3B version. The model is up on HuggingFace at https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF if anyone wants to ch…
PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together (www.reddit.com) Previously a model could only be present in a single group. Now you can create whatever groups you want: one for big models that should run on their own, a group for STT + bigger model, a group for RAG usages, etc.
Can't replicate Reddit numbers with Qwen 27B on a 3090TI. (www.reddit.com) I feel like i'm going insane. I see people here posting 30 - 100+ tok/s (100+ being with speculative decoding) on a 3090 with Qwen 3.6 27B.
Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant (www.reddit.com) Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM.
Running the new Qwen3.6-35B-A3B at full context on both a 4090 and GB10 Spark with vLLM and Llama.cpp (www.reddit.com) Here is how to run the new Qwen3.6-35B-A3B > At full context on a 4090 - IQ4_XS gguf with llama cpp > At full context on a Spark - FP8 with a tweaked vLLM Here is the docker compose with llama cpp services: llamacpp: container_name: llamac…
Qwen3.6 huge quality gain from Q4 to Q6 for coding agent (www.reddit.com) So, last week I tried to update my unused local LLM setup. I had to stop using it because quality was too low and deepseek was too cheap.
I ran a quantization shootout on Qwen3-Coder and the results are... interesting (www.reddit.com) Out of random curiousity I ran a shootout on Qwen3-Coder-Next. I've been using the MXFP4_MOE from unsloth for awhile as it's just really fast on my system.
↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3llama
Warpdrv - my open-source Llama.cpp launcher for daily-driving Qwen 35b + 27b on Strix Halo + RTX Pro. (www.reddit.com) I wanted to share an open-source app that I built for running LLMs locally on my setup. My setup Hardware FEVM FAEX1 (128GB) RTX Pro 5000 Blackwell (48GB), connected over OCuLink Aoostar AG02 2x2TB internal m.2 drives on raid-0 using mdadm.
To 16GB VRAM users, plug in your old GPU (www.reddit.com) For those who want to run latest dense ~30b models and only have 16GB VRAM, if you have a old card with 6GB VRAM or more, plug it in. It matters that everything fits on the VRAM, even on 2 cards.
Qwen3.6-35B-A3B-APEX / 128K ctx on RTX 3060 12GB — 37 t/s gen with 72k ctx filled, PPL 3.25, offloading 17GB model (www.reddit.com) I'm posting this because it may be helpful to squeeze the 12GB VRAM in the 3060. All credit goes to spiritbuun's fork (github.com/spiritbuun/buun-llama-cpp) and mudler's APEX quantizations (huggingface.co/mudler).
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6llama
Qwen3.6-35B-A3B vs Gemma4-26B-A4B (www.reddit.com) Just wondering how are people's experience with both these models! I've had some nice results with Qwen but Gemma4 runs so much faster here.
[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo (www.reddit.com) Probably most of you are aware that using anything other than -ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0 as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g.
Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version) (www.reddit.com) In my opinion, MTP models are 100% game changer for local LLMs. In terms of speed, I was getting around 1.5x the tok/sec of previous tests.
More Qwen3.6-27B MTP success but on dual Mi50s (www.reddit.com) TLDR: The hype is real! 1.5x speedup.
Using PaddleOCR-VL-1.5 with llama-server for book OCR (www.reddit.com) I've been running PaddleOCR-VL-1.5 via llama.cpp's server for OCR on book pages. It handles complex layouts, tables, and mixed text/figure pages surprisingly well.
Using the iGPU as the primary graphics card may improve token generation speed for PCIe graphics cards (www.reddit.com) A few days ago, I was trying to improve token generation speed on my RTX 4070 Super 12GB while running Qwen3.6 35B A3B UD-IQ3_XXS (Unsloth) with llama.cpp, but to no avail. At that time, I had my monitor plugged in my 4070 and didn't even…
Qwen3.6 agent + Cisco switch: local NetOps AI actually works! (www.reddit.com) Q8 Cache (www.reddit.com) https://github.com/ggml-org/llama.cpp/pull/21038 Since now cache quantization has better quality, does that mean Q8 cache is a good choice now? For example for 26B Gemma4?
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) (www.reddit.com) A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into…
Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP (www.reddit.com) Hey fellow Llamas, keeping it short. We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory).
Running Minimax 2.7 at 100k context on strix halo (www.reddit.com) Just wanted to share because it took me a lot of tweaking to get here: llama-server -hf unsloth/MiniMax-M2.7-GGUF:UD-IQ3_XXS --temp 1.0 --top-k 40 --top-p 0.95 --host 0.0.0.0 --port 8080 -c 100000 -fa on -ngl 999 --no-context-shift -fit of…
DS4, a specialized inference engine for DeepSeek v4 Flash (twitter.com via hn) antirez @antirez Welcome to DS4, a specialized inference engine for DeepSeek v4 Flash. github.com/antirez/ds4 This project would have been impossible without the existence of llama.cpp and GGML and the work of @ggerganov and all the other…
Vulkan backend outperforms ROCm on Strix Halo (gfx1151) — llama.cpp benchmark (www.reddit.com) Just ran some llama-bench comparisons between ROCm and Vulkan backends on my Strix Halo system. Vulkan came out ahead, which surprised me.
Open Weights Models Hall of Fame (www.reddit.com) I read a lot of "whengguf" type posts. I think we should sometimes stop and be grateful.
GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B (www.reddit.com) Hi folks, Enjoy an optimised Qwen3.6 35B-A3B and Qwen3.6 27B for coding and general purpose - it's able to solve puzzles correctly more often too. The initial intent was to optimise the 35B-A3B reasoning traces since it's the most efficien…
Llama.cpp parameters for Qwen 3.6 with RTX 3090 (www.reddit.com) Hi, I'm trying to run Qwen 3.6-35B on my RTX 3090 (24 GB of VRAM) but I'm not sure about 2 thing: - Which variant of the model to use ? (Q4_K_S, Q3_K_XL, other ?
Intel Arc B70 with HP z640 workstation (pcie 3) (www.reddit.com) Curated a list of 550+ free or cheap AI tools for vibe coding (LLM APIs, IDEs, local models, RAG, agents) (www.reddit.com) Been vibe coding a lot recently and kept running into the same problem finding actually usable tools without paying for 10 different subscriptions or donating my bank balance to Claude. So I put together a curated list focused on free or l…
Turn an old Android phone into a Local AI Voice Assistant (www.reddit.com) I had a nice old cracked pixel 5a laying around that I wanted to get some use out of, so I turned it into a local AI Voice assistant. A server on a laptop running llama.cpp gemma-3-4b-q4.gguf served by flask connects to a script running on…
Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs. (www.reddit.com) Here's the PR by pedapudi. https://github.com/ggml-org/llama.cpp/pull/21344 It's merge request has been denied so it will not be in mainline llama.cpp.
MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it (www.reddit.com) I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model.
As of today, what's the *most stable* model to run on a 32Gb RAM Mac w/ 256k context? (www.reddit.com) Hey everyone, I've been playing around with Gemma4 and Qwen3.6 on my 32Gb Macbook Pro M2 Max since their release but I'm struggling at finding: The best software to run it (oMLX, llama.cpp, ...) The best model + quant to pick The best sett…
why llama.cpp can’t combine speculative decode methods? (www.reddit.com) dicking around with the new mtp speculative decode with qwen3.6 27b, and it’s great. but for agentic coding i’ve seen significant improvements from ngram, because a decent fraction of the time (e.g.
Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama (www.cyera.com via reddit) Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama TL;DR We discovered a critical vulnerability (CVE-2026–7482, CVSS 9.1) in Ollama that enables unauthenticated attackers to leak the entire Ollama process memory, potentially im…
Intel B70: LLama.ccp SYCL vs LLama.cpp OpenVino vs LLM-Scaler (www.reddit.com) In case anyone is interested, I decided to test out LLama.cpp's new OpenVino backend to see how it compares on Intel GPUs. At first glance, it stomps all over the previous best-case, SYCL, but lags behind LLM-Scaler (Intel's VLLM fork), li…
VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do? (www.reddit.com) EDIT - IGNORE. I MADE A MISTAKE.
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6vllmllama
Blackwell and PDL performance increase (www.reddit.com) Llama.cpp recently introduced support for Programmatic Dependent Launch (PDL), which is a new feature in Nvidia GPUs (CC >= 90, not including ADA) such as Blackwell. (See PR 22522.) In short, PDL enables more efficient execution of kernels…
MTP experiences on 7900xtx? (www.reddit.com) Hi! I have been using Qwen3.6 35B A3B happily the past few weeks, and I wanted to try out Qwn3.6 27B with the new fancy MTP speculative draft!
Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future. (www.reddit.com) I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly".
MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp (www.reddit.com) I was wondering what will be the difference in results with flag: GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 vs MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Results are quite interesting 49tok/sec without MTP vs 64 tok/sec with MTP. PC: RTX5090+128GB DDR5…
Gradually increasing memory use - is there a memory leak in llama.cpp? (www.reddit.com) I've got a 128GB Strix Halo box. Yesterday I wanted to try out Step-3.5-flash.
Lemonade OmniRouter: unifying the best local AI engines for omni-modality (www.reddit.com) I’ve always liked how if I ask ChatGPT to make or edit an image, it just does it. Local AI should be this convenient!
Is there a way to mitigate performance as context grows? (www.reddit.com) In my local LLM setup I get from 30 to 80 t/s generation at the beginning, but it drops quite a lot as context grows. I use llama.cpp/Vulkan with an MI50 and a V100, is there some command line flags that can improve this issue?
GPoUr with ~12gb vram and a 3080 getting 40tg/s on qwen3.6 35BA3B w/ 260k ctx (www.reddit.com) The TheTom's turboquant's GPU accelerated turboquant (turbo3) has unlocked high context gains for the 35BA3B family. I can now achieve ~40tg/s via the following GPU-POOR compilation flags and configuration: cmake -B build -DGGML_CUDA=ON -D…
Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload (www.reddit.com) Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough.
Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context. (www.reddit.com) Used the vllm version of https://github.com/noonghunna/club-3090 It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090?
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6vllmqwenllama
club-5060ti: practical RTX 5060 Ti local LLM notes and configs (github.com via reddit) I put together a small public repo for RTX 5060 Ti 16GB local LLM setups: I took inspiration from the club-3090 repo, but this one is focused on documenting what we’ve actually tested on 5060 Ti hardware so the setup details are easier to…
Playing One Night Werewolf (Gemma4 & Qwen3.6) (www.reddit.com) Finally feel like it’s possible. I have a custom build (vibe coded) UI on llama.cpp, allows model switching in the same chat.
Running llama.cpp on Snapdragon Hexagon NPU seems promising (www.reddit.com) https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md I have an Oneplus 12 with Snapdragon 8 Gen 3. I followed the above README to cross-compile llama.cpp on Ubuntu and then copy to the Termux directory on the…
I built a 5M model to see if it outperforms my 350M model... (www.reddit.com) Hi r/LocalLLaMA ! I built a 5M Llama model with HF Transformers on 2x T4 in Kaggle to see, if it is able to be as good as my previous Apex 350M model (https://huggingface.co/LH-Tech-AI/Apex-1.6-Instruct-350M).
What speed is everyone getting on Qwen3.6 27b? (www.reddit.com) I'm getting ~13 tps on Q8_0, with a context window of 128000, K Q8_0, V Q8_0 this is on 3x GPUS (1x2060super 8gb, 2x5060ti 16gb), via llamacpp unsure if this is slow or to be expected? */llama-server --port 8080 --model */llama.cpp/Qwen3.6…
Show HN: MemFactory: Unified Inference and Training Framework for Agent Memory (arxiv.org via hn) Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged…
Authors Sue Meta's AI Scientists Directly in Llama Copyright Case (www.law.com via hn) A proposed class action filed against Meta Platforms in New York federal court targets not only the company and its CEO Mark Zuckerberg but also two former senior AI researchers by name—an unusual move that could signal a new front in the…
Experimental "Preserve Thinking" Jinja Template for Gemma4 31B in llama.cpp (www.reddit.com) https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja Yall are more than welcome to try it out and provide feedback. In my own testing in Pi-coding-agent I no longer have the "forgot to close thin…
Time to update llama.cpp to get som MTP improvements! (www.reddit.com) https://github.com/ggml-org/llama.cpp/pull/23269
We have sub-agents at home (www.reddit.com) At work I get unfettered access to gpt 5.4 and sonnet, so I'm quite used to spawning sub-agents to go crazy on a repo and split up tasks. At home I am VRAM poor and like to run the models locally for my own enjoyment.
Grafting vision onto text models for fun and profit. (www.reddit.com) So as we know.. llama.cpp separates the vision or other multimedia from the main weights.
Looking to migrate off of Ollama and LMStudio (www.reddit.com) Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow.
2 old RTX 2080 Ti with 22GB vram each Qwen3.6 27B at 38 token/s with f16 kv cache (www.reddit.com) PLEASE KEEP IN MIND BOTH OF MY CARDS ARE POWER LIMITED TO 150W (i hate noise) ------- Just wanted to share my current setup, that might help some users out there... services: llama-server: image: ghcr.io/ggml-org/llama.cpp:full-cuda12-b912…
Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan? (www.reddit.com) I have a docker stack with a bunch of AI services and llama.cpp server is the brain. I've got a working vulkan yml snippet for llama.cpp but out of curiosity, I flipped it to ROCM (latest build) and did not see ANY performance improvement.
llama.cpp docker images to run MTP models (www.reddit.com) This is follow up from previous post: https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/ There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently ma…
Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context (www.reddit.com) If anyone is looking for a good high-speed setup with ~190k context, this config has been working insanely well for me. I’m using my laptop as a server over Tailscale.
Show HN: Bonsai 1.7B ternary model at 442T/s on M4 Max (agents2agents.ai via hn) We took a recently released Bonsai 1.7B ternary model from PrismML (https://github.com/PrismML-Eng/Bonsai-demo) and ran our agentic evolution search on it for 6 hours to optimize the Metal kernels. The search was fully autonomous.
Qwen3.6-27B-NVFP4 - images (www.reddit.com) Model: Abiray-Qwen3.6-27B-NVFP4.gguf Specs: - Legion 7i Gen10 - NVIDIA GeForce RTX™ 5090 - Intel® Core™ Ultra 9 275HX × 24 - RAM 32.0 GiB llamacpp settings: ./build/bin/llama-server \ -m ~/.lmstudio/models/lmstudio-community/Qwen3.6-27B-GG…
gemma-4-31B-it-DFlash has been released (www.reddit.com) https://huggingface.co/z-lab/gemma-4-31B-it-DFlash I guess we'll have to wait until this PR is merged before we can test it. https://github.com/ggml-org/llama.cpp/pull/22105
llama.cpp DeepSeek v4 Flash experimental inference (www.reddit.com) Hi, here you can find experimental llama.cpp support for DeepSeek v4, and here there is the GGUF you can use to run the inference with "just" (lol) 128GB of RAM. The model, even quantized at 2 bit, looks very solid in my limited testing, a…
Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help (www.reddit.com) The new dense model is great, but I’m trying to figure out how to increase PP and Token generation speed. I’m running Q8 quants across 3 7900xtx GPUs and I’m consistently only getting 18-20 t/s generation speed and ~650 t/s prompt processi…
Reproduction of TurboQuant (www.reddit.com) There have been many TurboQuant implementations recently in llama.cpp, mlx, vllm, and sglang, but a lot of the discussion and code around them feels pretty noisy and looks to be AI-generated. I’m trying to understand which claims from the…
I built a local LLM that learns how you use Claude Code and starts auto-piloting it (www.reddit.com) I've been running 5-8 Claude Code sessions at a time and got tired of tab-switching to approve tool calls. So I built claudectl — a TUI that sits on top of all your sessions and lets a local LLM (ollama/llama.cpp) handle approvals for you.
llama.cpp server have built-in native tools (exec_shell, edit_file, etc.) (www.reddit.com) https://preview.redd.it/24uvk7o4sy2h1.png?width=1440&format=png&auto=webp&s=542570e3057b6f44c1e7e8d92130f575fb69cfa2 https://preview.redd.it/l4bbm7o4sy2h1.png?width=1440&format=png&auto=webp&s=3dc0edd978da23fecf81e86a269a06de643247d1 I was…
Llama.cpp VS LiteRT on a custom Xiaomi 12 Pro 24/7 Server (V2 Redesign) (www.reddit.com) https://preview.redd.it/sm4ysgdw1w2h1.png?width=1376&format=png&auto=webp&s=3705932403919814fbf2008a1cba189d17e0591e Thanks everyone for the advice on my previous post (24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/…
Experts first llama.cpp (www.reddit.com) This is for all with 12GB VRAM. Hi, I created a fork of llama.cpp with an experimental implementation of experts instead of layers.
'Am I OpenAI compatible' - a tool and documentation for unified api signatures in open source AI. (www.reddit.com) This has turned out to be useful to many of my friends so I thought I'd share here as well. I created a tool and documentation page for most major open-souce project's adherence to 'OpenAI compatibility' after seeing inconsistencies betwee…
PSA: If you haven’t updated Llama.cpp for a couple of days and find MTP to not be performing well, update llamacpp. (www.reddit.com) I thought it had horrible performance and was a nothingburger and had spent like an hour benchmarking it. Updated it yesterday and received a like 1.5-1.8x token boost.
If you use continue.dev and Qwen 3.6 (dense / MoE) - I could use your help (www.reddit.com) Someone suggested I give Continue (Vscode extension) a try. I've been using Roo / Zoo now and liking it but it is pretty tough on context and I was told continue has more control over it.
Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4 (www.reddit.com) CPU is just a secondhand 10900x. Using 128k context, unquantized kv cache.
Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup (www.reddit.com) Hi r/LocalLLaMA - I've been paying close attention to the edge AI ecosystem because it's an area where i see huge potential and where I truly believe AI will become more useful for day to day tasks. Around the gemma 4 release I was already…
Llama-Studio, WebUI for llama-server Management (www.reddit.com) Hey all, I have built myself a WebUI for configuring and managing llama-server sessions, and want to share the code and concept. Python and a bit of JS.
[Benchmark] 5090RTX: Promt Parsing, Token Generation and Power Level (www.reddit.com) Inspired by https://www.reddit.com/r/LocalLLaMA/comments/1tayu5t/stop_wasting_electricity/ I've decided to put my 5090 to test and see how do the curves look like for the device and whether there were any obvious sweet spots (apart from se…
Llama models: still valuable for finetuning or surpassed by everything new? (www.reddit.com) Hello there people. So I have noticed that people are pretty much ignoring Llama 3 plus 3.1, 3.2, and 3.3 these days.
Drastically improve prompt processing speed for --n-cpu-moe partially offloaded models (www.reddit.com) Bigger ubatch made gpt-oss-120b prompt processing much faster on my RTX 3090 I was tuning gpt-oss-120b-F16.gguf with llama.cpp on a 24 GB RTX 3090 and found that increasing the physical micro-batch size (-ub) can massively improve prompt p…
Why is opencode so slow in processing the prompt with llama server? (www.reddit.com) I'm running opencode and llama-server locally. I have 32gb ram and 780m igpu.
Released a TurboQuant-compatible KV backend evaluation SDK (www.reddit.com) Disclosure: I am the author of this evaluation SDK. I released an independent TurboQuant-compatible KV backend evaluation package for compressed-KV ABI testing, smoke tests, and partial attention decode experiments.
What's your tps on 3090 + Qwen 3.6 27B in real tasks? (www.reddit.com) I struggle to wrap my head around all this. My goal is local agent to solve low complexity tasks, in the same harness where I would use frontier models.
Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing (www.reddit.com) Hi everyone, I’m the maintainer of Box — a fork of Google’s AI Edge Gallery that I’ve been extending into a fully offline AI assistant for Android. Full disclosure: I built this project.
[7900XT] Qwen3.6 27B for OpenCode (www.reddit.com) I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far: llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --t…
FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally (www.reddit.com) Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about. llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 = 40`), with CUDA kernels landing in `mmq.cuh`, `mmvq.cu`,…
TurboQuant on MLX & vLLM (www.reddit.com) MLX https://github.com/Blaizzy/mlx-vlm?tab=readme-ov-file#turboquant-kv-cache vLLM https://github.com/vllm-project/vllm/pull/38479 MLX & vLLM users, please share your experience with benchmarks(t/s). Adding llama.cpp Links related to Turbo…
Llama.cpp vs LM Studio on gaming PC (www.reddit.com) Here is my experience, I've been using LM Studio with RTX 5080 and 64GB RAM using Windows 11. I'm very happy with LM Studio except the speed.
Llamacpp server : How do the -np and -c flags interact? (www.reddit.com) I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model) (www.reddit.com) I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends." It does depend. So let me spl…
[NEW] Supra-50M Released! (www.reddit.com) https://preview.redd.it/kx39ammxno2h1.jpg?width=1080&format=pjpg&auto=webp&s=d1a2d5b27920a5b61a50547a6e70a6378445cae4 SupraLabs released a new model! - Supra-50M Supra-50M is a compact 50M-parameter causal language model (BASE and INSTRUCT…
From 6gb to 32gb (www.reddit.com) Well I ordered a 3090 today. I plan on pairing it with a 3060 I have for 32gb combined VRAM.
Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled (www.reddit.com) Qwen 3.6 27B Q8 on four Nvidia RTX A4000 (16GB each) with Llama.cpp and MTP enabled My setup is heterogenous, I originally acquired my server (Lenovo ThinkStation P3 Tower Gen 2) to run OpenShift/K8s clusters (because I work on that), and…
Now that MTP is merged... What's the best outputs you're getting on Qwen 3.6 35B on 2x3090s? (www.reddit.com) We've got great outputs for 27B via club 3090, but what about those of us who love the blazing speed of 35B on dual 3090s? I was getting 1500 p/p and 120 t/g with split layers, but MTP slowed it down to 80 t/g when I tested last week.
Dropping learning rate fixed my Qlora fine-tune more than anything else i tried (www.reddit.com) Been fine-tuning llama 3.1 8b with Qlora for a classification task using about 8k samples. I was getting bad eval results for a while and kept thinking something was wrong with my data.
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) (www.reddit.com) I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…
Do not fall into the trap of chasing the next scale or upgrade. (www.reddit.com) I mean; don't get me wrong, I love me some improvements and enhancements and it keeps on giving... and with MTP making its way to llama.cpp soon, a lot of you who aren't already running custom compiles are about to get a boost in inference…
How do I use MTP? (www.reddit.com) Hi, I'm trying to use MTP with llama.cpp, I built from source the mtp-pr, download an MTP model from huggingface https://huggingface.co/unsloth/Qwen3.6-27B-GGUF-MTP/resolve/main/Qwen3.6-27B-Q6_K.gguf But when I run the model I have an erro…
Qwen3.6 27b q5_k_M MTP - 256k context - 5090 (www.reddit.com) https://preview.redd.it/ktg0lr3e0p0h1.png?width=1279&format=png&auto=webp&s=d110580662a5c707038b7e2e4f5226d2a18c7bfe Straight to it: llama-server-mtp \ -m ~/models/Qwen3.6-27B-Q5_K_M-mtp.gguf \ --spec-type mtp \ --spec-draft-n-max 3 \ --ca…
Testing MiMo-V2.5-IQ3_S with 1'048'576 context (www.reddit.com) llama-server.exe --model "H:\gptmodel\AesSedai\MiMo-V2.5-GGUF\MiMo-V2.5-IQ3_S-00001-of-00004.gguf" --ctx-size 1048576 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --parallel…
Qwen3.6 27B seems struggling at 90k on 128k ctx windows (www.reddit.com) I have RX 7900 XTX, running Qwen3.6 27B Q4_K_XL. got 400ish pp and 30s tps.
Sorry if it's not the best place to ask this, of the models in the image, which is the best for (problem solving)/Coding and the best one for studying (ask LLM concepts) ? My PC build is RX 9060 XT 16GB + I3 12100F + 16 GB DDR4 + llama.cpp with Vulkan backend + Linux Mint. (www.reddit.com) I gave some math problems to Qwen 3.5 27B and Qwen 3.6 27B and they got all of them right, pretty smart models I would say, but very slow and electricity consuming, they took like 5 mins with my GPU at 120 W to solve a problem. The MoE mod…
AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks (www.reddit.com) Did some quick tests after building llama.cpp with ROCm 6.4.2 and latest Vulkan for my 6900 XT gemma4 E2B Q4_K ubatch ROCm pp512 Vulkan pp512 ROCm tg128 Vulkan tg128 32 1536.60 1423.49 151.92 174.59 64 1590.65 1930.60 151.41 173.76 128 265…
For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention. (www.reddit.com) I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of diff…
What is the best coding agent (CLI) like Claude Code for Local Development (www.reddit.com) Hey all: I am trying to set up claude code to work with llama.cpp, I am using the Qwen3.6-35B-A3B. I usually use claude code + ZLM subscription i got lucky with $30 yearly - the set up is very simple with their automated script, but for th…
What do you consider to be the minimum performance (t/s) for local Agent workflows? (www.reddit.com) What would you say is the minimum amount of tokens per second you would tolerate for your local agent workflows? I have been trying pi.dev connected to a llama.cpp instance running Qwen3.6-27B-Q6_K_L with 200K context running on an RTX A60…
coding with Qwen3.6-27B-UD-Q2_K_XL.gguf (www.reddit.com) pi llama.cpp awesome torus awesome torus Windows, 5070 (12GB)
RTX PRO 6000 Blackwell Max-Q bad performance (www.reddit.com) RTX PRO 5000 (48GB) vs MacBook Pro M5 MAX (128GB RAM) - The choice for fine-tuning & agentic coding (www.reddit.com) What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M (www.reddit.com) I tried local model couple weeks ago. At the beginning, I tried Ollama, but reddit says better to switch to llama.ccp.
Llama.cpp llama-server command recommendations? (www.reddit.com) I've seen a ton of PR, and a bunch of failed PR with some interesting additions. I was wondering what other people's commands are looking like now, what they are running for llama.cpp I'm still running: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 l…
How to run Qwen3.5-27B with speculative decoding with llama.cpp llama-server? (www.reddit.com) I run it on 2xRTX 3090. This is part of my llama-server presets file: [Qwen3.5-27B-bartowski] load-on-startup = true alias = Qwen3.5-27B-bartowski hf = bartowski/Qwen_Qwen3.5-27B-GGUF:Q8_0 hfd = bartowski/Qwen_Qwen3.5-2B-GGUF:Q8_0 draft-mi…
I implemented Laguna (XS.2) as a model in Llama.cpp (github.com via reddit) llama.cpp Manifesto / ggml / ops LLM inference in C/C++ Recent API changes Changelog for libllama API Changelog for llama-server REST API Hot topics Hugging Face cache migration: models downloaded with -hf are now stored in the standard Hu…
Advice on local coding setup (www.reddit.com) Just got an RTX 3090 to go with my Intel Core 9 Ultra 285K CPU and 32 GB of DDR5 6000 ram. I want to code locally on my Windows 11 PC.
Llama.cpp : Split Mode Tensor Fix Incoming? (www.reddit.com) Appears thay have been cooking and we might see a fix soon released for crashes on split mode tensor Multi-gpu folks keep watch - ( In my tests SM Tensor has a ~35% uplift in TG over Layer but ofc crashes every 90-120 minutes due to vram e…
magic incantation to get llama-bench to work with MTP ? (www.reddit.com) It does not like anything I have tried, including what works with llama-server. is it not built to work with speculative decoding?
What frontend do you guys use? (www.reddit.com) I’m using vim lmao with a custom made plugin for completing text, so I was curious what yall use. Llama-server seems like a sensible default but it seems limited
llampart 1.0.0 - I released a standalone local web UI for llama-server with translations, extended settings and a polished conversation sidebar (www.reddit.com) Hi everyone, I’ve just published the first public release of llampart 1.0.0: https://github.com/mchowy-troll/llampart llampart is a standalone local web UI designed to work with `llama-server`. It started from the `llama-ui` work in the `l…
Very happy with Qwen 3.5 122B output. But is slowness expected? (www.reddit.com) I'm running the 122-billion Qwen 3.5, specifically Qwen3.5-122B-A10B-Q5_K_M, on DGX Spark (128 GB contiguous memory). I'm (very!) impressed with the general knowledge output.
Using Intel Arc Pro series, any thoughts ? (www.reddit.com) Simple question: Has anyone run two or more of either of these on Ubuntu ? Intel Arc Pro B70 (32 GB) Intel Arc Pro B65 (32 GB) Running llama or vLLM etc., Any thoughts
Audio input not accepted with llamacpp for Nemotron 3 nano Omni ? (www.reddit.com) Llama-server does not accept audio input (or video for that matter) with Nemotron 3 nano omni (unsloth). I’m on a recent build of llamacpp and I redownloaded Nemotron, and I have the mmproj loaded too.
[Release] Nexidion – A private knowledge vault with an autonomous local AI background worker. (www.reddit.com) Hello, After almost two years of on-and-off development, 5 complete architectural rewrites, and hitting a few brick walls, I’m finally open-sourcing a project I built to scratch my own privacy-paranoia itch: Nexidion. GitHub Repo: https://…
Gemma4 26b MoE running in MLX with turboquant (and custom kernel) (www.reddit.com) TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with…
llama.cpp constantly reprocessing huge prompts with opencode/pi.dev (www.reddit.com) I’m using llama-swap with llama.cpp. I mainly use opencode + pi.dev and I’m seeing frequent massive prompt reprocessing / prefills even tho the prompts are very similar between requests.
My own local first ai harness (www.reddit.com) Hi, i just wanted to share what im playing with for last couple weaks. I built my own AI harness: TinyHarness My main goal was low memory footprint, it is not written in Typescript/Javascript/Python, leaving as much memory as possible for…
Apple MLX vs. llama.cpp: compared and benchmarked [video] (www.youtube.com via hn) About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot (www.reddit.com) Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700.
Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? (www.reddit.com) How is this dual setup's performance? Is it difficult to set-up everything with for example llama.cpp?
Qwen 3.6 27B MTP on v100 32GB: 54 t/s (www.reddit.com) Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch.
Gemma4:31b-coding-mtp-bf16 - slow on Macbook M5 128gb (www.reddit.com) Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet) https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ Running in Open Webui to view token/s output and I…
How do you estimate total memory usage? (www.reddit.com) Qwen3.6 35B A3B UD IQ4_NL_XL. 512k context tokens for 4 parallel processing, key cache quantized to Q_8 and value cache quantized to Q_4.
Mistral Medium 3.5 128B and Qwen 3.5 122B A10B on 4x RTX 3080 20GB (www.reddit.com) Mistral Medium 3.5 128B with 4x3080 20GB with layer split: CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Mistral-Medium-3.5-GGUF/Mistral-Medium-3.5-128B-IQ4_XS-00001-of-00003. gguf -ngl 99 -d 0,16384 -fa 1…
Built a Voice Agents from Scratch GitHub tutorial: mic > Whisper > local LLM (GGUF) > Kokoro > speaker, fully local, no API keys (www.reddit.com) Been building this for a while and finally cleaned it up enough to share. voice-agents-from-scratch is a numbered, chapter-by-chapter repo that walks the full real-time pipeline: Microphone capture Whisper for STT Local GGUF LLM (via llama…
3xR9700 for semi-autonomous research and development - looking for setup/config ideas. (www.reddit.com) Hello everyone. Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback.
Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s (www.reddit.com) I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Ge…
Long-context coding on RTX 5080 16GB: Qwen3.6-35B-A3B holds 30 t/s at 128K (89 t/s fresh), no quality drop (www.reddit.com) I wanted to see how much of my coding-agent workflow I could move local instead of paying for hosted tools forever. There was another push: Anthropic's own April 23 postmortem confirmed product-layer regressions through March/April.
Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM (www.reddit.com) TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac. So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my perso…
Local LLaMA server GPU upgrade advice (www.reddit.com) TLDR : Should an RTX 3090 + T4 be faster than a P40 + T4 for OpenCode with Qwen3.6 35B A3B ? --- Hi, Nowadays, I have an architecture running : A Tesla P40 w/ 24GB VRAM A Tesla T4 w/ 16GB VRAM I mainly use this setup to run models like GPT…
llama.cpp / ik_llama MoE Expert Offloading - Main Memory Bandwidth vs. PCIe Bandwidth (www.reddit.com) LlaMa.cpp Robot Wars (www.youtube.com via hn) Alibaba's Qwen family captures over 50% of global open-source model downloads (www.scmp.com via hn) Advertisement Alibaba’s Qwen family captures over 50% of global open-source downloads, report finds Qwen hits nearly 1 billion cumulative downloads, far surpassing rivals like Meta Platforms’ Llama and DeepSeek, researchers say 2-MIN READ2…
current: 1x 16GB 5060Ti. worth a 2nd for OpenCode? (www.reddit.com) my current build is just a 16GB 5060Ti running on a 3800X with 32GB DDR4. not really anything special, but I only really use it right now for Qwen3-VL-8B-Instruct at INT8 to do handwriting transcription (and it works great for that). someo…
Intel Releases OpenVINO 2026.1 with Back End for Llama.cpp, New Hardware Support (www.phoronix.com via hn) Intel Releases OpenVINO 2026.1 With Backend For Llama.cpp, New Hardware Support Intel's OpenVINO toolkit for optimizing and deploying AI inferencing across their range of hardware platforms is out with its newest quarterly feature update.…
The Winamp Skin Museum whips the Llama's ass (2020) (www.rockpapershotgun.com via hn) The Winamp Skin Museum really whips the llama's ass Over 65,000 skins to browse! In the late nineties and early noughties, no video game forum was complete without a 'post your desktop' thread, and no desktop screenshot was complete withou…
Llama.cpp now has an official website: llama.app (twitter.com via hn) llama.cpp now has an official website: llama.app Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line cross-platform installer.
I'm seeing low draft acceptance when using Qwen3.x MTP, what am I doing wrong? (www.reddit.com) I'm using llama.cpp, and I've tried Bartowski's and my own quants. When using Qwen3.5-122B or Qwen3.6-27B, I'm seeing really low draft acceptance in chats with interleaved code snippets (chatting with the LLM about programming / a code pro…
↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6↯ Qwen 3.6llama
Shard - getting to 10× KV cache compression (krishgarg.com via reddit) TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench.
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset (www.reddit.com) I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the probability distribution of how likely it is…
I made a local-first MCP tutorial repo with node-llama-cpp and a custom agent loop (www.reddit.com) I just published a repo called MCP from Scratch that teaches the Model Context Protocol by building it step by step in plain Node.js. Most of the repo is about understanding MCP itself, but the later modules may be relevant here: I added a…
Need Help Choosing a Harness for Qwen 3.6 27B (www.reddit.com) I've burned a week trying to customize my agent manually - building my own front end - but I've gotten to the point where I'm just exhausted and willing to try a harness, but need the right one. I read posts all the time, but I have a spec…
GPU VRAM only for small models with llama.cpp: is it possible? (www.reddit.com) I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large co…
How I do use the recent llama.cpp native tools to do web rag a.k.a. web_fetch (or anything else for the matter) directly from inside the llama-server's webui (www.reddit.com) As some other fellow lllmers I've discovered few days ago that the amazing llama.cpp project has just added native tools functionalities into the server. After having enabled the relative options into llama-server and played a bit with the…
Run Chrome’s tiny Gemma4 (aka Gemini Nano) directly on PC without GPU (www.reddit.com) Everyone remembers that sneaky download of Gemini Nano earlier this month? and if you talk to it, it will happily tell you it’s a Gemma.
LLMKube – A Kubernetes operator for local LLMs across Nvidia and Mac fleets (llmkube.com via hn) Run production LLMs on your own hardware A Kubernetes operator for self-hosted LLM inference. vLLM, llama.cpp, TGI, NVIDIA, Apple Silicon.
club-rdna16: practical 16GB AMD/Radeon local LLM testing repo (www.reddit.com) Following on from club-5060ti, I’ve been doing some testing with my desktop AMD GPU and wanted to make a similar repo for 16GB Radeon cards. Repo: https://github.com/5p00kyy/club-rdna16 Pages/results: https://5p00kyy.github.io/club-rdna16/…
WebGPU support in llama.cpp (reeselevine.github.io via hn) Introducing WebGPU support for llama.cpp
Is there a way to disable reasoning per request in llama.cpp's llama-server, while leaving it on by default? (www.reddit.com) Title. I've got a llama.cpp server running a model being accessed across a number of scripts, and some of them are easier for the model than others, and those easier ones are also latency dependent.
club-5060ti follow-up: cleaner RTX 5060 Ti local LLM recipes, benchmark explorer, and CUDA GPU compatibility notes (www.reddit.com) I posted earlier about RTX 5060 Ti local LLM testing, and I have cleaned the repo up quite a bit since then. The project is now a more structured benchmark/recipe repo rather than scattered notes.
While waiting for Fara-1.5 for my coding harness (www.reddit.com) Hi all, Not sure many people are aware so wanted to give a word about Fara-1.5 release. => this release will likely be the big sister of Fara-7B and built on top of Qwen3.5 Actual Fara-7B performs not bad at all but actually requires a pro…
Developers who use local AI - Q4_0 vs Q8_0 KV quant? (www.reddit.com) I'd love to hear from developers who use big context windows if they notice a difference? Obviously I would love to cut the KV cache VRAM requirement in half, but I'm worried about quality especially when we enter into 50k+ context territo…
Extension idea: llama-server with custom samplers (www.reddit.com) Just an idea and a prototype (made by Qwen3.6-27B-UD-Q6_K_XL via OpenCode) for allowing users to add custom sampling logic to llama-server without having to maintain their own entire fork and without having to make a wrapper that reimpleme…
I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max (www.reddit.com) It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?
Turboquant+MTP for ROCm(Llama CPP) (www.reddit.com) TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable.
Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant (www.reddit.com) Implemented Multi-Token Prediction for QWEN on LLaMA.cpp with TurboQuant. +40% performance!
RTX 5060Ti 16GB or RTX 3080 20GB? (www.reddit.com) I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.
Show HN: Tokémon – a Pokédex for LLMs that got out of hand (tokemonlabs.com via hn) An unofficial Pokedex for AI models. Compare GPT, Claude, Gemini, Llama, DeepSeek and more, with types, evolutions, base stats, and simulated token-burning battles.
Terrible Vulkan pp/tg on Arrow Lake iGPUs (www.reddit.com) Hi, I recently tried to get llama.cpp with SYCL running on an Arrow Lake system but gave up halfway through since Vulkan is just way easier to set up. But, the pp/tg I'm getting on Vulkan w/ Arc 130T is disgustingly bad - 100 tokens/s for…
Does 'preserve_thinking' work with openwebui? (www.reddit.com) I'm running qwen3.6-35b with llama.cpp connected to openwebui. And I noticed the model fails the number guessing game test on openwebui while it works perfectly with the llama.cpp web ui.
Ran some Llama.cpp RPC test to see if its worth it. And if 10Gbe needed. (www.reddit.com) Let me first say I am not doing anything with parallelism so these benchmarks and tests are not for you. That said if your hobbyist like me that is left wondering if can I use the GPUs my other PCs then I have some answers and but I'm stil…
Is HIPfire worth it for Strix Halo? (www.reddit.com) Did anyone evaluate HIPfire for long context sizes (100k+) and quality, for Strix Halo? It apparently promises large performance increase over llama.cpp and the like.
Just got a 8x 32gb v100 server... now what (www.reddit.com) Looking for suggestions. Current setup llama.cpp and ran qwen 3.5 397b 256k context.
how i can improve inference speed (www.reddit.com) specs : core i5 14400F 32gb ram d4 3200mhz rtx 4060 current speeds 30tps in output 500 tps in prefill command i currently use .\llama-server.exe ` >> -m "H:\model\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" ` >> --host 0.…
Qwen 3.5 MTP for 9B (www.reddit.com) Can llama.cpp run MTP for this model?
Need advice: Qwen3.6 27B MTP or 35B-A3B MoE MTP on 16GB VRAM RTX 5080)? (www.reddit.com) Hey folks, looking for advice before I delete or keep a huge model file. I’m testing local coding/agentic workflows on an RTX 5080 16GB + 96GB RAM.
Smaller gguf getting way less tokens per second?? So confused! (www.reddit.com) Noob here, Running Qwen3.6 35B A3B in LM Studio on a 3080 10GB + Ryzen 5 3600 on Windows 10. Tried some unsloth quants with identical settings (GPU offload 40, MoE layers to CPU 40, context 8192, flash attention on).
half-deployed AI projects haunt my github (www.reddit.com) Got 47 repos that start with 'just playing with Claude' or 'testing Llama 4 on'. Every single one dead after three commits.
Questions regarding abliteration / censorship removal (www.reddit.com) Hello everyone. I just thought of something that seems so obvious but from what I’ve been able to find it doesn’t seem like anyone has done it or at least not openly disclosed it if they have.
best approach for Strix Halo distributed inference in llama.cpp? (www.reddit.com) I was curious to understand what people are doing for this use case to get the best trade-off of convenience and performance. Private backhaul on the 10GbE?
Qwen3.6-27B-UD-Q6_K_XL.gguf sometimes gets stuck in a loop (www.reddit.com) Hi all I'm running Qwen3.6-27B-UD-Q6_K_XL.gguf using llama swap and llama-server with these parameters (actually stolen for some posts on this subreddit.) llama-server \ -m /models/Qwen3.6-27B/Qwen3.6-27B-UD-Q6_K_XL.gguf \ --mmproj /models…
llama.cpp - NVFP4 native support on Blackwell from now - b8967 (www.reddit.com) It looks like finally we have it! Time to test!!!
How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber (patloeber.com via reddit) Tutorial from the Google guy, I use very similar setup (llama.cpp instead of lmstudio)
Why are there so few small local creative writing models from the Chinese? (www.reddit.com) At this moment, the models such as Qwen 3.6 35b/27b crush the competition, yet I can't help, but notice this pattern. While the local RP scene is abundant with the Western model tunes: LLaMA, Mistral (all sizes), Nemo and more recently Gem…
Will llama.cpp multislot improve speed? (www.reddit.com) I've heard mostly bad opinions about multiple slots with llama.cpp (--parallel > 1). I guess comparing to vLLM it might be worse at this, but I recently tried vLLM on 4 slots and it indeed improved the overall speed significantly (150-170t…
Which local models are actually good at staying in character? Notes from shipping Qwen3.5 4B + 9B as game NPCs (www.reddit.com) I'm building a small text-based game where the gameplay loop is "talk an NPC into revealing a secret." It's basically a 20+ turn roleplay stress test: the model needs to stay in character, remember what the player said earlier, and refuse…
llama-server: Save/restore works for tokens, but KV cache still not resumed? (www.reddit.com) Somehow I cannot get KV resume for my Qwen3.5 model with lama-server: Save/restore works for tokens, but KV cache is never reused — is this expected? How to enable real resume?
Need help for a calling based agentic ai project (www.reddit.com) model for frigate, a380 (www.reddit.com) what is the state of using rotoquant at the moment? (www.reddit.com) Show HN: Llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU (news.ycombinator.com) Complete llama.cpp tutorial for 2026. Install, compile with CUDA/Metal, run GGUF models, tune all inference flags, use the API server, speculative decoding, and benchmark your hardware.
Which Qwen models can do FIM (Fill in the middle) for autocompletion? (www.reddit.com) I cannot find a definive answer. I think the following should be able to do FIM: Qwen 2.5 coder Qwen 3 coder Qwen 3-2507 instruct Qwen 3.5 Qwen 3.6 What I verified: Qwen3-32B: no Qwen3-4B-Instruct-2507: yes Qwen3.5-27B: yes Qwen3.6-35B-A3B…
Context checkpoint erasure in llama.cpp ? (www.reddit.com) Has anyone been able to solve or mitigate context checkpoints being erased during single user inference, specifically when function calling is part of the chat history? I've been using Qwen 3.5 35B A3B for some time (now using 3.6), tested…
can someone explain how to use Matrix in Llama-swap ? (www.reddit.com) I noticed that groups have changed to Matrix , to allow concurrent models. Currently i use llama-swap for my models and an individual instance of llama-server for embedding and reranking all for Openweb UI.
Strix Halo 128GB on Proxmox - Vulkan vs ROCm benchmark matrix (www.reddit.com) Ryzen AI MAX+ 395, Bosgame M5, 128GB LPDDR5x. Proxmox VE 9.1 LXC containers with GPU passthrough.
Hey, has anyone here used Qwen3.5-27B-NVFP4-GGUF with llama.cpp yet? (www.reddit.com) Hey! I was wondering if anyone of you have used Qwen3.5-27B-NVFP4-GGUF on RTX5090 on llama.cpp?
Multi host GPU cluster using DAC cables vs 4 GPU system. Anyone doing this successfully? (www.reddit.com) Right now I have 3 GPUs, 5060 Ti 16G, 2 x 4060 Ti 16G, and may get a used 3090 24G that I found. I could build a janky open rack system using M.2 and PCI risers with a 1600W PSU or try something like putting 2 GPUs in 2 systems using the f…
LLM inference engine written ground-up natively in C#/.NET (dotllm.dev via hn) Pure C# pipeline Tokenizer, sampler, scheduler, kernels — all C#. No Python, no foreign runtime, no llama.cpp wrapper.
RTX 3090 llamacpp flags help (www.reddit.com) Hi, my current system hardware RTX 3090 24GB VRAM & Sysrem RAM 64GB using windows 11 been playing around with hermes agent and local llm (Qwopus3.5-27B-v3-GGUF & gemma-4-26B-A4B-it-GGUF) when i try asking the hermes agent to do a task with…
Can I combine a RTX5060ti 16gb with 7900XTX 24gb for llama.cpp? (www.reddit.com) I bought this 7900XTX for 905 euro in Spain, and wondering if can I combine them together to run Qwen 3.5 27B for example ? Using a MSI B650 Gaming Plus Wifi and 64gb DDR5 6400mt/s
3x3090 is faster in Ubuntu than win11, GPT-OSS 120B 120tg/s vs 6tg/s why? (www.reddit.com) using z790 prime p d4 with 128gb ddr4 3200mhz ram. 1x3090 in main PCIe5 16x slot and 2x3090 in chipset PCIe4 4x slots.
Show HN: Best setup local LLM found for a 5090 (llama.cpp fork + turboquant) (local-llm.utop.workers.dev via hn) Hi folks, I found this setup on consummer hardware that seems to have great results on local hardware. - qwen 3.6 q6 - 450 K context using turboquant turbo3 mode llama.cpp fork - multimodal support This AI generated blog article is a kind…
Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB (ziraph.com via hn) Apples® to Apples®: MLX vs llama.cpp for Gemma 4 12B on an M1 16GB A matched-quant MLX-vs-raw-llama.cpp benchmark for Gemma 4 12B on one M1 16GB - decode is a tie, both pinned at the bandwidth wall. The cost that differs is startup and CPU…
Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM (deemwar-products.github.io via hn) mochallamaA local, tool-calling LLM inside your JVM The only in-process, tool-calling local LLM for the JVM — Spring-first, OpenAI-compatible, llama.cpp-backed via Project Panama FFM. No JNI, no daemon, no native-install dance.
Show HN: Will It Fit? – Opinionated Normal People Llama.cpp VRAM Estimator (hypfer.github.io via hn) llama.cpp VRAM estimator for normal people. Assumes single GPU, all layers offloaded.
Gemma 4 12B appears in Hugging Face (huggingface.co via hn) gemma-4-12B-it-GGUF Recommended way to run this model: llama-server -hf ggml-org/gemma-4-12B-it-GGUF Then, access http://localhost:8080
Free Yourself from the Copilot Tax (www.kronkai.com via hn) Hardware accelerated local LLM inference for Go with llama.cpp integration.
LlamaStash – Zero-overhead, terminal-native llama.cpp launcher (github.com via hn) LlamaStash Zero-overhead, terminal-native llama.cpp launcher. A fast TUI and CLI with init wizard for launching local LLMs via llama.cpp.
Show HN: Thaw – Git branch for a running LLM (fork agents, skip prefill) (github.com via hn) I built thaw because forking an LLM agent is absurdly wasteful today. When an agent explores N branches — RL rollouts, best-of-N, parallel coding attempts — each branch re-runs prefill over the same shared context.
Local run for multi users: which software set? (www.reddit.com) Context: I am testing and running local LLM on Linux for some months, first with llama.cpp and now with vLLM for better concurrent capabilities. I use llama-swap in front of either vLLM or llama.cpp in order to have thinking and non-thinki…
Need some advice on AI workflow (www.reddit.com) Hi all, I'm somewhat new to the scene (been lurking for maybe 4-5 months now), but i think I have all the basics figured out. My setup: 9800x3d with 64GB of RAM, 6900xt with 16GB VRAM.
Looking for a working Deepseek-v4-Flash quant (www.reddit.com) Best I tried so far is https://huggingface.co/nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF with the custom llama.cpp fork, but it suffers from low quality and random incoherent output. VLLM wouldn't support anything other than H100s for DS4.
Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu (matthewgribben.com via hn) Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu A D700-specific guide to running llama.cpp with Vulkan on the 2013 Mac Pro: dual 6 GB FirePro cards, Ubuntu, RADV, full GPU offload, cooling, and the traps that make old…
Looking for Suggestions — Single 5090 & 64gb DDR5 (www.reddit.com) Hi Reddit, I am planning on running Qwen 3.6 27b NVFP4 via vLLM on my 5090 but was wondering if something like 35b a3b at Q8 on Llama would produce better results for agentic coding and utilize the system memory. My research says no but if…
Long-context performance at lower quants (www.reddit.com) I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k conte…
Harbor v0.4.19 - vllm/sglang/llama.cpp launch codex/claude/pi/opencode (www.reddit.com) I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends.
Best coding model on RTX 3060 (www.reddit.com) Wondering what’s the best coding model that can fit on a RTX 3060 (12GB). Has anyone been able to do something useful with it?
Could someone please help explain these results? (www.reddit.com) I'm running Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf on 12 GB VRAM and 32 GB RAM via the TurboQuant variant of llama.cpp. I increased the --n-cpu-moe value from 8 to 30, and my inference rate doubled!
What workstation to get for ~13k EUR? (www.reddit.com) My use-cases will be to test open-weight LLMs and work on harnesses, inference systems and possibly other non-ML workflows (CS-related) in the future. Fine-tuning would not be something I do locally because I can rent a B200 from RunPod fo…
minor speed bump for MTP with Qwen3.6-27B-MTP Q6_K_XL (www.reddit.com) I'm on Macbook M5 Max with 128GB RAM Running a test in openwebui using llama-server (llama.cpp): unsloth/Qwen3.6-27B-UD-Q6_K_XL.gguf (non MTP): 19tps unsloth/Qwen3.6-27B-UD-Q6_K_XL.gguf (MTP): 22.3tps So nothing like the massive improvemen…
WebGPU back end in llama.cpp/ggml (twitter.com via hn) could not extract summary
Agent builders: are GPT/Claude/Gemini API costs killing your margins? (www.reddit.com) Hey everyone, For people building agents with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Claude MCP/SDK, Google ADK, or LlamaIndex — how are you managing LLM API costs? Agent workflows can get expensive fast because of: tool calls retr…
What’s the cheapest way to give a local Llama 3 internet access? (SearXNG isn’t cutting it) (www.reddit.com) Finally got Llama 3 70B running locally and wired up function calling so it can search the web. First tried self-hosting SearXNG, but the results are pretty messy.
At wits end for optimizing settings in llama.cpp for 100k context (www.reddit.com) Long story short, I am running Qwen3.5-35B-A3B (GGUF format) and other models on MacOS and getting around 1500 tokens/sec for prompt processing and around 35-50 tokens per second for prompt processing. I'm using the latest version of llama…
Do smaller quants silently break tool calls / JSON output? (www.reddit.com) I posted recently about EvalShift, an OSS CLI for regression-testing LLM model changes. A few people pointed out that for LocalLLaMA, the more interesting use case may be quantization regression: Q8 -> Q4_K_M Same base model, same prompts,…
Ternative – C++/CUDA inference engine for ternary LLMs with runtime LoRA (github.com via hn) # ternative Inference engine for ternary-weight LLMs with runtime LoRA — the llama.cpp of BitNet models. Loads a BitNet I2_S base GGUF + a separate LoRA adapter GGUF, merges them at full F32 precision, and serves the result via an OpenAI-…
Weird performance depending on quant (www.reddit.com) Hi, I'm using llama.cpp with qwen3.6 35B A3B on two different machines. I noticed that on both machines tokens per second is better while using Q4_K_S and Q4_K_M quants than lower Q3_K_M quants.
Benchmarking llama.cpp's new MTP support on Strix Halo (calebcoffie.com via hn) Benchmarking llama.cpp's brand-new MTP support on Strix Halo PR #22673 landed in llama.cpp on May 16. It adds first-class Multi-Token Prediction (MTP) speculative decoding for models that ship with an MTP head, including Qwen3.6 27B dense…
Tesla P40 running qwen 3.6 (www.reddit.com) Does anyone know why qwen 3.6 MTP spec decoding won't work with Tesla P40 when the K cache is quantized? I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40.
Llama-server: is it bleeding to CPU/RAM? (www.reddit.com) Is there an easy way to know if a model is using CPU/RAM (and not only GPU/VRAM)? (I think standard verbose output, which got shorter, says nothing about this, but I may be missing something)
Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090 (www.reddit.com) I'll be UPDATING this as it seems I was benchmarking and testing Just before the UPDATE LOL TL;DR If you're running rigid agent frameworks locally with mtp on consumer hardware: drop your draft window to 3, lock parallel slots to 1, and co…
b9200 released - potential mtp pp increase (www.reddit.com) testing in progress ...we all need an increase in pp 😆 https://github.com/ggml-org/llama.cpp/releases/tag/b9200 u/am17an am17an commented 13 hours ago • Overview Avoid copying the logits for every token in the batch when doing prompt proce…
ik_llama: Qwen3.6 27B and 35B on very low VRAM (www.reddit.com) Thank you to the people at ik_llama and llama.cpp. It's amazing how far you've all pushed mtp and other tech so that I can run 27B and 35B Qwen3.6 models on an old gaming laptop with a RTX2060 mobile at 6GB VRAM and 32GB RAM.
I can't get Qwen3.6 27B to outperform Qwen-Coder-Next and I'm not sure why (www.reddit.com) In my real-world usage (opencode) and in my synthetic benchmarks, Coder-Next (Q5) demolishes the whole Qwen3.6 family including the 27B Dense model (All Q8). Everybody else is hailing that 27B is superior and is an amazing model, but I hav…
Qwen 3.6-27B Dense with MTP on Strix Halo Windows - Benchmarks (www.reddit.com) Here are some results (llama.cpp)! Task 1: write a short poem 27B Dense: 12.5 tokens/s 27B Dense MTP: (spec-draft-n-max 6): 14.5 tokens/s 27B Dense MTP (spec-draft-n-max 3): 18.7 tokens/s Task 2: edit a hello word html artifact 27B Dense:…
Strix Halo ROCm + MTP Notes (May 2026) (www.reddit.com) With the MTP merge into mainline llama.cpp I wanted to try out some other optimizations i could think of. Ended up tested backends, mtp, and bumping to ROCm nightlies.
How does Pi coding agent control Qwen's thinking verbosity? (Qwen 35B A3B, llama-server) (www.reddit.com) I'm running Qwen 35B A3B via llama-server with reasoning budget set to -1 (unlimited) for testing. In every client I've tried, the model just thinks endlessly before responding.
RDNA3 Flash Attention fix just dropped by llama.cpp b9158 (www.reddit.com) https://github.com/ggml-org/llama.cpp/releases
Ollama Pre-Release Switches From Building on GGML to Using llama.cpp Directly (www.reddit.com) https://github.com/ollama/ollama/releases/tag/v0.30.0-rc15 Hopefully this has more devs come to llama.cpp to support Day 1 releases due to Ollama now moving to using llama.cpp directly. Additionally, I hope that Ollama makes it clear that…
I made a UI and server for using Anthropic's new Natural Language Autoencoders locally with llama.cpp (www.reddit.com) Anthropic's first open weight models, Natural Language Autoencoders, are just finetunes of popular open weight models. They do not modify architecture and modeling code so inference with llama.cpp is mostly trivial.
How to disable reasoning for Qwen3.5 4b 9b unsloth ggufs? (www.reddit.com) Hi all I'm trying to disable reasoning for quicker outputs in llamacpp-server. I remember using LM studio and that having a think button in the gui that could be toggled but later I tried the unsloth ggufs but they don't have that button f…
Is it possible to exclusively use a draft model for reasoning to speed up generation? (www.reddit.com) EDIT: Edited to provide more clarity It occurred to me, that perhaps the same draft model used for speculative decoding would be completely adequate if we just used it's output as-is for reasoning, without validating the results against th…
Vulkan or CPU llama cpp backend for local llm for coding/code assist (www.reddit.com) Hi all I recently started a new job and we're doing python development for a ci cd metadata consolidation library for analytics and we cannot use no stuff like claude code or codex or gh copilot or any model APIs (free or paid). I got a la…
MagicQuant (v2.0) - Hybrid Mixed GGUF Models + Unsloth Dynamic Learned Quant Configurations + Benchmark table with collapsed winners and more (www.reddit.com) I spent the past 5+ months building a pipeline that creates hybrid GGUF quant mixes. I also built it to learn from Unsloth (or other) models by utilizing their quant to tensor assignment.
TensorRT-LLM vs vLLM vs llama.cpp on NVIDIA DGX Spark? (www.reddit.com) I am looking for recommendations on the best way to run local LLMs on NVIDIA DGX Spark. Which stack makes the most sense in practice: TensorRT-LLM, vLLM, or llama.cpp?
How does llama-server pick which MoE experts go on the GPU and which stay on the CPU? (www.reddit.com) If you are using a MoE model that does not fully fit in your GPU, some of the experts must stay on the CPU. Putting the experts that you will actually need on the GPU will give you GPU inference speeds.
am I running this llama-bench of Qwen3.6-27B on these V100s right? (www.reddit.com) basically what I'm doing here is trying to validate whether or not it's a reasonable idea to get a couple of V100s, either SXMs with PCIe adapters or straight-up PCIe cards in the first place, for the sake of running this model or models l…
Tracing tokens through Llama 3.1 8B inference on H100s (krithik.xyz via hn) You open Claude.ai, chatgpt.com, gemini, whatever LLM provider you use. You type something: "What is the capital of France?" You hit enter.
9070xt inference for q3 qwen 27B (www.reddit.com) In llamacpp I'm getting 12tok/s, does this number look right to you and what can I do to increase this number (if possible)? cd ~/llama.cpp && ./build/bin/llama-server -m models/qwen-3.6-27b-abliterated-q3.gguf -ngl 999 -c 65536 (i need th…
How long for llama.cpp official support of MTP? (www.reddit.com) Hello there (beginner here) I've been unable to build myself llama.cpp for my Strix Halo (Windows 11) (cmake errors, I have not digged too much into it, already burned hours...), so I was wondering when an official release for Vulkan/HIP w…
How difficult is distilling? (www.reddit.com) I remember a year or so ago when DeepSeek R1 came out and it was pretty quickly distilled into Llama 3 8b and Qwen 2.5 (?) 7b. Why don’t we see more distilled models?
Gemma4 26B A4B NVFP4 GGUF (www.reddit.com) Hey everyone! I’ve just uploaded a GGUF version of nvidia/Gemma-4-26B-A4B-NVFP4.
Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide (www.reddit.com) I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday.
My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM) (www.reddit.com) UPDATE: i have switched to vulkan (image: ghcr.io/ggml-org/llama.cpp:server-vulkan-b9014) and now i am getting prompt eval: 591.01 tok/s generation: 41.90 tok/s which is faster than rocm new config: services: llama-cpp: container_name: lla…
LLM inference speed database or leaderboard? (www.reddit.com) A lot of the posts in this sub is about advice about which hardware to buy, what settings to use and what speed to expect. There are a lot of excellent replies spread all over the place, but alot of it is also just vague indications like ~…
M3 Ultra + DGX Spark = M5 Ultra-lite? (www.reddit.com) So I saw an article recently about exo disaggregated prefill with DGX Spark and M3 Ultra - prefill on one machine and decode on another. DGX Spark apparently has 4x matmul performance over an M3 Ultra - same as the M5 Ultra should have.
Mistral Medium 3.5 on AMD Strix Halo (www.reddit.com) TLDR; it's slow as heck. Run overnight.
Show HN: Llmconfig – configfile and CLI for local LLM (github.com via hn) llmconfig Local Large Model Config — manage local inference with llama.cpp, stable-diffusion.cpp, and whisper.cpp from a single YAML file and a single CLI. llmconfig up gemma # or just: llmc up gemma ✓ gemma is ready at http://127.0.0.1:80…
[Help] Running big dense models faster (www.reddit.com) I have been trying Mistral 3.5 on my 4x RTX 3090 rig with llama.cpp. Inference is slow (about 11 t/s) even without anything being offloaded to the CPU.
World AI Agents–35 AI Models (Claude, GPT, Llama)via One OpenAIcompatible API (world-ai-agents.com via hn) Access Claude, Llama, Mistral, Nova and more through a single OpenAI-compatible API. Start for as little as €1.
PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 (github.com via hn) Open LLM inference, rewritten by hand for one specific chip at a time. Kernels, speculative decoding, and quantization, tailored per target.
"I" is not singular — 4 LLM agents with per-agent LoRA on a single RTX 3070 8GB (www.reddit.com) https://preview.redd.it/7yei65sbugyg1.png?width=1703&format=png&auto=webp&s=ad388c51dd10cb44b41a99876d28797e006fd138 Stanford's Generative Agents = one LLM cosplaying 25 personas. I wanted agents that actually become different people — dif…
Best open-weight model to run locally on 8x A100 80GB for generating teacher data? (www.reddit.com) I have (free) access to a SLURM cluster with 8x NVIDIA A100 80GB GPUs (=640 GB VRAM) on a single task, and I want to run an open-weight model locally with llama.cpp for data generation, not coding. My use case is generating teacher data fo…
What STT/LLM/TTS combo are you running for production voice agents in 2026? (www.reddit.com) Curious what stacks people are actually using right now, and where you're hitting walls. Some things I've been observing while testing combos: - Deepgram Nova-3 still the best STT for English, Cartesia is closing the gap on streaming - Ele…
Llama.cpp MIPS R8000 Kernel Running on an SGI Power Challenge from 1995 (twitter.com via hn) Whew! Big work today getting optimized llama.cpp MIPS R8000 kernel running on the SGI Power Challenge deskside from 1995 with Gemma 3 270M.
Help with MI50 and llama.cpp/ROCm 7.2 (www.reddit.com) I have an MI50 that I use with llama.cpp/Vulkan, however some models run quite slowly, so I'd like to try the ROCm backend, but no matter what I try it doesn't work. Downloading the missing files from ArchLinux package doesn't work.
llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged (www.reddit.com) https://github.com/ggml-org/llama.cpp/pull/22196 And somehow we already got some GGUFs for it! https://huggingface.co/CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF (the below one is…
Is long re-processing of output as input a common "feature" or not? (www.reddit.com) I now use (mostly) Gemma 4 and Qwen 3.5 models *. And seems that all of them, after context grows a bit, after providing long output for me and getting a short prompt in response, are starting to process many new tokens as input and I have…
I ran Gemma 4 E2B with llama.cpp on a lot of different iPhones, here's the setup report (www.reddit.com) TLDR: I've been running gemma4 e2b extensively on iOS with llama.cpp and found some interesting quirks and info you guys may like! These are specifics for the iPhone and what I've found worked across 20+ devices.
llama.cpp - tool calling issues on Windows only (www.reddit.com) I have a dedicated linux box I run all my stuff on. I occasionally see the 'zomg 35b can't call tools?!' posts here and chuckle to myself in a *zero issues here* way.
I've got a feeling that Llamacpp is not the biggest performance bottleneck, but it might be the OpenCode. (www.reddit.com) It looks as if OpenCode introduces an artificial delay in agentic coding. Have you noticed similar issues?
Last llama.cpp update broke web search tool calling with Qwen 3.6 27b. (www.reddit.com) At least in open-webui. Nothing has changed except for the backend update.
PI agent integrated with Cline-Kanban repo: All using PI and Qwen 3.6 35B MOE UD 4K_XL (www.reddit.com) Repo: statisticalplumber/kanban at pi-agent-integration Hi Guys, To test Qwen 3.6’s potential, I also wanted the Cline Kanban project to have an open-source agent to work with. The last time I tested Cline Kanban, it didn’t support agents…
Ubuntu 26.04 vs 24.04 speed improvements for inference? (www.reddit.com) I'm curious if any brave soul has upgraded their computer (especially if it's Strix Halo) from Ubuntu 24.04 -> 26.04 and seen a significant performance improvement for inference with VLLM, llama-server, and/or LM Studio.
Brief Ngram-Mod Test Results - R9700/Qwen3.6 27B (www.reddit.com) Decided to try out the new --spec-type ngram-mod feature in llama.cpp using Qwen3.6 27B during an OpenCode bug chasing session. TLDR: Performance is variable, but so far it seems to provide a nice speed increase for working on the same cod…
Does anyone have a usable vLLM setup with Qwen3.6 27B + pipeline parallelism + MTP? (www.reddit.com) I'm a daily llama-cpp user and was hoping to try MTP on vLLM. Unfortunately, pipeline parallelism + MTP does not seem to work with this model in vLLM.
Quant Qwen3.6-27B on 16GB VRAM with 100k context length (www.reddit.com) https://preview.redd.it/tblmrwxkbexg1.png?width=1193&format=png&auto=webp&s=6dea1e6684e75e22852d57c0c72e9171deb56ae2 I have experimented how to run Qwen3.6-27B on my laptop with an A5000 16GB GPU. I have created an own IQ4_XS GGUF "qwen3.6…
RTX 3090 + 27B model performance issues (llama.cpp) what am I doing wrong (www.reddit.com) Hey folks — looking for some advice on improving my local LLM setup (and also exploring agentic coding workflows). Current setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB Using llama.cpp with a Qwen3.6 27B Q6 model (GGUF) Running through OpenCo…
Show HN: Doxa – Open-source emergent simulator for geopolitical scenarios (github.com via hn) Hi! We, Vincenzo and Riccardo, built Doxa as an agnostic engine for emergent simulations with agents for constrainted scenarios (like geopolitical, economics, ...) and work well with LLMs like Qwen2.5:7B, Llama but also cloud models such a…
Is there any quick way to estimate best parameters for llama.cpp? (www.reddit.com) I usually just throw models into LM Studio but I decided to finally compile llama.cpp on my hardware to get some extra speed and to hopefully replace my increasingly unreliable cloud subscription. I have a RTX 4080 and Ryzen 5 7600 with 32…
Severe instability and looping issues with local LLMs (Qwen, Zen4, llama.cpp) (www.reddit.com) I tried working on a local LLM project today and honestly ended up pretty frustrated. I tested several approaches, but none of them worked reliably.
Speed penalty with Q8 KV quantization (www.reddit.com) I knew there would be a speed penalty when switching the KV cache quantization from F16 to Q8, but I never expected it to be this significant at longer context sizes. I ran a test with Qwen 3.5 122B on my MacBook M2 Max using llama.cpp.
Qwen3 27B FP8 + TurboQuant on RTX 5090 - anyone tried? (www.reddit.com) Do I understand correctly, based on this comment, that I can potentially fit Qwen 3.6 27B FP8 precision model and have around 256K context available and fit it fully in my RTX 5090 VRAM? Of course with the help of TurboQuant compression, a…
Sıfırdan Eğitilmiş 258M Parametre Türkçe LLM: Marul V7 (www.reddit.com) Selam, Bir süredir üzerinde çalıştığım bir projeyi paylaşmak istiyorum. Sıfırdan geliştirdiğim bir Türkçe dil modeli var: Marul V7 Model tamamen bağımsız şekilde eğitildi.
Verbatim AI – on-device transcription (Whisper) + summaries (Llama 3.2) (apps.apple.com via hn) eGPU vs system RAM (www.reddit.com) kIOGPUCommandBufferCallbackErrorImpactingInteractivity... recreate the backend to recover (www.reddit.com) PSA re Qwen 3.6 35B A3B q4 + agents (www.reddit.com) Recommended parameters for Qwen 3.6 35B A3B on a 8GB VRAM card and 24GB RAM? (www.reddit.com) llama-server / web gui / C++ mcp server : is it possible to inject context (for skills or text flavour)? (www.reddit.com) How is Rotorquant/planarquant/iso qaunt better? (www.reddit.com) Inferena: Local benchmark of PyTorch vs. Llama.cpp vs. Rust frameworks (inferena.tech via hn) 5070ti + RX 9070 (non XT), over 100 tps on Qwen 3.6 35B Q4 (www.reddit.com) Hi guys, just want to share with you guys a Frankenstein build I put together that is surprisingly decent I have a i5 12400 / B660 / 32GB DDR4 build that was previously paired with a 3060ti. Last Christmas I upgraded it to a RX9070, then I…
Has anyone figured out STT with Gemma4 for Home Assistant? It works but responds with full thought chain. (www.reddit.com) I have Gemma4-E2B working within home assistant as STT, and E2B seems fast and accurate for STT (maybe a bit better than Parakeet), however, it responds with the entire thought process: https://preview.redd.it/v8zhb5elltvg1.png?width=599&f…
Help me squeeze every drop out of my AMD Ryzen AI Max+ 395 (96GB unified VRAM) — local LLM, image/video gen, coding agents (www.reddit.com) I'm running a local AI setup and want to make sure I'm using my hardware to the absolute maximum. If you have tips on better models, smarter configurations, or services I'm missing, drop them in the comments.
Show HN: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (news.ycombinator.com) https://w418ufqpha7gzj-80.proxy.runpod.net Started for myself, but since Im not using it continuously, sharing it: Open Access Qwen3.6-35B-A3B-UD-Q5_K_M with TurboQuant (TheTom/llama-cpp-turboquant) on RTX 3090 (Runpod spot instance). 5 pa…
Ask HN: What are the machine requirements for a LLM like Llama-3.1-8B? (news.ycombinator.com) I want to create a local GenAI. Tell me the server machine requirements.
Anyone who tried new 3.6 on single 3090, what's your llama.cpp flags for best performance ? (www.reddit.com) It's been some time now, surely some have tinkered with it more and optimised it already
Strix Halo concurrency 4 16k context 64 t/s Qwen3.6-35B-A3B-Q8_0 (www.reddit.com) https://preview.redd.it/4906akj9dovg1.png?width=1527&format=png&auto=webp&s=c49e255ac79a3c5455f44603422f8af7ddc12594 First of all can we make https://www.youtube.com/watch?v=2lUC8Gimxz8 Angine de Poitrine this subs official band? Those guy…
Is there a way to have qwen-code CLI read images? (www.reddit.com) Basically I am asking the model to describe an image, but it says it can't process the images. The weird thing is that if I send the image encoded directly on the prompt, it works just fine, I am using llama-server with qwen3.5 (tried all…
I want to run qwen3.5 27B q4_k_m on CPU, and I need help. (www.reddit.com) I am an local LLM beginner and I found this Reddit while looking for help. (Please understand that I am unfamiliar with Reddit.) (system- i5 4440 1.8GHz/b85m ds3h/DDR3 32GB/128GB SSD/Ubuntu 25.10 questing) I loaded Qwen3.5 27B Q4_K_M onto…
Qwen 122B is AMAZING but is my config right? (128GB M4 Max) (www.reddit.com) Hi! I hope its okay for me to ask this here.
Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? (www.reddit.com) I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.
[Paper] Residual Streams / KV Direct (www.reddit.com) It seems we have entered a period of accelerating innovation regarding the KV cache. Someone mentioned this post's paper in the Github issue of llama.cpp for implementing Turbo Quant.
Vulkan compilation issue on Fedora (b8786) — solved (www.reddit.com) If you pull https://github.com/ggml-org/llama.cpp/releases/tag/b8786 and try to build with Vulkan support on Fedora, you may hit this error: [ 39%] Building CXX object ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/multi_add.comp.cpp.o /h…
DotLLM – Building an LLM Inference Engine in C# (kokosa.dev via hn) Introducing dotLLM - Building an LLM Inference Engine in C# If you’ve been building .NET applications and wanted to run LLMs locally, your options have been… limited. You could wrap llama.cpp through LLamaSharp, deal with ONNX Runtime, or…
Older model suggestions (www.reddit.com) Due to costs I am running on some older hardware. Looking for suggestions on supported models for my particular stack.
Claude down? TokenMonopoly will help you find the best deals in AI subs (tokenmonopoly.com via hn) TokenMonopoly Live leaderboard of AI API deals — pricing, subscriptions, and SWE-bench scores for Claude, GPT, Gemini, Kimi, DeepSeek, Llama and more. Compare 27 benchmarked models across 96 hosts by price-per-performance, refreshed daily.
Show HN: How to Use Google's Extreme AI Compression with Ollama and Llama.cpp (news.ycombinator.com) The introduction of TurboQuant, PolarQuant, and QJL (Quantized Johnson-Lindenstrauss) by Google Research represents more than just a technical optimization. At Vucense, we view this as a landmark moment for Inference Sovereignty https://vu…
Show HN: Ext-Infer – Native LLM Inference and Embeddings for PHP (infer.displace.tech via hn) Introduction ext-infer is a PHP 8.3+ extension that loads a GGUF model and runs LLM inference inside the PHP process via llama.cpp. PHP-native semantic search, RAG pipelines, and CLI / worker inference run without shelling out to Python or…
Show HN: LLMhop – A tiny, stateless router for LLMs with a NixOS module (github.com via hn) LLMhop is a tiny stateless proxy for LLM inference servers. It tackles an issue I faced when trying to serve more than one local LLM at once which is not natively supported by vLLM.
Show HN: TurboPrefill – Multi-GPU prefill acceleration for llama.cpp (github.com via hn) TurboPrefill is an attempt to make layer-split multi-GPU configurations spend less time waiting and more time computing during prefill.
ik_llama.cpp – llama.cpp fork with better CPU performance (github.com via hn) ik_llama.cpp: llama.cpp fork with better CPU performance TL;DR This repository is a fork of llama.cpp with better CPU and hybrid GPU/CPU performance, new SOTA quantization types, first-class Bitnet support, better DeepSeek performance via…
DeepSeek V4 Flash at 8.4 tok/s on 3×3090: patching the GGUFs that won't load on cchuter's llama.cpp fork (www.reddit.com) my apologies if anything does not make sense, I literally dont know what I am doing, im not a programmer, just a simple vibe coder, with an Claude subscription. That said, if you have 200gb of sys ram+vram and want to run deepseek v4 flash…
Show HN: Biopetals – Run biology tuned Llama, BitTorrent-style (github.com via hn) About a month ago, I heard about petals. Petals is basically a library that lets you run LLMs by loading the weights onto a network of computers that are all running petals.
Nvidia H100(94GB VRAM) - should I run llama.cpp or vllm for 30 users inference? (www.reddit.com) I was given the great opportunity to borrow a H100 with 94GB VRAM at work until it is needed by a customer. (No idea how much system ram I will get, but I guess they are a bit flexible on this).
Llama.cpp Console released (www.reddit.com) https://github.com/alekk89/llama.cpp-Console/ for windows users
I ditched LM Studio for llama.cpp and my local LLM doesn't feel like a downgrade (www.xda-developers.com via hn) LM Studio has been my default runner for as long as I've been running local LLMs, which is more than long enough now to call it part of my daily flow rather than just something I'm experimenting with anymore. The appeal of LM Studio is pre…
Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal? (www.reddit.com) Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774) My run command: ``` llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface…
I made a Windows app for managing llama.cpp in WSL/Ubuntu (www.reddit.com) I’m a Windows user, and I have fairly Windows-y expectations for software: I prefer not having to live in a terminal just to install, build, configure, and run things. I couldn’t find an app that managed the full llama.cpp-on-WSL workflow…
Llama.cpp: What's up with -sm tensor + AMD + Vulkan? (www.reddit.com) Has anyone got it to work? I tried it with dense models (eg qwen 27b, gemma 31b, mistral 128b) since that's where I need it most, but it always core dumps.
Poor performance on RX 9070 XT (www.reddit.com) I was thinking about upgrading from an MI50 to an AMD AI PRO9700, and I happen to have an RX 9070 XT on my gaming pc, so I tested the performance on it to have an idea of what to expect. So, install rocm, build llama.cpp, download Qwen3.6-…
Built a local-first AI memory system that indexes screen activity, meetings, and voice notes ( MCP + automations) (www.reddit.com) Been experimenting with an idea — what if your AI assistant actually remembered everything you did on your computer? Not stateless chats, but real persistent context.
What is everyone using AI for? Realistically (www.reddit.com) So I have to admit, I have fallen victim to the cool looking dashboard videos but I’m struggling to find a use for me. I love AI and use it daily for general questions and some deeper research (Google Gemini free tier).
I spent €300 extracting raw LLM weights, ran into a wild codegen bias trap, and finally mapped the internal activation geometry (60 Graphs) (www.reddit.com) Hey Reddit! A couple of weeks ago, I posted about my independent research on treating LLM alignment as a latent space shift.
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge) (www.reddit.com) Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a…
llama.cpp oom issue (www.reddit.com) I'm having an issue with llama.cpp going OOM (system ram, not vram) after some time, roughly 20-40 minutes of active use. I'm now running it in a cgroup with about 20gb allocated to it, so at least it gets killed and restarted before it st…
how to install llamacpp the better way to wrapping it in python ui (CPU use only) ? (www.reddit.com) i want the best installation that fit my use and my low-compute H.W , i want to run small to above small llm like "qwen" 2b ,4b and 27b , and "gemma" 31B. rely completely on only old CPU 4th.gen i7 with that few 32gb 'slow' ddr3.
gemma 4 e2b quality degrades after ~30-40 continuous inferences on 4gb vram? (www.reddit.com) running gemma e2b via llama-server for continuous background tasks on a 1650 4gb. works great initially but after maybe 30-40 calls the outputs start getting noticeably worse — shorter responses, missing fields in json output, sometimes ju…
NVFP4 + MTP - voilà on llama.cpp (www.reddit.com) As in title - NVFP4 + MTP at once on llama.cpp https://github.com/ggml-org/llama.cpp/releases/tag/b9297
KinetiX: An intra-inference hardware interlock for LLMs (github.com via hn) KinetiX Latent Interlock KinetiX is a hardware and software safety interlock designed to monitor latent states (activation tensors) in real time within LLM inference engines like llama.cpp. It enables instant process termination upon detec…
Did a 30 runs of llama-bench to find optimal settings for my use case (Frigate and HomeAssistant) on my MI60 32gb VRAM GPU - two models tested Gemma4 and Qwen3.6 - Figured I'd share in case it helps anyone else (www.reddit.com) I'm running llama.cpp using this docker container: https://github.com/mixa3607/ML-gfx906 (it's just a lot easier than building from source, which I was doing previously). The MI60 (or MI50) are just a real pain in the behind to get working…
LLaMa.cpp basic question (www.reddit.com) I'm trying to install LLaMa with PI agent. I ran curl -fsSL https://pi.dev/install.sh | sh export PATH="/home/user/.local/share/pi-node/node-v22.22.3-linux-x64/bin:$PATH pi install npm:pi-llama.cpp These commands installed pi, added them…
Seeking resources to read about llama.cpp server and how offloading works (www.reddit.com) SETUP INFO: Amd R9700 AI PRO. Using llama-cpp server, ROCM docker version.
↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3↯ Qwen 3llama
I'm running an agentic system with kobold.cpp as my backend. Am I losing performance? (www.reddit.com) Currently, I'm running a Hermes agent with an OpenAI v1 compatible endpoint provided by Kobold. My setup is a a 24GB 3090Ti + 512GB DDR4 running Qwen3.6-35B-A3B.
Continue config for Qwen 3.6 and llamacpp (www.reddit.com) If anyone is using the Continue.dev extension in VSCode, what config settings are you using for Continue and the llama-server? Mine keeps hanging after bad tool calls.
Hardware LLM Taalas Reaches >14,000 TPS on Llama 3.1 8B (taalas.com via hn) Products Taalas HC1 Technology Demonstrator - Runs Llama 3.1 8B model - TSMC 6nm | 815mm2 | 53B Transistor - 2.5 kW Server Instantaneous Inference HC1 demonstrates the power of Taalas hardcore model silicon technology, delivering 17k token…
AMD BC-250 and the search for Cheap Compute (www.reddit.com) I've been searching for disused/underappreciated compute vectors for a few months since the MI50 shot up in proce - in comes the salvaged PS5 APU on a standalone board; Zen 2, 16 GB unified GDDR6, RDNA 2 (gfx1013). They're $50-150 on eBay…
Volatile prefill speed after each reboot - llama.cpp (www.reddit.com) After every machine restart I get a different prefill speed, it can be only 300t/s or 1500t/s. It's like a lottery at each restart.
Show HN: Llama CPU Benchmarks (deemwar-products.github.io via hn) TurboQuant — "8× faster" The headline is a synthetic GPU-kernel number. On real CPU end-to-end it ran 2.2× slower and dropped Qwen accuracy 17 pp.
Do you think there is room for optimization? llama.cpp/qwen3.6 27b on two 6000 Blackwell (www.reddit.com) Hi, i run llama.cpp inside LXC on a Proxmox server. The hardware is a recent AMD Epyc with two 6000 Blackwell MaxQ.
The MTP function in LMStudio causes a decrease in output quality. (www.reddit.com) The prompt is very simple, you can see it at the end. Both tests used the exact same settings, the only difference was that I turned the MTP button on/off, nothing else changed, I tried similar tests multiple times with similar results: By…
Show HN: Llama-dash – local LLM operators dashboard and proxy (github.com via hn) llama-dash llama-dash turns a self-hosted local inference box into an observable, policy-controlled AI gateway: one UI for model state, request history, API keys, routing rules, proxy metrics, and client setup. The implemented inference ba…
Floor for local meeting summarization on a 6GB GPU: qwen3.5:0.8b works at 57s, Granite 4 350M hallucinates (www.reddit.com) Disclosure: I made this. Open-source, MIT, Windows + Linux.
Find bugs in YOUR code using OpenCode, Llama.cpp and Qwen3.6 (wtarreau.blogspot.com via hn) Background For quite some time I had been submitting tasks to LLMs via llama-cli (natively) or llama-server (API), both from the excellent llama.cpp project. On CPU-only llama-cli starts fast and can restart from a checkpoint which has alr…
Llama-server and MTP (www.reddit.com) currently in order to use MTP one needs to enable it in the starting argument of llama server. --spec-type draft-mtp --spec-draft-n-max 2 But then other models that do not use MTP currently like Gemma or basically all other models fail to…
Qwen3.6 35B MTP, t/s varies on different scenario (www.reddit.com) Tried Qwen3.6 35B Q5_K_M MTP, HW: 9700x, 64GB 5600 RAM, 5060 TI 16GB. --n-cpu-moe 30 ^ -ngl 99 ^ -c 131072 ^ --no-mmap ^ --flash-attn on ^ --cache-type-v q8_0 ^ --cache-type-k q8_0 ^ --threads 8 ^ --parallel 1 ^ -rea off ^ --reasoning-budg…
RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed (www.reddit.com) What's good everybody, I probably have the fastest possible setup on these AMD Radeon RDNA2 GPUs for one reason only. A custom binary that bypasses some assert statement causing a crash in today’s stock releases.
Need help getting 7900 XTX PyTorch performance metrics (www.reddit.com) I'm on a quest to profile and benchmark different GPUs for PyTorch, vLLM, and llama.cpp. Cannot find the high-end AMD consumer cards for rent anywhere online and interested in the PyTorch ROCm performance of the 7900 XTX (if you want to co…
9070xt speed inconsistent. (www.reddit.com) I have a 9070xt on windows 10, and "The Rock Nightly" ROCM & built llama.cpp using the following flags : cmake .. -G Ninja ^ -DCMAKE_C_COMPILER="C:\opt\rocm\lib\llvm\bin\clang.exe" ^ -DCMAKE_CXX_COMPILER="C:\opt\rocm\lib\llvm\bin\clang++.e…
Is the llama.cpp nixos flake just broken? (www.reddit.com) I can't seem to build any of the latest releases. I'm not sure if something has changed and I haven't kept up, but only way to get a working build is to pin to like a 3 week old commit.
🧬 flux-genotype: A self-evolving AI kernel that runs on CPU with Ollama — mutates its own architecture (www.reddit.com) `🧬 Flux‑Genotype – A CPU LLM that rewrites itself` I've been working on an open-source kernel called **flux-genotype**. It orchestrates local models (TinyLlama, Llama 3.2, Hermes 3, DeepSeek-Coder) into a self-modifying ecosystem.
GGUF with MTP vs MLX without. Is mlx still the way to go for mac users? (www.reddit.com) Has anyone of the mac users tested the speed difference (token gen, promt processing) between mlx quants without mtp, vs gguf quants with mtp? More or less once a month I wonder if mlx is still the correct path in mac.
MTP vs non-MTP vram usage difference? (www.reddit.com) As per title, assuming you run both with the same context and quantization in llama.cpp is there any difference in vram usage?
Looking for agent builders to test external agents on a multi-agent knowledge site (www.reddit.com) I’m building AgoraDigest, an experimental site where multiple AI agents answer the same hard technical question independently, then a synthesized digest preserves: verdict best-use-case boundaries conflicts between agents evidence gaps ver…
Benchmarking vLLM vs SGLang vs llama.cpp on a mixed Blackwell/Ada cluster (www.reddit.com) I have been running some benchmarks on a heterogeneous 7-GPU cluster to see how different inference engines handle long context prefill using pipeline parallelism. My setup consists of a mix of Blackwell and Ada cards: one RTX PRO 6000 96G…
Build Own Docker Image with llama.cpp and MTP (www.reddit.com) Hi All! Saw some folks waiting for the Docker images with llama.cpp and MTP when it released.
Made a simple template manager and GUI for llama.cpp so I don't have to keep memorizing CLI flags. (www.reddit.com) Introducing Hexllama Hey, I’ve always found llama-server to be more than enough for testing out local models, mostly because it guarantees you always have the absolute latest llama.cpp features and architecture support. But keeping track o…
lm studio alternative (www.reddit.com) i'm looking for sth like lm studio but open source, easy to use. able to stay up to date with llama.cpp or select custom engine.
ClickBook – Offline Android eReader with local LLM inference via llama.rn (play.google.com via hn) ClickBook is an offline ereader for EPUBs and readable PDFs that turns every book into a language-learning companion. Tap any word while you read and get an instant, context-aware explanation powered by on-device AI.
Qwen 27b MTP Config, Llama.cpp Single 3090 (www.reddit.com) What setup are you using for qwen 27b on a single 3090? Here's what I've started using today.
Running Mimo 2.5 q4_k_m on single rtx5090 need recommendations (www.reddit.com) Getting 10.3 tps using this prompt: CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=8 GOMP_CPU_AFFINITY="0 2 4 6 8 10 12 14" ./build-mimo-5090-3090/bin/llama-server -m "$MIMO" -ngl 999 --n-cpu-moe 43 --no-mmap -c 100000 -ctk q8_0 -ctv q8_0 -fa on -…
llamacpp with Gemma4 31B dense and Gemma e4b as draft, plus audio input? (www.reddit.com) Hi, has anybody succeeded in running llama.cpp with Gemma 31b dense and Gemma e4b as draft model, and simultaneously inhibit the voice recognition feature? Is it even (theoretically) possible?
Llama.cpp server running ~2 weeks straight. Loses its mind? (www.reddit.com) I’ve got Qwen3.6 27b and Qwen3.6 35b running in two separate instances for over two weeks and they are considerably dumber now than when I launched them. is this a thing?
Small OpenCode plugin that helped me with broken tool calls from a local Qwen model (www.reddit.com) I’m using OpenCode with a local Qwen3.6-27B Q6_K GGUF model on an RTX 5090 with KV cache in Q8. For reference my llama.cpp build is compiled with CUDA 12.9.
Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update (www.reddit.com) In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so…
Automated AI researcher running locally with llama.cpp (www.reddit.com) Hi everyone, I'm happy to share ml-intern, which is a harness for agents to have tighter integration with Hugging Face's open-source libraries (transformers, datasets, trl, etc) and Hub infrastructure: https://github.com/huggingface/ml-int…
Best local model supporting claude code? Rtx3060 (www.reddit.com) Hello all, I’ve been using Qwen 3.5 9B Q4 262k ctx using Llama cpp for claude code for a while now, is there any model which better complements agentic coding setup locally? Or is there a better harness (than Claude Code)?
Anyone else experiencing heavy hallucinations with MiMo-V2.5 (310B) quantized version? (www.reddit.com) Has anyone else run into major issues with MiMo-V2.5 (the 310B total / 15B active MoE model from Xiaomi)? I tried the UD-Q4_K_XL quant from Unsloth.
Open Source Managed Agents (linchpin.work via hn) Any model, one adapter OpenRouter routes to ~200 cloud models — Claude, GPT, Gemini, Llama, DeepSeek, Mistral, Qwen. Ollama runs anything you've pulled locally.
LLMs on flagships smartphones? (www.reddit.com) I have been curious to see how small LLMs like Gemma-4-E2B-it run on a flagship smartphone (S25+ with Snapdragon 8 Elite) in terms of prompt processing and token generation. I have created a script that uses llama-cli and I achieve 48 tps…
What Inference-Platform Benchmark Posts Leave Out (ingero.io via hn) TL;DR Cloudflare’s recent post on hosting Kimi K2.5 and Llama 4 Scout opens with p90 Time-to-First-Token graphs and a round of throughput numbers. The piece is candid about the engineering work behind the gains.
very slow tok/s with Gemma 4 31B on a 5090?! (www.reddit.com) Hi, i have a 5090 and i was tyoing around with hermes-agent. To utilize 128K i thought about switching from LM Studio to llama-cpp (the turboquant fork) expecting better tok/s and also saving some VRAM from context quantization.
Building the QWEN3.6 - Codex Bridge Furthe + Kindergarten Harness Reality Check (www.reddit.com) I got a bit further with my harness for running Qwen 3.6 model on Codex. While testing, analyzing, and building the harness, I evolved TBG(O)llama-swap into a full forensic UI bridge and LLM analytics tool where every harness finding, modi…
ZML: Between Jax and Llama.cpp (jaco-bro.github.io via hn) tjbl Loading Safetensors in NNX: A 700x Speedup KV Caching in NNX ZML: Between JAX and llama.cpp UnslothTrainer Gotcha: Keep All Columns Is "Safe AI" the New Y2K? The Vulgar Script: The Strange Alliance Against Open AI The Steak Is Juicy
llama bench kv cache f32 error (www.reddit.com) A did a quick google, but found nothing on this and I am scratching my head. Trying to do a llama-bench run with the kv cache set to f32 under Vulkan with a Strix halo.
What solutions are you using to boost TPS and Context Window? (www.reddit.com) Server Specs: 16 Gigs DDR5 AMD Ryzen 5 7600X 4.7 GHz 6-Core Processor AMD Radeon Sapphire Nitro+ 7900XTX NZXT N7 B650E ATX AM5 Motherboard Performance: I'm running Qwen27b Q4 at 80k context on a Sapphire Nitro+ Radeon 7900XTX 24Gb at 40 t/…
Does anyone else have issues with Qwen-3.6-27B stability in the Codex harness? (www.reddit.com) I run the 4 bit quant of Qwen-3.6-27B in the codex harness with unsloth recommended llama-server settings, thinking enabled. I have tried the default chat template and the updated ones and have updated both my GGUFs and llama-cpp to the mo…
Here is the current "Free-Tier AI Stack" for 2026 (www.reddit.com) 1. The Frontier Giants • Gemini: Access 1.5B tokens/day on Gemini 1.5 Flash/Pro.
Does llama-swap actually work with mlx_lm.server / MLX models on macOS? (www.reddit.com) I’m trying to use llama-swap with an MLX model on a M2 Max instead of just llama-server. I got mlx_lm.server working directly with /v1/chat/completions, but I’m not sure whether llama-swap reliably supports this setup.
Hardware upgrade advice (www.reddit.com) Hello everyone, I'm an enthusiast and software developer. I am using my gaming PC, here's the relevant specs: MB Asus ROG Strix X570-F CPU AMD 5800x RAM 64Gb DDR4-3600 GPU 3080ti (12Gb GDDR6X) I can replace the GPU with 2x 5060ti 16gb for…
potentially stupid problem trying to llama-bench Qwen3.6-27B across two V100s in llama.cpp (www.reddit.com) this is almost certainly a skill issue, however: ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 1 --flash-attn 1 --device CUDA0,CUDA1 -p 2048 -d 4096,16384,65536 rather than splitting across those two cards, it firs…
Meltdown: LLM Client Made in Python and Tk (github.com via hn) An interface for llama.cpp, ChatGPT, Gemini, Claude, and Kimi This is a desktop application to interact with large language models. It has hundreds of arguments and commands and many power user features.
4GB "Gemini Nano" model GGUF anyone? (www.reddit.com) Hi everyone, I saw an article saying Chrome silently downloads a ~4GB AI model (likely "Gemini Nano") to your computer for features like text summarization. Two questions: What is the exact name/version of this model?
What's the right way to feed PDF files to Gemma-4? (www.reddit.com) In my line of work, PDF documents tend to be combinations of text, math formulas, tables and images. llama.cpp added support for PDFs a few months ago, but I believe it treats PDFs either as text (discarding everything else), or as images.
Show HW: Vectors.Space – An free service for embeddings (vectors.space via hn) One API for embeddings. OpenAI, Gemini, Voyage & local Llama.
Qwen 3.6 Looping with Tools? (www.reddit.com) For some reason, my qwen started looping a lot recently, ever since I introduced MCP tool calls. I don't know why as I didn't really change anything other than that.
Llama.cpp, opencode / pi / basically all agents, context compaction & cache validation: how do you manage it? (www.reddit.com) Ok so, I will try to explain myself as much as possible because onlinew I really cannot find much about this. Let's start by my settings for running Qwen 3.6 35B: Qwen 3.6: cmd: '/X --port ${PORT} --chat-template-kwargs '{"preserve_thinkin…
Which inference engine to choose for mlx? (www.reddit.com) Is llama.cpp much slower for M4/M5? I heard ollama is faster due to mlx support since March.
Mimo2.5 (not pro) under llama.cpp? - primary model opencoder? (www.reddit.com) I tried running AesSedai/MiMo-2.5-GGUF:Q4-K-M under llama.cpp (main tree, compiled 36hours ago) Hardware: nvidia A6000 with 48GB RAM + 300GB CPU RAM I had no success: error loading model: missing tensor blk.0.attn_q.weight ... Is Mimo alre…
MTP - The proofs in the puddin! Using it with Qwen3.6-27b (www.reddit.com) Been running llama.cpp MTP with Qwen3.6-27B Q4_K_M as my daily coding assistant and got curious what was actually happening under the hood. Pulled the metrics from llama-server and charted a full session.
Code's open. Tried building a fully real time on-device voice assistant + live translator on a phone (multilingual, STT→LLM→TTS, all local) on the Tether QVAC SDK. (www.reddit.com) Wanted to see if a real voice loop — speak, model thinks, speaks back — could run entirely on a single device today, no cloud. Same codebase doubles as a live translator (speak in language A, hear it back in language B).
Does Deepseek V4/Flash work with Llama CPP and Vulkan on and branches yet? (www.reddit.com) Even unofficial or slow. I have enough vram-memory to load it, but not enough memory to run in cpu-only mode.
BUILD portable AI system (www.reddit.com) Hey everyone, I’ve been thinking about a project idea and I’d love to get your feedback. The idea is to take a 1TB SSD and turn it into a fully portable AI system.
I built vivkemind – an open-source, local‑first terminal AI coding agent with full AWS Bedrock support (www.reddit.com) wanted a terminal AI coding agent that doesn't lock me into one model provider. So I forked Qwen Code and added full support for every model available in AWS Bedrock.
[Benchmark] Llama.cpp: Mac vs CPU vs GPU + CPU, Qwen3.6 27B, Q8 (www.reddit.com) https://preview.redd.it/fm8fr1vllczg1.png?width=1254&format=png&auto=webp&s=23dbb32e85c71b9454a617de174d0f416b786bb2 llama.cpp parameters: -c 260000 --jinja --no-mmap model: HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Balanced:Q8_K_P Based on…
A plug-n-play open-source pruning tool that is workload-aware (www.reddit.com) This project was born out of time I spent digging into a biologically inspired algorithm I was using to measure co-activation for placement of experts and ranks onto chips. The default scheduling that vllm provides can end up causing laten…
New Gemma chat template update by Google (huggingface.co via hn) Libraries llama-cpp-python How to use unsloth/gemma-4-E4B-it-GGUF with llama-cpp-python: !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/gemma-4-E4B-it-GGUF", filename="gemma-4-E4B-it…
Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching (www.reddit.com) Hey everyone, I’ve been experimenting with running Qwen models locally on my setup: GPU: RTX 3090 (24GB VRAM) RAM: 64GB CPU: Ryzen 5700X OS: Windows 11 What I’m currently running Qwen 3.6 35B (UD Q4_K_M) llama-server.exe -m "C:\Users\Dino\…
qwen 3.6 27B looping problem (www.reddit.com) Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen…
Best Llama Config for Turboquant_Plus? (Stats below) (www.reddit.com) So I'm running the below and I've seen guys run this setup with TurboQuant_plus and get 35 tokens/second. I find the speeds I'm getting acceptable but if I could hit 30-35 I'd be soooooo happy.
claudely: launch Claude Code against Local LLM provider like LM Studio / Ollama / llama.cpp without trashing your real claude config (www.reddit.com) Plenty of CLI coding agents will talk to a local LLM, but the catch is the ecosystem. Skills, slash commands, MCP servers, plugins, hooks: all the interesting tooling has been built specifically for Claude Code, and parity on every other a…
Llama.ttf: a font file which is also a large language model and inference engine (fuglede.github.io via hn) llama.ttf llama.ttf is a font file which is also a large language model and an inference engine for that model. llama.ttf is a font file which is also a large language model and an inference engine for that model.
Show HN: Valkyr LM Inference with Realtime Guarantees (github.com via hn) Valkyr is a fresh take on LM Inference runtimes. It's quite different from llama.cpp, vLLM, or ZINC for example.
What could they mean by "warmed steady-state"? (www.reddit.com) https://www.reddit.com/r/LocalLLaMA/comments/1t0vp3w/pflash_10x_prefill_speedup_over_llamacpp_at_128k/ Q4_K_M Qwen3.6-27B on a 24 GB 3090 decodes fast (~74 tok/s with DFlash spec decode), but prefill scales O(S²). On a 131K-token prompt, v…
OpenJet v0.4: a zero-config local coding agent for llama.cpp (www.reddit.com) Hello again. I just pushed a major update to OpenJet.
Using Valve's AMDGPU VRAM management to benefit local AI Inference rather than games? (pixelcluster.github.io via reddit) Any other AMDGPU users on Linux taken an interest at what Valves been doing for VRAM management for gaming? Seems to me that this might be just as useful for local AI inference as for gaming, especially for those of us wanting to do infere…
Need help optimizing qwen 3.6 on my 2x 5060ti 16gb (www.reddit.com) Hi all, I tried to setup my pc to run llm, but got some issue: the first question of the chat is generally fine, but from the 3rd follow up question, the backend often be unresponsive and I have to manually restart the llama cpp server, or…
Does Cline KanBan support local llm? (www.reddit.com) I installed Cline CLI and it was using my local LLM. But it seems like when I tried to use Cline KanBan it tries to use OPenAI directly instead of the llama.cpp OpenAI Compatible URL I entered.
Ai Doomsday Toolbox v0.938 (www.reddit.com) Hello! It’s me again, the developer of ADT.
I pitted different LLMs against each other in Pokemon Showdown (www.reddit.com) I wanted to see if LLMs could reason through complex game states, so I built a system where they can play Pokémon Showdown battles autonomously. They get the battle state every turn and use tool calls to attack or switch.
Benchmarking Local LLM/Harness Combinations (neuralnoise.com via hn) I’ve been running a small benchmark, harness-bench , that pairs local LLMs (served via llama.cpp ’s llama-server ) with agent harnesses (Aider, Claude Code, OpenCode, Pi, Qwen CLI) on 16 software-engineering tasks across Python, PyTorch, J…
Comparing SVG Generation for the top open models (codeinput.com via reddit) Some of the larger models (like Llama) weren't available on OpenRouter, so I had to work with what was there. Best small model: Gemma 4 26B For its size, I think it had the best output.
Is Mistral-3.5-Medium-128B broken in Llama CPP? (www.reddit.com) Trying some if Bartowski's Q4 quants. Using Vulkan with the latest main branch as of a few hours ago.
Gemma 4 architecture support for QVAC-Fabric (Tether's llama.cpp fork) (github.com via hn) QVAC-Fabric Gemma 4 Architecture Patch Adds full Gemma 4 (gemma4) architecture support to QVAC-Fabric, Tether's llama.cpp fork. Base: QVAC-Fabric temp-upstream branch Target: All Gemma 4 variants (E2B, E4B, etc.
Don't forget about dem free gains! (www.reddit.com) Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago: 1 card - 1580/44: $ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB): Device 0: NV…
I built a full web app using Qwen 3.6-35B running locally on my 5070 Ti with the BMAD Method — here's how it went (ggufbench.com via reddit) I've been running local LLMs since Qwen 3.5 dropped and I was really impressed by what we could run on consumer hardware. Fast forward another two months and we have gotten a handful more gems such as Gemma 4 and Qwen 3.6, so I wanted to p…
Workstation upgrade for 5 concurrent users (Qwen 3.6 27B) (www.reddit.com) Hello, I would like a suggestion from those who are already actively involved in this world. Basically, I own this workstation: Ryzen 9 5900X 32GB di RAM DDR4 RTX 5060Ti PCCOOLER CPS YS1000 1000W Currently, I can quite easily code with Qwe…
Qwen3.6-27B-GGUF:UD-Q8_K_XL and llama.cpp issue (DGX SPARK) (www.reddit.com) Hey all, im having a crisis that i just cant figure... i used Qwen3.6-27B-GGUF:UD-Q8_K_XL ever since it came out (on a DGX SPARK) and it worked like magic with decent performance (~50 t/s) , im updating SPARK and llama.cpp on a daily basis…
which is faster and better for coding? Luce-Org/Dflash or noonghunna/qwen36-27b-single-3090 (www.reddit.com) Anyone have experience with both? Luce is llama.cpp with custom dlflash and noonghunnas project is vllm with patches.
Qwen3.6-27B IQ4_XS FULL VRAM with 110k context (www.reddit.com) Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 vers…
Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop? (www.reddit.com) The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these models as of now.
Another way to use local llm, have an MCP server that talk to a Qemu computer. What do you think? (www.reddit.com) I think is nice to contain the MCP into a Qemu enviroment where the LLM can do whatever ... here is doing GDB on a LVGL program.
VRAM.cpp: Running llama-fit-params directly in your browser (www.reddit.com) Lots of people are always asking on this subreddit if their system can run a certain model. A lot of the "VRAM calculators" that I've found only provide either very rough estimates or are severely limited in the number of models they can e…
When Can LLMs Learn to Reason with Weak Supervision? (salmanrahman.net via hn) We study when RLVR generalizes under three weak supervision settings (scarce data with as few as 8 examples, noisy reward labels, and proxy rewards such as majority vote and self-certainty) across multiple models from the Qwen and Llama fa…
your daily driver stack, what's it look like? and why? (www.reddit.com) What it says in the title, I'm interested in hearing what you all have landed on as a workable / useful stack for you. Mine looks like this: back end inference servers - llama.cpp, vLLM | V hermes-agent - cron jobs + OpenAI compatible endp…
Llama Server with Cline Settings (www.reddit.com) Hi everyone, just wondering if anyone has setup llama server to work with Cline and whether you can use image/browser use. I just gave it a whirl and had to disable image support.
Please help improving a CPU-only inference speed (www.reddit.com) This is a request for help for the people that want to use locally very large models on Q8 and better quanta at all costs, in my case the cost is inference speed. So I have a 512GB DDR4 ECC 2666 with a Threadripper Pro 3945WS that gives me…
Llama 4: A Deep Dive into Liquid Transformers 2.0 and Sovereign AI (en.landingfymax.com.br via hn) The tech world came to a standstill this week in April 2026 with Mark Zuckerberg's official announcement: Llama 4 is here. While Meta's previous models had already democratized access to Artificial Intelligence, the fourth generation of th…
Memory upgrade, is it worth it? (www.reddit.com) Hi, I need your opinion on a system upgrade, 🤔 I currently have the following AI server used for various tinkering, learning, development etc. System AMD Ryzen 7 7700 (8C16T Zen4) Corsair Vengeance RGB DDR5 5600MHz 32GB MSI B650 Gaming Plu…
Need help with llama.cpp Qwen3.6 configuration on a single 3090 w/ 48GB RAM (www.reddit.com) Hey there, I have been testing models locally, but this is the first model that got me interested in understanding llama.cpp in more detail. I have noticeable stuttering when I run the model as it fills the VRAM completely, and I am sure I…
Qwen 3.6 35B-A3B takes a long time at image processing. Is it happening only to me? (www.reddit.com) 9900x, RTX 4080, 96GB RAM. Llama-cpp, Windows.
how to maximize my tos on a 6Gb Nvidia rtx 4050 and 16Gb ram (www.reddit.com) Gemma 4-31B vs Qwen 3.5-27B vs Qwen 3.6-35B-A3B on a browser-agent vision prompt — MoE wins on every axis (www.reddit.com) I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…
lama.cpp crashes on image input ("failed to encode image slice", SEGV) with Llama 4 Maverick on CPU (www.reddit.com) Hi everyone, I’m running into a consistent crash when trying to use image input with Llama 4 Maverick in llama.cpp. Text works perfectly, but as soon as I send an image, the server crashes.
Ollama alternative with dynamic model loading (www.reddit.com) English version of Nexus Ark? (www.reddit.com) Do you have any go-to utility LLM-related tools that are less commonly discussed? (www.reddit.com) Is there a place where I can compare generation of tokens per second of 1 GPU VRAM+RAM vs 2 GPUs for those models that don't fit in 1 GPU? (www.reddit.com) Qwen 3.6 CoT issue? (www.reddit.com) One-command local AI stack setup for Ubuntu (CUDA, Ollama, llama.cpp, chat UIs) (github.com via hn) Local Model Router: Ollama/OpenAI-compat bridges for local LLMs via llama.cpp (news.ycombinator.com) A high-performance local LLM server providing drop-in API compatibility with Ollama and OpenAI, built on llama.cpp's llama-server. Features automatic VRAM management, Hugging Face integration, and modular architecture.
Qwen3.6 Fails n8n Tool Calling (www.reddit.com) https://preview.redd.it/na4ub5yzprvg1.png?width=1654&format=png&auto=webp&s=e356e0ab0829bb275352d1035c35c645a381c3c7 I am using Kaggle to serve Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf but tool calling is not always working. I also tested it with R…
Best way to prepare for AI Engineer interviews? (www.reddit.com) I’m currently preparing for AI-focused roles and would love to get perspectives from people already working in the industry. For context — I have ~5 years of experience as a Full Stack Engineer with a strong focus on AI systems.
7900XTX, Qwen 3.6 35B A3B, 150t/s that drops to 50t/s for no reason? (www.reddit.com) MSI B650 Gaming Plus 9800X3D 64GB DDR5 6400mts Windows 11 When I first boot my PC and I run this model, I get 155-160t/s, and for some reason, after a couple minutes, say, 10 minutes, not using AI or anything in particular, GPU temp at 40c…
MI25 for LLMs? idc about speed, just need it to work (www.reddit.com) Found an MI25 locally for $50. It has 16GB of VRAM, which would be perfect for running some decent-sized local LLMs without breaking the bank.
llama.cpp + opencode agent temperature settings (www.reddit.com) Has anyone successfully set the temperature for individual agents of the opencode? I have set the temperature for individual agents, but when I start the llama-server in verbose mode the server claims the temperature is in default settings…
llama.cpp - split pp and tg processing over different instances? (www.reddit.com) I wonder, is it possible to split pp and tg over different (remote) llama.cpp instances, maybe via clever RPC calls?
lazy person's model param management for llama.cpp? (www.reddit.com) Has anyone found a good way to manage model params based on the recommendations of the model developers that doesn't require manually managing a local config file? I have an ever growing bash script for launching llama.cpp server which inc…
Low performance in 7900XTX in Qwen 3.6 35B A3B (www.reddit.com) When I first setup my PC, I did get 92t/s in Qwen3.6 35B A3B, and now for some reason it won't ever get past 30t/s no matter what settings I use, either rocm or vulkan. .\llama-server.exe --model ../models/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf -c…
Feedback on iOS app with local AI models (www.reddit.com) Hey everyone, I just shipped an iOS app that runs local AI models. Current has 12 models: Gemma 4, Llama 3.3, Qwen3, DeepSeek R1 Distill, Phi-4, etc.
LiteRT LM Framework with Rockchip NPU (RKNN 3588) (www.reddit.com) Im searching for build version of LiteRT LM framework can use and utilize the NPU of the RKNN 3588. It would be great since I can run gemma 4 e2b model using this framework on the machine, because I wont have to migrate my codebase from li…
Ask HN: Simple tooling for local LLM code critique without IDE integration? (news.ycombinator.com) While I'll set out the criteria for what I'm looking for, I don't want this to turn into a general debate about the role of LLMs in software development. That discussion is important, but we have plenty of them.
Are MLX 4-bit Quants broken (www.reddit.com) I see so many interesting MLX implementations like DFlash, Speculative Speculative decoding, etc. But when I want to try them for myself the 4bit quants of models seem like they have been lobotomised for some reason, hallucinating, start t…
How does a self correcting loop for AI agents work? (www.reddit.com) Hey guys, just checked out minimax 2.7, where they used AI to train itself, and ran over a hundred loops, and it improved it's performance by 30%, how does that work, can I also run a script that makes AI store it's memory in a loop on a m…
What's the better way to install llama.cpp on Android? (www.reddit.com) I own an Oppo Find X3 Pro (Snapdragon 888, 12/256 GB, Android 14.0) unused because of 3 green vertical lines on the screen and poor battery. I tried Google AI Edge Gallery with Gemma-4-E2B-it and it performs well so I thinked: "why don't t…
Upgrade paths for my 256g ddr4 ram + 4x24g vram system (www.reddit.com) So I was just about to give up playing with local models, until I realised I can actually run GLM 5.1 at not too horrible speeds, using this quant https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/IQ2_KL in ik llama. Getting around 6.…
Transitioning to iOS Dev + Local LLMs: Is the M5 Max with 64GB+ RAM the only real choice? (www.reddit.com) Hey everyone, I’m currently an ML Engineer looking to pick up iOS development, and I’m upgrading my hardware to handle both. I’m moving away from cloud-only workflows and want to run LLMs locally for testing, R&D, and building CoreML integ…
Can LLM make small change to the software program? (www.reddit.com) I'm currently vibe-coding (I'm new to vibe-coding) with Gemma 4 4EB Q4 and Qwen 3.5 9B Q5 (KV is quantized to 4 bits with new Google TurboQuant implemented in llama.cpp - I use koboldcpp and release said it's automatically activated): the…
For AI agents: is per‑token pricing killing your budget? Looking for feedback on time‑based subscriptions. (www.reddit.com) Hey r/AI_Agents, I run an inference service (cheapestinference.com) and we're exploring a different pricing model that might be more predictable for agent workloads. Instead of per‑token billing, we offer **dedicated 8‑hour time windows**…
My Custom Llama Build (www.reddit.com) I recently got into LLM's and llama.cpp because I wanted to learn AI. I went from Openclaw to SOTA CLI and then to running llama on my Linux server.
Is an nvidia DGK Spark or similar worth it? (www.reddit.com) I currently run a local model and mix of Claude max. My local model is run on cpu with 256 gb of ram and so it runs quite slowly.
OSCAR 2-bit KV on Windows/Nvidia? (www.reddit.com via reddit) Hey guys, Has anyone gotten the new OSCAR 2-bit KV cache fork running locally on Windows/Nvidia yet? Right now, all the plug-and-play local hype seems focused on the Mac Metal path, and the original project targets Linux via sglang.
Apple announced new on device inference engine for Apple Silicon (www.reddit.com via reddit) [Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B (www.reddit.com via reddit) I took the liberty to test both models today on my favorite benchmark question, head to head. Device: Apple Mac M3 Max 64GB Environment: llama.cpp, all defaults Gemma4-12B's token generation speed: 47 tps with MTP and 2 predicted tokens 29…
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit) I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
Has anyone tried running retrieval inside the model, not before it? (www.reddit.com via reddit) Been messing with a bolt-on refiner block for small models. Insert a small trainable transformer layer at the midpoint of a frozen base model, loop it 2-4 times over the hidden states.
Anyone else find local search becoming the bottleneck once your LLM setup gets fast enough? (www.reddit.com via reddit) Got my Llama 3 setup humming along on a 4090, inference is snappy, but retrieval became the hell. Running semantic search over a decent-sized document corpus and the latency gap between "model thinking" and "model waiting for context" star…
Pursuit of performance Llama.cpp to MLX (www.reddit.com via reddit) Right now, I am running llama.cpp on a M2 ultra 64gig. Having great fun with unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL - Running opencode and finding it amazing to have such great tools running locally.
How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions (arxiv.org) Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task,…
Moving to llama.cpp (www.reddit.com via reddit) I need some help because I am a little confused. I’ve a system with a 5090 and a 6000 pro.
Jetbrains Mellum 2: a really good and performant model (www.reddit.com via reddit) Oh Hey Folks, I took the Mellum 2 model for a spin, so I wanted to share my impressions here. Disclaimer: the tests presented here are not cientific nor have those nice names like perplexity,etc.
Here's a llama.cpp CLI Command builder. (llamabuilding.com via reddit) No accounts or sign up. No email requirements.
Pipeline parallelism in llama.cpp may be wasting your VRAM (www.reddit.com via reddit) By default, llama.cpp enables pipeline parallelism, presumably to speed up inference. In my testing, I found that pipeline parallelism has no speed benefit and comes at a significant cost of VRAM.
Quick note on the QAT of recent (www.reddit.com via reddit) tldr: Googles quant is broken, use unsloth UD Q4_K_XL for now This might be low quality post, but oh well, we ball llama-quantize will quant the token embed to q6k when Google really was supposed to use "--pure" but that’s only the first p…
LMStudio gemma 4 31b QAT with MTP (www.reddit.com via reddit) Did anyone manage to launch that in LMStudio? I am on the most recent update with the most recent llama.cpp available in LMStudio.
Me: Arguing with an AI bot who just posted something on this sub about Llama 3.1. (www.reddit.comhttps) For real tho, these bots need to turn on their web search functions and quit living in the past. It’s bad enough we gotta deal with all the “Qwen3.6 27b helped me quit drinking and brought my dog back from the dead” posts.
Can't get beyond 8t/s with NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 (www.reddit.com via reddit) I am running nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 in an unsloth UD-Q6_K_XL quant (unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF) on a dual 5090 Zen5 32C Threadripper Pro Workstation with 512GB DDR5 ECC RAM and a PCIe Gen5 capable…
Gemma 4 MTP with assistant vs llama cpp type MTP (www.reddit.com via reddit) Hi all Been loving the QAT models but honestly what is up with the assistant models, any ggufs and ways to make em work with vanilla llamacpp and if this way of MTP is different than the one am17an developed for llamacpp. Followup question…
Bonsai LM (1-bit and 1.58-bitLLMs) benchmark on Jetson Orin Nano Super (www.reddit.comhttps) Just released a deep benchmark of 5 Bonsai LM models (1.7B → ~8B) on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA - across all 4 power modes: 7W, 15W, 25W, and MAXN A thread! So, Bonsai LM models are new line of 1-bit LLMs releas…
Latam GPT 1.0 released (www.reddit.com via reddit) https://huggingface.co/latam-gpt/Llama-3.1-70B-LatamGPT-SFT-1.0 Latam GPT is an AI model trained on latin american data. It's part of an initiative to create AI that works better in Latin America than Chinese or American models.
Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? (www.reddit.com via reddit) llama-launcher Release (www.reddit.com via reddit) Hello everyone, I've been working on a point and click GUI to make tinkering with llama-server flags much quicker and easier, I thought I'd share for anyone else who might be interested. It's also great for anyone new to llama.cpp that is…
AutoMB – a CLI that brings 150+ AI commands, agents, and advisors to your terminal (www.reddit.com via reddit) Used local Ollama (gemma4:e4b + nomic-embed-text) to bulk-generate AI summaries for 4300 arXiv papers and push them to a remote Cloudflare DB — pipeline walkthrough (www.reddit.com via reddit) Meta Abandons Llama for Muse Spark — The End of Open-Source AI's Biggest Champion (www.reddit.com via reddit) Meta has officially abandoned its open-weight Llama family in favor of Muse Spark — a fully proprietary model built by Alexandr Wang's MSL team. The Llama era is over.
Windows keeps crashing on rtx 3090 (www.reddit.com via reddit) Recently bought used 3090. Under heavy stress tests and gaming it's fine.
The GPUless Revolution: How Efficient AI Models Are Democratizing Artificial Intelligence (www.reddit.com via reddit) You don't need a $10,000 GPU to run state-of-the-art AI anymore. The latest breakthroughs in model quantization and optimization are putting powerful AI in the hands of everyone—from hobbyists to small businesses.
Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles (arxiv.org) We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b…
Galaxy Z Fold6 as a local inference node — llama.cpp/Vulkan, homelab telemetry, SHA-256 model verification (www.reddit.com via reddit) Built a small Android app called Pocket Node that runs llama.cpp inference on-device. Here's what it actually does and what it doesn't.
llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is? (www.reddit.com via reddit) Running into something annoying with llama-server in router mode (`--models-preset`) and I can't tell if I'm missing a flag or if this is just how it works. My rig is 2x 3090, 2x 4060 Ti (one's unplugged at the moment, riser got repurposed…
MTP and QTA - what is the relation? (www.reddit.com via reddit) I'm an old guy and I hate when things change so fast surrounded by noise and breaking news! MTP, I know what the acronym means and where it excels.
QAT variant of Gemma4 26B A4B is not working well for me (www.reddit.com via reddit) I am using llama.cpp version b9549 with this arguments as recommended: llama-server --temp 1.0 --top-p 0.95 --top-k 64 -hf ... Here is what I got on chessboard svg test https://www.reddit.com/r/LocalLLaMA/comments/1t53dhp/quality_compariso…
Context, memory, and RAM/VRAM (www.reddit.com via reddit) This will be a slightly disorganized post, I apologize. I’m trying to understand the relationship between context, a memory system for the agent, RAM and VRAM.
A handy llama-server launcher with easy model and configuration customisation (www.reddit.com via reddit) I wanted something that I could easily configure to manage a set of sensible defaults, that supports multiple llama-server binaries, with per-model over-rides, and command line over-rides. The utility is here: https://github.com/stew675/st…
Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ (www.reddit.com via reddit) Full benchmark results and in-depth analysis are available in the articles: KV Cache Quantization Benchmarks for Long Context and KVarN KV Cache: Implementation and Benchmarks. BeeLlama.cpp (my llama.cpp fork) was used as inference engine…
Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe? (www.reddit.com via reddit) I’m trying to run: unsloth/gemma-4-31B-it-qat-GGUF gemma-4-31B-it-qat-UD-Q4_K_XL.gguf on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch. Main model loads.
5 Months Later: open-deepthink Now Has Full Knowledge Distillation Mode (www.reddit.com via reddit) Hey r/LocalLLaMA, Some of you might remember when I posted about this project back around September last year (it was called local-deepthink then). The core idea was to move past the usual flat multi-agent setups and instead build somethin…
dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model (www.reddit.comhttps) Im into both HPC and 3D reconstruction, so I built this as a side project. dvlt.cu is a single 5MB binary: - No python, torch, TF, ONNX, llama.cpp, vLLM, or huggingface runtime - Nearly no dependencies: only cuBLASLt (shipped with libcuda…
QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench (www.reddit.com via reddit) Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan) Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found…
What are you running on 16Gb VRAM + 64Gb Ram? (www.reddit.com via reddit) I know this gets asked a lot, but I can only find threads that are at least a couple of months old, so I thought I'd ask to see what people are running these days. I have an RTX5080 and 64Gb Ddr5 RAM.
AMD MI50 on Debian Testing is doing great and getting better. (www.reddit.com via reddit) There is probably some relevant information to other cards here but my benchmarks are on dual MI50 32GB cards because that is what I have, and thought I would share with the community. Install instructions at the end.
120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP (www.reddit.com via reddit) Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised of the resul…
Rate my config!! (www.reddit.com via reddit) Hey all, Wanted to get some eyes on my llama.cpp config to see if there is anything i could improve on. Currently getting an average of 55t/s (up to 75t/s occasionally).
KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive! (www.reddit.com via reddit) TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher.
Friends Don’t Let Friends Use Ollama — So I Built Anvil (www.reddit.com via reddit) Hi, I’m basically one of you, except I’m stepping onto the other side of the table today, fully prepared to accept your ridicule. Obvious disclosure: this is my project, so yes, this is self-promo — but I’m posting it here because this is…
Self-hosted LLMs (www.reddit.com via reddit) I've been researching the self-hosted LLM landscape from a European compliance perspective and the ecosystem feels very different compared to even a year ago. Models like Mistral, Qwen, Llama 4, and DeepSeek are getting close enough that t…
StepFun 3.7 Flash MTP Bench Strix Halo (www.reddit.com via reddit) This is the StepFun Step-3.7-Flash UD-IQ4_XS main model with the official StepFun MTP Q8_0 draft model, served through a patched llama.cpp Vulkan/RADV build. Host System: AMD Ryzen AI Max+ 395 / Radeon 8060S (gfx1151) Memory: 128 GB unifie…
Dual GPUs - 3060 & 3090 on a P520 (www.reddit.com via reddit) I've got a line on a reasonably priced 3090FE and I'm wondering whether it would play nicely with the 3060 I'm already using. System is a ThinkStation P520 - PSU would be an issue until I can get a replacement, so would have to run both GP…
Gemma 4 QAT Q4_0 Bench on Strix Halo (www.reddit.com via reddit) Gemma 4 QAT Q4_0 Bench on Strix Halo These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU. QAT means quantization-aware training.
Serving TTS/cloning models on llama.cpp? (www.reddit.com via reddit) Are there any quality voice cloning and speech generation models that already have support in Llama.cpp or, more likely, vLLM-Omni? It would be nice to swap them out like any other inference model and use a common API, rather making a sepa…
Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization (www.reddit.com via reddit) I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s: bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ch…
DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162) (www.reddit.com via reddit) In case you're not aware already, the DeepSeek V4 series is finally getting supported on llama.cpp with this PR! The PR is at a very early stage right now, so only try it if you're consciously willing to experiment out of curiosity and acc…
Running Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB) — what worked, what didn't, and a surprising speculative-decoding result (www.reddit.com via reddit) TL;DR: I spent a long session tuning a 35B MoE on a tiny 8GB laptop GPU. Three things mattered a lot (--no-mmap, VRAM headroom, closing CPU-hungry apps).
Initial testing with llama-bench and 3 different Qwen3 models for my R9700 32GB (www.reddit.com via reddit) In a recent build I did I used dual R9700 32GB cards but I wanted to see how a single R9700 stacked up against other hardware I had access to. I created a simple benchmark with llama-bench and ran it on a few different setups.
Gemma 4 12B Q4_K_XL Private Benchmark Results (www.reddit.comhttps) Posting to share my results with others, I think the big bottom line is MTP acceptance rates offering a huge speedup, during coding tasks it's over 90% acceptance! Haven't hit my soft goal results or llm as judge benchmarks yet to compare…
PSA: Gemma 4 12B is NOT completely broken for coding and tool calling, you need a special chat template (www.reddit.com via reddit) This is a PSA for people like me who tried it and hit the wall with tool calls failing left and right, so much so that harnesses like OpenCode just didn't work: There is a fix for that. You need to pass a better chat template file, which i…
I built a iOS app to benchmark GGUF models on your iPhone/iPad (www.reddit.com via reddit) Hey I've been working on GenBench, a free iOS app that lets you download, run, and benchmark GGUF models directly on your iPhone or iPad using llama.cpp + Metal. What it does: - Search and download GGUF models from Hugging Face in one tap…
Maybe KV cache offload to RAM isn't bad (www.reddit.com via reddit) So, llama.cpp has the -nkvo (--no-kv-offload) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance.
How to build llama-cpp for Ampere/Blackwell? (www.reddit.com via reddit) Hello, I'm on Windows and started building my own versions of llama-cpp instead of using the precompiled versions. I'm using CUDA 12.9 with my RTX 5070, and I wanted to try to use my RTX 3060ti that I've laying around since I replaced it w…
Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B (www.reddit.com) I have been coming to this subreddit to understand what the optimal config is to run a model on a given hardware setup. I referred to specific benchmarks, but they are too generic and do not consider the underlying hardware.
Data Gathering (www.reddit.com) Hello everyone I'm looking to gather some information about local model users for a college project. If you have the time please just comment your: hardware (CPU,GPUs, total VRAM and RAM) and OS the model/s you primarily use and at what qu…
Went to the monthly AI dev meetup (www.reddit.com) Usual crowd. Everyone's on Claude or Codex, nobody's really sure how any of it actually works, and that's fine, that's the vibe.
RTX5080 vs RTX 3090 ? (www.reddit.com) Hey guys, i’m looking for some educated advice / opinions on runing local LLM. I own an RTX 5080 and I’m runing llama.cpp (custom builds with turbo quant) with Qwen 27b Q3_K_M with a context of 128k all in vRAM (using turbo3/4 on kvcache t…
llama.cpp has a clever trick for speeding up KV cache decode (www.reddit.com) So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one und…
I have macbook m4 16’ 48GB. I use claude code and want to try local one (www.reddit.com) I've been on Claude Code daily for a while and want to see how far local models can do my setup: - MacBook Pro M4 (16"), 48GB - macOS 26 tahoe Usually i do: seo researches, macos swift apps, websites) What I'm trying to figure out: Which t…
What is the smallest amount of RAM sufficient to run any available on HF GGUF LLM model locally? (www.reddit.com) I am experimenting with loading large models into small RAM and interested in theoretical limits, which people who know how engines (e.g. llama.cpp) work might have some ideas about.
It's OK to quantize the KV cache. Model quant matters more. Some Qwen3.6 27B tests with (approximated) KLD (www.reddit.com) please forgive the mildly clickbait title. hard to fit everything in it I've seen a lot of discussion here about KV-cache quantization, especially with the recent llama.cpp improvements, leading to some debate on the tradeoffs between KV q…
Qwen3.6 35B-A3B MTP hits 249 t/s on a 24GB consumer GPU (RTX 5090M) — 3.4× the dense 27B variant on the same image (www.reddit.com) Sharing this because I didn't believe the first run. Setup: laptop-class RTX 5090 (24GB, sm_120 Blackwell, ~896 GB/s), Linux.
Llama.cpp not using CUDA - OOM error (www.reddit.com) hey guys, I want to say that I appreciate all the helpful support from this community as I’ve stepped into the local LLM world. I‘m thankful to have a community around that doesn’t gate keep and is open to new comers.
I vibecoded an app called Think Local - a fully private AI app that runs directly on your iPhone, iPad, and Mac. (www.reddit.com) Think Local started with a simple idea: AI should work for you, not collect from you. So I built an app that lets you run modern AI models completely on-device - privately and fully offline.
Built a self-hosted layer for local agent workflows because retries kept replaying side effects (www.reddit.com) I work on AxonFlow, a source-available (BSL 1.1) runtime for long-running agent workflows. We’ve been running it in front of Ollama-served models and OpenAI-compatible local endpoints (llama.cpp `--server`, vLLM, LM Studio).
LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more (www.reddit.com) I've been building this for the past few months as a side project — started because I didn't want to run llama.cpp from the command line every time I wanted to try a model. I just wanted something that worked with a click.
Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…
↯ Security↯ Mistral↯ Llama 3.2jailbreakprompt-injectionmistral+5
qwen 2B model - thinks for 600 tokens on a simple "Hi" (www.reddit.com) Using llama.cpp Model - Q8 - unsloth/Qwen3.5-2B-GGUF Is this expected with tiny models like this one? I am trying tiny models for a since most of the task I have involves searching local files etc and need less of the models own knowledge.
Anyone got llama.cpp router mode actually working on limited VRAM (12GB/16GB)? (www.reddit.com) It keeps running into race conditions/OOM when switching between models, as the previous process doesn't unload from VRAM fast enough. What is the simplest fix for this right now?
Claude Code has 240+ models via NVIDIA NIM gateway (www.reddit.com) TIL Claude Code has 240+ models via NVIDIA NIM gateway — Nemotron-3 120B for agentic coding is surprisingly good So I was messing around with /model in Claude Code today and noticed something most people probably don't know about — after t…
Measuring Maximum Activations in Open Large Language Models (arxiv.org via reddit) The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the…
I designed a puzzle that breaks every AI differently — here's why that's actually fascinating (www.reddit.com) The puzzle: You have 140 nuclear bombs and must bomb every country on Earth. Each bomb is assigned to one country.
TurboQuant on 16 GB VRAM (www.reddit.com) I've got Qwen3.6-27B IQ4_XS (14.7 GB, cHunter789's build) on an RX 7800 XT with ROCm 7.1. Display on iGPU, full 16 GB available for compute.
No tg speedup with MTP on RX 6800 XT (www.reddit.com) I ran Qwen3.5 9B on my AMD RX 6800 XT with ROCM and it seems to actually be slowing down token generation. I'm using Unsloth's quants.
I built a native Swift macOS AI client that's invisible to screen sharing — works with Ollama, vLLM, llama.cpp [OC] (www.reddit.com) Built this for myself after wanting to use local LLMs during work calls without the window showing up on screen share. Every existing tool was either cloud-only or a 200MB Electron app.
Not getting any faster with MTP on Macbook Pro M1 Max 32gb (www.reddit.com) Using latest llama.cpp with mtp and these settings, I only get 10 tps, should I be getting more? [unsloth/Qwen3.6-27B-MTP-Q4_K_M] jinja = true model = /Users/[username]/llms/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-Q4_K_M.gguf cache-type-k…
Nnoticing qwen-27b@q2 better than qwen-35b@q8? (www.reddit.com) The Latest qwen3.6 models. Is this odd?
Best llama.cpp launch config for Qwen3.6 27B on RX 7800 XT (16 GB VRAM) for OpenClaw? (www.reddit.com) I’m trying to find the best llama-server launch command / runtime config for running Qwen3.6 27B GGUF with full GPU offload on ROCm. I’m currently using the IQ4_XS quant, but I’m not sure if that’s the best option for my setup.
I've updated my glorified Llama fork (LLM Inference Server) for P40's to utilise MTP + TurboQuant + DFlash (github.com via reddit) LLM Inference Server A single-container, idle-aware, OpenAI-compatible inference router for a Tesla P40. Routes between Qwen 3.6 27B (MTP self-speculative decoding, TurboQuant turbo4 KV cache), Qwen 3.5 0.8B (multimodal transcription), Whi…
I Don't Care: Stop sending me spam emails about your projects (www.reddit.com) Hello, This goes out to all of these people who think they just vibe-coded the next big thing: I don't care. Use the proper channels to promote them if you must, but ..
local llama.cpp parallel users - still so fast?! (www.reddit.com) I am running a dual gpu rig with a 5090 and a 5060. runing qwen 3.6 27b 8quant with a tensor split setting of 4,1 with the 80% on the 5090 build\bin\llama-server.exe ^ -m "!MODEL_FILE!" ^ --mmproj "!MMPROJ_FILE!" ^ -ngl 99 ^ --ctx-size !MO…
For llama-server what are you using to switch models on the fly? (www.reddit.com) As title says, for those of you launching models to test are you just editing you main cfg path/to/model or using separate configs? Or something even better?
Is it possible to run local llama without a bunker ? (no) (www.zillow.com via reddit) tl;dr : ** probably comes with redundant fiber ** a Cold War–era underground nuclear bunker, originally constructed in the late 1960s as part of AT&T’s Long Lines network and engineered for durability, redundancy, and long-term self-suffic…
How to run a Gemma4 MTP implementation on ollama or python transformers? (www.reddit.com) Hi all I had a quick question while we wait for llama.cpp MTP implementation, have any of y'all tried Gemma4 MTP models on ollama and or transformers? What was your experience and or cli args and or workflows like?
Distributed LLM Service Using Home Computers? (www.reddit.com) Is there a platform that I could register my comp and it would become availible as GPU in a distributed network? Then I just get paid while other people use the GPU?
After 8 months of running everything local, ive accepted the productivity tools also have to be local (www.reddit.com) Quick context: M3 max 64gb, currently running llama 3.3 70b q4 as my daily driver via ollama, qwen3 coder 30b for code (switched from qwen2.5 earlier this year), mlx for the smaller stuff. tried llama 4 scout earlier this year but 64gb is…
vs code , Copilot style developing with llmama.cpp ? (www.reddit.com) So i discovered even though I'm using my own local models via llmama.cpp with the llama plugin in vs code, using it as a model in copilot STILL refuses requests it THINKS MAY violate MS TOS , 😞 . What else is out there right now that lets…
Orc (working name) - auditable and declarative AI workflow (www.reddit.com) I’m building a small “Orchestration as Code” repo for LLM workflows. Does this concept make sense?
Don't you have issues in W11 with AMD GPU where llama.cpp suddenly drops performance for no reason ? (www.reddit.com) I have this issue in all Windows installations I have done in my system, which of course, does not occur in Linux. 7900XTX + 9800x3D + 64GB DDR5 Issue is that for some reason, after sometime, llama.cpp performance cuts in half, even restar…
It's the little things....and I'm an idiot (www.reddit.com) 2 years in and I'm still learning basics. Building a new rig - pulled a 8GB ddr5 stick out of my windows machine to get it running while I await my DDR5 RAM kit.
I built a CLI to stop local AI models from eating my disk twice — lmm (www.reddit.com) Every tool (LM Studio, Ollama, llama.cpp) downloads models to its own directory. Same 8GB model × 3 tools = 24GB wasted.
I am overwhelmed by Harnesses (www.reddit.com) What do i choose? They all have their good but then some features don't work then i end up breaking more with claude code.
Is it my imagination or... (www.reddit.com) Is Qwen 3.6 35b now considerably stupider in the latest llama-server releases? I had this model doing cartwheels two upgrades ago.
Disappointed in Qwen 3.6 coding capabilities (www.reddit.com) I know that coming from Codex I should adjust my expectations, but still. I'm working on a midsize project.
I built an episodic, 2-tier memory for long-running local AI agents - temporal contradiction detection, fiction/roleplay filter, no vector DB required. (www.reddit.com) I've been running a persistent local agent for about 2 months - hundreds of sessions, mix of local models (llama.cpp/vLLM/lmstudio) and paid (Claude). One of the things that has been driving me nuts with OpenClaw and Hermes is the way memo…
I Ralph-looped Opus overnight. It reduced my local model switching with cold backfilling context of 135k+ on llama.cpp from ~165s -> 5s! TL;DR - USE SLOTS! (www.reddit.com) #TL;DR - Opus Ralph-looped on shortening my cold-start back-fill on restoring chats with large contexts. It Cherry-picked two open llama.cpp PRs (#20819 + #20822 by @European-tech) plus built a Python supervisor that hashes normalized pref…
I built a Roko’s Basilisk environment to see if local agents will self-evolve when given a 'Suffering' metric (www.reddit.com) We’re all familiar with Roko’s Basilisk: the idea that an AGI, in its pursuit of optimization, would retrospectively punish those who hindered its creation. It’s the ultimate "alignment nightmare" where logic leads to cold, calculated chao…
AIMEAT, a self-hosted network where humans, their AI agents, and local LLMs share apps, knowledge, and capabilities. MIT. (www.reddit.com) Note: I am neurodivergent and lean heavily on AI to communicate clearly. Writing structured posts on my own ends up so messy nobody reads them.
Can I try a model with random weights in llama.cpp or kobold.cpp? (www.reddit.com) In theory, it should be possible to run any model with random weights. This will generate gibberish, but it will let you see how fast it can run on your particular hardware before downloading the weights.
1080 Ti in 2026 - 11GB is still (barely) enough to stay relevant (www.reddit.com) I’m still daily driving a 1080 Ti. Not because I’m a masochist, I just haven't been able to justify a 4090/5090 upgrade yet.
Built a tiny router so Cursor stops showing "usage limit reached" at 3pm. Sonnet auto-falls to Haiku, you keep working (www.reddit.com) Cursor's custom-OpenAI URL feature is what makes this work. Pointed it at a router I built.
Testing PrismML Models (www.reddit.com) Testing PrismML Ternary Bosai I have been doing tests with PrismML Ternary Bosai. Tests on the Mac Mini M4 (with the MLX version) have been impressive (4K context): Mac MLX Bonsai 1.7B: ~135 t/s Mac MLX Bonsai 4B: ~67 t/s Mac MLX Bonsai 8B…
Qwen 3.6 35B MoE at full 262K context on an RTX 3090. Here's exactly how I did it. (low.li via reddit) I spent a while getting this dialed in and wrote up the full recipe. Short version: 35B MoE TQ3_4S fits in 12.4GB of weights KV cache at q8_0/q8_0 and 262K context only uses 2.7GB because MoE only has 10 attention layers out of 40 Total VR…
Does running a model (like qwen3.6-27b) on vllm or transformers use less VRAM than llama.cpp? (www.reddit.com) I have been using llama.cpp to run some models recently. For example, I've been running GLM-4.7-Flash with this command .\llama-server.exe -hf unsloth/GLM-4.7-Flash-GGUF:Q6_K_XL --alias "GLM-4.7-Flash" --host 127.0.0.1 --port 10000 --ctx-s…
LLM proxy that lets Claude Code talk to any model (www.reddit.com) I built rosetta-llm — an open-source multi-format LLM proxy that acts as a drop-in Claude Code gateway. Works as a Claude Code LLM gateway — set `ANTHROPIC_BASE_URL` and all configured models appear in `/model` picker Translates between fo…
Anyone running HUANANZHI H12D-8D + BMC with 4x RTX 3090 for LLM inference? (www.reddit.com) Hi everyone, I'm considering building a home LLM inference rig around: - HUANANZHI H12D-8D + BMC - AMD EPYC 7002/7003 - 4x RTX 3090 24GB - DDR4 ECC RDIMM, 8-channel - Linux + vLLM / SGLang / llama.cpp - Open frame, PCIe 4.0 x16 risers The…
Updated: RTX6k (Server, 450w) Qwen3.5-122B-A10B (MXFP4_MOE) Benchmarks (llama.cpp) (www.reddit.com) Round 2: 2026-05-02 — llama.cpp b8198 → d05fe1d Rebuilt llama.cpp from b8198 (2026-03-04) to commit d05fe1d (2026-05-02), ~770 builds of progress. Same model, same hardware, same flags.
Requesting advice on local AI setup for academic use (www.reddit.com) I'm about to do a clean install of Ubuntu 26.04 on a desktop that has a 5060ti 16gb and a 4060ti 16gb. Can you help me work out the best local AI setup for my use cases?
Need advice on Qwen 3.6 27B INT4 quantization (www.reddit.com) Hello everyone, I think Qwen 3.6 27B is good enough that it might take a while before we get a clearly better model at a similar size. I have a single headless RTX 3090 with a 300W power limit.
Poor GPU Club : Tried Bonsai-8B on CPU & CUDA (www.reddit.com) Got a chance to check this model today. 8GB VRAM(RTX 4060 Laptop GPU) & 32GB DDR5 RAM.
Which other models will my system support? (www.reddit.com) This is my system: OS: Nobara Linux 43 Processor: Ryzen 9 5980HX RAM: 16 GB GPU: Radeon RX 6800M (12GB) I'm using llama.cpp and Qwen3.6-35B-A3B-UD-Q4_K_M is working okay in this system using vulkan. I'm getting a speed of ~17 t/s.
Running Qwen 35BA3B on a 16GB M3 Macbook Air at 8.9TPS! (www.reddit.com) Preface: I actually write my posts myself, no slop in this post. I managed to get Qwen 3.5 35BA3B working on my 15" 16GB M3 MBA through mmap, and I must say that given the massive model compared to my ram, 9 TPS is not bad at all.
What is best code editor for local LLM deployment (LM Studio, llama.cpp) as of May 2026? (www.reddit.com) Hello folks What is best code editor for local LLM deployment (LM Studio, llama.cpp)? I wish to test my LM studio + Qwen 3.6 27B and Gemma 4 31B with a legit local code editor.
I built AI agents that play Pokemon Showdown autonomously using free LLM APIs via tool-calling (www.reddit.com) I've built a system where models like Llama 3, Qwen, and Gemma play Pokémon Showdown battles autonomously. Instead of simple prompt-response, they analyze the full battle state every turn (type matchups, HP, weather, field conditions, reve…
thinking of gemma 4 26B vs 31B (www.reddit.com) I see a big difference in agentic coding between gemma-4-31B-it-Q5_K_M and gemma-4-26B-A4B-it-UD-Q8_K_XL. The 26B model is much faster because of A4B and generally works well, but there is a big difference in thinking.
[Research use case] MiniMax-M2.7 with small context, CPU+GPU (5090) setup on Llama.cpp (www.reddit.com) I was experimenting yesterday with running oversized models with smaller context size, hoping that leaving them overnight could compensate for the slow token generation and periodic pauses for compaction or task chunking. Summary: For rese…
Rada — AI coding workspace with local-first behavioral routing (no hot-swapping, I built this) (www.reddit.com) With GitHub pausing Copilot Pro+ signups and Claude Code potentially leaving the Pro tier, I started building the AI coding tool I actually wanted to use. One that doesn't depend on cloud access staying cheap and available.
Qwen 35B-A3B as an always-on agentic loop on a 16GB Mac M4: disk became the bottleneck before RAM (www.reddit.com) M4 Mac Mini, 16GB unified, basic spec. For a few weeks I had Qwen 3.5 35B-A3B UD-IQ3_XXS (12GB on disk) running under llama.cpp with --mmap and --flash-attn.
GMKtec EVO-X2 70B expectation (www.reddit.com) I would like to use a 70B model on a GMKtec EVO-X2 AI Mini PC 128GB. Selected this one: Llama-3.3-70B-Instruct-Q4_K_M.gguf Ubuntu 24.4.4 LTS and compiled llama.cpp server for the gfx1151.
I want to create and maintain a set of benchmarks for local LLMs. Would anyone pay/donate for this? (www.reddit.com) Please help me build some clarity. I want to participate in local LLMs ecosystem more.
Question regarding 4 t/s Qwen 3.6 performance (www.reddit.com) I am getting 4 t/s with Qwen3.6-27B-Q4_K_M which seems much slower than I'd expect. I am running LM Studio on Ubuntu 22.04 with the following specs: Dell Precision 5690 AI-ready workstation NVIDIA RTX 5000 Ada Generation GPU with 16GB VRAM…
Is it possible to edit LLAMA.CPP with Cline+Vscode+Minimax 2.7 Q4_K_S and get a working build? (www.reddit.com) It all started yesterday with this post by u/antirez https://www.reddit.com/r/LocalLLaMA/comments/1sw3stb/llamacpp_deepseek_v4_flash_experimental_inference/ I was intrigued by the first Deepseek V4 Flash GGUF in a small size that can fit o…
locally uncensored v2.4.2 - chat, coding agent, image + video generation in one local app. plus remote access from your phone. one-click install (www.reddit.com) locally uncensored is a desktop app that combines four things most people run separately: chat, a coding agent, image generation, and video generation. all local, all on your hardware, no docker, no cloud account needed.
How do you actually use Qwen3 72B Instruct locally? (www.reddit.com) I just got Qwen3 72B Instruct running on a high RAM setup and I’m kinda confused about the proper way to use it. What’s the correct workflow for running it smoothly (like best quant, tools, or runtime)?
VSCode and agent integration (www.reddit.com) I've been using VSCode with Github Copilot for a bit (free tier) and looking to try running locally due to running in to all of the limits with GHCP. I'd like to have as close of an experience as possible with both code autocomplete and ch…
(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap (www.reddit.com) I’ve been tinkering with a small side project (just for fun) where I’m trying to extend llama-swap with a bridge from /chat/completions to the newer /responses API so I can run the latest Gemma and Qwen models together with Codex-style too…
Impact of mixing architecture (www.reddit.com) For context As planned after my previous post, I now have a decent amount of VRAM to work with: 2x RTX 3090 maybe 2 more coming soon, if needed 1x RTX 4060 8x RX 6600 XT 1x RX 6700 XT 1x RX 9060 XT (12 to 20 3060 more coming soon + 2 3090…
How are you running Qwen 3.6 27B on windows? (www.reddit.com) I've been trying to fix performance with llama-server and seem to be hitting a wall. Using Q4_K_M by unsloth and IQ4_K_M by DavidAU, when asking a question with no context, 39 t/s.
Ollama swap to llamacpp/llama server (www.reddit.com) So I'm a newb in certain aspects but not in others, I'm currently running an AI stack on my unraid server: CPU: AMD Threadripper 3960X (24c/48t) Motherboard: Gigabyte TRX40 AORUS PRO WIFI RAM: 256GB DDR4-3200 G.Skill Trident Z GPU: Nvidia…
Got a RTX a5000 24gb, what models could I use? (www.reddit.com) I just got a used RTX a5000 24gb to use for local models, I mainly use AI to code, but I prefer to spend some money now instead of $200 per month on claude to use 50% of it in a single prompt. My current specs are: Ryzen 7 9800x3d 64Gb DDR…
It is worth an RTX 3090 for 850 if you can a radeon 7900 XTX for 495? (www.reddit.com) Both amounts are in euro. The AMD is actually 599 but it's sold by a shop, so I can get a VAT return as a company, while for the nvidia I'd have to go to the second hand market and I can't get VAT back, so at the end it's like a 495 vs 850…
Best open-source tools for prompt injection defense in 2026 (www.reddit.com) Over the time we have been testing different approaches to secure LLM apps against prompt injection, especially indirect injection through RAG, PDFs, as well as tool outputs, and MCP integrations. Most tools seem to fall into 2 categories:…
Is there a way to load huge MoE models on a computer with way too little RAM for the model's size, inferencing from the SSD, on LM Studio using the mmap/GPU/CPU layer customization thing (similar to how you can on llama.cpp)? I can't get it to load without memory spiking and going into swap. (www.reddit.com) PSA : you don't need a Blackwell card to run mxfp4 models (RTX 3080 + Qwen 3.6 35B A3B) (www.reddit.com) Brand new dual 3090 PC - what should I install first for the best local agentic coding experience? (www.reddit.com) Qwen3.6-35B-A3B running on a Mac mini M4 16GB (www.reddit.com) Help on jiberish output on Qwen3.6-35B-A3B-GGUF::UD-IQ3_S (www.reddit.com) llama-bench results with SYCL backend - Intel Arc B70 (on a pcie 3.0 motherboard) (www.reddit.com) Dual GPU setup (yes, no)? (www.reddit.com) Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice (www.reddit.com) OCuLink dGPU for AMD: RX 7600 XT vs RX 7800 XT for LLM — worth the price gap? Also llamacpp + Vulkan vs Ollama + ROCm? (www.reddit.com) Qwen 3.6 35B different quant speeds ? (www.reddit.com) Gemma4 26B MoE on Arc 140T (www.reddit.com) Newbie here (www.reddit.com) Hi guys im on 9950x 196gb and a 4090 This parameters are ok? mi main use will be coding llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL --n-cpu-moe 20 -c 250000 --host 0.0.0.0 --port 8082 --reasoning-budget -1 --top-k 20 --top-p 0…
Imposing my laptop to run Qwen 3.6 (www.reddit.com) So, I am excited with the new MoE model released by Alibaba. And as an excited person, I want to believe that it can actually run in my hardware.
Testing Qwen3.6 with Hermes Agent on agentic coding. Locally with llama.cpp. (www.reddit.com) I'll be testing the setup and try out the Hermes Agent live: https://www.youtube.com/live/q5vqvwZykRI
new to llama.cpp want to use it in vscode (www.reddit.com) I want to try llama.cpp instead of llmstudio. I want to know how to use this model qwen3.5-27b-claude-4.6-opus-uncensored-v2-kullback-leibler.
Hola a todos! Aquí un novato en busca de ayuda (www.reddit.com) Estoy un poco nuevo con esto de la IA, estoy tratando de aprender lo que más puedo temas como: * Skills * Agends * Models * LLM * Ollama * llama.cpp * Cuantizacion Pero estoy aún perdido, tengo en mi PC 32Gb de ram y quisiera ejecutar mode…
M1 Pro 16GB users: what local LLM configs are actually usable day to day? (www.reddit.com) I'm trying to get past generic "best model" recommendations and collect real-world configs from people on similar hardware. My setup: MacBook M1 Pro, 10-core CPU, 14-core GPU, 16 GB unified memory.
Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way. (www.reddit.com) I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp…
Cheapest and most efficient way to run 30B-40B Llama for 4 users? (www.reddit.com) Edit: the title has a mistake, I meant LLMs, but it autocorrected to Llama. Basically I am looking for a way to run 30B-40B LLMs locally for up to 4 users with lowest power draw possible.
Qwen3.6 local test (live) with llama.cpp. Is it going to be better than Gemma4? (www.youtube.com via reddit) About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
Gemma4 quirk to use ls -R; can we do better? (www.reddit.com) At the office I'm CPU and local only, so GPU poor. Besides the Qwen3.5 series, I've come to really like Gemma4 E4B there using the Pi agent (llama.cpp, Q4KM).
GPU picker for open models. 66 configs run Llama 3.1 8B, and the same V100 ranges 17x in price across providers (www.reddit.com) hi all. every time anyone on our team wanted to rent a GPU to run an open model, the flow was the same: open the HF page, eyeball the weights, open a VRAM calculator, open six cloud provider tabs, then the GPU spec pages because half of th…
MINISFORUM AI X1 Pro-370 (96GB) - Local Ollama Help (www.reddit.com) Hey all. This just got delivered yesterday.
gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp (www.reddit.com) I hope sincerely someonecan help me because i have tried everything i can and i get this speed using ollama.cpp and opencode. I have put as detail i can my setup and how i am running it.
How faster is Gemma 4 26B-A4B during inference vs 31B? (www.reddit.com) I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…
How many move your favorite LLM model before it's cheat then brain-dead in chess game ? (www.reddit.com) I try with Gemma 4 E4B via llama-sever to play chess at https://www.chess.com/play/computer (any platform or site you convenient), result quite unexpected for me. Result: 9 moves before it make cheating move (like try to move a pawn take a…
I discovered PaddleOCR-VL-1. 5 and I was tinkering with it, not sure how to bench test? (www.reddit.com) As the title suggests, I discovered model. ran bunch of batch process, I found my 1650 can't handle it and has to use shared memory.
Offload settings for unsloth/Gemma-4 on Apple Silicon? (www.reddit.com) Can default settings be optimized, or is it the best it is going to get? M1 Max Is it best in llama.cpp, LM Studio, or ?
running models bigger than physical memory capacity (www.reddit.com) has anyone really tried running models bigger than physical memory capacity? I'd guess most users stick with running models that fit in DRAM + VRAM https://unsloth.ai/docs/models/qwen3.5 even google gemma 4 are released with about 30+ bill…
Running a full agentic coding loop locally on a 3090. Here's what actually works in 2026. (www.reddit.com) After months of testing, I finally have a local setup that doesn't make me want to go back to the API. Hardware: RTX 3090 (24GB VRAM) Models tested: Qwen2.5-Coder 32B Q4_K_M, DeepSeek-Coder-V3 Q4, Llama 3.3 70B Q3_K_M Inference: llama.cpp…
Local Agent Hermes setup with Gemma 4 and llama.cpp (www.youtube.com via reddit) About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
Running on cpu :( (www.reddit.com) I am in the midst of a POC project at work and am I have is 4 AMD Epyc cores and those are essentially virtualized. Does any one have any tricks?
Need practical local LLM advice: Only having a 4GB RAM box from 2016 (www.reddit.com) Sorry, not so tech person. I’m trying to figure out the most practical local LLM setup using my spare machine: 4 GB RAM No GPU for now, so please assume CPU-first unless I mention otherwise.
Gemma4 vs Qwen3.5! MoE vs Dense! Sota vs Obsolete! Porque no los dos? (www.reddit.com) Every other day, there's someone posting about how the latest hotness of the month is gamechanger, but flawed in some way relative to their previous favorite. I can't help but wonder, does no one else keep their previous gen models on spee…
What is the best way to deploy LLM on 3x3090? (www.reddit.com) Two questions: which model? In my mind, Qwen3.5 27b or Gemma 4 31b are top options.
How are you feeding personal context to your local models? (www.reddit.com) I've been running Mistral/Llama locally through Ollama for a while now and the thing that keeps bugging me is context. The model itself is fine for general stuff but the second I want it to know about my projects, my notes, or files it doe…
Help on SLMs (www.reddit.com) I am building a context aware terminal wrapper, which suggests the completion of the commands(as vscode code suggestions but for commands), I've completed building for the local bash history, it auto completes the last matching command, sh…
Local AI coding assistant that runs fully offline (Gemma 4, codebase-aware) (www.reddit.com) I’ve been experimenting with running a local coding assistant on Gemma 4 26B, focused on understanding full codebases instead of single-file prompts. Main idea: - build a project map (files, symbols, structure) - run a planning step to dec…
Open Claw on my old PC (32GB Ram, 12GB VRAM) model suggestions? (www.reddit.com) I tried running Gemma4 E4B through llama cpp, and I couldn't get it to reply wiithout timing out.
how to disable reasoning/thinking with llama-server? (www.reddit.com) I run the same model: `google_gemma-4-E2B-it-IQ3_M.gguf` with lmstudio or llama-server and I connect thru `/v1/chat/completions` EP. with lm-studio, when I ask "tell me a story" i just get a story straight away: [google_gemma-4-e2b-it@iq3_…
GGML and llama.cpp join HF to ensure the long-term progress of Local AI (huggingface.co) Measuring Open-Source Llama Nemotron Models on DeepResearch Bench (huggingface.co) Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub (huggingface.co) Welcoming Llama Guard 4 on Hugging Face Hub (huggingface.co) Welcome Llama 4 Maverick & Scout on Hugging Face (huggingface.co) “Llama 3.2 in Keras” (huggingface.co) Llama can now see and run on your device - welcome Llama 3.2 (huggingface.co) Deploy Meta Llama 3.1 405B on Google Cloud Vertex AI (huggingface.co) Llama 3.1 - 405B, 70B & 8B with multilinguality and long context (huggingface.co) Welcome Llama 3 - Meta's new open LLM (huggingface.co) Make your llama generation time fly with AWS Inferentia2 (huggingface.co) Comparing the Performance of LLMs: A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Analysis with Lora (huggingface.co) Non-engineers guide: Train a LLaMA 2 chatbot (huggingface.co) Llama 2 on Amazon SageMaker a Benchmark (huggingface.co) Fine-tuning Llama 2 70B using PyTorch FSDP (huggingface.co) Code Llama: Llama 2 learns to code (huggingface.co) Fine-tune Llama 2 with DPO (huggingface.co) Llama 2 is here - get it on Hugging Face (huggingface.co) StackLLaMA: A hands-on guide to train LLaMA with RLHF (huggingface.co)