Friends Don't Let Friends Use Ollama Ollama gained traction by being the first easy llama.cpp wrapper, then spent years dodging attribution, misleading users, and pivoting to cloud, all while riding VC money earned on someone else's engine…
#llama
71 items
The local LLM ecosystem doesn’t need Ollama (sleepingrobots.com via hn) Gemma4 26b & E4B are crazy good, and replaced Qwen for me! (www.reddit.com via reddit) My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where…
Local AI is the best (www.reddit.com via reddit) Funny image, but also I'd like to add that I love how much freedom and honesty I can finetune the model to. No glazing, no censorship, no data harvesting.
These "Claude-4.6-Opus" Fine Tunes of Local Models Are Usually A Downgrade (www.reddit.com via reddit) Time and time again I find posts about these fine tunes that promise increased intelligence and reasoning with base models, and I continuously try them, realize they're botched, and delete them shortly after. I sometimes do resort to a low…
The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) (www.reddit.com via reddit) What is the current status with Turbo Quant? (www.reddit.com via reddit) It has been hyped ±2 weeks ago and I remember seeing some pull requests into llama.cpp, but what is the current status after the hype faded away?
Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s (www.reddit.com via reddit) Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to s…
Share your speculative settings for llama.cpp and Gemma4 (www.reddit.com via reddit) MiniMax M2.7 GGUF Investigation, Fixes, Benchmarks (www.reddit.com via reddit) Hey r/LocalLLaMA, we did an investigation into MiniMax-M2.7 GGUF causing NaNs on perplexity. Our findings show the issue affects 21%-38% of all GGUFs on Hugging Face (not just ours).
[P] Built GPT-2, Llama 3, and DeepSeek from scratch in PyTorch - open source code + book (www.reddit.com via reddit) I wrote a book that implements modern LLM architectures from scratch. The part most relevant to this sub: Chapter 3 takes GPT-2 and swaps exactly 4 things to get Llama 3.2-3B: LayerNorm → RMSNorm Learned positional encodings → RoPE GELU →…
common/gemma4 : handle parsing edge cases by aldehir · Pull Request #21760 · ggml-org/llama.cpp (github.com via reddit) If you are on Gemma (like me), you basically have to compile llama.cpp daily now
Introducing BlueTTS (www.reddit.com via reddit) FYI, Step 3.5 Flash has better perf and context is 1/4 the price in llama.cpp (www.reddit.com via reddit) Can't keep up with Llama.cpp changes, made a n8n workflow to summarize it for me daily (www.reddit.com via reddit) My kind of daily news sent to me via Discord https://preview.redd.it/prmris11vdvg1.png?width=684&format=png&auto=webp&s=0dcb00079362a38a29d981dd2f3a4e5143c8091f The N8N workflow (you could probably have Hermes or another agent do similar):…
Compile English function descriptions into 22MB neural programs that run locally via llama.cpp (www.reddit.com via reddit) We built a system where a neural compiler takes a plain-English function description and produces a "neural program" (a combination of a continuous LoRA adapter and a discrete pseudo-program). At inference time, these adapt a fixed interpr…
Q8 Cache (www.reddit.com via reddit) Turn an old Android phone into a Local AI Voice Assistant (www.reddit.com via reddit) I had a nice old cracked pixel 5a laying around that I wanted to get some use out of, so I turned it into a local AI Voice assistant. A server on a laptop running llama.cpp gemma-3-4b-q4.gguf served by flask connects to a script running on…
(llama.cpp) Possible to disable reasoning for some requests (while leaving reasoning on by default)? (www.reddit.com via reddit) I am running unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf with llama-server (with reasoning enabled). Is it possible to disable reasoning for some requests only?
Hot Experts in your VRAM! Dynamic expert cache in llama.cpp for 27% faster CPU +GPU token generation with Qwen3.5-122B-A10B compared to layer-based single-GPU partial offload (www.reddit.com via reddit) Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough.
Reproduction of TurboQuant (www.reddit.com via reddit) There have been many TurboQuant implementations recently in llama.cpp, mlx, vllm, and sglang, but a lot of the discussion and code around them feels pretty noisy and looks to be AI-generated. I’m trying to understand which claims from the…
I built a local LLM that learns how you use Claude Code and starts auto-piloting it (www.reddit.com via reddit) I've been running 5-8 Claude Code sessions at a time and got tired of tab-switching to approve tool calls. So I built claudectl — a TUI that sits on top of all your sessions and lets a local LLM (ollama/llama.cpp) handle approvals for you.
Llama.cpp vs LM Studio on gaming PC (www.reddit.com via reddit) Here is my experience, I've been using LM Studio with RTX 5080 and 64GB RAM using Windows 11. I'm very happy with LM Studio except the speed.
Llama.cpp llama-server command recommendations? (www.reddit.com via reddit) How to run Qwen3.5-27B with speculative decoding with llama.cpp llama-server? (www.reddit.com via reddit) current: 1x 16GB 5060Ti. worth a 2nd for OpenCode? (www.reddit.com via reddit) my current build is just a 16GB 5060Ti running on a 3800X with 32GB DDR4. not really anything special, but I only really use it right now for Qwen3-VL-8B-Instruct at INT8 to do handwriting transcription (and it works great for that). someo…
Intel Releases OpenVINO 2026.1 with Back End for Llama.cpp, New Hardware Support (www.phoronix.com via hn) Intel Releases OpenVINO 2026.1 With Backend For Llama.cpp, New Hardware Support Intel's OpenVINO toolkit for optimizing and deploying AI inferencing across their range of hardware platforms is out with its newest quarterly feature update.…
Hey, has anyone here used Qwen3.5-27B-NVFP4-GGUF with llama.cpp yet? (www.reddit.com via reddit) Hey! I was wondering if anyone of you have used Qwen3.5-27B-NVFP4-GGUF on RTX5090 on llama.cpp?
Multi host GPU cluster using DAC cables vs 4 GPU system. Anyone doing this successfully? (www.reddit.com via reddit) Right now I have 3 GPUs, 5060 Ti 16G, 2 x 4060 Ti 16G, and may get a used 3090 24G that I found. I could build a janky open rack system using M.2 and PCI risers with a 1600W PSU or try something like putting 2 GPUs in 2 systems using the f…
RTX 3090 llamacpp flags help (www.reddit.com via reddit) Can I combine a RTX5060ti 16gb with 7900XTX 24gb for llama.cpp? (www.reddit.com via reddit) 3x3090 is faster in Ubuntu than win11, GPT-OSS 120B 120tg/s vs 6tg/s why? (www.reddit.com via reddit) Show HN: A book that builds GPT-2, Llama 3, DeepSeek from scratch in PyTorch (news.ycombinator.com via hn) I'm a software engineer who works with LLMs professionally (Forward Deployed Engineer at TrueFoundry). Over the past year I built up implementations of five LLM architectures from scratch and wrote a book around them.
I want to run qwen3.5 27B q4_k_m on CPU, and I need help. (www.reddit.com via reddit) I am an local LLM beginner and I found this Reddit while looking for help. (Please understand that I am unfamiliar with Reddit.) (system- i5 4440 1.8GHz/b85m ds3h/DDR3 32GB/128GB SSD/Ubuntu 25.10 questing) I loaded Qwen3.5 27B Q4_K_M onto…
Qwen 122B is AMAZING but is my config right? (128GB M4 Max) (www.reddit.com via reddit) Hi! I hope its okay for me to ask this here.
Anybody got Qwen3.5-27B working with Intel Arc B70 (or similar) and proper optimization? (www.reddit.com via reddit) I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.
[Paper] Residual Streams / KV Direct (www.reddit.com via reddit) It seems we have entered a period of accelerating innovation regarding the KV cache. Someone mentioned this post's paper in the Github issue of llama.cpp for implementing Turbo Quant.
Vulkan compilation issue on Fedora (b8786) — solved (www.reddit.com via reddit) If you pull https://github.com/ggml-org/llama.cpp/releases/tag/b8786 and try to build with Vulkan support on Fedora, you may hit this error: [ 39%] Building CXX object ggml/src/ggml-vulkan/CMakeFiles/ggml-vulkan.dir/multi_add.comp.cpp.o /h…
Older model suggestions (www.reddit.com via reddit) Due to costs I am running on some older hardware. Looking for suggestions on supported models for my particular stack.
Claude down? TokenMonopoly will help you find the best deals in AI subs (tokenmonopoly.com via hn) TokenMonopoly Live leaderboard of AI API deals — pricing, subscriptions, and SWE-bench scores for Claude, GPT, Gemini, Kimi, DeepSeek, Llama and more. Compare 27 benchmarked models across 96 hosts by price-per-performance, refreshed daily.
Show HN: How to Use Google's Extreme AI Compression with Ollama and Llama.cpp (news.ycombinator.com via hn) Feedback on iOS app with local AI models (www.reddit.com via reddit) Hey everyone, I just shipped an iOS app that runs local AI models. Current has 12 models: Gemma 4, Llama 3.3, Qwen3, DeepSeek R1 Distill, Phi-4, etc.
LiteRT LM Framework with Rockchip NPU (RKNN 3588) (www.reddit.com via reddit) Im searching for build version of LiteRT LM framework can use and utilize the NPU of the RKNN 3588. It would be great since I can run gemma 4 e2b model using this framework on the machine, because I wont have to migrate my codebase from li…
Ask HN: Simple tooling for local LLM code critique without IDE integration? (news.ycombinator.com via hn) While I'll set out the criteria for what I'm looking for, I don't want this to turn into a general debate about the role of LLMs in software development. That discussion is important, but we have plenty of them.
Are MLX 4-bit Quants broken (www.reddit.com via reddit) I see so many interesting MLX implementations like DFlash, Speculative Speculative decoding, etc. But when I want to try them for myself the 4bit quants of models seem like they have been lobotomised for some reason, hallucinating, start t…
How does a self correcting loop for AI agents work? (www.reddit.com via reddit) Hey guys, just checked out minimax 2.7, where they used AI to train itself, and ran over a hundred loops, and it improved it's performance by 30%, how does that work, can I also run a script that makes AI store it's memory in a loop on a m…
What's the better way to install llama.cpp on Android? (www.reddit.com via reddit) I own an Oppo Find X3 Pro (Snapdragon 888, 12/256 GB, Android 14.0) unused because of 3 green vertical lines on the screen and poor battery. I tried Google AI Edge Gallery with Gemma-4-E2B-it and it performs well so I thinked: "why don't t…
Upgrade paths for my 256g ddr4 ram + 4x24g vram system (www.reddit.com via reddit) So I was just about to give up playing with local models, until I realised I can actually run GLM 5.1 at not too horrible speeds, using this quant https://huggingface.co/ubergarm/GLM-5.1-GGUF/tree/main/IQ2_KL in ik llama. Getting around 6.…
Transitioning to iOS Dev + Local LLMs: Is the M5 Max with 64GB+ RAM the only real choice? (www.reddit.com via reddit) Hey everyone, I’m currently an ML Engineer looking to pick up iOS development, and I’m upgrading my hardware to handle both. I’m moving away from cloud-only workflows and want to run LLMs locally for testing, R&D, and building CoreML integ…
Can LLM make small change to the software program? (www.reddit.com via reddit) I'm currently vibe-coding (I'm new to vibe-coding) with Gemma 4 4EB Q4 and Qwen 3.5 9B Q5 (KV is quantized to 4 bits with new Google TurboQuant implemented in llama.cpp - I use koboldcpp and release said it's automatically activated): the…
For AI agents: is per‑token pricing killing your budget? Looking for feedback on time‑based subscriptions. (www.reddit.com via reddit) Hey r/AI_Agents, I run an inference service (cheapestinference.com) and we're exploring a different pricing model that might be more predictable for agent workloads. Instead of per‑token billing, we offer **dedicated 8‑hour time windows**…
DGX spark (www.reddit.com via reddit) My Custom Llama Build (www.reddit.com via reddit) I recently got into LLM's and llama.cpp because I wanted to learn AI. I went from Openclaw to SOTA CLI and then to running llama on my Linux server.
Is an nvidia DGK Spark or similar worth it? (www.reddit.com via reddit) MINISFORUM AI X1 Pro-370 (96GB) - Local Ollama Help (www.reddit.com via reddit) Hey all. This just got delivered yesterday.
gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp (www.reddit.com via reddit) I hope sincerely someonecan help me because i have tried everything i can and i get this speed using ollama.cpp and opencode. I have put as detail i can my setup and how i am running it.
How faster is Gemma 4 26B-A4B during inference vs 31B? (www.reddit.com via reddit) I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…
How many move your favorite LLM model before it's cheat then brain-dead in chess game ? (www.reddit.com via reddit) I try with Gemma 4 E4B via llama-sever to play chess at https://www.chess.com/play/computer (any platform or site you convenient), result quite unexpected for me. Result: 9 moves before it make cheating move (like try to move a pawn take a…
I discovered PaddleOCR-VL-1. 5 and I was tinkering with it, not sure how to bench test? (www.reddit.com via reddit) As the title suggests, I discovered model. ran bunch of batch process, I found my 1650 can't handle it and has to use shared memory.
Offload settings for unsloth/Gemma-4 on Apple Silicon? (www.reddit.com via reddit) Can default settings be optimized, or is it the best it is going to get? M1 Max Is it best in llama.cpp, LM Studio, or ?
running models bigger than physical memory capacity (www.reddit.com via reddit) has anyone really tried running models bigger than physical memory capacity? I'd guess most users stick with running models that fit in DRAM + VRAM https://unsloth.ai/docs/models/qwen3.5 even google gemma 4 are released with about 30+ bill…
Running a full agentic coding loop locally on a 3090. Here's what actually works in 2026. (www.reddit.com via reddit) After months of testing, I finally have a local setup that doesn't make me want to go back to the API. Hardware: RTX 3090 (24GB VRAM) Models tested: Qwen2.5-Coder 32B Q4_K_M, DeepSeek-Coder-V3 Q4, Llama 3.3 70B Q3_K_M Inference: llama.cpp…
Local Agent Hermes setup with Gemma 4 and llama.cpp (www.youtube.com via reddit) About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
Running on cpu :( (www.reddit.com via reddit) I am in the midst of a POC project at work and am I have is 4 AMD Epyc cores and those are essentially virtualized. Does any one have any tricks?
Need practical local LLM advice: Only having a 4GB RAM box from 2016 (www.reddit.com via reddit) Sorry, not so tech person. I’m trying to figure out the most practical local LLM setup using my spare machine: 4 GB RAM No GPU for now, so please assume CPU-first unless I mention otherwise.
Gemma4 vs Qwen3.5! MoE vs Dense! Sota vs Obsolete! Porque no los dos? (www.reddit.com via reddit) What is the best way to deploy LLM on 3x3090? (www.reddit.com via reddit) How are you feeding personal context to your local models? (www.reddit.com via reddit) I've been running Mistral/Llama locally through Ollama for a while now and the thing that keeps bugging me is context. The model itself is fine for general stuff but the second I want it to know about my projects, my notes, or files it doe…
Help on SLMs (www.reddit.com via reddit) Local AI coding assistant that runs fully offline (Gemma 4, codebase-aware) (www.reddit.com via reddit) Open Claw on my old PC (32GB Ram, 12GB VRAM) model suggestions? (www.reddit.com via reddit) how to disable reasoning/thinking with llama-server? (www.reddit.com via reddit)