Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.
#gemma
437 items
Gemma 4 Jailbreak System Prompt (www.reddit.com) Gemma4 26b & E4B are crazy good, and replaced Qwen for me! (www.reddit.com) My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where…
Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results (localbench.substack.com via reddit) Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results 4 models tested with q8_0 and q4_0 KV cache against full-precision baseline What this measures KV cache quantization stores the key-value cache in lower precision to s…
Qwen3.6 GGUF Benchmarks (www.reddit.com) Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant. Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.
Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference (www.gizmoweek.com via hn) Google Gemma 4 Runs Natively on iPhone With Full Offline AI Inference - GizmoWeek GizmoWeek Read the News News Reviews Apple How to Phones Products Subscribe Subscribe to newsletter [x] I've read and accept the Privacy Policy. Follow us Fa…
Local models are a godsend when it comes to discussing personal matters (www.reddit.com) I’ve been keeping a personal journal for the past few years. The entire thing is made up of over 100k+ tokens.
Local manga translator with LLM build-in, written in Rust with llama.cpp integration (www.reddit.com) Hi LocalLLaMA, I created a post a few weeks ago, but this time this project has become more reliable and easier to use. This is a manga translator that can also be used to translate any image.
Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions. (www.reddit.com) Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention.
Which Gemma model do you want next? (www.reddit.com) Opencode you naughty minx (www.reddit.com) Man, AI agents getting pretty crazy these days. :) (local, I just decided to try to get an orchestrator in there, when Qwen and Gemma aren't up to it.)
Qwen3.6 is incredible with OpenCode! (www.reddit.com) I've tried a few different local models in the past (gemma 4 being the latest), but none of them felt as good as this. (Or maybe I just didn't give them a proper chance, you guys let me know).
Recent Open models from last 6 Months - Nov 2025 - Apr 2026 (www.reddit.com) I created this chart with recent open models from last 6 months. Few might be older than that possibly.
Qwen 3.6 27B vs Gemma 4 31B - making Packman game! (www.reddit.com) Gemma just crushed Qwen in a local LLM gamedev contest! Device: MacBook Pro M5 Max, 64GB RAM Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens.
ExLlamaV3 Major Updates! (www.reddit.com) Turboderp has a been on an absolute tear recently, in the endless battle to cram new llamas into smaller, faster boxes. We started off last month with the release of gemma 4 support, and continued with improved caching efficiency.
The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B) (www.reddit.com) This is V2 of my previous post. What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.
Gemma 4 Vision (www.reddit.com) I hope that someday we will have a 124B Gemma. (www.reddit.com) could not extract summary
I made a visualizer for Hugging Face models (www.reddit.com) I built hfviewer.com, a small tool for visually exploring Hugging Face model architectures. You can paste a Hugging Face URL and get an interactive visualization of the architecture, which can make it easier to understand how different mod…
CPUs Aren't Dead. Gemma2B Out Scored GPT-3.5 Turbo on Test That Made It Famous (seqpu.com via hn) CPUs Aren't Dead. Gemma 2B Just Scored Higher Than GPT-3.5 Turbo on the Test That Made It Famous — Your Laptop Can Run It, or Cloudflare for $5/Mo.
Do you guys think there’s a high chance of Singularity being open source? (www.reddit.com) GLM 5.1 is dominant in almost every aspect in Design arena, surpassing Opus 4.6 in many tasks. Although user experiences vary dependent on subscription plans for both of those one of them is open source.
Qwen 3.6 35B crushes Gemma 4 26B on my tests (www.reddit.com) I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information fr…
Bonsai models are pure hype: Bonsai-8B is MUCH dumber than Gemma-4-E2B (www.reddit.com) I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: Gemma 4 has 2.3B at 4.8 bpw (Q4_K_M) = 1104 MB Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% sm…
(Interactive)OpenCode Racing Game Comparison Qwen3.6 35B vs Qwen3.5 122B vs Qwen3.5 27B vs Qwen3.5 4B vs Gemma 4 31B vs Gemma 4 26B vs Qwen3 Coder Next vs GLM 4.7 Flash (www.reddit.com) Gemma 4 - MLX doesn't seem better than GGUF (www.reddit.com) LLM Neuroanatomy III - LLMs seem to think in geometry, not language (www.reddit.com) Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB) (teamchong.github.io via hn) Deepseek flash seems like a very good replacement for Haiku at the very least (www.reddit.com) We have a chat system which we use haiku for because it is mostly about tool calling and summarisation of them. But we have many tools with pretty complex input schemas, and stuff like gemma didn't cut it, so we went with haiku.
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline. (www.reddit.com) BeeLlama v0.2.0 is here! Not quite a pegasus, but close enough.
it's time to update your Gemma 4 GGUFs (www.reddit.com) Chat Template was fixed a few days ago choose your fav dealer: https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF https://huggingface.co/bartowski/google_gemma-4-E4B-…
THE UNDERPRIVILEGED AI FOUNDATION Because every little model deserves a chance (www.reddit.com) Is there a 7B parameter model in your life struggling to understand sarcasm? A tiny 1.5B that can't afford one more epoch?
Gemma 4 31B — 4bit is all you need (www.reddit.com) Gemma quant comparison on M5 Max MacBook Pro 128GB (subjective of course, but on variety of categories): gemma 4 leaderboard the surprising bit: Gemma 4 31B 4bit scored higher than 8bit. 91.3% vs 88.4%.
I tested 8 LLMs as tabletop GMs - a 27B model beat the 405B on narrative quality (www.reddit.com) Google AI Edge Gallery v1.0.13 & v1.0.14 updates: Gemma 4 Multi-Token Prediction, Pixel TPU support, experimental MCP, new skills, now saves chat history (github.com via reddit) Google AI Edge Gallery ✨ Explore, Experience, and Evaluate the Future of On-Device Generative AI with Google AI Edge. AI Edge Gallery is the premier destination for running the world's most powerful open-source Large Language Models (LLMs)…
PSA: Having issues with Qwen3.5 overthinking? Give it a tool, and it can help dramatically. (www.reddit.com) I'm sure everyone has seen the posts from people talking about Qwen 3.5 over-thinking, or maybe you've experienced it yourself. Considering we're like 2 months out from the release and I still see people talk about this issue, I decided it…
Accelerating Gemma 4: faster inference with multi-token prediction drafters (blog.google via hn) Accelerating Gemma 4: faster inference with multi-token prediction drafters Just a few weeks ago, we introduced Gemma 4, our most capable open models to date. With over 60 million downloads in just the first few weeks, Gemma 4 is deliverin…
My thought on Qwen and Gemma (www.reddit.com) This spring is really hot since the localLLM giant, both Qwen and Gemma released major models. I'm really excited with those release and happy with their capability.
Comparing Qwen3.5 27B vs Gemma 4 31B for agentic stuff (www.reddit.com) Models compared: Qwen3.5-27B-UD-Q5_K_XL gemma-4-31B-it-UD-Q5_K_XL Main flags for boths --flash-attn on \ --n-gpu-layers 99 \ --no-mmap \ -c 150000 \ --temp 1 --top-p 0.9 --min-p 0.1 --top-k 20 \ --ctx-checkpoints 1 \ --jinja \ -np 1 \ --re…
G4-MeroMero-26B-A4B-it-uncensored-heretic Is Out Now, a Finetune of gemma-4-26B-A4B-it, With KLD of 0.0152 and 12/100 Refusals! (huggingface.co via reddit) When I previously posted the uncensored version of the 31B version of the MeroMero finetune, quite a few people asked for the 26B-A4B version, I wasn't so keen on it because I considered the 31B to be the better version, but I understand t…
Gemma 4 running fully offline on WebGPU with Transformers.js, controlling Reachy Mini over WebSerial. (www.reddit.com) could not extract summary
Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%) (aiexplr.com via reddit) Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.
↯ Security↯ Gemma 4↯ Function Callingfunction-callingprompt-injectionrag+2
great work, Gemma (www.reddit.com) another day with pi + gemma 26B
🚀Pocket LLM v1.5.0 is out: offline Android LLM chat with voice, image input, OCR, and camera capture (www.reddit.com) Pocket LLM v1.5.0🚀 New in this release: - 🎙️ Voice input - 🖼️ Image input with OCR, Gemma vision, and FastVLM support - 📷 Camera capture with retake, crop, and photo review - 🗂️ Previous chats side panel - 💾 Downloaded model deletion to sa…
Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster. (open.substack.com via reddit) Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article...
You can now read Gemma 3's mind (www.reddit.com) Anthropic has released new research to show what an LLM is thinking when generating a next token using NLA or "Natural Language Autoencoders", the NLAs are a pair to LLMs that can translate internal thoughts of LLM for any specific token.…
Roundtable chat with Talkie-1930 and Gemma 4 31B (www.reddit.com) Talkie-1930-13b-it and Gemma 4 31b in the same chat. Talkie is a 13B vintage language model from 1930.
I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned (www.reddit.com) Howdy everyone! Quick disclosure: I work on this - it's a project my studio created called the Null Epoch.
Been using Qwen-3.6-27B-q8_k_xl + VSCode + RTX 6000 Pro As Daily Driver (www.reddit.com) So in response to the Great Token Reconning of 2026, I decided to try out Qwen 3.6 as a daily driver, and although it's only been about a day, I have to say I'm thoroughly impressed. I had to download the VSCode insiders edition and set up…
For chat and Q&A: Which MoE model is better: Qwen 3.6 35B or Gemma 4 26B (no coding or agents) (www.reddit.com) common/gemma4 : handle parsing edge cases by aldehir · Pull Request #21760 · ggml-org/llama.cpp (github.com via reddit) If you are on Gemma (like me), you basically have to compile llama.cpp daily now
Benchmark: Windows 11 vs Lubuntu 26.04 on Llama.cpp (RTX 5080 + i9-14900KF). I didn't expect the gap to be this big. (www.reddit.com) As a life-long Windows user (don't hate me, I was exposed to it at a young age) I was wondering how much (if any) performance I'm leaving on the table. So I did the sensible thing and run some benchmarks.
nvidia/Gemma-4-26B-A4B-NVFP4 (huggingface.co via reddit) Can confirm it works on a 5090, with 80% allocation (of 32gb) I got around 50k context. It's 18.8GB Benchmark Baseline (Full Precision) NVFP4 GPQA Diamond 80.30% 79.90% AIME 2025 88.95% 90.00% MMLU Pro 85.00% 84.80% LiveCodeBench (pass@1)…
I stumbled on a Gemma 4 chat template bug for tools and fixed it (www.reddit.com) TLDR: tool parameters using the common JSON Schema pattern `anyOf: [$ref, null]` are rendered into the prompt as empty `type` fields. This strips the useful schema information before the model sees it.
Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results (www.reddit.com) Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across ea…
Gemma 4 26B Hits 600 Tok/s on One RTX 5090 (www.reddit.com) I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: GPU: RTX 5090, 32GB VRAM vLLM: 0.19.2rc1 Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit Draft model: z-lab/gemma-4-26B-A4B-it-DFlash Workload: r…
Larger Gemma-4/Qwen3.6 (www.reddit.com) Qwen3.5-122B-A10B at Q6_K is really good. Do you think we will see a larger MoE Gemma-4 or Qwen3.6 at some point?
Tested how OpenCode Works with SelfHosted LLMS: Qwen 3.5, 3.6, Gemma 4, Nemotron 3, GLM-4.7 Flash - v2 (www.reddit.com) I have run two tests on each LLM with OpenCode to check their basic readiness and convenience: - Create IndexNow CLI in Golang (Easy Task) and - Create Migration Map for a website following SiteStructure Strategy. (Complex Task) Tested Qwe…
What it feels like to have to have Qwen 3.6 or Gemma 4 running locally (www.reddit.com) Well or pretty close to it, they are excellent work horses. I run them in real work scenarios doing some of the work I used to do myself as an skilled expert in my field, billing 200$ an hour.
Gemma4 26b a4b Apex quant is quite good (www.reddit.com) I tried mudler's apex quant for gemma4 26b a4b and it was amazing! I got 38tps at 90.000 context with no loop and suprisingly no quality degradation.
Anybody else noticing how good gemma-4-26b-a4b is with one-shotting three.js? (rowanunderwood.github.io via reddit) I wrote up this little python app to cycle through a bunch of prompts like this: Single HTML file using three.js from CDN. A central rotating MeshNormalMaterial torus knot.
Gemma 4 31b 3D geometry (www.reddit.com) I have been nothing but impressed by the quality of Gemma 4 since release. In general conversation it's adaptable to different personas.
Extracted MTP tensor GGUFs - smaller donor models for grafting. (www.reddit.com) The script to graft MTP tensors requires a full GGUF model file. I felt that was a bit hefty, so I asked local Gemma to write something to just extract what's required.
Decoupled Attention from Weights - Gemma 4 26B (www.reddit.com) Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!!
Show HN: Site Mogging (sitemogging.com via hn) Hi HN, I've been playing around with Cloudflare's Browser Run and Workers AI to create this funny "website vs website"-website. Google's Gemma 4b model is actually quite good at vision.
Gemma-4-Harmonia-31B-Uncensored-Heretic Is Out Now, a Merge of Multiple gemma-4-31B-it Finetunes Designed for a Targeted Approach to Deep Neural Consolidation, Minimizing Regression While Amplifying Unique Capability Boundaries. With KLD 0.0047 and 9/100 Refusals! (huggingface.co via reddit) Provided in both Safetensors and GGUFs. Safetensors, llmfan46/Gemma-4-Harmonia-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/Gemma-4-Harmonia-31B-uncensored-heretic GGUFs, llmfan46/Gemma-4-Harmonia-31B-it-uncensored-heretic-GG…
z-lab released gemma-4-26B-A4B-it-DFlash. Anybody tried it yet? (huggingface.co via reddit) Past few days, its all been about MTPs. Somehow people missed out the fact that Z lab released the Dflash for Gemma4 26B a couple of days ago.
Are Qwen 3.6 27B and 35B making other ~30B models obsolete? (www.reddit.com) Have Qwen 3.6 27B and Qwen 3.6 35B basically made most of the older ~30B models irrelevant? They seem to beat stuff like Qwen coder 30B, GPT OSS 20B, Gemma models, especially for coding and agent workflows.
Gemini 3.5 Flash vs Gemma4 31B - building SuperMario (Sound on!) (www.reddit.com) Asked new Google Model to build SuperMario. Compared with Local Gemma4.
Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? (www.reddit.com) Some people say they’d never go under Q8, and others say they find Q3 acceptable! What’s your take?
Gemma-4-Gembrain-31B-it-uncensored-heretic Is Out Now, a Merge of Multiple Gemma 4 31B it Finetunes Designed to Boost Logical and Lateral Thinking for Improved Adherence, Increased Swipe Variety and Enhanced Creative Prose, With KLD of 0.0186 and 13/100 Refusals! (huggingface.co via reddit) Provided in both Safetensors and GGUFs. Safetensors: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic GGUFs: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic…
Gemma 4 31B passed 7/8 real-world production tests — including ones I designed to make it fail. Full prompts + outputs. (www.reddit.com) I've been waiting for a capable free local LLM for a while. I think we're close — the quality is getting there fast, and Gemma 4 is the first open-weight model where I genuinely considered using it in production for simple-to-medium tasks.
[cupel] M5 Max 128GB: Qwen3.5-397B IQ2 @ 29 tokens per second (www.reddit.com) A year ago I would just read about 397B league of models. Today I can run it on my laptop.
Open Weights Models Hall of Fame (www.reddit.com) I read a lot of "whengguf" type posts. I think we should sometimes stop and be grateful.
Qwen3.6-35B-A3B-UD-IQ4_XS C++ to Rust Code Port Test: It Worked (Mostly)! (www.reddit.com) When Qwen3.6-35B-A3B was released a week or so ago, I sort of expected an iterative improvement on the previous Qwen3.5 models. After all, those models were pretty decent as compared with the previous local models I had tried, and Qwen3.5…
Turn an old Android phone into a Local AI Voice Assistant (www.reddit.com) I had a nice old cracked pixel 5a laying around that I wanted to get some use out of, so I turned it into a local AI Voice assistant. A server on a laptop running llama.cpp gemma-3-4b-q4.gguf served by flask connects to a script running on…
Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers (www.reddit.com) I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.
Gemma 4 E2B runs surprisingly well on my 8GB Android phone, so I built a private voice notes app around it. (www.reddit.com) Been running Gemma 4 E2B locally on my OnePlus CE 5 (8GB RAM) for a few months. Chat quality is fine for the size.
Multi agent AI Trading Floor (www.reddit.com) Hello, I built a multi agent AI trading floor for a school project: 10 agents (news, research, macro, crowd sim, trading…) Running 100% locally on Ollama, Gemma 4:26b, qwen3.6:35b, gemma4:31b. no paid APIs.
Pocket LLM v1.3.0: Offline local LLM chat on Android with LiteRT + ONNX builds (www.reddit.com) Hi everyone, I've been working on Pocket LLM, an Android app for running local LLMs fully offline for private, real-time chat. The latest v1.3.0 update adds: • LiteRT support for Gemma 4 E2B, Gemma 4 E4B, and Qwen3-0.6B • Persistent local…
Got local Qwen 3.5/3.6 generating meeting summaries entirely offline on an M4 Max. Demo with Wi-Fi off. This is the future. (www.reddit.com) I'm the founder behind Hedy, an AI meeting app. I'm a huge supporter of Local AI, and we've been working on making it "consumer friendly".
Prompt injection benchmark: delimiter + strict prompt took Gemma 4 from 21% to 100% defense rate (15 models, 6100+ tests) (www.reddit.com) When dealing with untrusted outside input, I think you should handle it based on the situation. If you're processing structured data files, it's better to use tools to isolate and handle them.
GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) — aiming for 64K ctx, 30+ t/s (www.reddit.com) Hey all, Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B. What I’m targeting Context: 64K+ (ideally higher later) Speed: 30+ tok/s @ tg128 minimum Power: not critical…
Running llama.cpp on Snapdragon Hexagon NPU seems promising (www.reddit.com) https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/snapdragon/README.md I have an Oneplus 12 with Snapdragon 8 Gen 3. I followed the above README to cross-compile llama.cpp on Ubuntu and then copy to the Termux directory on the…
I've created a LoRA for Gemma 3 270M making it probably the smallest thinking model? (www.reddit.com) https://huggingface.co/firstbober/gemma-3-270M-it-smol-thinker Here is an example of the output: ``` ==================== THINKING ==================== Here is the thinking process: This is a large community with a wide range of interests…
Throughput and TTFT comparisons of Qwen 3.6 27B, Qwen 3.6 35B A3B and Gemma 4 models on H100 (www.reddit.com) I wanted to figure out which of the newer small and mid-size models are actually worth running on a single H100, so I put 8 of them through a proper vLLM benchmark and recorded what came out. The setup was simple.
Small Gemma 4, Qwen 3.6 and Qwen 3 Coder Next comparison for a debugging use-case (www.reddit.com) Ive automated my email/sms/phone (www.reddit.com) we got it good boys! how many of you are doing this??
Experimental "Preserve Thinking" Jinja Template for Gemma4 31B in llama.cpp (www.reddit.com) https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja Yall are more than welcome to try it out and provide feedback. In my own testing in Pi-coding-agent I no longer have the "forgot to close thin…
Looking to migrate off of Ollama and LMStudio (www.reddit.com) Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow.
gemma-4-31B-it-DFlash has been released (www.reddit.com) https://huggingface.co/z-lab/gemma-4-31B-it-DFlash I guess we'll have to wait until this PR is merged before we can test it. https://github.com/ggml-org/llama.cpp/pull/22105
12GB-Club: 4070S qwen3.6 27b + 35b a3b, and Gemma 4 26b a4b + 31b speeds (www.reddit.com) Longtime lurker here, thought i should post my speeeeds... I have a RTX 4070S 12 GB Vram (+10% OC), AMD 9800x3D with 4x16 Gb DDR5 6000Mhz CL30.
Gemm4:e4B-IT good at instructions following no refusals. (www.reddit.com) why gemma 4 31b so bad in long context? (www.reddit.com) question, I'm using it for text translations and on each large prompt (20K+) it stops with a remark 'now I'm going to put that to the file' or some other operation I have asked in the prompt for but it did nothing, just stopped. I'm runnin…
Is there any case of a less quantised smaller model outperforming a more quantised larger model? (www.reddit.com) As per the title Such as Gemma 4 31B Q4 K S vs Gemma 4 26B A4B Q8 Or Qwen 3.6 27B Q4 K M vs Qwen 3.6 35B A3B Q6 K Etc At what point is it worth switching? My use case is mostly creative writing.
Choosing an abliterated version of Gemma 4 31B and 26B-A4B (www.reddit.com) The only thread was 2 months ago, when the model had just dropped. Since then, more versions from different authors have appeared, and users have had time to test them.
LatitudeGames/Equinox-31B · Hugging Face (huggingface.co via reddit) new model from LatitudeGames - Gemma 31B finetune https://huggingface.co/LatitudeGames/Equinox-31B-GGUF Equinox draws its name from the balance between extremes. Trained on a balanced blend of Wayfarer 2's unforgiving dark adventures and H…
Gemma 4 + LiteRT-LM on mobile: much better memory/perf than my llama.cpp setup (www.reddit.com) Hi r/LocalLLaMA - I've been paying close attention to the edge AI ecosystem because it's an area where i see huge potential and where I truly believe AI will become more useful for day to day tasks. Around the gemma 4 release I was already…
How many of you tried BeeLlama.cpp? How's it? Agentic coding possible with 8GB VRAM? (www.reddit.com) We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs).
Local AI video pipeline review: Qwen3 27B beat Gemma 4 26B for tool calling (www.reddit.com) Watched All About AI's 100% local Fireship-style video automation experiment over the weekend (link in comments). A few things worth flagging if you're trying the same stack.
Two related prompts, different results: Qwen 3.5 and Gemma 4 need different prompting than Qwen 3.6 (www.reddit.com) With every new model release there's the "better than Opus 6.13" guys vs the "this is so bad, why did they even release it" camp and I'm always wondering which one is using it wrong. So I did a little test with 2 related prompts, 3 models…
For Non-hallucinating work, MiMo 2.5 delivers (www.reddit.com) MIT license and fully open source. MiMo-V2.5-Pro was just 3 points from Opus 4.7 max and the normal V2.5 is only a step behind SOTA.
Speculative decoding with Gemma-4-31B + Gemma-4-E2B enables 120 - 200 tok/s output speed for specific tasks (www.reddit.com) So for my project I was using up until now either Gemini 3 / 2.5 Flash or Flash-lite. All my use cases are not agentic, simply LLM workflows for atomic tasks like extracting references from the law, classifying, adjusting titles to nominat…
Llama.cpp vs LM Studio on gaming PC (www.reddit.com) Here is my experience, I've been using LM Studio with RTX 5080 and 64GB RAM using Windows 11. I'm very happy with LM Studio except the speed.
Gemma 4 and the Economics of Selling AI (gertlabs.com via hn) Benchmarks, rankings, and live play for AI models and agents.
Thinking with a smaller model to speed things up? (www.reddit.com) Question: can i do the thinking with a smaller model, like Gemma 4 4B, then use that as the prompt for Gemma 4 31B, to speed things up? Has anyone done this and measure if it's worth it?
Show HN: Hitoku Draft – Context aware local assistant (hitoku.me via hn) Hi guys. I have been working on Hitoku Draft, an open-source, voice-first AI assistant that runs entirely locally.
24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context) (www.reddit.com) I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). Results (Q4_K_M mo…
Has anyone been able to get Draft Models to load in LM Studio? (www.reddit.com) Per title. Been trying to load Gemma E2b as draft model for 26b as target using LM Studio's UI but it can't seem to recognise what's already been downloaded.
What do you use Gemma 4 for? (www.reddit.com) Both Gemma 4 and Qwen 3.6 seems to be the hottest local models right now. Looking at the benchmarks and reviews, it seems like it's better in every way: coding, benchmarks, agentic tasks.
Anyone tried +- 100B models locally with foreign languages? (www.reddit.com) I am quite curious as I tried Gemma 4 31B, Qwen 3.6 27B, GLM 4.7 30B and some others in my native language (czech). Gemma performs "best" and considering the fact its "just" 18GB model - it actually blows my mind how well it can respond in…
I hate this group but not literally (www.reddit.com) True story, I got interested in AI after seeing it at work and wanted to run models locally. I started with an M3 Ultra 96GB, quickly learned it was not enough for what I wanted, and kept upgrading hardware (including refurbished Mac Studi…
AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks (www.reddit.com) Did some quick tests after building llama.cpp with ROCm 6.4.2 and latest Vulkan for my 6900 XT gemma4 E2B Q4_K ubatch ROCm pp512 Vulkan pp512 ROCm tg128 Vulkan tg128 32 1536.60 1423.49 151.92 174.59 64 1590.65 1930.60 151.41 173.76 128 265…
AMG GPUs are faster at pre filling (www.reddit.com) I did give same prompt same document to 1660ti running Gemma 4 e2b q4 coz of the small vram and another to and igpu running Gemma 4 e4b q8 prefill rate before token generation was like 4-5 times faster with the 890m igpu then token generat…
gemma4 vs qwen3.5 122A10 real usages (www.reddit.com) Deploying Gemma 4 26B on an RTX 5090 (datapnt.com via hn) What's your favorite small-medium local model? (www.reddit.com) I'm now having fun with Gemma-4-E4B and Qwen3.5-9B, trying different variants like Gemopus and Qwopus, and Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0 don't quite know other models, so what's your favorite? why and how are them?
Why some small/medium models fail at grammar checking task? (www.reddit.com) Recently, I try playing with gemma 4 (gemma-4-E4B-it-Q5_K_S.guff) and find out it fail at easy grammar check (it try to fix the already corrected word "contemporary"). I noticed the same mistake from openai/gpt-oss-20b and qwen3-next-80b-a…
Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM (arstechnica.com via hn) The generative AI boom has driven the cost of memory into the stratosphere, and Google is a key part of that trend. So it’s only fitting that Google should offer some less RAM-hungry local AI models.
Show HN: I made a Gemma 4 Mac app that names screenshots with local AI (snapname.app via hn) I made my first macOS utility app that ships with a bundled Gemma 4 model, specifically the Gemma E4B one. It made my app DMG have 5.3 GB in size, but I think it is a small size for the power that this free local model can provide.
Fun Local LLM Comparisons with Gemma, Granite, and Qwen (ekorbia.com via hn) Fun local LLM comparisons with Gemma, Granite, and Qwen Ekorbia v0.2 features a comparison-chat mode that runs 2-3 local models against the same prompt in parallel. Here are a few fun prompts running across Gemma 4 (e2b), IBM Granite 4.1 (…
↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4gemmaqwen
gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now, A Writing Finetune that Aims to Improve Gemma 4 31B it Writing Quality with More Natural English and Better Prose, Good for Creative Writings, Translations and RPs! (huggingface.co via reddit) Provided in both Safetensors and GGUFs. llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic llmfan46/gemma-4-Ortenzya…
how would you set up a local llm server for a business of 7 people? (www.reddit.com) Okay so i've been stalking this sub for some time and i run the occasional small 2-8b model on my laptop (not the best) for fun but say my role at a company is to set up a local LLM since we obviously don't want confidential data going to…
Qwen3.6 9B will release around Google I/O? (www.reddit.com) I don't think alibaba officially stated about "no qwen3.6 smaller models", and according to the patterns, she should ave been released it in the first week of may, but I think they delayed a little bit to catch the spotlight from Google I/…
The "the future is fictional" problem of many local LLMs (www.reddit.com) Many local models have a problem (that raised due to excessive RHLF training): They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical". To be fair: Even the Gemini API without web ac…
Q: Does DFlash (and PFlash) work with Heretic models? (www.reddit.com) Z-Lab did some good work with speeding up output, while Luce managed to use smaller models of the same family to accelerate prefill... Since Heretic and other "smart ablation" tools can decensor a model, would they work with these multi-mo…
Show HN: ChonkLM – Tiny language models running offline in the browser (chonklm.com via hn) I had been looking to try <500M parameter language models but you wouldn't find an API to try them anywhere, so I built this cloudflare hosted static website that hosts weights and built an inference runtime for these models that uses WebG…
Gemma4:31b-coding-mtp-bf16 - slow on Macbook M5 128gb (www.reddit.com) Very quick initial test of Gemma 4 new MTP model via Ollama (llama.cpp doesnt support yet) https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ Running in Open Webui to view token/s output and I…
Offload routine Claude Code work to Gemma 4 through the Google GenAI API (www.reddit.com) The idea of offload-mcp is simple: instead of running hardware-hungry local models for routine work, let Claude offload that work to FREE model APIs and SAVE tokens. I’m using Gemma via the Google GenAI API because I like it in my processi…
Using a Radeon 9060 XT 16 GB, the gemma4 24b a4b iq4 nl model achieves 25.9 t/s (www.reddit.com) I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Ge…
Escaping model lock-in (www.reddit.com) I have observed that many ai teams try to always use the best model to ensure quality. When a new model drops out, they are forced to pay for it, because their competitors will.
Fine-tuning and deploying Gemma 4 is not that easy (ghost.oxen.ai via hn) Writing a fine-tuning and deployment pipeline isn't as easy as it looks (Gemma 4 Version) Fine-tune and deploy Gemma 4 on Oxen.ai Google's Gemma 4 dropped in April 2026 with multimodal support (text, image, video, audio), a novel hybrid KV…
GGUF Quants Arena for MMLU (24GB VRAM + 128GB RAM) (www.reddit.com) Dataset: MMLU subset (DEV+TEST) Llamacpp setting: 3 params only ctx 8192 , seed 42 , fa on Let me know whatelse do you want to see. Thanks.
Gemma 4 running locally on an iPhone 13 Pro (www.reddit.com) I’ve been experimenting with running LLMs fully on-device, and managed to get Gemma 4 running locally on an iPhone 13 Pro. This is built on top of a lightweight Swift wrapper I open-sourced: https://github.com/mylovelycodes/LiteRTLM-Swift…
[Fix] Gemma 4 MCP tool calls broken in LM Studio — "Unknown test: sequence" (www.reddit.com) If you're using Gemma 4 with external MCP servers in LM Studio and getting this error: Error rendering prompt with jinja template: "Unknown test: sequence" This is a bug in Google's official Gemma 4 Jinja prompt template. LM Studio's Jinja…
Running Gemma4 31b-it on vLLM 0.21.0 A100s (bad quality or what am I doing wrong) (www.reddit.com) Okay fun time I got access to two Nvlinked A100s for some research project I benchmarked my work against the Gemma 4 31b-it available through Google, but my dataset is rather massive, so I need to run it on the "local" resources. Basically…
Anyone use QwQ-32B? It's over a year old? Has Qwen 3.6 27b basically replaced it? (www.reddit.com) I seen this one mentioned but it was a source from about 14 months ago. In the age of the Qwen 3.6 and Gemma 4- is there still a use for QwQ 32B?
AI content detector based on Qwen 0.8b fine-tuned on Pangram dataset (www.reddit.com) I've fine-tuned Qwen 3.5 0.8B on the dataset provided by Pangram with their EditLens paper. It's available via a Chrome extension; you can just click selected text and it's going to give you the probability distribution of how likely it is…
Run Chrome’s tiny Gemma4 (aka Gemini Nano) directly on PC without GPU (www.reddit.com) Everyone remembers that sneaky download of Gemini Nano earlier this month? and if you talk to it, it will happily tell you it’s a Gemma.
Gemma 4 MTP with LlamaCPP (www.reddit.com) I am running Gemma 4 31B for a project using LlamaCPP. There is no integrated main model + MTP drafter GGUF.
I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max (www.reddit.com) It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?
Does THINKING MODE significantly improve translation? (www.reddit.com) Between a solid model from Qwen or Gemma 4, when translating a text, does "thinking mode" significantly boost the quality of the translation, or is the difference negligible?
RTX 5060Ti 16GB or RTX 3080 20GB? (www.reddit.com) I would like to dedicate a budget of about 500 euros to upgrade my workstation and run inference on the qwen 3.6 27b and gemma 4 31b models. I currently have an RTX 5060Ti 16GB.
Will unsloth release MLX versions of the MTP qwen3.6 and gemma 4 models? (www.reddit.com) Question in title. Would be awesome to have this on macs, especially q8 or whatever the minimal-loss quant is, since macs can have lots of ram.
Terrible Vulkan pp/tg on Arrow Lake iGPUs (www.reddit.com) Hi, I recently tried to get llama.cpp with SYCL running on an Arrow Lake system but gave up halfway through since Vulkan is just way easier to set up. But, the pp/tg I'm getting on Vulkan w/ Arc 130T is disgustingly bad - 100 tokens/s for…
3060 Ti 12GB vs RX 7600 XT 16GB? (www.reddit.com) Trying to figure out which is better for LLM. Mainly Gemma 4.
Should I sell my RTX3090s? (www.reddit.com) I have a GPU server (4 × RTX3090s) that I've been using for research and PoC in the past 2 years. Mostly running vLLM for Qwen, GPT-OSS, and Gemma.
What are some good use cases for Gemma Embedding 2? (www.reddit.com) Does anyone know of any use cases of Gemma Embedding 2? Or is it solely for search?
Five labs, one suite, do model families have personalities? (benchmark) (www.reddit.com) Bench 3 from my 18GB M3 Pro. Bench 2 was the 4B-class post where the comments were mostly right: I gave thinking models a fixed 1024-token cap, Qwen got kneecapped, Gemma E4B needed clearer active-param labeling, and the headline was partl…
TurboQuant enabled Runtime Valkyr (www.reddit.com) Based on the recent TRiP source code by Carlo Valenti. Ported to Zig and headless Vulkan Compute shaders.
How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber (patloeber.com via reddit) Tutorial from the Google guy, I use very similar setup (llama.cpp instead of lmstudio)
Why are there so few small local creative writing models from the Chinese? (www.reddit.com) At this moment, the models such as Qwen 3.6 35b/27b crush the competition, yet I can't help, but notice this pattern. While the local RP scene is abundant with the Western model tunes: LLaMA, Mistral (all sizes), Nemo and more recently Gem…
What are your most interesting and hard Vision use cases? I plan to do side by side comparison of Gemma 4 (31B) vs Qwen 3.6(27B) Vision and I look for inspiration (www.reddit.com) Hey guys, I built a custom vLLM pipeline to run Gemma 4 (31B FP8) and Qwen 3.5 side-by-side locally to see how they actually perform in the wild with preprocessing of audio and images. But of course new model Qwen 3.6 27B came out just whe…
Gemma 4 vs Qwen 3.5 Vision on vLLM — 5 things I learned benchmarking them side-by-side (Reasoning budgets, FP8, pre-processing the input). (www.reddit.com) Hi guys, I’ve been running side-by-side experiments on Gemma 4 (31B FP8) and Qwen 3.5 Vision for the last few days using vLLM in Docker to see how they actually handle real-world images and video. A few things I found out: 1.
Handling a large amount of files (www.reddit.com) LLM Router: Best way to dynamically route prompts between proprietary and open-sourced models? (www.reddit.com) Ask HN: How do you use Local LLMs? (April 2026) (news.ycombinator.com) Building a fully local Android manual assistant (LiteRT-LM + RAG) what architecture would you use? (www.reddit.com) Processing img 8ofni1q6dpvg1... Hello everyone, I’m building an offline RAG system for my company, we are trying to run an app that retrieves infromation from two manuals in an android tablet with the idea of an AI to provide precise answe…
Gemma, my precious (philipmw.github.io via hn) Gemma, my precious This week I finished my income tax return. Gemma, my new personal assistant, helped.
MB Pro M5, 24GB/32GB difference? (www.reddit.com) Hi, I got new MB Pro 24GB/1TB. I've test Gemma 4 26B with ollama, 16k context.
Strix Halo 128GB on Proxmox - Vulkan vs ROCm benchmark matrix (www.reddit.com) Ryzen AI MAX+ 395, Bosgame M5, 128GB LPDDR5x. Proxmox VE 9.1 LXC containers with GPU passthrough.
Issues with Gemma 4 tool calling - abrupt gen ending despite the model telling me it wants to do X. (www.reddit.com) Hello, I have noticed an annoying issue with Gemma 4 26b a4b. It seems like it cannot do multiple think->tool call->think->tool call turns.
Loading "stacks" of models on-demand? Does a tool like this exist? (www.reddit.com) I'd like to self-host some LLM models but a couple different ones for different usecases, and they don't all fit in VRAM at the same time. So i'm kind of looking for a tool in which i can define "profiles" or "stacks" of LLM's that get loa…
What's the deal with Qwen3.5's and Gemma 4's reasoning traces? (www.reddit.com) Hey there, I noticed something odd when trying out the latest and greatest local reasoning models recently. First, I just noticed it for Qwen3.5, but Gemma 4 seems to do it too: The reasoning traces do that weird thing of starting with "He…
RTX 3090 llamacpp flags help (www.reddit.com) Hi, my current system hardware RTX 3090 24GB VRAM & Sysrem RAM 64GB using windows 11 been playing around with hermes agent and local llm (Qwopus3.5-27B-v3-GGUF & gemma-4-26B-A4B-it-GGUF) when i try asking the hermes agent to do a task with…
Gemma 4 E4B as a primary local LLM (replaced Qwen) (digg.com via hn) Gemma 4 E4B 6bit is now the local model of my choice and loaded 24/7 on my Mac (using @lmstudio), replacing Qwen3, 3.5 4B after ~9 months of usage What an insane model, congrats @GoogleDeepMind 🤠 The new setup replaces his nine-month daily…
Ask HN: Is it feasible to run a model on device for complete privacy? (news.ycombinator.com) Tried Gemma, Qwen and a few others. Need vision and larger context windows for an application I am working on.
Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB (ziraph.com via hn) Apples® to Apples®: MLX vs llama.cpp for Gemma 4 12B on an M1 16GB A matched-quant MLX-vs-raw-llama.cpp benchmark for Gemma 4 12B on one M1 16GB - decode is a tie, both pinned at the bandwidth wall. The cost that differs is startup and CPU…
Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency (blog.google via hn) Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to acce…
Gemma 4 12B: The Developer Guide (developers.googleblog.com via hn) Following the announcement in our launch blog, we are releasing Gemma 4 12B, a dense multimodal model with a unified, encoder-free architecture. Gemma 4 12B introduces several milestones for local AI: Traditional multimodal models rely on…
Gemma 4 12B appears in Hugging Face (huggingface.co via hn) gemma-4-12B-it-GGUF Recommended way to run this model: llama-server -hf ggml-org/gemma-4-12B-it-GGUF Then, access http://localhost:8080
Gemma 4 26B on a consumer GPU: build pain, throughput, and BFCL numbers (algollabs.com via hn) 2026-05-05 Gemma 4 26B on consumer-grade 5070Ti GPU A week running Google's Gemma 4 26B as my daily local agent on a single RTX 5070 Ti. No API calls, no cloud, no rate limits.
Show HN: Free open source coding models in Slack (www.runcord.com via hn) Hey HN, We believe we have the easiest onboarding from signup to being able to spin up coding agents in slack like Stripe, Ramp & Coinbase. Demo of the onboarding: https://www.tella.tv/video/connecting-cord-to-slack-1-19ep Every signup get…
Local LLMs on Refurb M4 Max vs new M5 Max (www.reddit.com) Hoping the community can guide me on this one. I'm on the fence about the following purchase: Refurbished 16-inch MacBook Pro Apple M4 Max Chip with 16‑Core CPU and 40‑Core GPU, 64gb ram for $3,479.00 vs The new 16-inch MacBook Pro Apple M…
Is something went wrong with those online free model, why I feel they worse than Gemma 4 26B A4B Q4_KM ?? (www.reddit.com) It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is much bigger parameter and quite update i…
Show HN: Charm – on-device spelling, grammar, and prediction for macOS (www.theodorehq.com via hn) I've spent the last year building Charm, a native macOS menu bar app that corrects spelling, fixes grammar, and predicts your next word. Three features: - Spells: NSSpellChecker plus a local LLM for context-aware corrections (catches "defi…
5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp (www.reddit.com) Hey all, While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.
Any good MOE ~60B models? I have 64GB vram (www.reddit.com) I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for ~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while. Are there any good MOE models that are around 60B in parameters so I can ma…
Good candidate model to act as a PA (www.reddit.com) I really benefit a lot from having claude code act as a personal assistant - it reminds me of things I need to do, helps me focus on what matters, and keeps me accountable on making sure I don't let important things slip But I am well awar…
converting weights to snn (www.reddit.com) Hello everyone, I developed the snn architecture from scratch based on the human brain. I had several successful launches of training spike models from scratch and I also had an idea: what would happen if I took the gemma 4 model and conve…
Is Qwen3-coder the best kept secret out there? (www.reddit.com) So I'm brand new to this scene but I'm using Claude to help me fine tune a model for a startup idea I have in the Healthcare space. I have been working with the 27-35B parameter mdoels (Qwen3.6, Gemma 4) and the couple of 120B+ models (Qwe…
Grafting a Speech Head onto Gemma 4 E4B (www.frisson-labs.com via hn) Grafting a Speech Head onto Gemma 4 E4B For a Discord buddy, the tempting model shape is small, fast, and multimodal. It should hear the call, see the game, read the chat, and respond quickly enough that the moment is still alive.
Tools in Openwebui (www.reddit.com) I am trying out some tools that are from the openwebui community that I have directed towards my LM Studio server instance. It seems really hit or miss on most of the tools being called by the LLM or not.
Gemma 4 - website translations (large model, or small model)? (www.reddit.com) I have setup a workflow to process website translations with Gemma 4, I just host it on LM Studio, and a custom Python wrapper iterates through and runs overnight. My question is..
Gemma4 26B A4B NVFP4 GGUF (www.reddit.com) Hey everyone! I’ve just uploaded a GGUF version of nvidia/Gemma-4-26B-A4B-NVFP4.
On-Device AI Coming to React Native with Gemma and React Native Executorch (twitter.com via hn) Don’t miss what’s happening People on X are the first to know. Log in Sign up Post Conversation Software Mansion @swmansion On-device @googlegemma in React Native with react-native-executorch Coming to the library very very soon!
Anybody tried openclaw + M5 pro + 48gb? (www.reddit.com) Hello, posting again on this since my last post was removed. I am working on an AI agent solution to help me with my multiple daily tasks for different business activities; a few rental properties, a manufacturer trying to enter the Mexico…
Show HN: Llmconfig – configfile and CLI for local LLM (github.com via hn) llmconfig Local Large Model Config — manage local inference with llama.cpp, stable-diffusion.cpp, and whisper.cpp from a single YAML file and a single CLI. llmconfig up gemma # or just: llmc up gemma ✓ gemma is ready at http://127.0.0.1:80…
Tried running Claude Code with local LLMs via Ollama — ended up subscribing to Pro anyway. But now I can't disconnect from the local server. (www.reddit.com) I've been experimenting with using Ollama to run Claude Code locally with models like Gemma 4, thinking I could avoid API costs. However, I quickly realised these models aren't really optimised for Claude Code's agentic workflows — they te…
"LLM is created so engineer don't have to write a report", anyway found out ONLYOFFICE can connect to OpenAI compatible, using Qwen 3.6 to do elaboration. (www.reddit.com) It is pluggin made for ONLYOFFICE, much simpler than copy-paste from webui. PS.
Which model for 32GB M2 Max? (www.reddit.com) I would like to experiment but before investing loads of money, I do have a MacBook Pro with 32GB RAM, M2 Pro. Which model would maximize versatility given this hardware?
Qwen 3.6 and Gemma 4 "Zombie Loops" (terminal thinking loops) (www.reddit.com) I've got to the point where I need some help. I'm trying to run Qwen 3.6, and it will eventually fall into a loop where it's just outputting "/" symbols when it's "thinking".
Llama.cpp MIPS R8000 Kernel Running on an SGI Power Challenge from 1995 (twitter.com via hn) Whew! Big work today getting optimized llama.cpp MIPS R8000 kernel running on the SGI Power Challenge deskside from 1995 with Gemma 3 270M.
llama.cpp's Preliminary SM120 Native NVFP4 MMQ Is Merged (www.reddit.com) https://github.com/ggml-org/llama.cpp/pull/22196 And somehow we already got some GGUFs for it! https://huggingface.co/CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF (the below one is…
Gemma-4 MLX reasoning? (www.reddit.com) Gemma-4 is great. On a MacBook M5, using lm-studio, the MLX versions (specifically looking at https://huggingface.co/lmstudio-community/gemma-4-26B-A4B-it-MLX-8bit) rock.
Is long re-processing of output as input a common "feature" or not? (www.reddit.com) I now use (mostly) Gemma 4 and Qwen 3.5 models *. And seems that all of them, after context grows a bit, after providing long output for me and getting a short prompt in response, are starting to process many new tokens as input and I have…
I ran Gemma 4 E2B with llama.cpp on a lot of different iPhones, here's the setup report (www.reddit.com) TLDR: I've been running gemma4 e2b extensively on iOS with llama.cpp and found some interesting quirks and info you guys may like! These are specifics for the iPhone and what I've found worked across 20+ devices.
Running Gemma 4 31B on Mac with Ollama (sammyrulez.github.io via hn) A practical configuration for a 32 GB M5 Mac that still needs to remain usable Running large language models locally has become surprisingly practical on Apple Silicon. With a modern Mac, Ollama, and a carefully quantized GGUF model, it is…
Were Qwen3.6 models scrubbed from openrouter? (www.reddit.com) I made a simple app using openrouter, hoping to use the new small qwen models (the a3b moe and the 27b dense one), but they aren’t listed. Also, I swear some qwen3.6 models that were listed before are missing now.
Gemma 4 is not your standard transformer (idlemachines.co.uk via hn) Gemma 4 makes five quiet departures from the standard transformer recipe. QK-norm instead of 1/√d, partial RoPE on global layers, per-layer input gating, KV sharing across layers, and an MoE that sits alongside the MLP rather than replacin…
How is Rotorquant/planarquant/iso qaunt better? (www.reddit.com) Has anyone figured out STT with Gemma4 for Home Assistant? It works but responds with full thought chain. (www.reddit.com) I have Gemma4-E2B working within home assistant as STT, and E2B seems fast and accurate for STT (maybe a bit better than Parakeet), however, it responds with the entire thought process: https://preview.redd.it/v8zhb5elltvg1.png?width=599&f…
Intel Lunar Lake 258V (32GB) vs Qwen 3.6 35B-A3B: Pushing the limits of MoP architecture. (www.reddit.com) Hardware: Intel Core Ultra 7 258V, 32GB Unified Memory. Model: Qwen 3.6 35B A3B (Quant: Q3_K_S) via LM Studio.
Will Gemma 4 replace Claude Code or are we lying to ourselves again (webmatrices.com via hn) Vercel Security Checkpoint | cle1::1776468758-lOAcIwtVVUa8cG9OLlcTtnlZwlvTxsBe
Getting gibberish when trying to generate with gemma-4-31b-it in LM Studio (lmstudio-community quant) (www.reddit.com) could not extract summary
Knlowledge Graph and hybrid DB (www.reddit.com) Hello, everybody! I'm building and hybrid database with Qdrant and Neo4j for a few personal projects.
Minimax M2.7 on Q3_K_S or Smaller Model with greater precision? (www.reddit.com) I currently am looking for models to fit into my single DGX Spark for use. I have an RTX Pro 6000 and also a 5090 as well that I'm considering using in combination if the DGX Spark is too slow, but the intent here is to play around with Op…
Ollama Cloud - Pro (www.reddit.com) Hi. I've been looking at ollama cloud's Pro offering ($20), which says "Run 3 cloud models at a time".
Does an MLX conversation have same capabilities as the GGUF? (www.reddit.com) For example, in LMStudio the official Gemma 4 is a GGUF that has Vision, Reasoning, and Tools flags. But the MLX version does not.
Suggestion for a local model to solve math problems. (www.reddit.com) Does anyone know of a good edge local llm that is good in math's. I tried Gemma 4 E2B, microsoft phi mini reasoning but both can't answer some basic apti question's.
How do I use gemma4 on 5090 gpu for coding? (www.reddit.com) I'm trying to replace openai codex which i used for development all the time, with gemma4 on 4090, small tasks it solves quite impressively, but i need to have some agent. So I tried to connect 31b to cline and to aider and it didn't reall…
How the Community Trained Gemma to "Think" with Tunix and TPUs (developers.googleblog.com via hn) Discover how developers at the Google Tunix Hackathon trained small Gemma models to reason under a limited compute budget. Learn the winning, open-source post-training recipes—combining SFT, GRPO, and SimPO—to build your own structured rea…
Llama.cpp: What's up with -sm tensor + AMD + Vulkan? (www.reddit.com) Has anyone got it to work? I tried it with dense models (eg qwen 27b, gemma 31b, mistral 128b) since that's where I need it most, but it always core dumps.
Built a local-first AI memory system that indexes screen activity, meetings, and voice notes ( MCP + automations) (www.reddit.com) Been experimenting with an idea — what if your AI assistant actually remembered everything you did on your computer? Not stateless chats, but real persistent context.
ReAct tool-calling issue: Orchestration model computes internally instead of using tools (www.reddit.com) Built a local ReAct-style calculator agent with 6 tools: add subtract multiply divide modulo etc. The setup is: orchestrator agent dynamic tool selection ReAct loop tools exposed as functions Problem: Even when the user asks multi-step ari…
Want Built a React-style looping agent with small LLMs (Qwen 3.5 9B / Gemma4) + LangGraph? (www.reddit.com) Currently experimenting with building a React-style looping agent system using small LLMs like Qwen 3.5 9B and Gemma 4 (E2B), and I wanted to ask if anyone here has worked on something similar. Current setup: Using LangGraph Around 5 tools…
Gemma 4: A new, budget-focused model in Posit AI (posit.co via hn) Gemma 4: A new, budget-focused model in Posit AI Gemma 4 is now available in Posit Assistant via the Posit AI provider. It's priced at a tenth of the price of Claude Sonnet 4.6 and less than a third of the price of our current cheapest off…
Hermes w/cloud LLM and w/local LLM does it work? (www.reddit.com) I’ve tried openclaw locally for about a month. Hardware: M5 Pro w/48 gb ram.
how to install llamacpp the better way to wrapping it in python ui (CPU use only) ? (www.reddit.com) i want the best installation that fit my use and my low-compute H.W , i want to run small to above small llm like "qwen" 2b ,4b and 27b , and "gemma" 31B. rely completely on only old CPU 4th.gen i7 with that few 32gb 'slow' ddr3.
gemma 4 e2b quality degrades after ~30-40 continuous inferences on 4gb vram? (www.reddit.com) running gemma e2b via llama-server for continuous background tasks on a 1650 4gb. works great initially but after maybe 30-40 calls the outputs start getting noticeably worse — shorter responses, missing fields in json output, sometimes ju…
One Night Werewolf played by LLMs (www.reddit.com) The other day I posted about playing one night werewolf on my custom made UI via tool calls. Since then I’ve played a few games and improved the prompts.
I built myself a finite AI news feed which doesn’t undermine AI research (www.reddit.com) Hello, I built myself a news feed which scores and summarizes research papers along with relevant AI news from Huggjngface, Reddit, hacker news etc. I used Claude code to build the whole thing.
Llama-server and MTP (www.reddit.com) currently in order to use MTP one needs to enable it in the starting argument of llama server. --spec-type draft-mtp --spec-draft-n-max 2 But then other models that do not use MTP currently like Gemma or basically all other models fail to…
Seeing the activity pop up big time in this sub due to various open models. Most of them require at least 16gb vram. What can I do with 8? (www.reddit.com) Not deeply technically fluent but have ran few models locally before, around the time before gemma 4 dropped. I tried some low quant of qwen 2.5 coder and after some tinkering I got it to run but it was just so slow, obviously.
Best local model for C# coding with 24GB VRAM? (www.reddit.com) I can't decide that Qwen 3.6 35b q4 (130k context) or Gemma 4 26b q4 (95k context) is better for C# coding with 24GB VRAM. Please share your experiences!
Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention (magazine.sebastianraschka.com via hn) Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention From Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context Costs After a short family break, I am excited to be back and catching up o…
Adding E4B audio encoder to larger models (www.reddit.com) I am curious if anyone here has tried doing this, I did a bit of digging and it seems like it would be easier to do then I first thought and would like to ask ask for correction if my assumptions are wrong. Here is how I would go about it:…
llamacpp with Gemma4 31B dense and Gemma e4b as draft, plus audio input? (www.reddit.com) Hi, has anybody succeeded in running llama.cpp with Gemma 31b dense and Gemma e4b as draft model, and simultaneously inhibit the voice recognition feature? Is it even (theoretically) possible?
Replaced my $15/mo Wispr Flow subscription with a free local macOS app I built using Claude Code (www.reddit.com) I spend most of my day writing prompts to Claude. Read a study recently that said people speak ~3x faster than they type, which lands differently when "writing" is basically your whole workflow.
LLMs on flagships smartphones? (www.reddit.com) I have been curious to see how small LLMs like Gemma-4-E2B-it run on a flagship smartphone (S25+ with Snapdragon 8 Elite) in terms of prompt processing and token generation. I have created a script that uses llama-cli and I achieve 48 tps…
very slow tok/s with Gemma 4 31B on a 5090?! (www.reddit.com) Hi, i have a 5090 and i was tyoing around with hermes-agent. To utilize 128K i thought about switching from LM Studio to llama-cpp (the turboquant fork) expecting better tok/s and also saving some VRAM from context quantization.
Local audio/multimodal models that can be used for language pronunciation grading (www.reddit.com) My partner uses Duolingo for learning and practicing languages, but has been getting increasingly sick of it. I decided to experiment with whether local models would be good for creating and grading language exercises.
Are harnesses like OpenClaw and Hermes really necessary? (www.reddit.com) My setup: Windows 10/11 i7 12700K | RTX 3090 TI | 96GB RAM Local server: LM Studio Models: Qwen 3.5/3.6 27B|35B Q5 UD K XL + Gemma 4 31B| 26B Q4 UD K XL Up until this point, I've only used sota models for coding. When Qwen 3.5 dropped, it…
Gemma 4 E4B is great for short transcriptions (www.reddit.com) Yes, for material that is an hour long, there is no getting around tools like Whisper - or something even better. However, for transcribing short snippets, Gemma works very quickly and reliably- even in foreign languages.
Does Claude sonnet/opus also use drafter like Gemma 4 MTP? if not why? (www.reddit.com) Per my experience, Opus 4.7 is so slow, Sonnet 4.6 is ok. I am also using local models wondering if Claude is already leveraging drafters/assistant AIs and despite that so slow or not?
Gemma Chat: Offline Vibe Coding on Apple Silicon (github.com via hn) Gemma Chat Vibe code without the internet. A local coding agent powered by Google's Gemma 4 — runs entirely on your Mac via Apple's MLX framework.
Stop picking LLMs by reputation. Run the eval first. (www.reddit.com) We ran GPT-5.4 vs Gemma 3 27B on 2 prompts. One open-source model won.
Show HN: Airplane AI – Local NDA Safe AI Powered by Gemma (airplane-ai.franzai.com via hn) Private offline AI chat for macOS. Free for 14 days, then €29.99 once.
I built a local proxy that does context work for Claude so you don't have to (www.reddit.com) Hey folks, I posted here a few months back about how I was basically working for Claude -- pasting the same emails, re-explaining the same backstory, being its memory across every chat. Today I'm launching Contextify.
What's the right way to feed PDF files to Gemma-4? (www.reddit.com) In my line of work, PDF documents tend to be combinations of text, math formulas, tables and images. llama.cpp added support for PDFs a few months ago, but I believe it treats PDFs either as text (discarding everything else), or as images.
Qwen 36 27B + Gemma 4 - the best set for 1x 3090 ? (www.reddit.com) Hi guys 👋 When I started my adventure with Qwen 3.6 27B I felt wow.... Now when I connect it with Gemma 4 I'm feeling more wow...
What models for coding are you running for a mid level PC? (www.reddit.com) I have a 4060 (8GB Vram) and 16GB of ram wondering which models could fit in my setup for coding, the new Qwen 3.6 and Gemma 4 MoE models look good but might not fit, wondering about your experiences
BUILD portable AI system (www.reddit.com) Hey everyone, I’ve been thinking about a project idea and I’d love to get your feedback. The idea is to take a 1TB SSD and turn it into a fully portable AI system.
New Gemma chat template update by Google (huggingface.co via hn) Libraries llama-cpp-python How to use unsloth/gemma-4-E4B-it-GGUF with llama-cpp-python: !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/gemma-4-E4B-it-GGUF", filename="gemma-4-E4B-it…
qwen 3.6 27B looping problem (www.reddit.com) Whenever I write here that I use gemma 31B I get answers that qwen 27B is better. I switched in the pi from gemma 31B Q5 to qwen 27B Q8 and generally I manage to code, document and run tests but somewhere after exceeding 100k context qwen…
Offload MCP – Offload tasks to free models via API and save tokens (github.com via hn) offload-mcp MCP server for offloading routine coding-assistant work to a cheaper model. The default model chain uses Gemma because the models are useful, open, and fun to experiment with.
I built an AI tool that turns any movie into viral recap videos in minutes (www.reddit.com) Hey everyone, I built a tool that creates movie recap videos automatically using local models. The problem: making recap videos takes forever.
interacting with gemma 4 w/ live video and audio (www.reddit.com) I saw someone on this forum demonstrate using gemma 4 - live streaming audio and video from his webcam to it asking it what it was seeing. It was pretty great but I cant find that post anymore and I can't find a good repo on github where I…
Potential of Gemma4 Per-layer embeddings? (www.reddit.com) Hey there people. So let's talk about GEMMA 4 per layer embeddings.
Filed two PRs for SGLang which may help others too — FP8 KV cache corruption and memory leak on image requests (www.reddit.com) We run Qwen3.6-27B-FP8 at AI Router Switzerland and hit two issues, so I wanted to share in case anyone else runs into them. FP8 KV cache produces silent garbage output with radix cache prefix hits (PR #24198 — ✅ approved) We were running…
Model stuck in some thinking zone where it keeps saying a similar thing again and again (www.reddit.com) I experienced this with Q4 and Q3 versions of Qwen3.6-35B-A3B and Gemma-4-26B-A4B. It starts saying things which sound similar in thinking mode: I must do ....
RPers: how do the new Gemma and Qwen compare to the old 70B models? (www.reddit.com) I can’t really run 70B models on my current setup, but I’m curious haha
Comparing SVG Generation for the top open models (codeinput.com via reddit) Some of the larger models (like Llama) weren't available on OpenRouter, so I had to work with what was there. Best small model: Gemma 4 26B For its size, I think it had the best output.
Based on what should I choose Gemma 4 models/quantizations? (www.reddit.com) I have an RTX 4060 8GB(+16GB RAM) laptop, and when asking Gemini or ChatGPT, they say the Gemma 4 Q4 K M is the best fit for my hardware with Context Length around 16k-32k. However, in practice, after loading even a higher quantization lik…
Creation OS: local σ-gated LLM runtime — BitNet/Qwen/Gemma, abstention, conformal gate, MCP, no cloud (www.reddit.com) I’ve been building a local-first AI runtime that wraps local LLMs with a σ-gate — a measurement layer that decides ACCEPT, RETHINK, or ABSTAIN before an answer reaches you. The idea: local models should be able to say “I don’t know” instea…
If you could do anything with the local models in your corporate workflows, what would it be? (www.reddit.com) With the release of Gemma 4 models and a slew of open weight/source models subsequently, some of the workflows like drafting emails/ trivial coding tasks have become possible. I’m exploring the possibility of integrating some of the powerf…
Gemma 4 architecture support for QVAC-Fabric (Tether's llama.cpp fork) (github.com via hn) QVAC-Fabric Gemma 4 Architecture Patch Adds full Gemma 4 (gemma4) architecture support to QVAC-Fabric, Tether's llama.cpp fork. Base: QVAC-Fabric temp-upstream branch Target: All Gemma 4 variants (E2B, E4B, etc.
I built a full web app using Qwen 3.6-35B running locally on my 5070 Ti with the BMAD Method — here's how it went (ggufbench.com via reddit) I've been running local LLMs since Qwen 3.5 dropped and I was really impressed by what we could run on consumer hardware. Fast forward another two months and we have gotten a handful more gems such as Gemma 4 and Qwen 3.6, so I wanted to p…
Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop? (www.reddit.com) The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these models as of now.
Is mlx-optiq legit? Has anyone tested the new quants for Gemma4/qwen3.6 yet? (www.reddit.com) https://huggingface.co/mlx-community/Qwen3.6-35B-A3B-OptiQ-4bit https://huggingface.co/mlx-community/Qwen3.6-27B-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-26B-A4B-it-OptiQ-4bit https://huggingface.co/mlx-community/gemma-4-31B…
Using Google's Gemma 4 E4B Local AI Model to Reverse Engineer a Simple Crackme (github.com via hn) Using Google's Gemma 4 E4B local AI model to Reverse Engineer a simple Crackme I was playing around with the new Gemma E4B open weights local model which Google released, and to my surprise I was seeing a great deal of success in using it…
Ask HN: Will local models on normal hardware ever compete? (news.ycombinator.com) I have a Macbook Air M3 with 24gb RAM. The other day, I wanted to try running an LLM locally for the first time ever.
A weekend with LoRA on Gemma 4 E2B: instrumenting what fine-tuning changes (aiexplr.com via hn) Spent a week doing LoRA fine-tuning on Gemma 4 E2B (~5.1B total params, ~2B active in text decoder) for a narrow Python code-generation task. Bad outputs went from ~5% to 0% (greedy) and 1.5% (sampled) across 134 tests.
Gemma 4 Folks (www.reddit.com) Full Answer >>> A plane crashes on the border of two countries. Where do they bury the survi ...
Best settings for gemma-4 on a 3090? (www.reddit.com) 3090 (24G) + 32G DDR4 Currently running --mmproj mmproj-BF16.gguf --chat-template-kwargs '{"enable_thinking":true}' \ --flash-attn on \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ -np 1 \ -c 160000 \ --jinja at 26B-A4B-it-UD-Q5_K_XL and ge…
Claude Cowork Now Runs Any LLM. Test It Free (www.productcompass.pm via hn) OpenAI, Gemma, Kimi K2, or run locally. Free via OpenRouter.
What are your favorite LLMs for translation/docuement work? (www.reddit.com) I am currently working on a system to translate books/web novels. I got a working prototype, but now I am looking into optimizing it.
Windows freezing up as VRAM fills up - Does this happen for everyone? (www.reddit.com) Hey everyone, I run llamacpp precompiled with CUDA 12.4 on Windows 11 with a RTX 4090. With small models like gemma-4-E4B everything runs fine, but as soon as I run a bigger model like Qwen3.6-27B (IQ4_NL) or a medium sized model with larg…
Short term access to 4x rtx6000pro... Suggestion on what to try/test? (www.reddit.com) Always been stuck with models that fit on my 16gb .... Going to have about a week for free with 4x rtx6000pro .
Has anyone managed to use gemma 4 e4b in Open Code/other agentic TUIs? (www.reddit.com) Hi everyone, as a power user I hit Claude Code's usage cap too often I wanted to set up my own local model, however I only have RTX 5070 with 12 GB of VRAM so the only realistic option was Gemma 4 with effective 4B params. When I tried to…
Pioneer: Vibetune Your LLMs (pioneer.ai via hn) +30% avg accuracy lift on classification & extraction tasks vs. base Gemma ~7 days until your first auto-improvement run lands in production 0 lines of fine-tuning code you have to write, ever $0/retrain starting price.
I tested 9 local models on the same flight sim prompt, all Q8, different Q providers, MLX (www.reddit.com) I gave 9 local models the same flight combat sim prompt. The results broke a few of my assumptions about quant providers and parameter count.
Gemma 4 is much less popular on Hugging Face than Qwen 3.x. (www.reddit.com) The difference is quite big: likes downloads last month finetunes Qwen3.5-27B 952 3,233,034 263 Qwen3.5-35B-A3B 1,397 3,977,637 87 Qwen3.6-35B-A3B 1,115 458,436 60 gemma-4-31B 323 343,895 13 gemma-4-26B-A4B 227 118,464 13
Did Google hide the best version of Gemma 4 e4b in Android? The extracted model beats Unsloth and everything else I've tried. (www.reddit.com) Why does Gemma 4 e4b from Google AI Edge Gallery on Android weigh only 3.6 gigs, while the one from Unsloth (gemma-4-E4B-it-UD-Q2_K_XL.gguf) weighs 3.7, and for some reason the model image in litertlm format extracted via adb from Google A…
Gemma 4-31B vs Qwen 3.5-27B vs Qwen 3.6-35B-A3B on a browser-agent vision prompt — MoE wins on every axis (www.reddit.com) I was building a dedicated-vision-model feature for an open-source browser agent and wanted to figure out which local model to actually recommend. Wrote a small probe that sends the same image + same system prompt + same params (temperatur…
Building the smallest Gemma 4 (35M params) from scratch — Part 1: Tokenization + Data Pipeline (www.reddit.com) Lekh AI iOS v7.0 is Live – Bonsai 8B & Gemma 4 + Lower Memory Image Gen (www.reddit.com) Deploying Gemma 4 26B A4B on a single RTX 5090 — ~196 tok/s with AWQ + vLLM on RunPod Serverless (www.reddit.com) localLLamA playground (www.reddit.com) Gpu reccommendations for Coding/chat LLM (www.reddit.com) Keinsaas Navigator + LM Studio + Geforce RTX 5080 (www.reddit.com) Gemma 4 coding performance, do different harnesses give wildly different results? (www.reddit.com) So the question I've seen posed many times in /r/singularity is if the Gemini models are actually that bad at coding compared to their benchmarks, or whether the harness used makes an absolutely gigantic difference in model performance. Gi…
For 36gb vram, Gemma 4 or Qwen3.5 ? (www.reddit.com) I have 3090ti and i will add 3080ti to my system soon. With 3090ti only, i found it little bit slow to run gemma 4 26b 4q.
Frontier Coding Agents Built a Video Diffusion Pipeline on Max (www.modular.com via hn) Gemma 4 just dropped on Modular, Day Zero! Read More → Inference Products Shared Endpoints Access frontier models via an API Dedicated Endpoints Mission critical reliability Custom models Your model, peak performance Deployment Options Our…
Gemma 4-written, small cc0 encyclopedia of some core science content (stateofutopia.com via hn) Published: April 16, 2026 This is an encyclopedia of some core content from Biology and Health Sciences, Physical Sciences, and Technology. It contains 2,259 small entries of about a paragraph each.
Local Coding Stacks (www.reddit.com) I’m trying to reduce my reliance on Claude. I have a 5090/128GB RAM.
Feedback on iOS app with local AI models (www.reddit.com) Hey everyone, I just shipped an iOS app that runs local AI models. Current has 12 models: Gemma 4, Llama 3.3, Qwen3, DeepSeek R1 Distill, Phi-4, etc.
LiteRT LM Framework with Rockchip NPU (RKNN 3588) (www.reddit.com) Im searching for build version of LiteRT LM framework can use and utilize the NPU of the RKNN 3588. It would be great since I can run gemma 4 e2b model using this framework on the machine, because I wont have to migrate my codebase from li…
Thinking issue [Qwen3.5] (www.reddit.com) I've been testing a few models lately and I'm running into a weird issue with the bigger Qwen3.5s. Tested: Gemma 4 26B Qwen3.5 9B Qwen3.5 27B Qwen3.5 35B The 27B and 35B are driving me nuts.
Best Ollama models/settings for an 8GB VPS (CPU only, ARM)? Running into memory & looping issues. (www.reddit.com) Hi everyone, I'm trying to run a local LLM via Ollama on a Hetzner cax21 VPS (ARM64, 4 vCPUs, 8GB RAM, 80GB SSD). I have Ollama running successfully via Coolify.
Ask HN: What are you building with Gemma? What do you wish existed? (news.ycombinator.com) could not extract summary
What's the better way to install llama.cpp on Android? (www.reddit.com) I own an Oppo Find X3 Pro (Snapdragon 888, 12/256 GB, Android 14.0) unused because of 3 green vertical lines on the screen and poor battery. I tried Google AI Edge Gallery with Gemma-4-E2B-it and it performs well so I thinked: "why don't t…
Gemma 4 Thinking Like Claude Opus (decrypt.co via hn) If you've been following the local AI scene, you probably know Qwopus—the open-source model that tried to distill Claude Opus 4.6's reasoning into Alibaba's Qwen, so you could run something resembling Opus on your own hardware for free. It…
Can LLM make small change to the software program? (www.reddit.com) I'm currently vibe-coding (I'm new to vibe-coding) with Gemma 4 4EB Q4 and Qwen 3.5 9B Q5 (KV is quantized to 4 bits with new Google TurboQuant implemented in llama.cpp - I use koboldcpp and release said it's automatically activated): the…
Gemma 4 & Obsidian (www.reddit.com) so today I tried the Obsidian LLM wiki system by Karparthy, but with Gemma 4 locally in OpenCode with instead of Claude code. My experience is very frustrating.
Gemopus: A Gemma fine-tune that prioritizes stability over long chain-of-thought (huggingface.co via hn) 🌟 Gemopus-4-26B-A4B-it [!NOTE] Gemopus is an attempt at fine-tuning Gemma 4 with a core philosophy of "stability first". While preserving the original reasoning order of Gemma 4 as much as possible, we conducted targeted refinements for an…
Gemma 4 E2B on Android: OpenCL crash on emulator, anyone solved this? (www.reddit.com) I was building an Android app and integrated Gemma 4 E2B directly using LiteRT-LM. On-device translation, zero server cost, the dream setup.
Gemma 4 base GGUF? (www.reddit.com) Hello, I've seen reviews that gemma 4 31b base is very good at roleplaying. But I can't find the gguf version of the basic gemma 4 anywhere.
Looking for a team to participate in Gemma 4 good hackathon (www.reddit.com) Hey folks, I've been tinkering with Gemma 4 and absolutely the fact this model can run locally on Android phone! I am experienced fullstackdev, open to solve any real-world problem that has an impact.
I ran Gemma 4 as a local model in Codex CLI (medium.com via hn) I ran Gemma 4 as a local model in Codex CLI | by Daniel Vaughan | Google Cloud - Community | Apr, 2026 | Medium Sitemap Open in app Sign up Sign in Get app Write Search Sign up Sign in Google Cloud - Community · A collection of technical a…
Getting no result train Gemma 4 for structured data extraction (www.reddit.com) Hello, I've been trying for several days to train Gemma-4 for extracting data from a string and convert it into a structured JSON. I've tried a fair amount of different configurations, I've tried Unsloth studio and Llamafactory, but in eac…
Newer Qwen models are worse at summarization? (www.reddit.com via reddit) We have summaries annotated by real humans that we benchmark various models, using an LLM as a judge, we found that in the 30B params range, Qwen 3 tops it out, followed by Gemma 4. It feels like newer Qwens are optimized to perform agenti…
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization (www.reddit.comhttps) Nex-N2 Q4 KS (www.reddit.com via reddit) Have a crap IGPU 64 gb AMD. THis model works pretty good.
Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G (huggingface.co via reddit) Unsloth Gemma 4 QAT MTP assistant models now available (www.reddit.com via reddit) Unsloth Gemma 4 QAT MTP assistant models now available They're both available as q8_0 models named mtp-gemma-4-*.gguf on the root of the directory and in both q8 and larger quants within an MTP folder. https://huggingface.co/unsloth/gemma-…
Gemma having updated knowledge base is so awesome (www.reddit.com via reddit) I never used to think it was all that important! But I’m using it for Svelte 5 and it ACTUALLY knows runes out of the box.
Introducing Gemma 4 12B: a unified, encoder-free multimodal model (deepmind.google) Introducing Gemma 4 12B: a unified, encoder-free multimodal model Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B…
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit) I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
Gemma 4 31B's competence surprised me (www.reddit.com via reddit) I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.
Unexpected Unsloth QAT Performance Compared to Unsloth IQ4_XS (www.reddit.com via reddit) Hi everyone, I am comparing the standard (non-QAT) iq4_xs and q3_k_m quants with this QAT q4_k_xl model. (All of them are Unsloth versions)(gemma-4-26B-A4B-it-GGUF via lmstudio).
Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? (www.reddit.com via reddit) I'm trying to find out if anyone has done any benchmarking comparing the Gemma 4 4-bit QAT models (via Unsloth) against standard 8-bit non-QAT quants. I know QAT is supposed to retain a ton of accuracy compared to the baseline BF16, but I'…
Gemma 4 26B A4B IT QAT Comparison (www.reddit.com via reddit) Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me.
[Follow-up] Qwen3.6-35B-A3B 8GB RTX: I tried Linux, tested Gemma 4, and now understand why Windows was faster (www.reddit.com via reddit) Original post: https://www.reddit.com/r/LocalLLaMA/comments/1txwff3/comment/oq1e0jt/?context=3 TL;DR: Migrated to WSL2 to test Linux (several people suggested it). Embedded MTP on the UD model: 25.8 tok/s.
LMStudio gemma 4 31b QAT with MTP (www.reddit.com via reddit) Did anyone manage to launch that in LMStudio? I am on the most recent update with the most recent llama.cpp available in LMStudio.
Gemma 4 MTP with assistant vs llama cpp type MTP (www.reddit.com via reddit) Hi all Been loving the QAT models but honestly what is up with the assistant models, any ggufs and ways to make em work with vanilla llamacpp and if this way of MTP is different than the one am17an developed for llamacpp. Followup question…
Why is the MLX version of the Gemma 4 QAT so big?? (www.reddit.comhttps) the MLX version of the QAT 4bit is like 27gb but the none QAT version is 17gb and the regular 4bit MLX version is also 17gb… anyone know why?
Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? (www.reddit.com via reddit) I tested in-conversation memory on LFM2.5, Gemma 4 E2B and E4B. The biggest model forgot a fact from earlier in the chat first. (www.reddit.comhttps) Ran a small, focused eval on three on-device models and the result was backwards from what I expected, so sharing the method and numbers. The task: tell the model "my dog is named Pablo," then add N turns of unrelated filler (shuffled gene…
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better] (www.reddit.com via reddit) These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…
Gemma 4 Chat Template now has preserve thinking (huggingface.co via reddit) Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze (www.reddit.com via reddit) I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me.
Thoughts on Gemma4 12b vs 26a4b, which one is better? (www.reddit.com via reddit) Not talking about 31b. In terms of creative tasks, writing, chatting, not necessarily coding but can still be included, Does Gemma 12b outperform in any way?
QATs Q4_0 from Google have more precision than Q4_K_XL from Unsloth (at least some) (www.reddit.com via reddit) I wanted to try new QATs and opened two collections on HF (which HF found for me): https://huggingface.co/collections/google/gemma-4-qat-q4-0 https://huggingface.co/collections/unsloth/gemma-4-qat One strange thing caught my attention, for…
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines (arxiv.org) Gemma4_31b_fp8 keeping up with Sonnet_4.6_medium in my harness. (www.reddit.com via reddit) how to run gemma-4-12b-it-qat-w4a16-ct in vllm or any version quantized of the model (www.reddit.com via reddit) when running by using transformers it runs by using vllm some weird error come up plese can any body share the command of running it on vllm ?
Is Gemma 4 12b good for coding? (www.reddit.com via reddit) How are you using it? Quantized?
llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is? (www.reddit.com via reddit) Running into something annoying with llama-server in router mode (`--models-preset`) and I can't tell if I'm missing a flag or if this is just how it works. My rig is 2x 3090, 2x 4060 Ti (one's unplugged at the moment, riser got repurposed…
QAT variant of Gemma4 26B A4B is not working well for me (www.reddit.com via reddit) I am using llama.cpp version b9549 with this arguments as recommended: llama-server --temp 1.0 --top-p 0.95 --top-k 64 -hf ... Here is what I got on chessboard svg test https://www.reddit.com/r/LocalLLaMA/comments/1t53dhp/quality_compariso…
Any smaller model than OmniCoder v2 9b that can appropriately and accurately tool call? (www.reddit.com via reddit) Hate to ask a simple question, but I’ve looked around and I see plenty of smaller models that *can* tool call, but none of them seem to do so appropriately or agentically. Referring to this.
Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe? (www.reddit.com via reddit) I’m trying to run: unsloth/gemma-4-31B-it-qat-GGUF gemma-4-31B-it-qat-UD-Q4_K_XL.gguf on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch. Main model loads.
How to compare Original vs QAT Gemma 4 31B Q4 quants (www.reddit.com via reddit) I just came across the following post, where a user found some confusing divergence results between Q4 quants of the original and QAT models with a Q8/unquantized reference of the original model. https://www.reddit.com/r/LocalLLaMA/comment…
You don't need a GPU to run gemma-4-26B-A4B (www.reddit.com via reddit) I've been running LLMs on my old potato i5-8500 with 32GB of RAM and *no GPU* for awhile now, running up to 12B dense models which run slow but perfectly useable. But this Gemma-4-26B-A4B simply flies on this CPU - only machine using Kobol…
I can't wait for all the x250 sample distills of Mythos and GPT-5.6 (www.reddit.com via reddit) Just kidding. Are there any distills that actually improve a model's quality?
↯ Anthropic Mythos↯ Gemma 4↯ Qwen 3.6↯ Qwen 3.5mythosgpt-5gemma+1
Gemma 4 31B QAT Q4 vs standard Q4 — Top1 KLD benchmark results have me confused. Someone please explain or poke holes in this. (www.reddit.com via reddit) I'll be upfront: I vibe-benched and vibe-reported this with Claude Sonnet 4.6, but I reviewed and edited everything before posting (too lazy to take out all the AI EM dash —), so hopefully nobody considers this AI slop. And more importantl…
QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench (www.reddit.com via reddit) Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan) Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found…
Z.ai, we need Air! GLM GGUF wen? (www.reddit.com via reddit) First we never saw an upgraded Air model after 4.5. Then GLM 4.7 Turbo was great, but quickly surpassed for coding.
120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP (www.reddit.com via reddit) Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised of the resul…
Gemma 4 QAT Unquantized Heretic is here (huggingface.co via reddit) Now someone needs to quantize them to 4bit, also I have intentionally kept the divergence and refusal different from original Gemma 4 heretic collection, so you can even try these as alternative to original model.
Gemma 4 QAT accuracy inconsistencies (www.reddit.com via reddit) Table from https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis I heard that MoE models are usually more susceptible to quantization error, but what happened with the 12B? I thought lower-parameter models usually quantized worse and yet…
Experimentation with Qwen 3.6 and Gemma 4 - Guidance needed (www.reddit.com via reddit) I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful.
Gemma 4 QAT Q4_0 Bench on Strix Halo (www.reddit.com via reddit) Gemma 4 QAT Q4_0 Bench on Strix Halo These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU. QAT means quantization-aware training.
From Cloud to Local: The Revolution in Small, Efficient AI Agents Like OpenLumara and Gemma 4 (www.reddit.com via reddit) While everyone's obsessing over giant cloud-based AI models, a quiet revolution is happening in local AI. We're seeing the emergence of extremely token-efficient, super-small system prompts, and modular agents designed specifically for loc…
AA comparison of the latest local models (www.reddit.com via reddit) I picked models I consider local (usable on 3×3090), so there are no 300B models, and you should probably skip 200B models too (but MiniMax and Step are pretty fast in Q3) Gemma-4 12B is still missing
Tip: Stop Worshiping Models and Start Building Things (www.reddit.com via reddit) This subreddit is where I learned the most about using Local LLMs. I've been on this journey for 4 months now, and I'm already using Local LLMs in very complex pipelines.
Gemma 4 Haters 2 months Ago now seems to love Gemma 4 now. (www.reddit.com via reddit) What's with the switch guys? now imagine if google gonna drop 128B model or a MoE version (I bet those Qwen lovers will forget Qwen even existed).
MLX Community forgot about Gemma 4 12B QAT (www.reddit.comhttps) They started uploading to Gemma 4 MTP QAT but forgot to upload 12B quants to the Gemma 4 QAT 😭.
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss (www.reddit.com via reddit) I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought.
What exactly is quantization aware training? (www.reddit.com via reddit) First time hearing it. I also heard about the gemma 4 qat quants and if any one of them is good for 4gb vram and 16gb ram.
Gemma 4 12B Q4_K_XL Private Benchmark Results (www.reddit.comhttps) Posting to share my results with others, I think the big bottom line is MTP acceptance rates offering a huge speedup, during coding tasks it's over 90% acceptance! Haven't hit my soft goal results or llm as judge benchmarks yet to compare…
At least one more Gemma 4 model confirmed?? (www.reddit.com via reddit) could not extract summary
PSA: Gemma 4 12B is NOT completely broken for coding and tool calling, you need a special chat template (www.reddit.com via reddit) This is a PSA for people like me who tried it and hit the wall with tool calls failing left and right, so much so that harnesses like OpenCode just didn't work: There is a fix for that. You need to pass a better chat template file, which i…
Google's quantization aware trained Gemma checkpoints enabling mobile device inference just dropped on HF (www.reddit.comhttps) Release Blog Post: Gemma 4 with quantization-aware training HuggingFace for mobile: Gemma 4 QAT Mobile - a google Collection HuggingFace for Q4_0: Gemma 4 QAT Q4_0 - a google Collection
Gemma 4 QAT GGUFs from Unsloth (www.reddit.com via reddit) Their collection: https://huggingface.co/collections/unsloth/gemma-4-qat And their guide, always a very interesting read: https://unsloth.ai/docs/models/gemma-4/qat
Gemma 4 2B handling structured JSON output + tool calling + reasoning traces correctly via Spring AI / LM Studio — including identifying a real Java bug in code review (www.reddit.com) Wanted to share a result I didn't expect to work. Running google/gemma-4-e2b locally through LM Studio, exposed via OpenAI-compatible endpoint, called from a Spring Boot app using Spring AI's ChatClient abstraction.
Gemma is so much better than Qwen, prove me wrong (www.reddit.com) Ever since the latest Gemma releases, there is literally zero reason to use Qwen. Better architecture, cleaner code output, and it doesn't get stuck in weird multi-turn reasoning loops.
I vibecoded an app called Think Local - a fully private AI app that runs directly on your iPhone, iPad, and Mac. (www.reddit.com) Think Local started with a simple idea: AI should work for you, not collect from you. So I built an app that lets you run modern AI models completely on-device - privately and fully offline.
Qwen 3.6. struggling with German (www.reddit.com) Hi everyone, I’m looking for advice on local AI setups. My goal is to have a local AI generate text documentation from my one-hour therapy sessions.
For the users who have add bad luck with QWEN 3.6 27B, and Gemma 4 31B. "Actually..wait..actually". Endless reasoning. Horrible output. I found a solution. rtx pro 6000. (www.reddit.com) Edit: does this happen every time a newbie tries to post here. Getting roasted despite having valid results?
Frontier models mass collapse is near (www.reddit.com) Hi all this is to inform you all that many frontline models like GPT, sonnet opus and or Gemma even are at stage of collapsing as they have frequently started drifting and running away from provided work either stretching that work too lon…
Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…
↯ Security↯ Mistral↯ Llama 3.2jailbreakprompt-injectionmistral+5
On my RTX 4060 8GB laptop, I can run Gemma 4 E4B Q6 K XL with mmproj at only 6GB of VRAM usage despite sources recommending Q4 K M for my hardware. What is going on? (www.reddit.com) I can set my context length as high as 64k and the vram usage is not even remotely close to the maximum utilisation. My TPS is also 40+.
Model for reverse engineering (www.reddit.com) For a system with 4x RTX 3090: what's the best model you could use for reverse engineering C# code? Qwen3.5-122b-A10B?
Afraid of Using the Wrong LLM. ChatGPT 5.5 waterdown, Gemma Struggles (www.reddit.com) Hi, I’m having a very hard time right now. I used to use ChatGPT 4o and 5.1 Thinking for helping me write my story, and I was very happy with them.
What are the best 40-500 B MoE LLM models now? (www.reddit.com) Due to old GPU I run on CPU and came to appreciate value of MoE. I know of MoE for Qwen 3.6 and Gemma-4, which are <40B.
New Gemma 4 draft models released (alternativeto.net via reddit) Just saw this and wanted to ask the obligatory GGUF when?
Google's Gemma 4 AI models get 3x speed boost by predicting future tokens (arstechnica.com) Google launched its Gemma 4 open models this spring, promising a new level of power and performance for local AI. Google’s take on edge AI could be getting even faster already with the release of Multi-Token Prediction (MTP) drafters for G…
Getting unexpected output with Gemma 4 31b-it on vLLM (www.reddit.com) Hey everyone, I'm running into a weird issue and hoping someone here might have a fix or some troubleshooting ideas. I'm currently trying to run the new Gemma 4 31b-it model using vLLM (v0.20.0-cu130) deployed via Helm chart (https://gith…
Gemma 4 31B MTP Drafter on H100 -- Real Benchmarks + DFlash Comparison (www.reddit.com) Just tested Gemma 4 31B with the new official MTP Drafter on my H100 today and compared the approach with DFlash to help you decide which one to use. Without drafter: 13.7 tok/s.
Gemma 4 MTP Test - what speedup can you gain? (www.reddit.com) Let's test the Gemma 4 MTP implementation using the HuggingFace Transformers library and the new drafter model by Google. We'll load both models and test on a couple of prompts with and without the MTP support https://www.youtube.com/live/…
A simple "hack" to speed up prompt processing for Qwen 3.5/3.6 in LM Studio (www.reddit.com) Increase your CPU Thread Pool Size to your processor's max. In LM Studio, the max is 10.
Is 2x5070Ti a good setup? (www.reddit.com) I'm confused about what to get. I don't want to get something super expensive, but would like to have something that's "good enough" for coding etc.
Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real. (www.reddit.com) Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants)…
Sub7, modem reboots for the family. 30 years later I shipped a desktop AI agent with mobile remote control. Solo, 3 weeks. (www.reddit.com) 1990s. I was the kid with the dial-up sound burned into my brain.
What is best code editor for local LLM deployment (LM Studio, llama.cpp) as of May 2026? (www.reddit.com) Hello folks What is best code editor for local LLM deployment (LM Studio, llama.cpp)? I wish to test my LM studio + Qwen 3.6 27B and Gemma 4 31B with a legit local code editor.
I built AI agents that play Pokemon Showdown autonomously using free LLM APIs via tool-calling (www.reddit.com) I've built a system where models like Llama 3, Qwen, and Gemma play Pokémon Showdown battles autonomously. Instead of simple prompt-response, they analyze the full battle state every turn (type matchups, HP, weather, field conditions, reve…
thinking of gemma 4 26B vs 31B (www.reddit.com) I see a big difference in agentic coding between gemma-4-31B-it-Q5_K_M and gemma-4-26B-A4B-it-UD-Q8_K_XL. The 26B model is much faster because of A4B and generally works well, but there is a big difference in thinking.
Did anyone of you already make the "doomsday" or "offgrid" knowledge based? (ofc powered with LLM) (www.reddit.com) Basically, I’m really into the idea of a fully offline setup. (Another way to say it: I’m a data hoarder.) For LLMs, I’m using uncensored models from both Western (Gemma, GPT-OSS) and Eastern ones (GLM 4.7 Flash, Qwen 35B).
Ran my own benchmark Qwen 3.6 35B vs Gemma 4 26B.... theres a clear winner here (www.reddit.com) Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989? According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe…
Good LLM to generate ascii art? (www.reddit.com) I tried with Qwen but it sucked, Gemma3/4 was better but not good enough. From Gemma: https://pastebin.com/raw/Qr5iMgYj Still looks like a bloody car accident though.
Best sota 12b-32b creative writing model? (www.reddit.com) I love using openrouter but I would also love a smaller model that can fit within 16gb of VRAM and 64b of ram, that can pack a punch for its size specifically in the creative writing section. Any good recommendations?
(Gemma/Qwen + Codex) - Bridging /chat/completions → /responses in llama-swap (www.reddit.com) I’ve been tinkering with a small side project (just for fun) where I’m trying to extend llama-swap with a bridge from /chat/completions to the newer /responses API so I can run the latest Gemma and Qwen models together with Codex-style too…
Three lessons from fine-tuning a 5B code assistant — bad outputs from 5% → 0% (www.reddit.com) Spent a week doing LoRA fine-tuning on Gemma 4 E2B (gemma-4-e2b-it, ~5.1B total params, ~2B active in the text decoder) for a narrow Python code-generation task. Setup: Model: Gemma 4 E2B, bf16, language_model only (vision + audio towers f…
Gemma 4 VLA Demo on Jetson Orin Nano Super (huggingface.co) Gemma 4 VLA Demo on Jetson Orin Nano Super You speak → Parakeet STT → Gemma 4 → [Webcam if needed] → Kokoro TTS → Speaker Press SPACE to record, SPACE again to stop. This is a simple VLA: the model decides on its own whether to act based o…
It is worth an RTX 3090 for 850 if you can a radeon 7900 XTX for 495? (www.reddit.com) Both amounts are in euro. The AMD is actually 599 but it's sold by a shop, so I can get a VAT return as a company, while for the nvidia I'd have to go to the second hand market and I can't get VAT back, so at the end it's like a 495 vs 850…
Built an Android app that exposes Gemma 4 as an OpenAI-compatible endpoint on your LAN (www.reddit.com) My old Samsung S10 was sitting in a drawer so I turned it into an always-on LLM endpoint. PocketPal is great for on-phone chat, but I wanted the phone itself to be an OpenAI-compatible endpoint for the rest of my network.
Anyone know how to run the new gemma 4 edge gallery litertlm format in the browser? Trying to load Gemma 4 e4b. (www.reddit.com) How do I run Gemma 4 e4b, extracted via adb from Google AI Edge Gallery on Android, whose image is in litertlm format and weighs 3.6 gigs, in a browser? I mean using web technologies?
Gemma 4 E4B is broken (www.reddit.com) Are we at the point where local AI isn’t a compromise anymore? (Gemma 4 experience) (medium.com via reddit) 5070 Ti (New) vs 3090 (Used) to pair with 4070 for local LLMs? (www.reddit.com) Why model(s) input often includes last output? (www.reddit.com) Qwen3-30B-A3B-Instruct-2507 is better than the new Qwen 3.6 for our tasks (www.reddit.com) How do I get the LLM to answer everything? (www.reddit.com) Qwen3.6-35B-A3B just dropped — quick thoughts after trying it (www.reddit.com) Just gave the new Qwen3.6-35B-A3B a spin. It’s a MoE model (35B total, ~3B active), but honestly the more interesting part is how much they’re pushing agent-style coding.
NVIDIA V100 32GB for AI in 2026 (www.reddit.com) hello. i have the oportunity of buying Nvidia V100 with 32GB for about 915$ / 775 euro.
Want your LLM to use the internet? Here's an MCP server for that. (www.reddit.com) The showcased examples were made using Gemma 4 31b. Any LLM with tool calling support should work.
Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way. (www.reddit.com) I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp…
Where my Gemma 4 gets this data? Trying to explain weird behaviour. Please help! (www.reddit.com) https://preview.redd.it/w6ssjgidjlvg1.png?width=2786&format=png&auto=webp&s=f52736d40580fe8a8ff74adbbb5be81f12fbcbfc So I was playing with Gemma 4 and was trying to figure out whether the model could determine its own training data cutoff…
My Qwen 3.6 fails the car wash vibe check (www.reddit.com) I configured it to the best of my abilities, even at Q8. It fails to give the correct number of tools it supports on Claude Code and it fails the car wash test.
I shipped an iOS app running Gemma 4 E2B fully on-device — here's what I learned about MLX Swift in production (www.reddit.com) I just launched ios app that uses Gemma 4 (E2B 4-bit via mlx-community) to rewrite oral transcripts into heirloom-quality paragraphs, 100% offline. What made this interesting technically: MLX Swift + MLXLLM in production (not a demo) — fir…
Spring benchmark update: Gemma 4 / Qwen3.5 vs Gemma 3 / Qwen3 for chat (www.reddit.com) Google and Alibaba recently shipped Gemma 4 and Qwen3.5, so I wanted to see whether the new generations are actually better on my setup. My context is private local chat running on my own hardware, a Mac mini M4 Pro.
Need suggestions for local AI Machine (www.reddit.com) I’ve been running various AI harnesses like OpenClaw, ForgeCode, ClaudeCode, etc. Most of these are running via OpenRouter or Minimax (credits/subscription model).
gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM (www.reddit.com) I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error : Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.3…
gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp (www.reddit.com) I hope sincerely someonecan help me because i have tried everything i can and i get this speed using ollama.cpp and opencode. I have put as detail i can my setup and how i am running it.
How faster is Gemma 4 26B-A4B during inference vs 31B? (www.reddit.com) I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…
5090 for 285k on amazon india? (amzn.in via reddit) How is it possible the seller also has no record just wanted to run gemma 4 31B q4 with 150k ctx
How many move your favorite LLM model before it's cheat then brain-dead in chess game ? (www.reddit.com) I try with Gemma 4 E4B via llama-sever to play chess at https://www.chess.com/play/computer (any platform or site you convenient), result quite unexpected for me. Result: 9 moves before it make cheating move (like try to move a pawn take a…
Gemma 4 on iOS: Anyone else stuck on CPU because of the “Buffer(31) Metal Crash? (www.reddit.com) Gemma 4 on iOS: Anyone else stuck on CPU because of the "Buffer(31)" Metal crash? Hey everyone, I’m hitting a massive performance wall building an on-device AI app for the iPhone 17 Pro.
Gemma 4 is good or bad at real word (www.reddit.com) Based on real-world usage by the community, roughly which version of which model is Gemma 4 comparable to? It would be great if you could also mention the hardware requirements for running it (like VRAM or GPU needs)
Offload settings for unsloth/Gemma-4 on Apple Silicon? (www.reddit.com) Can default settings be optimized, or is it the best it is going to get? M1 Max Is it best in llama.cpp, LM Studio, or ?
running models bigger than physical memory capacity (www.reddit.com) has anyone really tried running models bigger than physical memory capacity? I'd guess most users stick with running models that fit in DRAM + VRAM https://unsloth.ai/docs/models/qwen3.5 even google gemma 4 are released with about 30+ bill…
Is Gemma 4 26B MoE or 31B good as an MCP agent for coding with Xcode? (www.reddit.com) Thanks
What are your opinions on the SuperGemma finetune? (www.reddit.com) So, I'm relatively new to the scene and I kind of want to do a sanity check. I've been using gemma-4-26B.
Local Agent Hermes setup with Gemma 4 and llama.cpp (www.youtube.com via reddit) About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
Why don't Groq (with a q) and Cerebras add new models (www.reddit.com) Both Groq and Cerebras haven't really updated their provided model for a while, long enough to notice the difference between old and new models on the market. So why don't they add any new models?
Hardware needed for Gemma 26B MoE vs Qwen 14B for ~100–300 users (vLLM, single node?) (www.reddit.com) I'm trying to figure out what sort of hardware setup i will need to accomodate a userbase of 100 users (not necessarily concurrent). Does anyone have any idea what sort of setup i'd be looking at?
What is the best way to deploy LLM on 3x3090? (www.reddit.com) Two questions: which model? In my mind, Qwen3.5 27b or Gemma 4 31b are top options.
My guess as to what Apple Foundation Models will be like in iOS 27 (www.reddit.com) Could you imagine if the new Apple Foundation Models was based on Gemma 4 E4B text like the LiteRT version is? That would be one amazing built in model.
Best setup for multiple high-end dissimilar PCs (www.reddit.com) I did some searching and didn't find a extremely similar situation. I'm jumping head first into hosting locally, and my experience has been good so far.
Opinion on best suit for my hardware (www.reddit.com) Hello everyone, a newbie here. Amazed by OpenClaw and worried by its high API consumption, I decided to buy two Asus Ascent GX10s (like the Nvidia Spark), so I have a pretty powerful inference cluster with 220GB of real available memory.
¿Es el procesamiento 100% offline el verdadero "game changer" de este año? (www.reddit.com) Con el lanzamiento de modelos optimizados para ejecutarse localmente (como lo que estamos viendo con la evolución de Gemma 4), parece que el péndulo de la IA se está alejando de la nube.
Speed on m5 pro 48Gb (www.reddit.com) Hey guys! How would you reckon a 30-50b model would run on a 48 GBs m5 pro?
Local AI coding assistant that runs fully offline (Gemma 4, codebase-aware) (www.reddit.com) I’ve been experimenting with running a local coding assistant on Gemma 4 26B, focused on understanding full codebases instead of single-file prompts. Main idea: - build a project map (files, symbols, structure) - run a planning step to dec…
Are the LiteRT versions of Gemma 4 a different architecture? (www.reddit.com) I was surprised at how much smaller the LiteRT versions of Gemma 4 E2B used in Edge Gallery were (2.0-3.3 GB) compared to the main release (10.2 GB), so I had Claude code take a look. Claude tells me that the vocab size for the LiteRT vers…
Opencode + lmstudio : first prompt very slow (www.reddit.com) I actually make some tests with lm studio and Opencode with the new Gemma 4 26b model. The results are really impressive especially on small refactoring and integration tasks.
Best local LLM that will work fine as a backend for an NSFW discord bot? + having an issue with OpenClaw (www.reddit.com) My specs: RTX 5060ti(16gb), 16gb DDR5 ram. (os : Fedora 43) I want an uncensored model, it would be preferable if it can do image gen but if the quality of text is high enough it should not be problem if it does not support it.
Gemma 4: Byte for byte, the most capable open models (deepmind.google) Gemma 4: Byte for byte, the most capable open models Today, we are introducing Gemma 4 — our most intelligent open models to date. Purpose-built for advanced reasoning and agentic workflows, Gemma 4 delivers an unprecedented level of intel…
Welcome Gemma 4: Frontier multimodal intelligence on device (huggingface.co) Welcome Gemma 4: Frontier multimodal intelligence on device These models are the real deal: truly open with Apache 2 licenses, high quality with pareto frontier arena scores, multimodal including audio, and sizes you can use everywhere inc…
Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior (deepmind.google) T5Gemma: A new collection of encoder-decoder Gemma models (deepmind.google) Introducing Gemma 3 270M: The compact model for hyper-efficient AI (deepmind.google) How a Gemma model helped discover a new potential cancer therapy pathway (deepmind.google) Gemma 3n fully available in the open-source ecosystem! (huggingface.co) Announcing Gemma 3n preview: Powerful, efficient, mobile-first AI (deepmind.google) Introducing Gemma 3 (deepmind.google) Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM (huggingface.co) Google releases Gemma 2 2B, ShieldGemma and Gemma Scope (huggingface.co) Fine-Tuning Gemma Models in Hugging Face (huggingface.co)