Newer Qwen models are worse at summarization? (www.reddit.com via reddit)
model roundup
Gemma 4
-
We have summaries annotated by real humans that we benchmark various models, using an LLM as a judge, we found that in the 30B params range, Qwen 3 tops it out, followed by Gemma 4. It feels like newer Qwens are optimized to perform agenti…
-
-
-
Unsloth Gemma 4 QAT MTP assistant models now available (www.reddit.com via reddit)
Unsloth Gemma 4 QAT MTP assistant models now available They're both available as q8_0 models named mtp-gemma-4-*.gguf on the root of the directory and in both q8 and larger quants within an MTP folder. https://huggingface.co/unsloth/gemma-…
-
Introducing Gemma 4 12B: a unified, encoder-free multimodal model (deepmind.google)
Introducing Gemma 4 12B: a unified, encoder-free multimodal model Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B…
-
Bonsai: Human->LLM->Web with LLM interface using Gemma4 12B locally on Windows (drive.google.com via hn)
JavaScript must be enabled to use Google Drive Learn more Skip to main content Keyboard shortcuts Accessibility feedback This browser version is no longer supported. Please upgrade to a supported browser.
-
I took the liberty to test both models today on my favorite benchmark question, head to head. Device: Apple Mac M3 Max 64GB Environment: llama.cpp, all defaults Gemma4-12B's token generation speed: 47 tps with MTP and 2 predicted tokens 29…
-
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit)
I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
-
Gemma 4 31B's competence surprised me (www.reddit.com via reddit)
I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.
-
Unexpected Unsloth QAT Performance Compared to Unsloth IQ4_XS (www.reddit.com via reddit)
Hi everyone, I am comparing the standard (non-QAT) iq4_xs and q3_k_m quants with this QAT q4_k_xl model. (All of them are Unsloth versions)(gemma-4-26B-A4B-it-GGUF via lmstudio).
-
Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? (www.reddit.com via reddit)
I'm trying to find out if anyone has done any benchmarking comparing the Gemma 4 4-bit QAT models (via Unsloth) against standard 8-bit non-QAT quants. I know QAT is supposed to retain a ton of accuracy compared to the baseline BF16, but I'…
-
Gemma 4 26B A4B IT QAT Comparison (www.reddit.com via reddit)
Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me.
-
Original post: https://www.reddit.com/r/LocalLLaMA/comments/1txwff3/comment/oq1e0jt/?context=3 TL;DR: Migrated to WSL2 to test Linux (several people suggested it). Embedded MTP on the UD model: 25.8 tok/s.
-
LMStudio gemma 4 31b QAT with MTP (www.reddit.com via reddit)
Did anyone manage to launch that in LMStudio? I am on the most recent update with the most recent llama.cpp available in LMStudio.
-
Gemma 4 MTP with assistant vs llama cpp type MTP (www.reddit.com via reddit)
Hi all Been loving the QAT models but honestly what is up with the assistant models, any ggufs and ways to make em work with vanilla llamacpp and if this way of MTP is different than the one am17an developed for llamacpp. Followup question…
-
Why is the MLX version of the Gemma 4 QAT so big?? (www.reddit.comhttps)
the MLX version of the QAT 4bit is like 27gb but the none QAT version is 17gb and the regular 4bit MLX version is also 17gb… anyone know why?
-
Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? (www.reddit.com via reddit)
-
Ran a small, focused eval on three on-device models and the result was backwards from what I expected, so sharing the method and numbers. The task: tell the model "my dog is named Pablo," then add N turns of unrelated filler (shuffled gene…
-
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better] (www.reddit.com via reddit)
These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…
-
Gemma 4 Chat Template now has preserve thinking (huggingface.co via reddit)
-
-
You don't need a $10,000 GPU to run state-of-the-art AI anymore. The latest breakthroughs in model quantization and optimization are putting powerful AI in the hands of everyone—from hobbyists to small businesses.
-
I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me.
-
Thoughts on Gemma4 12b vs 26a4b, which one is better? (www.reddit.com via reddit)
Not talking about 31b. In terms of creative tasks, writing, chatting, not necessarily coding but can still be included, Does Gemma 12b outperform in any way?
-
I wanted to try new QATs and opened two collections on HF (which HF found for me): https://huggingface.co/collections/google/gemma-4-qat-q4-0 https://huggingface.co/collections/unsloth/gemma-4-qat One strange thing caught my attention, for…
-
-
when running by using transformers it runs by using vllm some weird error come up plese can any body share the command of running it on vllm ?
-
Is Gemma 4 12b good for coding? (www.reddit.com via reddit)
How are you using it? Quantized?
-
MTP and QTA - what is the relation? (www.reddit.com via reddit)
I'm an old guy and I hate when things change so fast surrounded by noise and breaking news! MTP, I know what the acronym means and where it excels.
-
Gemma 4 E4B as a primary local LLM (replaced Qwen) (digg.com via hn)
Gemma 4 E4B 6bit is now the local model of my choice and loaded 24/7 on my Mac (using @lmstudio), replacing Qwen3, 3.5 4B after ~9 months of usage What an insane model, congrats @GoogleDeepMind 🤠 The new setup replaces his nine-month daily…
-
QAT variant of Gemma4 26B A4B is not working well for me (www.reddit.com via reddit)
I am using llama.cpp version b9549 with this arguments as recommended: llama-server --temp 1.0 --top-p 0.95 --top-k 64 -hf ... Here is what I got on chessboard svg test https://www.reddit.com/r/LocalLLaMA/comments/1t53dhp/quality_compariso…
-
Need some guidance toying with local models (www.reddit.com via reddit)
Hi, so I have a pretty low-end laptop regarding running LLMs locally (NVIDIA GeForce RTX 3050 with 4GB VRAM, AMD Ryzen 7 5800H and 16GB DDR4) and while I'm not looking for anything to realistically work with, I'd be interested in how could…
-
I’m trying to run: unsloth/gemma-4-31B-it-qat-GGUF gemma-4-31B-it-qat-UD-Q4_K_XL.gguf on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch. Main model loads.
-
How to compare Original vs QAT Gemma 4 31B Q4 quants (www.reddit.com via reddit)
I just came across the following post, where a user found some confusing divergence results between Q4 quants of the original and QAT models with a Q8/unquantized reference of the original model. https://www.reddit.com/r/LocalLLaMA/comment…
-
You don't need a GPU to run gemma-4-26B-A4B (www.reddit.com via reddit)
I've been running LLMs on my old potato i5-8500 with 32GB of RAM and *no GPU* for awhile now, running up to 12B dense models which run slow but perfectly useable. But this Gemma-4-26B-A4B simply flies on this CPU - only machine using Kobol…
-
Dense vs MoE quantization resiliance (www.reddit.com via reddit)
Which one is more resiliant to quantization? Especially at 4-bit?
-
I can't wait for all the x250 sample distills of Mythos and GPT-5.6 (www.reddit.com via reddit)
Just kidding. Are there any distills that actually improve a model's quality?
-
Gemma4 12B - Experiences? (www.reddit.com via reddit)
Anyone check out the new Gemma4 12B that dropped 3 days ago? Integrated vision and audio recognition, no mmpro needed plus tool use.
-
I'll be upfront: I vibe-benched and vibe-reported this with Claude Sonnet 4.6, but I reviewed and edited everything before posting (too lazy to take out all the AI EM dash —), so hopefully nobody considers this AI slop. And more importantl…
-
QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench (www.reddit.com via reddit)
Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan) Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found…
-
Best local model for Xcode with 64GB MBP using LMStudio as the MCP server (www.reddit.com via reddit)
Gemma4?
-
120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP (www.reddit.com via reddit)
Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised of the resul…
-
Gemma 4 QAT Unquantized Heretic is here (huggingface.co via reddit)
Now someone needs to quantize them to 4bit, also I have intentionally kept the divergence and refusal different from original Gemma 4 heretic collection, so you can even try these as alternative to original model.
-
Gemma 4 QAT accuracy inconsistencies (www.reddit.com via reddit)
Table from https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis I heard that MoE models are usually more susceptible to quantization error, but what happened with the 12B? I thought lower-parameter models usually quantized worse and yet…
-
Experimentation with Qwen 3.6 and Gemma 4 - Guidance needed (www.reddit.com via reddit)
I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful.
-
Gemma 4 QAT Q4_0 Bench on Strix Halo (www.reddit.com via reddit)
Gemma 4 QAT Q4_0 Bench on Strix Halo These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU. QAT means quantization-aware training.
-
Activating MTP for QATGemma4 31b q4_0? (www.reddit.com via reddit)
Has anyone figured out how to activate MTP for Gemma4’s new QAT q4_0 GGUF for 31b? Or is this still not supported in llamacpp?
-
While everyone's obsessing over giant cloud-based AI models, a quiet revolution is happening in local AI. We're seeing the emergence of extremely token-efficient, super-small system prompts, and modular agents designed specifically for loc…
-
AA comparison of the latest local models (www.reddit.com via reddit)
I picked models I consider local (usable on 3×3090), so there are no 300B models, and you should probably skip 200B models too (but MiniMax and Step are pretty fast in Q3) Gemma-4 12B is still missing
-
A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic) (www.reddit.com via reddit)
No numbers. Not sure if anybody cares… I’ve run the UD version of Q4_k_m for a month.
-
Gemma 4 Haters 2 months Ago now seems to love Gemma 4 now. (www.reddit.com via reddit)
What's with the switch guys? now imagine if google gonna drop 128B model or a MoE version (I bet those Qwen lovers will forget Qwen even existed).
-
MLX Community forgot about Gemma 4 12B QAT (www.reddit.comhttps)
They started uploading to Gemma 4 MTP QAT but forgot to upload 12B quants to the Gemma 4 QAT 😭.
-
Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss (www.reddit.com via reddit)
I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought.
-
What exactly is quantization aware training? (www.reddit.com via reddit)
First time hearing it. I also heard about the gemma 4 qat quants and if any one of them is good for 4gb vram and 16gb ram.
-
Gemma 4 12B Q4_K_XL Private Benchmark Results (www.reddit.comhttps)
Posting to share my results with others, I think the big bottom line is MTP acceptance rates offering a huge speedup, during coding tasks it's over 90% acceptance! Haven't hit my soft goal results or llm as judge benchmarks yet to compare…
-
At least one more Gemma 4 model confirmed?? (www.reddit.com via reddit)
could not extract summary
-
qwen3.6 35B has much worse vision capability than gemma4? (www.reddit.com via reddit)
How different are the image recognition capabilities between gemma4 and qwen3.6? I give the model the task to extract calendar events from a photo of an calendar that is croped to the calendar.
-
This is a PSA for people like me who tried it and hit the wall with tool calls failing left and right, so much so that harnesses like OpenCode just didn't work: There is a fix for that. You need to pass a better chat template file, which i…
-
Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB (ziraph.com via hn)
Apples® to Apples®: MLX vs llama.cpp for Gemma 4 12B on an M1 16GB A matched-quant MLX-vs-raw-llama.cpp benchmark for Gemma 4 12B on one M1 16GB - decode is a tie, both pinned at the bandwidth wall. The cost that differs is startup and CPU…
-
Release Blog Post: Gemma 4 with quantization-aware training HuggingFace for mobile: Gemma 4 QAT Mobile - a google Collection HuggingFace for Q4_0: Gemma 4 QAT Q4_0 - a google Collection
-
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to acce…
-
is there a way to take few gigabytes from the final GGUF, instead of usual Q8 size we can get that Q8 but lower 2gb in size ? say 27B Q8 model is like 30Gb , is there way to reduce this by removing layers!?
-
Gemma 4 QAT GGUFs from Unsloth (www.reddit.com via reddit)
Their collection: https://huggingface.co/collections/unsloth/gemma-4-qat And their guide, always a very interesting read: https://unsloth.ai/docs/models/gemma-4/qat
-
Gemma 4 12B: The Developer Guide (developers.googleblog.com via hn)
Following the announcement in our launch blog, we are releasing Gemma 4 12B, a dense multimodal model with a unified, encoder-free architecture. Gemma 4 12B introduces several milestones for local AI: Traditional multimodal models rely on…
-
Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM (arstechnica.com via hn)
The generative AI boom has driven the cost of memory into the stratosphere, and Google is a key part of that trend. So it’s only fitting that Google should offer some less RAM-hungry local AI models.
-
Gemma 4 12B appears in Hugging Face (huggingface.co via hn)
gemma-4-12B-it-GGUF Recommended way to run this model: llama-server -hf ggml-org/gemma-4-12B-it-GGUF Then, access http://localhost:8080
-
Gemma 4 26B on a consumer GPU: build pain, throughput, and BFCL numbers (algollabs.com via hn)
2026-05-05 Gemma 4 26B on consumer-grade 5070Ti GPU A week running Google's Gemma 4 26B as my daily local agent on a single RTX 5070 Ti. No API calls, no cloud, no rate limits.
-
Show HN: I made a Gemma 4 Mac app that names screenshots with local AI (snapname.app via hn)
I made my first macOS utility app that ships with a bundled Gemma 4 model, specifically the Gemma E4B one. It made my app DMG have 5.3 GB in size, but I think it is a small size for the power that this free local model can provide.