model roundup

Gemma 4

66 items · started 2026-06-01 · closed 2026-06-16

Gemma 4 for Telephony: From Two AI Models to One – Until I Switched to Chinese (medium.com via hn)

+3 11d gemma

9 min read 23 hours ago Building a phone agent on a multimodal LLM: dropping faster-whisper and letting Gemma 4 hear the caller directly — a response-time and reply-accuracy benchmark across English, French, and Mandarin Press enter or cli…
Show HN: Ciris – an open-source AI agent in 29 languages on iOS and Android (ciris.ai via hn)

+2 2w gemma

On your phone A small open model, like Gemma 4, runs on the device. Completely offline.
I have finally tested it : large models can be run on low RAM / no VRAM (www.reddit.com via reddit)

2w moe gemma

I was not sure myself, seeing a lot of statements here and around like "you need XXX VRAM / Unified Memory to run this model". So today I finally tested it.
Any chances for a 12B diffusion Gemma? (www.reddit.com via reddit)

2w gemma llama

Currently recompiling my llama.cpp with support for diffusion Gemma, but I know on my hardware it won't likely be all that viable. I feel like if the goal was to take better advantage of consume GPUs for fast, intelligent generation, build…
nvidia/diffusiongemma-26B-A4B-it-NVFP4 · Hugging Face (huggingface.co via reddit)

2w function-calling deepmind moe+1

Model Overview Description: DiffusionGemma 26B A4B IT is an open-weights multimodal generative model developed by Google DeepMind that processes text, image, and video inputs to produce text output via discrete diffusion. Built on the Gemm…
Monitor your screen using local LLMs with only one sentence! Free, Open Source and Local. (youtu.be via reddit)

2w gemma llama mcp

TLDR: I just added an MCP to the Observer framework making it 10x easier to use, so you can create micro-agents that monitor your screen autonomously, literally one sentence and you're done! So just typing "Monitor my Steam download and se…
LLMs and tabletop games (www.reddit.com via reddit)

2w

Hey everyone, Recently I bought S.T.A.L.K.E.R. The Board Game.
Are these quants of QAT better than non-QAT? What do I use? (www.reddit.com via reddit)

2w gemma

https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-i1-GGUF/tree/main https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-GGUF/tree/main I waited a bit before asking this. I have 3060 12GB and 32GB d…
Gemma-4-31B at 256K context on a $1,400 AMD GPU – measured, with patches (github.com via hn)

+2 2w gemma

Gemma-4 31B at 256K Context on a $1,400 AMD GPU — TurboQuant KV Cache on RDNA4 Running **Gemma-4-31B-it with a TurboQuant KV cache and HIP graphs together on AMD RDNA4 (gfx1201) — a combination that crashes out of the box and, to our knowl…
I installed: HONCHO local hosted no docker (TUTORIAL) (www.reddit.com via reddit)

2w gemini

...So you don't have to. For those curious about honcho but overwhelmed by the lack of clarity unless you want to use docker...
Anyone gotten Gemma 4 12B (unified audio) to actually attend to speech with a large system prompt? (www.reddit.com via reddit)

2w vllm gemma

I'm trying to use Gemma 4 12B — the new encoder-free unified model (audio/vision/text in one) — for a one-pass audio → response voice assistant: feed the recorded WAV + system prompt and get the reply back as text directly, collapsing the…
I wired up Agentic Coding with Code Context Graphs, results are interesting (www.reddit.com via reddit)

2w gemma gemini mcp+1

I have been curious about how will having a infrastructure that provides agents the capability to explore code bases as relations, rather than text will change the performance of the AI agents So, for the last few weeks, I have been buildi…
Newer Qwen models are worse at summarization? (www.reddit.com via reddit)

2w gemma qwen agentic

We have summaries annotated by real humans that we benchmark various models, using an LLM as a judge, we found that in the 30B params range, Qwen 3 tops it out, followed by Gemma 4. It feels like newer Qwens are optimized to perform agenti…
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization (www.reddit.comhttps)

2w gemma
Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G (huggingface.co via reddit)

2w gemma
Unsloth Gemma 4 QAT MTP assistant models now available (www.reddit.com via reddit)

2w gemma

Unsloth Gemma 4 QAT MTP assistant models now available They're both available as q8_0 models named mtp-gemma-4-*.gguf on the root of the directory and in both q8 and larger quants within an MTP folder. https://huggingface.co/unsloth/gemma-…
Introducing Gemma 4 12B: a unified, encoder-free multimodal model (deepmind.google)

2w gemma agentic

Introducing Gemma 4 12B: a unified, encoder-free multimodal model Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B…
Bonsai: Human->LLM->Web with LLM interface using Gemma4 12B locally on Windows (drive.google.com via hn)

+21 2w

JavaScript must be enabled to use Google Drive Learn more Skip to main content Keyboard shortcuts Accessibility feedback This browser version is no longer supported. Please upgrade to a supported browser.
Unexpected Unsloth QAT Performance Compared to Unsloth IQ4_XS (www.reddit.com via reddit)

2w gemma

Hi everyone, I am comparing the standard (non-QAT) iq4_xs and q3_k_m quants with this QAT q4_k_xl model. (All of them are Unsloth versions)(gemma-4-26B-A4B-it-GGUF via lmstudio).
Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? (www.reddit.com via reddit)

2w gemma

I'm trying to find out if anyone has done any benchmarking comparing the Gemma 4 4-bit QAT models (via Unsloth) against standard 8-bit non-QAT quants. I know QAT is supposed to retain a ton of accuracy compared to the baseline BF16, but I'…
LMStudio gemma 4 31b QAT with MTP (www.reddit.com via reddit)

2w gemma llama

Did anyone manage to launch that in LMStudio? I am on the most recent update with the most recent llama.cpp available in LMStudio.
Gemma 4 MTP with assistant vs llama cpp type MTP (www.reddit.com via reddit)

2w gemma llama

Hi all Been loving the QAT models but honestly what is up with the assistant models, any ggufs and ways to make em work with vanilla llamacpp and if this way of MTP is different than the one am17an developed for llamacpp. Followup question…
Why is the MLX version of the Gemma 4 QAT so big?? (www.reddit.comhttps)

2w gemma

the MLX version of the QAT 4bit is like 27gb but the none QAT version is 17gb and the regular 4bit MLX version is also 17gb… anyone know why?
Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? (www.reddit.com via reddit)

2w gemma llama
I tested in-conversation memory on LFM2.5, Gemma 4 E2B and E4B. The biggest model forgot a fact from earlier in the chat first. (www.reddit.comhttps)

2w moe gemma

Ran a small, focused eval on three on-device models and the result was backwards from what I expected, so sharing the method and numbers. The task: tell the model "my dog is named Pablo," then add N turns of unrelated filler (shuffled gene…
Gemma 4 Chat Template now has preserve thinking (huggingface.co via reddit)

2w gemma
The GPUless Revolution: How Efficient AI Models Are Democratizing Artificial Intelligence (www.reddit.com via reddit)

2w llama

You don't need a $10,000 GPU to run state-of-the-art AI anymore. The latest breakthroughs in model quantization and optimization are putting powerful AI in the hands of everyone—from hobbyists to small businesses.
Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze (www.reddit.com via reddit)

2w gemma qwen

I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me.
Thoughts on Gemma4 12b vs 26a4b, which one is better? (www.reddit.com via reddit)

2w gemma

Not talking about 31b. In terms of creative tasks, writing, chatting, not necessarily coding but can still be included, Does Gemma 12b outperform in any way?
QATs Q4_0 from Google have more precision than Q4_K_XL from Unsloth (at least some) (www.reddit.com via reddit)

2w gemma

I wanted to try new QATs and opened two collections on HF (which HF found for me): https://huggingface.co/collections/google/gemma-4-qat-q4-0 https://huggingface.co/collections/unsloth/gemma-4-qat One strange thing caught my attention, for…
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines (arxiv.org)

2w fine-tuning gemma
how to run gemma-4-12b-it-qat-w4a16-ct in vllm or any version quantized of the model (www.reddit.com via reddit)

2w vllm gemma

when running by using transformers it runs by using vllm some weird error come up plese can any body share the command of running it on vllm ?
Is Gemma 4 12b good for coding? (www.reddit.com via reddit)

2w gemma

How are you using it? Quantized?
MTP and QTA - what is the relation? (www.reddit.com via reddit)

2w llama

I'm an old guy and I hate when things change so fast surrounded by noise and breaking news! MTP, I know what the acronym means and where it excels.
Gemma 4 E4B as a primary local LLM (replaced Qwen) (digg.com via hn)

+2 2w gemma qwen

Gemma 4 E4B 6bit is now the local model of my choice and loaded 24/7 on my Mac (using @lmstudio), replacing Qwen3, 3.5 4B after ~9 months of usage What an insane model, congrats @GoogleDeepMind 🤠 The new setup replaces his nine-month daily…
QAT variant of Gemma4 26B A4B is not working well for me (www.reddit.com via reddit)

2w gemma llama

I am using llama.cpp version b9549 with this arguments as recommended: llama-server --temp 1.0 --top-p 0.95 --top-k 64 -hf ... Here is what I got on chessboard svg test https://www.reddit.com/r/LocalLLaMA/comments/1t53dhp/quality_compariso…
Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe? (www.reddit.com via reddit)

2w gemma llama

I’m trying to run: unsloth/gemma-4-31B-it-qat-GGUF gemma-4-31B-it-qat-UD-Q4_K_XL.gguf on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch. Main model loads.
How to compare Original vs QAT Gemma 4 31B Q4 quants (www.reddit.com via reddit)

2w gemma

I just came across the following post, where a user found some confusing divergence results between Q4 quants of the original and QAT models with a Q8/unquantized reference of the original model. https://www.reddit.com/r/LocalLLaMA/comment…
You don't need a GPU to run gemma-4-26B-A4B (www.reddit.com via reddit)

2w gemma

I've been running LLMs on my old potato i5-8500 with 32GB of RAM and *no GPU* for awhile now, running up to 12B dense models which run slow but perfectly useable. But this Gemma-4-26B-A4B simply flies on this CPU - only machine using Kobol…
Gemma4 12B - Experiences? (www.reddit.com via reddit)

2w tool-use

Anyone check out the new Gemma4 12B that dropped 3 days ago? Integrated vision and audio recognition, no mmpro needed plus tool use.
QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench (www.reddit.com via reddit)

2w gemma llama

Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan) Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found…
Best local model for Xcode with 64GB MBP using LMStudio as the MCP server (www.reddit.com via reddit)

2w mcp

Gemma4?
120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP (www.reddit.com via reddit)

2w gemma llama

Google just released the QAT (Quantization-Aware Training) variant of their Gemma 4 models, including 12B, so it was only natural for me to benchmark it on my 12GB GPU since it fits entirely in VRAM. I was pleasantly surprised of the resul…
Gemma 4 QAT Unquantized Heretic is here (huggingface.co via reddit)

2w gemma

Now someone needs to quantize them to 4bit, also I have intentionally kept the divergence and refusal different from original Gemma 4 heretic collection, so you can even try these as alternative to original model.
Gemma 4 QAT accuracy inconsistencies (www.reddit.com via reddit)

2w moe gemma

Table from https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis I heard that MoE models are usually more susceptible to quantization error, but what happened with the 12B? I thought lower-parameter models usually quantized worse and yet…
Gemma 4 QAT Q4_0 Bench on Strix Halo (www.reddit.com via reddit)

2w gemma llama

Gemma 4 QAT Q4_0 Bench on Strix Halo These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU. QAT means quantization-aware training.
Activating MTP for QATGemma4 31b q4_0? (www.reddit.com via reddit)

2w vllm

Has anyone figured out how to activate MTP for Gemma4’s new QAT q4_0 GGUF for 31b? Or is this still not supported in llamacpp?
From Cloud to Local: The Revolution in Small, Efficient AI Agents Like OpenLumara and Gemma 4 (www.reddit.com via reddit)

2w gemma

While everyone's obsessing over giant cloud-based AI models, a quiet revolution is happening in local AI. We're seeing the emergence of extremely token-efficient, super-small system prompts, and modular agents designed specifically for loc…
AA comparison of the latest local models (www.reddit.com via reddit)

2w minimax gemma

I picked models I consider local (usable on 3×3090), so there are no 300B models, and you should probably skip 200B models too (but MiniMax and Step are pretty fast in Q3) Gemma-4 12B is still missing
A quick Gemma4 31B comparison (Q4_k_M, QAT, heretic) (www.reddit.com via reddit)

2w

No numbers. Not sure if anybody cares… I’ve run the UD version of Q4_k_m for a month.
Gemma 4 Haters 2 months Ago now seems to love Gemma 4 now. (www.reddit.com via reddit)

2w moe gemma qwen

What's with the switch guys? now imagine if google gonna drop 128B model or a MoE version (I bet those Qwen lovers will forget Qwen even existed).
MLX Community forgot about Gemma 4 12B QAT (www.reddit.comhttps)

2w gemma

They started uploading to Gemma 4 MTP QAT but forgot to upload 12B quants to the Gemma 4 QAT 😭.
What exactly is quantization aware training? (www.reddit.com via reddit)

2w moe gemma

First time hearing it. I also heard about the gemma 4 qat quants and if any one of them is good for 4gb vram and 16gb ram.
Gemma 4 12B Q4_K_XL Private Benchmark Results (www.reddit.comhttps)

2w gemma llama

Posting to share my results with others, I think the big bottom line is MTP acceptance rates offering a huge speedup, during coding tasks it's over 90% acceptance! Haven't hit my soft goal results or llm as judge benchmarks yet to compare…
At least one more Gemma 4 model confirmed?? (www.reddit.com via reddit)

2w gemma

could not extract summary
PSA: Gemma 4 12B is NOT completely broken for coding and tool calling, you need a special chat template (www.reddit.com via reddit)

2w gemma qwen llama

This is a PSA for people like me who tried it and hit the wall with tool calls failing left and right, so much so that harnesses like OpenCode just didn't work: There is a fix for that. You need to pass a better chat template file, which i…
Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB (ziraph.com via hn)

+21 2w gemma llama

Apples® to Apples®: MLX vs llama.cpp for Gemma 4 12B on an M1 16GB A matched-quant MLX-vs-raw-llama.cpp benchmark for Gemma 4 12B on one M1 16GB - decode is a tie, both pinned at the bandwidth wall. The cost that differs is startup and CPU…
Google's quantization aware trained Gemma checkpoints enabling mobile device inference just dropped on HF (www.reddit.comhttps)

2w gemma

Release Blog Post: Gemma 4 with quantization-aware training HuggingFace for mobile: Gemma 4 QAT Mobile - a google Collection HuggingFace for Q4_0: Gemma 4 QAT Q4_0 - a google Collection
Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency (blog.google via hn)

+2 2w gemma

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency Since releasing Gemma 4 two months ago, we've been continuously working to expand its capabilities. First, we introduced Multi-Token Prediction (MTP) to acce…
is there possible way to shrink 2GB or 4GB from a 27B llm to produce a bit lower size Q8 GGUF ? (www.reddit.com via reddit)

2w

is there a way to take few gigabytes from the final GGUF, instead of usual Q8 size we can get that Q8 but lower 2gb in size ? say 27B Q8 model is like 30Gb , is there way to reduce this by removing layers!?
Gemma 4 QAT GGUFs from Unsloth (www.reddit.com via reddit)

2w gemma

Their collection: https://huggingface.co/collections/unsloth/gemma-4-qat And their guide, always a very interesting read: https://unsloth.ai/docs/models/gemma-4/qat
Gemma 4 12B: The Developer Guide (developers.googleblog.com via hn)

+2 3w gemma

Following the announcement in our launch blog, we are releasing Gemma 4 12B, a dense multimodal model with a unified, encoder-free architecture. Gemma 4 12B introduces several milestones for local AI: Traditional multimodal models rely on…
Google's new Gemma 4 12B model is designed to run on any laptop with 16GB of RAM (arstechnica.com via hn)

+4 3w gemma

The generative AI boom has driven the cost of memory into the stratosphere, and Google is a key part of that trend. So it’s only fitting that Google should offer some less RAM-hungry local AI models.
Gemma 4 12B appears in Hugging Face (huggingface.co via hn)

+2 3w gemma llama

gemma-4-12B-it-GGUF Recommended way to run this model: llama-server -hf ggml-org/gemma-4-12B-it-GGUF Then, access http://localhost:8080
Gemma 4 26B on a consumer GPU: build pain, throughput, and BFCL numbers (algollabs.com via hn)

+2 3w gemma

2026-05-05 Gemma 4 26B on consumer-grade 5070Ti GPU A week running Google's Gemma 4 26B as my daily local agent on a single RTX 5070 Ti. No API calls, no cloud, no rate limits.
Show HN: I made a Gemma 4 Mac app that names screenshots with local AI (snapname.app via hn)

+42 3w gemma

I made my first macOS utility app that ships with a bundled Gemma 4 model, specifically the Gemma E4B one. It made my app DMG have 5.3 GB in size, but I think it is a small size for the power that this free local model can provide.

← all threads