We have summaries annotated by real humans that we benchmark various models, using an LLM as a judge, we found that in the 30B params range, Qwen 3 tops it out, followed by Gemma 4. It feels like newer Qwens are optimized to perform agenti…
model
gemma-4-26B-A4B-it
huggingface.co/google/gemma-4-26B-A4B-it ↗
11696495 downloads1064 likesimage-text-to-texttransformers
from the model card
Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key capability and architectural advancements: Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes. Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models). Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment. Optimized for On-Device – Smaller models are …
discussions
- Gemma 4 79 ongoing since 2026-06-01
recent items
Newer Qwen models are worse at summarization? (www.reddit.com via reddit) OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization (www.reddit.comhttps) Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G (huggingface.co via reddit) Bonsai: Human->LLM->Web with LLM interface using Gemma4 12B locally on Windows (drive.google.com via hn) JavaScript must be enabled to use Google Drive Learn more Skip to main content Keyboard shortcuts Accessibility feedback This browser version is no longer supported. Please upgrade to a supported browser.
Unsloth Gemma 4 QAT MTP assistant models now available (www.reddit.com via reddit) Unsloth Gemma 4 QAT MTP assistant models now available They're both available as q8_0 models named mtp-gemma-4-*.gguf on the root of the directory and in both q8 and larger quants within an MTP folder. https://huggingface.co/unsloth/gemma-…
Introducing Gemma 4 12B: a unified, encoder-free multimodal model (deepmind.google) Introducing Gemma 4 12B: a unified, encoder-free multimodal model Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B…
[Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B (www.reddit.com via reddit) I took the liberty to test both models today on my favorite benchmark question, head to head. Device: Apple Mac M3 Max 64GB Environment: llama.cpp, all defaults Gemma4-12B's token generation speed: 47 tps with MTP and 2 predicted tokens 29…
Jetson Orin NX Build for Hermes Agent + Benchmarking (www.reddit.com via reddit) I had a huge LLM server, and now I have a tiny one! I had a Jetson Orin NX gathering dust from a long dead robotics project, from back in the Llama-7B days.
Gemma 4 31B's competence surprised me (www.reddit.com via reddit) I'm just getting started using local LLMs for code. I'm not interested vibe coding, but I am hoping to increase my productivity in the publish or perish world of academia.
Unexpected Unsloth QAT Performance Compared to Unsloth IQ4_XS (www.reddit.com via reddit) Hi everyone, I am comparing the standard (non-QAT) iq4_xs and q3_k_m quants with this QAT q4_k_xl model. (All of them are Unsloth versions)(gemma-4-26B-A4B-it-GGUF via lmstudio).
Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants? (www.reddit.com via reddit) I'm trying to find out if anyone has done any benchmarking comparing the Gemma 4 4-bit QAT models (via Unsloth) against standard 8-bit non-QAT quants. I know QAT is supposed to retain a ton of accuracy compared to the baseline BF16, but I'…
Gemma 4 26B A4B IT QAT Comparison (www.reddit.com via reddit) Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me.
[Follow-up] Qwen3.6-35B-A3B 8GB RTX: I tried Linux, tested Gemma 4, and now understand why Windows was faster (www.reddit.com via reddit) Original post: https://www.reddit.com/r/LocalLLaMA/comments/1txwff3/comment/oq1e0jt/?context=3 TL;DR: Migrated to WSL2 to test Linux (several people suggested it). Embedded MTP on the UD model: 25.8 tok/s.
LMStudio gemma 4 31b QAT with MTP (www.reddit.com via reddit) Did anyone manage to launch that in LMStudio? I am on the most recent update with the most recent llama.cpp available in LMStudio.
Gemma 4 MTP with assistant vs llama cpp type MTP (www.reddit.com via reddit) Hi all Been loving the QAT models but honestly what is up with the assistant models, any ggufs and ways to make em work with vanilla llamacpp and if this way of MTP is different than the one am17an developed for llamacpp. Followup question…
Why is the MLX version of the Gemma 4 QAT so big?? (www.reddit.comhttps) the MLX version of the QAT 4bit is like 27gb but the none QAT version is 17gb and the regular 4bit MLX version is also 17gb… anyone know why?
Gemma 4 QAT + MTP: max 33% speed increase in token generation, any ideas? (www.reddit.com via reddit) I tested in-conversation memory on LFM2.5, Gemma 4 E2B and E4B. The biggest model forgot a fact from earlier in the chat first. (www.reddit.comhttps) Ran a small, focused eval on three on-device models and the result was backwards from what I expected, so sharing the method and numbers. The task: tell the model "my dog is named Pablo," then add N turns of unrelated filler (shuffled gene…
[3090] Gemma4 QAT + MTP quick TPS numbers [TLDR 1.2-1.8x better] (www.reddit.com via reddit) These last few weeks have been godsend for 24GB (and below) gpu poor peeps. Killer models released (Gemma 4 / Qwen 3.6) Free intelligence via QAT Bonus speed via MTP We're at the tipping point where GPU poor (24gb and below) people are act…
Gemma 4 Chat Template now has preserve thinking (huggingface.co via reddit) Used local Ollama (gemma4:e4b + nomic-embed-text) to bulk-generate AI summaries for 4300 arXiv papers and push them to a remote Cloudflare DB — pipeline walkthrough (www.reddit.com via reddit) The GPUless Revolution: How Efficient AI Models Are Democratizing Artificial Intelligence (www.reddit.com via reddit) You don't need a $10,000 GPU to run state-of-the-art AI anymore. The latest breakthroughs in model quantization and optimization are putting powerful AI in the hands of everyone—from hobbyists to small businesses.
Gemma 4 12b QAT is a regression for my use case, despite all the hype.. Not my main Squeeze (www.reddit.com via reddit) I spent the last few days trying to get consistent tool calling out of the new Gemma 4 12b QAT model and had to give up. When the model actually works, it works great, but for my specific use case and workflows it is just not for me.
Thoughts on Gemma4 12b vs 26a4b, which one is better? (www.reddit.com via reddit) Not talking about 31b. In terms of creative tasks, writing, chatting, not necessarily coding but can still be included, Does Gemma 12b outperform in any way?
QATs Q4_0 from Google have more precision than Q4_K_XL from Unsloth (at least some) (www.reddit.com via reddit) I wanted to try new QATs and opened two collections on HF (which HF found for me): https://huggingface.co/collections/google/gemma-4-qat-q4-0 https://huggingface.co/collections/unsloth/gemma-4-qat One strange thing caught my attention, for…
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines (arxiv.org) how to run gemma-4-12b-it-qat-w4a16-ct in vllm or any version quantized of the model (www.reddit.com via reddit) when running by using transformers it runs by using vllm some weird error come up plese can any body share the command of running it on vllm ?
Is Gemma 4 12b good for coding? (www.reddit.com via reddit) How are you using it? Quantized?
MTP and QTA - what is the relation? (www.reddit.com via reddit) I'm an old guy and I hate when things change so fast surrounded by noise and breaking news! MTP, I know what the acronym means and where it excels.