2026-05-05 Gemma 4 26B on consumer-grade 5070Ti GPU A week running Google's Gemma 4 26B as my daily local agent on a single RTX 5070 Ti. No API calls, no cloud, no rate limits.
model
gemma-4-31B-it
huggingface.co/google/gemma-4-31B-it ↗
2640636 downloads1903 likesimage-text-to-texttransformers
from the model card
Hugging Face | GitHub | Launch Blog | Documentation License: Apache 2.0 | Authors: Google DeepMind Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on small models) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages. Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in four distinct sizes: E2B, E4B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI. Gemma 4 introduces key capability and architectural advancements: Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes. Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models). Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment. Optimized for On-Device – Smaller models are …
discussions
recent items
Gemma 4 26B on a consumer GPU: build pain, throughput, and BFCL numbers (algollabs.com via hn) Show HN: I made a Gemma 4 Mac app that names screenshots with local AI (snapname.app via hn) I made my first macOS utility app that ships with a bundled Gemma 4 model, specifically the Gemma E4B one. It made my app DMG have 5.3 GB in size, but I think it is a small size for the power that this free local model can provide.
Fun Local LLM Comparisons with Gemma, Granite, and Qwen (ekorbia.com via hn) Fun local LLM comparisons with Gemma, Granite, and Qwen Ekorbia v0.2 features a comparison-chat mode that runs 2-3 local models against the same prompt in parallel. Here are a few fun prompts running across Gemma 4 (e2b), IBM Granite 4.1 (…
↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4↯ Gemma 4gemmaqwen
Gemma-4-Harmonia-31B-Uncensored-Heretic Is Out Now, a Merge of Multiple gemma-4-31B-it Finetunes Designed for a Targeted Approach to Deep Neural Consolidation, Minimizing Regression While Amplifying Unique Capability Boundaries. With KLD 0.0047 and 9/100 Refusals! (huggingface.co via reddit) Provided in both Safetensors and GGUFs. Safetensors, llmfan46/Gemma-4-Harmonia-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/Gemma-4-Harmonia-31B-uncensored-heretic GGUFs, llmfan46/Gemma-4-Harmonia-31B-it-uncensored-heretic-GG…
Running Gemma4 31b-it on vLLM 0.21.0 A100s (bad quality or what am I doing wrong) (www.reddit.com) Okay fun time I got access to two Nvlinked A100s for some research project I benchmarked my work against the Gemma 4 31b-it available through Google, but my dataset is rather massive, so I need to run it on the "local" resources. Basically…
G4-MeroMero-26B-A4B-it-uncensored-heretic Is Out Now, a Finetune of gemma-4-26B-A4B-it, With KLD of 0.0152 and 12/100 Refusals! (huggingface.co via reddit) When I previously posted the uncensored version of the 31B version of the MeroMero finetune, quite a few people asked for the 26B-A4B version, I wasn't so keen on it because I considered the 31B to be the better version, but I understand t…
Gemma4 26b a4b Apex quant is quite good (www.reddit.com) I tried mudler's apex quant for gemma4 26b a4b and it was amazing! I got 38tps at 90.000 context with no loop and suprisingly no quality degradation.
Need Help - What would you build? Air-gapped NL assistant that is integrated with Splunk (www.reddit.com) So I have a side project with given scope: Fully air-gapped / on-prem - no internet, no outbound calls of any kind Engineers ask questions about Splunk data in natural language Has to hold the conversation in Korean (index/field names stay…
Choosing an abliterated version of Gemma 4 31B and 26B-A4B (www.reddit.com) The only thread was 2 months ago, when the model had just dropped. Since then, more versions from different authors have appeared, and users have had time to test them.
Google AI Edge Gallery v1.0.13 & v1.0.14 updates: Gemma 4 Multi-Token Prediction, Pixel TPU support, experimental MCP, new skills, now saves chat history (github.com via reddit) Google AI Edge Gallery ✨ Explore, Experience, and Evaluate the Future of On-Device Generative AI with Google AI Edge. AI Edge Gallery is the premier destination for running the world's most powerful open-source Large Language Models (LLMs)…
Experimental "Preserve Thinking" Jinja Template for Gemma4 31B in llama.cpp (www.reddit.com) https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja Yall are more than welcome to try it out and provide feedback. In my own testing in Pi-coding-agent I no longer have the "forgot to close thin…
Llama.cpp VS LiteRT on a custom Xiaomi 12 Pro 24/7 Server (V2 Redesign) (www.reddit.com) https://preview.redd.it/sm4ysgdw1w2h1.png?width=1376&format=png&auto=webp&s=3705932403919814fbf2008a1cba189d17e0591e Thanks everyone for the advice on my previous post (24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/…
Is something went wrong with those online free model, why I feel they worse than Gemma 4 26B A4B Q4_KM ?? (www.reddit.com) It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is much bigger parameter and quite update i…
Run Chrome’s tiny Gemma4 (aka Gemini Nano) directly on PC without GPU (www.reddit.com) Everyone remembers that sneaky download of Gemini Nano earlier this month? and if you talk to it, it will happily tell you it’s a Gemma.
Built a local-first AI memory system that indexes screen activity, meetings, and voice notes ( MCP + automations) (www.reddit.com) Been experimenting with an idea — what if your AI assistant actually remembered everything you did on your computer? Not stateless chats, but real persistent context.
Gemma-4-Gembrain-31B-it-uncensored-heretic Is Out Now, a Merge of Multiple Gemma 4 31B it Finetunes Designed to Boost Logical and Lateral Thinking for Improved Adherence, Increased Swipe Variety and Enhanced Creative Prose, With KLD of 0.0186 and 13/100 Refusals! (huggingface.co via reddit) Provided in both Safetensors and GGUFs. Safetensors: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic GGUFs: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic…
Gemma 4 MTP with LlamaCPP (www.reddit.com) I am running Gemma 4 31B for a project using LlamaCPP. There is no integrated main model + MTP drafter GGUF.
Looking to migrate off of Ollama and LMStudio (www.reddit.com) Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow.
Best AI (agent?) for coding locally? (www.reddit.com) Ryzen 5, 7500F RX 9070 XT 32 GB DDR5 I want to code a website and an app for something and I was wondering, whats the best AI I can run with my hardware, and should I use a tool like Claude Code or Pi agent to run them? I tried Gemma4 on P…
gemma 4 e2b quality degrades after ~30-40 continuous inferences on 4gb vram? (www.reddit.com) running gemma e2b via llama-server for continuous background tasks on a 1650 4gb. works great initially but after maybe 30-40 calls the outputs start getting noticeably worse — shorter responses, missing fields in json output, sometimes ju…
gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now, A Writing Finetune that Aims to Improve Gemma 4 31B it Writing Quality with More Natural English and Better Prose, Good for Creative Writings, Translations and RPs! (huggingface.co via reddit) Provided in both Safetensors and GGUFs. llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic llmfan46/gemma-4-Ortenzya…
Audio input not accepted with llamacpp for Nemotron 3 nano Omni ? (www.reddit.com) Llama-server does not accept audio input (or video for that matter) with Nemotron 3 nano omni (unsloth). I’m on a recent build of llamacpp and I redownloaded Nemotron, and I have the mmproj loaded too.
Gemma4 26b MoE running in MLX with turboquant (and custom kernel) (www.reddit.com) TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with…
5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp (www.reddit.com) Hey all, While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.
Any good MOE ~60B models? I have 64GB vram (www.reddit.com) I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for ~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while. Are there any good MOE models that are around 60B in parameters so I can ma…
I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max (www.reddit.com) It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?
Translate long subtitle files (www.reddit.com) I'm struggling to find a good system to translate a movie length subtitle .srt file. My current setup is to run Kobold with Gemma4 into Subtitle Edit, which then sends a request to the LLM to translate every line, but it does a bad job bec…
Why use token/s as a metric when perplexity and time to first token feel more important (www.reddit.com) I have been doing Local LLM to solve problems like mass classification of images, code generation, etc as opposed to generating text. In my experience, tokens per second aren't as descriptive of the quality of the model as is the time to f…