Fun Local LLM Comparisons with Gemma, Granite, and Qwen (ekorbia.com via hn)
model roundup
Gemma 4
-
Fun local LLM comparisons with Gemma, Granite, and Qwen Ekorbia v0.2 features a comparison-chat mode that runs 2-3 local models against the same prompt in parallel. Here are a few fun prompts running across Gemma 4 (e2b), IBM Granite 4.1 (…
-
Provided in both Safetensors and GGUFs. Safetensors, llmfan46/Gemma-4-Harmonia-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/Gemma-4-Harmonia-31B-uncensored-heretic GGUFs, llmfan46/Gemma-4-Harmonia-31B-it-uncensored-heretic-GG…
-
Okay fun time I got access to two Nvlinked A100s for some research project I benchmarked my work against the Gemma 4 31b-it available through Google, but my dataset is rather massive, so I need to run it on the "local" resources. Basically…
-
Been experimenting with an idea — what if your AI assistant actually remembered everything you did on your computer? Not stateless chats, but real persistent context.
-
It started with I just want to make a chat app like roleplay with characters but Gemma 4 26B A4B Q4_KM doesn't have info some old character so I crawl back to those online services as those model is much bigger parameter and quite update i…
-
So I have a side project with given scope: Fully air-gapped / on-prem - no internet, no outbound calls of any kind Engineers ask questions about Splunk data in natural language Has to hold the conversation in Korean (index/field names stay…
-
Best AI (agent?) for coding locally? (www.reddit.com)
Ryzen 5, 7500F RX 9070 XT 32 GB DDR5 I want to code a website and an app for something and I was wondering, whats the best AI I can run with my hardware, and should I use a tool like Claude Code or Pi agent to run them? I tried Gemma4 on P…
-
running gemma e2b via llama-server for continuous background tasks on a 1650 4gb. works great initially but after maybe 30-40 calls the outputs start getting noticeably worse — shorter responses, missing fields in json output, sometimes ju…
-
Choosing an abliterated version of Gemma 4 31B and 26B-A4B (www.reddit.com)
The only thread was 2 months ago, when the model had just dropped. Since then, more versions from different authors have appeared, and users have had time to test them.
-
Everyone remembers that sneaky download of Gemini Nano earlier this month? and if you talk to it, it will happily tell you it’s a Gemma.
-
https://preview.redd.it/sm4ysgdw1w2h1.png?width=1376&format=png&auto=webp&s=3705932403919814fbf2008a1cba189d17e0591e Thanks everyone for the advice on my previous post (24/7 Headless AI Server on Xiaomi 12 Pro (Snapdragon 8 Gen 1 + Ollama/…
-
Gemma4 26b a4b Apex quant is quite good (www.reddit.com)
I tried mudler's apex quant for gemma4 26b a4b and it was amazing! I got 38tps at 90.000 context with no loop and suprisingly no quality degradation.
-
https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF/blob/main/gemma4-improved.jinja Yall are more than welcome to try it out and provide feedback. In my own testing in Pi-coding-agent I no longer have the "forgot to close thin…
-
When I previously posted the uncensored version of the 31B version of the MeroMero finetune, quite a few people asked for the 26B-A4B version, I wasn't so keen on it because I considered the 31B to be the better version, but I understand t…
-
Translate long subtitle files (www.reddit.com)
I'm struggling to find a good system to translate a movie length subtitle .srt file. My current setup is to run Kobold with Gemma4 into Subtitle Edit, which then sends a request to the LLM to translate every line, but it does a bad job bec…
-
Gemma 4 MTP with LlamaCPP (www.reddit.com)
I am running Gemma 4 31B for a project using LlamaCPP. There is no integrated main model + MTP drafter GGUF.
-
Google AI Edge Gallery ✨ Explore, Experience, and Evaluate the Future of On-Device Generative AI with Google AI Edge. AI Edge Gallery is the premier destination for running the world's most powerful open-source Large Language Models (LLMs)…
-
Provided in both Safetensors and GGUFs. Safetensors: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic GGUFs: llmfan46/Gemma-4-Gembrain-31B-it-uncensored-heretic…
-
5060ti chads -> gemma-4-31b-it-nvfp4 + vllm + mtp (www.reddit.com)
Hey all, While nvfp4 still seems to be a work in progress, the latest version of vllm 0.21 finally has mtp working for gemma. With all the talk of qwen being badass I thought I would revisit gemma.
-
Any good MOE ~60B models? I have 64GB vram (www.reddit.com)
I have a build with 2 x MI50 32GBs and 64 gigs of DDR4 (bought before rampocolypse for ~630 USD total, I’m not rich) and I’m not gonna upgrade it for a long while. Are there any good MOE models that are around 60B in parameters so I can ma…
-
Looking to migrate off of Ollama and LMStudio (www.reddit.com)
Hello, I'm currently using Ollama / lm studio for things like code inference and proof reading emails, etc. Definitely not experienced in this space but looking to grow.
-
Provided in both Safetensors and GGUFs. llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic: https://huggingface.co/llmfan46/gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic llmfan46/gemma-4-Ortenzya…
-
I have been doing Local LLM to solve problems like mass classification of images, code generation, etc as opposed to generating text. In my experience, tokens per second aren't as descriptive of the quality of the model as is the time to f…
-
Audio input not accepted with llamacpp for Nemotron 3 nano Omni ? (www.reddit.com)
Llama-server does not accept audio input (or video for that matter) with Nemotron 3 nano omni (unsloth). I’m on a recent build of llamacpp and I redownloaded Nemotron, and I have the mmproj loaded too.
-
Gemma4 26b MoE running in MLX with turboquant (and custom kernel) (www.reddit.com)
TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with…
-
It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?
-
Sparky runs entirely on the Jetson. Gemma 4 E4B at Q4_K_M via llama.cpp with q8_0 KV cache and flash attention.
-
Hi, has anybody succeeded in running llama.cpp with Gemma 31b dense and Gemma e4b as draft model, and simultaneously inhibit the voice recognition feature? Is it even (theoretically) possible?
-
Hi r/LocalLLaMA - I've been paying close attention to the edge AI ecosystem because it's an area where i see huge potential and where I truly believe AI will become more useful for day to day tasks. Around the gemma 4 release I was already…
-
I spend most of my day writing prompts to Claude. Read a study recently that said people speak ~3x faster than they type, which lands differently when "writing" is basically your whole workflow.
-
The "the future is fictional" problem of many local LLMs (www.reddit.com)
Many local models have a problem (that raised due to excessive RHLF training): They mostly think that everything that is beyond their knowledge cutoff date would be "fictional" or "satirical". To be fair: Even the Gemini API without web ac…
-
I can set my context length as high as 64k and the vram usage is not even remotely close to the maximum utilisation. My TPS is also 40+.
-
LLMs on flagships smartphones? (www.reddit.com)
I have been curious to see how small LLMs like Gemma-4-E2B-it run on a flagship smartphone (S25+ with Snapdragon 8 Elite) in terms of prompt processing and token generation. I have created a script that uses llama-cli and I achieve 48 tps…
-
very slow tok/s with Gemma 4 31B on a 5090?! (www.reddit.com)
Hi, i have a 5090 and i was tyoing around with hermes-agent. To utilize 128K i thought about switching from LM Studio to llama-cpp (the turboquant fork) expecting better tok/s and also saving some VRAM from context quantization.
-
Does THINKING MODE significantly improve translation? (www.reddit.com)
Between a solid model from Qwen or Gemma 4, when translating a text, does "thinking mode" significantly boost the quality of the translation, or is the difference negligible?
-
Hi all I had a quick question while we wait for llama.cpp MTP implementation, have any of y'all tried Gemma4 MTP models on ollama and or transformers? What was your experience and or cli args and or workflows like?
-
My partner uses Duolingo for learning and practicing languages, but has been getting increasingly sick of it. I decided to experiment with whether local models would be good for creating and grading language exercises.
-
Gemma 4 MTP vs DFlash on 1x H100: dense vs MoE results (www.reddit.com)
Benchmarked Gemma 4 MTP and z-lab's DFlash on a single H100 80GB using vLLM and NVIDIA's SPEED-Bench qualitative dataset. Setup: Hardware: 1x H100 80GB Runtime: vLLM Dataset: SPEED-Bench qualitative Prompts: 880 total, 80 prompts across ea…
-
converting weights to snn (www.reddit.com)
Hello everyone, I developed the snn architecture from scratch based on the human brain. I had several successful launches of training spike models from scratch and I also had an idea: what would happen if I took the gemma 4 model and conve…
-
Gemma 4 E4B is great for short transcriptions (www.reddit.com)
Yes, for material that is an hour long, there is no getting around tools like Whisper - or something even better. However, for transcribing short snippets, Gemma works very quickly and reliably- even in foreign languages.
-
could not extract summary
-
Terrible Vulkan pp/tg on Arrow Lake iGPUs (www.reddit.com)
Hi, I recently tried to get llama.cpp with SYCL running on an Arrow Lake system but gave up halfway through since Vulkan is just way easier to set up. But, the pp/tg I'm getting on Vulkan w/ Arc 130T is disgustingly bad - 100 tokens/s for…
-
ExLlamaV3 Major Updates! (www.reddit.com)
Turboderp has a been on an absolute tear recently, in the endless battle to cram new llamas into smaller, faster boxes. We started off last month with the release of gemma 4 support, and continued with improved caching efficiency.
-
Anybody else noticing how good gemma-4-26b-a4b is with one-shotting three.js? (rowanunderwood.github.io via reddit)
I wrote up this little python app to cycle through a bunch of prompts like this: Single HTML file using three.js from CDN. A central rotating MeshNormalMaterial torus knot.
-
Gemma Chat: Offline Vibe Coding on Apple Silicon (github.com via hn)
Gemma Chat Vibe code without the internet. A local coding agent powered by Google's Gemma 4 — runs entirely on your Mac via Apple's MLX framework.
-
Dual gpu question (www.reddit.com)
Hı, i have rx 9060XT and rx 6600. 16gb and 8gb.
-
Grafting a Speech Head onto Gemma 4 E4B (www.frisson-labs.com via hn)
Grafting a Speech Head onto Gemma 4 E4B For a Discord buddy, the tempting model shape is small, fast, and multimodal. It should hear the call, see the game, read the chat, and respond quickly enough that the moment is still alive.
-
Gemma 4 - website translations (large model, or small model)? (www.reddit.com)
I have setup a workflow to process website translations with Gemma 4, I just host it on LM Studio, and a custom Python wrapper iterates through and runs overnight. My question is..
-
3060 Ti 12GB vs RX 7600 XT 16GB? (www.reddit.com)
Trying to figure out which is better for LLM. Mainly Gemma 4.
-
MTP is all about acceptance rate (www.reddit.com)
So I was very excited about the MTP stuff especially since Gemma4 has become my "daily driver" for some stuff. I grabbed the latest mlx-vlm and did some tests and found it disappointing.
-
Gemma 4 26B Hits 600 Tok/s on One RTX 5090 (www.reddit.com)
I ran a benchmark to see how much DFlash speculative decoding actually helps in vLLM. Setup: GPU: RTX 5090, 32GB VRAM vLLM: 0.19.2rc1 Main model: cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit Draft model: z-lab/gemma-4-26B-A4B-it-DFlash Workload: r…
-
Hey folks, I posted here a few months back about how I was basically working for Claude -- pasting the same emails, re-explaining the same backstory, being its memory across every chat. Today I'm launching Contextify.
-
Gemma4 26B A4B NVFP4 GGUF (www.reddit.com)
Hey everyone! I’ve just uploaded a GGUF version of nvidia/Gemma-4-26B-A4B-NVFP4.
-
What's the right way to feed PDF files to Gemma-4? (www.reddit.com)
In my line of work, PDF documents tend to be combinations of text, math formulas, tables and images. llama.cpp added support for PDFs a few months ago, but I believe it treats PDFs either as text (discarding everything else), or as images.
-
New Gemma 4 draft models released (alternativeto.net via reddit)
Just saw this and wanted to ask the obligatory GGUF when?
-
Google launched its Gemma 4 open models this spring, promising a new level of power and performance for local AI. Google’s take on edge AI could be getting even faster already with the release of Multi-Token Prediction (MTP) drafters for G…
-
Getting unexpected output with Gemma 4 31b-it on vLLM (www.reddit.com)
Hey everyone, I'm running into a weird issue and hoping someone here might have a fix or some troubleshooting ideas. I'm currently trying to run the new Gemma 4 31b-it model using vLLM (v0.20.0-cu130) deployed via Helm chart (https://gith…
-
Decoupled Attention from Weights - Gemma 4 26B (www.reddit.com)
Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!!
-
Just tested Gemma 4 31B with the new official MTP Drafter on my H100 today and compared the approach with DFlash to help you decide which one to use. Without drafter: 13.7 tok/s.
-
The idea of offload-mcp is simple: instead of running hardware-hungry local models for routine work, let Claude offload that work to FREE model APIs and SAVE tokens. I’m using Gemma via the Google GenAI API because I like it in my processi…
-
Gemma 4 MTP Test - what speedup can you gain? (www.reddit.com)
Let's test the Gemma 4 MTP implementation using the HuggingFace Transformers library and the new drafter model by Google. We'll load both models and test on a couple of prompts with and without the MTP support https://www.youtube.com/live/…
-
Testing an Ollama powered agent mode with gemma4 inside Modly (www.reddit.com)
Hey everyone ! Quick Modly update.
-
Accelerating Gemma 4: faster inference with multi-token prediction drafters Just a few weeks ago, we introduced Gemma 4, our most capable open models to date. With over 60 million downloads in just the first few weeks, Gemma 4 is deliverin…
-
Running a 26B LLM locally with no GPU (www.reddit.com)
This is crazy. I've been running local LLMs on CPU only for awhile now and have great results with 12B models running on an i5-8500 and only 32GB of RAM with no GPU.
-
When dealing with untrusted outside input, I think you should handle it based on the situation. If you're processing structured data files, it's better to use tools to isolate and handle them.
-
Hey everyone, I built a tool that creates movie recap videos automatically using local models. The problem: making recap videos takes forever.
-
Roundtable chat with Talkie-1930 and Gemma 4 31B (www.reddit.com)
Talkie-1930-13b-it and Gemma 4 31b in the same chat. Talkie is a 13B vintage language model from 1930.
-
it's time to update your Gemma 4 GGUFs (www.reddit.com)
Chat Template was fixed a few days ago choose your fav dealer: https://huggingface.co/bartowski/google_gemma-4-31B-it-GGUF https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF https://huggingface.co/bartowski/google_gemma-4-E4B-…
-
interacting with gemma 4 w/ live video and audio (www.reddit.com)
I saw someone on this forum demonstrate using gemma 4 - live streaming audio and video from his webcam to it asking it what it was seeing. It was pretty great but I cant find that post anymore and I can't find a good repo on github where I…
-
Anybody tried openclaw + M5 pro + 48gb? (www.reddit.com)
Hello, posting again on this since my last post was removed. I am working on an AI agent solution to help me with my multiple daily tasks for different business activities; a few rental properties, a manufacturer trying to enter the Mexico…
-
The first photo shows the results when run on the CPU, and the second one is on the GPU. Look at the speed difference between the Prefill and Decode speeds in my benchmark results.
-
Been running Gemma 4 E2B locally on my OnePlus CE 5 (8GB RAM) for a few months. Chat quality is fine for the size.
-
Potential of Gemma4 Per-layer embeddings? (www.reddit.com)
Hey there people. So let's talk about GEMMA 4 per layer embeddings.
-
I've been experimenting with using Ollama to run Claude Code locally with models like Gemma 4, thinking I could avoid API costs. However, I quickly realised these models aren't really optimised for Claude Code's agentic workflows — they te…
-
gemma-4-31B-it-DFlash has been released (www.reddit.com)
https://huggingface.co/z-lab/gemma-4-31B-it-DFlash I guess we'll have to wait until this PR is merged before we can test it. https://github.com/ggml-org/llama.cpp/pull/22105
-
I'm testing running local LLMs on a gaming mini PC (AMD 7840HS, 32 GB RAM) paired with an eGPU (Radeon 9060XT with 16 GB VRAM). Since I'm not very familiar with using llama.cpp, I kept getting unsatisfactory results, but with the recent Ge…
-
nvidia/Gemma-4-26B-A4B-NVFP4 (huggingface.co via reddit)
Can confirm it works on a 5090, with 80% allocation (of 32gb) I got around 50k context. It's 18.8GB Benchmark Baseline (Full Precision) NVFP4 GPQA Diamond 80.30% 79.90% AIME 2025 88.95% 90.00% MMLU Pro 85.00% 84.80% LiveCodeBench (pass@1)…
-
thinking of gemma 4 26B vs 31B (www.reddit.com)
I see a big difference in agentic coding between gemma-4-31B-it-Q5_K_M and gemma-4-26B-A4B-it-UD-Q8_K_XL. The 26B model is much faster because of A4B and generally works well, but there is a big difference in thinking.
-
Based on what should I choose Gemma 4 models/quantizations? (www.reddit.com)
I have an RTX 4060 8GB(+16GB RAM) laptop, and when asking Gemini or ChatGPT, they say the Gemma 4 Q4 K M is the best fit for my hardware with Context Length around 16k-32k. However, in practice, after loading even a higher quantization lik…
-
With the release of Gemma 4 models and a slew of open weight/source models subsequently, some of the workflows like drafting emails/ trivial coding tasks have become possible. I’m exploring the possibility of integrating some of the powerf…
-
Gemma 4 architecture support for QVAC-Fabric (Tether's llama.cpp fork) (github.com via hn)
QVAC-Fabric Gemma 4 Architecture Patch Adds full Gemma 4 (gemma4) architecture support to QVAC-Fabric, Tether's llama.cpp fork. Base: QVAC-Fabric temp-upstream branch Target: All Gemma 4 variants (E2B, E4B, etc.
-
Gemma-4 MLX reasoning? (www.reddit.com)
Gemma-4 is great. On a MacBook M5, using lm-studio, the MLX versions (specifically looking at https://huggingface.co/lmstudio-community/gemma-4-26B-A4B-it-MLX-8bit) rock.
-
TLDR: I've been running gemma4 e2b extensively on iOS with llama.cpp and found some interesting quirks and info you guys may like! These are specifics for the iPhone and what I've found worked across 20+ devices.
-
The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these models as of now.
-
AMG GPUs are faster at pre filling (www.reddit.com)
I did give same prompt same document to 1660ti running Gemma 4 e2b q4 coz of the small vram and another to and igpu running Gemma 4 e4b q8 prefill rate before token generation was like 4-5 times faster with the 890m igpu then token generat…
-
How to run a local coding agent with Gemma 4 and Pi | Patrick Loeber (patloeber.com via reddit)
Tutorial from the Google guy, I use very similar setup (llama.cpp instead of lmstudio)
-
Using Google's Gemma 4 E4B local AI model to Reverse Engineer a simple Crackme I was playing around with the new Gemma E4B open weights local model which Google released, and to my surprise I was seeing a great deal of success in using it…
-
Running Gemma 4 31B on Mac with Ollama (sammyrulez.github.io via hn)
A practical configuration for a 32 GB M5 Mac that still needs to remain usable Running large language models locally has become surprisingly practical on Apple Silicon. With a modern Mac, Ollama, and a carefully quantized GGUF model, it is…
-
Best sota 12b-32b creative writing model? (www.reddit.com)
I love using openrouter but I would also love a smaller model that can fit within 16gb of VRAM and 64b of ram, that can pack a punch for its size specifically in the creative writing section. Any good recommendations?
-
A weekend with LoRA on Gemma 4 E2B: instrumenting what fine-tuning changes (aiexplr.com via hn)
Spent a week doing LoRA fine-tuning on Gemma 4 E2B (~5.1B total params, ~2B active in text decoder) for a narrow Python code-generation task. Bad outputs went from ~5% to 0% (greedy) and 1.5% (sampled) across 134 tests.
-
Gemma 4 Folks (www.reddit.com)
Full Answer >>> A plane crashes on the border of two countries. Where do they bury the survi ...
-
Best settings for gemma-4 on a 3090? (www.reddit.com)
3090 (24G) + 32G DDR4 Currently running --mmproj mmproj-BF16.gguf --chat-template-kwargs '{"enable_thinking":true}' \ --flash-attn on \ --cache-type-k q4_0 \ --cache-type-v q4_0 \ -np 1 \ -c 160000 \ --jinja at 26B-A4B-it-UD-Q5_K_XL and ge…
-
Spent a week doing LoRA fine-tuning on Gemma 4 E2B (gemma-4-e2b-it, ~5.1B total params, ~2B active in the text decoder) for a narrow Python code-generation task. Setup: Model: Gemma 4 E2B, bf16, language_model only (vision + audio towers f…
-
Ollama swap to llamacpp/llama server (www.reddit.com)
So I'm a newb in certain aspects but not in others, I'm currently running an AI stack on my unraid server: CPU: AMD Threadripper 3960X (24c/48t) Motherboard: Gigabyte TRX40 AORUS PRO WIFI RAM: 256GB DDR4-3200 G.Skill Trident Z GPU: Nvidia…
-
Always been stuck with models that fit on my 16gb .... Going to have about a week for free with 4x rtx6000pro .
-
Hi everyone, as a power user I hit Claude Code's usage cap too often I wanted to set up my own local model, however I only have RTX 5070 with 12 GB of VRAM so the only realistic option was Gemma 4 with effective 4B params. When I tried to…
-
Gemma 4 VLA Demo on Jetson Orin Nano Super (huggingface.co)
Gemma 4 VLA Demo on Jetson Orin Nano Super You speak → Parakeet STT → Gemma 4 → [Webcam if needed] → Kokoro TTS → Speaker Press SPACE to record, SPACE again to stop. This is a simple VLA: the model decides on its own whether to act based o…
-
Gemma 4 is not your standard transformer (idlemachines.co.uk via hn)
Gemma 4 makes five quiet departures from the standard transformer recipe. QK-norm instead of 1/√d, partial RoPE on global layers, per-layer input gating, KV sharing across layers, and an MoE that sits alongside the MLP rather than replacin…
-
My old Samsung S10 was sitting in a drawer so I turned it into an always-on LLM endpoint. PocketPal is great for on-phone chat, but I wanted the phone itself to be an OpenAI-compatible endpoint for the rest of my network.
-
I am using an Arc Pro B70 to do inference, and it's token generation speed is fine using Ollama, but it takes *forever* to do a prefill. vLLM absolutely tackles the prefill problem (nearly instant responses), but I can't run nearly as larg…
-
How do I run Gemma 4 e4b, extracted via adb from Google AI Edge Gallery on Android, whose image is in litertlm format and weighs 3.6 gigs, in a browser? I mean using web technologies?
-
Why does Gemma 4 e4b from Google AI Edge Gallery on Android weigh only 3.6 gigs, while the one from Unsloth (gemma-4-E4B-it-UD-Q2_K_XL.gguf) weighs 3.7, and for some reason the model image in litertlm format extracted via adb from Google A…
-
-
Gemma 4 E4B is broken (www.reddit.com)
-
Gemma 4 Vision (www.reddit.com)
-
-
Handling a large amount of files (www.reddit.com)
-
-
model for frigate, a380 (www.reddit.com)
-
Gemm4:e4B-IT good at instructions following no refusals. (www.reddit.com)
-
Deploying Gemma 4 26B on an RTX 5090 (datapnt.com via hn)
-
Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB) (teamchong.github.io via hn)
-
Gemma 4 - MLX doesn't seem better than GGUF (www.reddit.com)
-
Gpu reccommendations for Coding/chat LLM (www.reddit.com)
-
Gemma4 26B MoE on Arc 140T (www.reddit.com)
-
How do I get the LLM to answer everything? (www.reddit.com)
-
Lm studio running some models very slow while others run normally. (www.reddit.com)
-
I have Gemma4-E2B working within home assistant as STT, and E2B seems fast and accurate for STT (maybe a bit better than Parakeet), however, it responds with the entire thought process: https://preview.redd.it/v8zhb5elltvg1.png?width=599&f…
-
So the question I've seen posed many times in /r/singularity is if the Gemini models are actually that bad at coding compared to their benchmarks, or whether the harness used makes an absolutely gigantic difference in model performance. Gi…
-
Saw a post about running hermes agent locally with gemma4 through ollama. zero api costs, unlimited tokens, full privacy.
-
NVIDIA V100 32GB for AI in 2026 (www.reddit.com)
hello. i have the oportunity of buying Nvidia V100 with 32GB for about 915$ / 775 euro.
-
Will Gemma 4 replace Claude Code or are we lying to ourselves again (webmatrices.com via hn)
Vercel Security Checkpoint | cle1::1776468758-lOAcIwtVVUa8cG9OLlcTtnlZwlvTxsBe
-
Apparently, llms are just graph databases? (www.reddit.com)
I found this youtube video, where this guy created a database querying language to basically query models as if they are just database. I am blind so can't see the graphs, but he talks about edges, nodes, features and entities.
-
Dabbling in Ai - Which Hardware to get (www.reddit.com)
Hi everyone, I want to get deeper into running local models and need new Hardware for this. My best suited machine for this is currently the 2020 M1 Macbook Pro with 16GB shared Memory which is cool for Gemma4 4B but I think I am missing o…
-
I'm using the https://github.com/PrismML-Eng/llama.cpp fork for Bonsai, regular llama.cpp for Gemma. Without embedding parameters: Gemma 4 has 2.3B at 4.8 bpw (Q4_K_M) = 1104 MB Bonsai-8B has 6.95B at 1.125 bpw (Q1_0) = 782 MB (only 29% sm…
-
New to this - question about main AI model plus sub agent (www.reddit.com)
Just installed open claw It is runing gemma4 right now - which feels somewhat slow in responding. After doing some more reading I wanted to ask if it's really practical to use Free Chat GPT - for the main chat agent.
-
Want your LLM to use the internet? Here's an MCP server for that. (www.reddit.com)
The showcased examples were made using Gemma 4 31b. Any LLM with tool calling support should work.
-
could not extract summary
-
MB Pro M5, 24GB/32GB difference? (www.reddit.com)
Hi, I got new MB Pro 24GB/1TB. I've test Gemma 4 26B with ollama, 16k context.
-
Frontier Coding Agents Built a Video Diffusion Pipeline on Max (www.modular.com via hn)
Gemma 4 just dropped on Modular, Day Zero! Read More → Inference Products Shared Endpoints Access frontier models via an API Dedicated Endpoints Mission critical reliability Custom models Your model, peak performance Deployment Options Our…
-
https://preview.redd.it/w6ssjgidjlvg1.png?width=2786&format=png&auto=webp&s=f52736d40580fe8a8ff74adbbb5be81f12fbcbfc So I was playing with Gemma 4 and was trying to figure out whether the model could determine its own training data cutoff…
-
Fine-tuning and deploying Gemma 4 is not that easy (ghost.oxen.ai via hn)
Writing a fine-tuning and deployment pipeline isn't as easy as it looks (Gemma 4 Version) Fine-tune and deploy Gemma 4 on Oxen.ai Google's Gemma 4 dropped in April 2026 with multimodal support (text, image, video, audio), a novel hybrid KV…
-
Summary of Findings This issue documents what we learned making Gemma4 26B-A4B-it train on consumer hardware (RTX 4090, 24GB VRAM). No A100.
-
Gemma 4-written, small cc0 encyclopedia of some core science content (stateofutopia.com via hn)
Published: April 16, 2026 This is an encyclopedia of some core content from Biology and Health Sciences, Physical Sciences, and Technology. It contains 2,259 small entries of about a paragraph each.
-
why gemma 4 31b so bad in long context? (www.reddit.com)
question, I'm using it for text translations and on each large prompt (20K+) it stops with a remark 'now I'm going to put that to the file' or some other operation I have asked in the prompt for but it did nothing, just stopped. I'm runnin…
-
LiteRT LM Framework with Rockchip NPU (RKNN 3588) (www.reddit.com)
Im searching for build version of LiteRT LM framework can use and utilize the NPU of the RKNN 3588. It would be great since I can run gemma 4 e2b model using this framework on the machine, because I wont have to migrate my codebase from li…
-
Need suggestions for local AI Machine (www.reddit.com)
I’ve been running various AI harnesses like OpenClaw, ForgeCode, ClaudeCode, etc. Most of these are running via OpenRouter or Minimax (credits/subscription model).
-
gemma4 e2b ore4b on rtx 5070 ti laptop 12GB not running on vLLM (www.reddit.com)
I cant get gemma 4 e2b or gemma 4 e4b to run on my laptop. I am runnning it via docker as per vllm website and i get the error : Free memory on device cuda:0 (9.71/11.5 GiB) on startup is less than desired GPU memory utilization (0.9, 10.3…
- gemma4 e4b on rtx 5070 ti laptop 12GB running slow 5t/s llama.cpp (www.reddit.com)
-
5090 for 285k on amazon india? (amzn.in via reddit)
How is it possible the seller also has no record just wanted to run gemma 4 31B q4 with 150k ctx
-
I try with Gemma 4 E4B via llama-sever to play chess at https://www.chess.com/play/computer (any platform or site you convenient), result quite unexpected for me. Result: 9 moves before it make cheating move (like try to move a pawn take a…
-
Llama.cpp vs LM Studio on gaming PC (www.reddit.com)
Here is my experience, I've been using LM Studio with RTX 5080 and 64GB RAM using Windows 11. I'm very happy with LM Studio except the speed.
-
Gemma 4 on iOS: Anyone else stuck on CPU because of the "Buffer(31)" Metal crash? Hey everyone, I’m hitting a massive performance wall building an on-device AI app for the iPhone 17 Pro.
-
Experience with medium sized LLMs (www.reddit.com)
I have tried to use several models on my 8gb ram MacBook and concluded that 4b parameters models are just “stupid” for my tasks (i.e. summarisation of pdfs, language learning, etc.).
-
Right now, I'm working on a small app to help eliminate my own doomscrolling by automatically crawling sites and summarizing news articles. However, I don't like the idea of giving OpenClaw free reign of my system, nor giving it any sort o…
-
Hello, I have noticed an annoying issue with Gemma 4 26b a4b. It seems like it cannot do multiple think->tool call->think->tool call turns.
-
Gemma 4 is good or bad at real word (www.reddit.com)
Based on real-world usage by the community, roughly which version of which model is Gemma 4 comparable to? It would be great if you could also mention the hardware requirements for running it (like VRAM or GPU needs)
-
Offload settings for unsloth/Gemma-4 on Apple Silicon? (www.reddit.com)
Can default settings be optimized, or is it the best it is going to get? M1 Max Is it best in llama.cpp, LM Studio, or ?
-
Ollama Cloud - Pro (www.reddit.com)
Hi. I've been looking at ollama cloud's Pro offering ($20), which says "Run 3 cloud models at a time".
-
What's the better way to install llama.cpp on Android? (www.reddit.com)
I own an Oppo Find X3 Pro (Snapdragon 888, 12/256 GB, Android 14.0) unused because of 3 green vertical lines on the screen and poor battery. I tried Google AI Edge Gallery with Gemma-4-E2B-it and it performs well so I thinked: "why don't t…
-
Thanks
-
What are your opinions on the SuperGemma finetune? (www.reddit.com)
So, I'm relatively new to the scene and I kind of want to do a sanity check. I've been using gemma-4-26B.
-
Gemma 4 Jailbreak System Prompt (www.reddit.com)
Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.
-
Gemma 4 running locally on an iPhone 13 Pro (www.reddit.com)
I’ve been experimenting with running LLMs fully on-device, and managed to get Gemma 4 running locally on an iPhone 13 Pro. This is built on top of a lightweight Swift wrapper I open-sourced: https://github.com/mylovelycodes/LiteRTLM-Swift…
-
Ive automated my email/sms/phone (www.reddit.com)
we got it good boys! how many of you are doing this??
-
Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference (www.gizmoweek.com via hn)
Google Gemma 4 Runs Natively on iPhone With Full Offline AI Inference - GizmoWeek GizmoWeek Read the News News Reviews Apple How to Phones Products Subscribe Subscribe to newsletter [x] I've read and accept the Privacy Policy. Follow us Fa…
-
Been trying to run local LLMs on my new Dell XPS 13 with Intel Arc 140V (Lunar Lake, 16GB) and hit a wall — Intel's official docs point to a portable zip frozen at Ollama v0.5.4 which can't pull any modern model. Spent a while debugging it…
-
Local Agent Hermes setup with Gemma 4 and llama.cpp (www.youtube.com via reddit)
About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
-
Gemma 4 & Obsidian (www.reddit.com)
so today I tried the Obsidian LLM wiki system by Karparthy, but with Gemma 4 locally in OpenCode with instead of Claude code. My experience is very frustrating.
-
Does an MLX conversation have same capabilities as the GGUF? (www.reddit.com)
For example, in LMStudio the official Gemma 4 is a GGUF that has Vision, Reasoning, and Tools flags. But the MLX version does not.
-
Gemopus: A Gemma fine-tune that prioritizes stability over long chain-of-thought (huggingface.co via hn)
🌟 Gemopus-4-26B-A4B-it [!NOTE] Gemopus is an attempt at fine-tuning Gemma 4 with a core philosophy of "stability first". While preserving the original reasoning order of Gemma 4 as much as possible, we conducted targeted refinements for an…
-
Gemma 4 and the Economics of Selling AI (gertlabs.com via hn)
Benchmarks, rankings, and live play for AI models and agents.
-
If you're using Gemma 4 with external MCP servers in LM Studio and getting this error: Error rendering prompt with jinja template: "Unknown test: sequence" This is a bug in Google's official Gemma 4 Jinja prompt template. LM Studio's Jinja…
-
I was building an Android app and integrated Gemma 4 E2B directly using LiteRT-LM. On-device translation, zero server cost, the dream setup.
-
Gemma 4 base GGUF? (www.reddit.com)
Hello, I've seen reviews that gemma 4 31b base is very good at roleplaying. But I can't find the gguf version of the basic gemma 4 anywhere.
-
Suggestion for a local model to solve math problems. (www.reddit.com)
Does anyone know of a good edge local llm that is good in math's. I tried Gemma 4 E2B, microsoft phi mini reasoning but both can't answer some basic apti question's.
-
RTX 3090 llamacpp flags help (www.reddit.com)
Hi, my current system hardware RTX 3090 24GB VRAM & Sysrem RAM 64GB using windows 11 been playing around with hermes agent and local llm (Qwopus3.5-27B-v3-GGUF & gemma-4-26B-A4B-it-GGUF) when i try asking the hermes agent to do a task with…
-
Q8 Cache (www.reddit.com)
https://github.com/ggml-org/llama.cpp/pull/21038 Since now cache quantization has better quality, does that mean Q8 cache is a good choice now? For example for 26B Gemma4?
-
Gemma 4 31B — 4bit is all you need (www.reddit.com)
Gemma quant comparison on M5 Max MacBook Pro 128GB (subjective of course, but on variety of categories): gemma 4 leaderboard the surprising bit: Gemma 4 31B 4bit scored higher than 8bit. 91.3% vs 88.4%.
-
Turned a Xiaomi 12 Pro into a dedicated local AI node. Here is the technical setup: OS Optimization: Flashed LineageOS to strip the Android UI and background bloat, leaving ~9GB of RAM for LLM compute.
-
Share your speculative settings for llama.cpp and Gemma4 (www.reddit.com)
I have totally missed the boat on speculative decoding. Today when generating some code again for the frontend i found myself staring down at some quite monotonic javascript code.
-
My guess as to what Apple Foundation Models will be like in iOS 27 (www.reddit.com)
Could you imagine if the new Apple Foundation Models was based on Gemma 4 E4B text like the LiteRT version is? That would be one amazing built in model.
-
Looking for a team to participate in Gemma 4 good hackathon (www.reddit.com)
Hey folks, I've been tinkering with Gemma 4 and absolutely the fact this model can run locally on Android phone! I am experienced fullstackdev, open to solve any real-world problem that has an impact.
-
If you are on Gemma (like me), you basically have to compile llama.cpp daily now
-
Best setup for multiple high-end dissimilar PCs (www.reddit.com)
I did some searching and didn't find a extremely similar situation. I'm jumping head first into hosting locally, and my experience has been good so far.
-
Con el lanzamiento de modelos optimizados para ejecutarse localmente (como lo que estamos viendo con la evolución de Gemma 4), parece que el péndulo de la IA se está alejando de la nube.
-
How do I use gemma4 on 5090 gpu for coding? (www.reddit.com)
I'm trying to replace openai codex which i used for development all the time, with gemma4 on 4090, small tasks it solves quite impressively, but i need to have some agent. So I tried to connect 31b to cline and to aider and it didn't reall…
-
Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.
-
Why some small/medium models fail at grammar checking task? (www.reddit.com)
Recently, I try playing with gemma 4 (gemma-4-E4B-it-Q5_K_S.guff) and find out it fail at easy grammar check (it try to fix the already corrected word "contemporary"). I noticed the same mistake from openai/gpt-oss-20b and qwen3-next-80b-a…
-
I’ve been experimenting with running a local coding assistant on Gemma 4 26B, focused on understanding full codebases instead of single-file prompts. Main idea: - build a project map (files, symbols, structure) - run a planning step to dec…
-
Are the LiteRT versions of Gemma 4 a different architecture? (www.reddit.com)
I was surprised at how much smaller the LiteRT versions of Gemma 4 E2B used in Edge Gallery were (2.0-3.3 GB) compared to the main release (10.2 GB), so I had Claude code take a look. Claude tells me that the vocab size for the LiteRT vers…
-
I ran Gemma 4 as a local model in Codex CLI (medium.com via hn)
I ran Gemma 4 as a local model in Codex CLI | by Daniel Vaughan | Google Cloud - Community | Apr, 2026 | Medium Sitemap Open in app Sign up Sign in Get app Write Search Sign up Sign in Google Cloud - Community · A collection of technical a…
-
Opencode + lmstudio : first prompt very slow (www.reddit.com)
I actually make some tests with lm studio and Opencode with the new Gemma 4 26b model. The results are really impressive especially on small refactoring and integration tasks.
-
Thinking with a smaller model to speed things up? (www.reddit.com)
Question: can i do the thinking with a smaller model, like Gemma 4 4B, then use that as the prompt for Gemma 4 31B, to speed things up? Has anyone done this and measure if it's worth it?
-
Open Claw on my old PC (32GB Ram, 12GB VRAM) model suggestions? (www.reddit.com)
I tried running Gemma4 E4B through llama cpp, and I couldn't get it to reply wiithout timing out.
-
My specs: RTX 5060ti(16gb), 16gb DDR5 ram. (os : Fedora 43) I want an uncensored model, it would be preferable if it can do image gen but if the quality of text is high enough it should not be problem if it does not support it.
-
I’ve been keeping a personal journal for the past few years. The entire thing is made up of over 100k+ tokens.
-
Getting no result train Gemma 4 for structured data extraction (www.reddit.com)
Hello, I've been trying for several days to train Gemma-4 for extracting data from a string and convert it into a structured JSON. I've tried a fair amount of different configurations, I've tried Unsloth studio and Llamafactory, but in eac…
-
Cursor Native tool calling with Gemma4 and Ollama: (www.reddit.com)
I'm a beginner using local models, now I have a good GPU I installed ollama using docker. Pulled the Gemma4 weights and was able to add it to cursor using ngrok.