Lower inference speed of Gemma4 26BA4B on vllm.

reddit-localllama · www.reddit.com ·8 replies ↗ ·13h

For my earlier use case I used to host qwen 2.5 vl 7b gptq int4. Now I was looking to switch to Gemma4 26B A4B, as it would improve performance as well as improve latency considering only 4B parameters are active..