model roundup

Qwen 3.5

15 items · started 2026-06-04 · closed 2026-06-13

DiffusionGemma made me rethink what memory bandwidth means for local agent inference (www.reddit.com via reddit)

2w agentic

Been testing DiffusionGemma 26B A4B for the last few days and the bottleneck profile is completely different from autoregressive models. With autoregressive models you are compute-bound during prefill and memory-bandwidth-bound during deco…
How do i prevent llama.cpp from offloading on Swap? (www.reddit.com via reddit)

2w qwen llama

I have tried preventing this issue by using llama.cpp flags. However, I still have the issue: whenever I'm close to my 96GB of RAM, llama-server / llama.cpp decides to offload the KV cache onto my swap.
NVFP4 with llama.cpp - FAQs? (www.reddit.com via reddit)

2w llama

Lets clarify all things related to NVFP4 in this thread. Sharing few questions & links here.
Ask HN: Any Local LLM can I run without GPU for Local Agentic workflow AI? (news.ycombinator.com)

+3 2w llama agentic claude-code

Claude Code like agentic workflow ai too costly for me.Any LLM can I run with VSCode at the below setup? 16ram Intel core i7 h processor 13gen 512gb NVMe SSD I want to run the ai as local agentic workflow with Vscode.I want use LLAMA agent…
Hot Take "Rigid code is better than Flexible code if you're on a budget" (www.reddit.com via reddit)

2w gemma qwen agentic

I've spent the last six months trying to build a fully local, agentic pipeline for a text_processing and extraction tool I use daily. Because I’m running everything on a single consumer GPU setup, my choices are limited to smaller, quanti…
I have 4x 128 GB VRAM now , what should i do. (www.reddit.com via reddit)

2w vllm qwen
nice_meme (www.reddit.com via reddit)

2w copilot

https://preview.redd.it/z66h627yi96h1.png?width=1080&format=png&auto=webp&s=94040bb76c0f8099b58927771c2193dd6a5019da qwen3.5 9b at 0 bit quant>>>>>>>copilot
[Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B (www.reddit.com via reddit)

2w qwen llama

I took the liberty to test both models today on my favorite benchmark question, head to head. Device: Apple Mac M3 Max 64GB Environment: llama.cpp, all defaults Gemma4-12B's token generation speed: 47 tps with MTP and 2 predicted tokens 29…
Nex N2 has a funny "few words do trick" reasoning (www.reddit.com via reddit)

2w qwen
Preferred two LLM combo (www.reddit.com via reddit)

2w

I’m using my MacBook Pro M1 Pro with 32GB to run Qwen3.5-35B in Q4 as my coding agent. I have a gaming PC with a 5070 Ti that I’m currently not using but would like to.
Dense vs MoE quantization resiliance (www.reddit.com via reddit)

2w moe qwen

Which one is more resiliant to quantization? Especially at 4-bit?
Running Hermes fully local (www.reddit.com via reddit)

2w agentic

Before Hermes was announced, I was working on my own fully local, personal agentic system. Now, I'm a novice when it comes to coding.
It felt good to return my Asus Spark (www.reddit.com via reddit)

2w moe qwen

It's an incredible little package but too expensive of a price to pay for the performance and I simply didn't want to be part of the great "Superchip lie" - it could be super, but its super ruined by its limited memory bandwidth even thoug…
Launch HN: General Instinct (YC P26) – Frontier models on edge devices (news.ycombinator.com)

+42 2w moe

Hey HN, Guanming and Bill here from General Instinct (https://general-instinct.com/). After years of working in robotics, we kept running into the same problem: the best models never fit the hardware we actually had available.
Show HN: Hitoku Draft – Context aware local assistant (hitoku.me via hn)

+5 3w gemma qwen

Hi guys. I have been working on Hitoku Draft, an open-source, voice-first AI assistant that runs entirely locally.

← all threads