Hey r/LocalLLaMA, We’ve released our ByteShape Qwen 3.6 35B GGUF quantizations in two families: standard NTP (Next Token Prediction or non-MTP) and MTP. Blog / Download NTP Models / Download MTP Models TL;DR For NTP, “pick the largest quan…
#mmlu
11 items
Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs (www.reddit.com) MiniMax m2.7 under 64gb for Macs - 91% MMLU (www.reddit.com) https://huggingface.co/JANGQ-AI/MiniMax-M2.7-JANGTQ Used TQ as quantization method where it matters. Finally mac users under 64 gb - esp base m5 users can get a real cloud SOTA-like level LLM running from home.
nvidia/Gemma-4-26B-A4B-NVFP4 (huggingface.co via reddit) Can confirm it works on a 5090, with 80% allocation (of 32gb) I got around 50k context. It's 18.8GB Benchmark Baseline (Full Precision) NVFP4 GPQA Diamond 80.30% 79.90% AIME 2025 88.95% 90.00% MMLU Pro 85.00% 84.80% LiveCodeBench (pass@1)…
First DeepSeek V4 Flash-Base-Int4 Quant (huggingface.co via hn) DeepSeek-V4-Flash-Base INT4 A real INT4 packed-storage quantization of deepseek-ai/DeepSeek-V4-Flash-Base — a 284 B-parameter Mixture-of-Experts model. Hero numbers | Metric | This release | Community Q4KM norm | |---|---|---| | MMLU (5 su…
Show HN: Flint – A 30B model fine-tuned for less repetition (springboards.ai via hn) As frontier LLMs have very little output diversity even for open ended queries. We built Flint to see if we could reverse this.
GGUF Quants Arena for MMLU (24GB VRAM + 128GB RAM) (www.reddit.com) Dataset: MMLU subset (DEV+TEST) Llamacpp setting: 3 params only ctx 8192 , seed 42 , fa on Let me know whatelse do you want to see. Thanks.
You don't need all the LLM benchmarks (alex.smola.org via hn) Every time a new model comes out, somebody runs it on MMLU (57 subjects), MTEB (56 tasks), HELM, the Open LLM Leaderboard, AlpacaEval, LiveBench, BigCodeBench, WildBench, Arena-Hard, MT-Bench, and a dozen others. That’s days of GPU time an…
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas (arxiv.org via hn) Aggregate metacognitive quality scores mask within-model variation across MMLU benchmark domains. We administered 1,500 MMLU items (250 per domain, under an a priori six-domain grouping) to 33 frontier LLMs from eight model families and co…
I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM (www.reddit.com via reddit) In 2017, I watched OpenAI Five destroy pro players at Dota 2. That moment taught me something: games are the ultimate test of emergent intelligence.
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding (arxiv.org) Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources.
We open-sourced Chaperone-Thinking-LQ-1.0 — a 4-bit GPTQ + QLoRA fine-tuned DeepSeek-R1-32B that hits 84% on MedQA in ~20GB (www.reddit.com) Hey everyone, We just open-sourced our reasoning model, Chaperone-Thinking-LQ-1.0, on Hugging Face. It's built on DeepSeek-R1-Distill-Qwen-32B but goes well beyond a simple quantization — here's what we actually did: The pipeline: 4-bit GP…