#humaneval

12 items

Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (www.reddit.com) +9035 8w

Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…

↯ Function Calling ↯ Qwen 3.6 humaneval function-calling qwen+1
Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 (www.reddit.com) +5015 8w

Hey fellow Llamas, your time is precious, so I'll keep it short. We built a GGUF port of DFlash speculative decoding.

↯ Qwen 3.6 humaneval
"Second Thoughts" Been playing with adding a small transformer that reads output near the end of generation, and feeds it back near the top as a refinement loop. A quick test of 1.7B model showed drastic improvement in focused tasks (like coding) (bigattichouse.medium.com via reddit) +232 7w

A 1.7B model can actually turn out some code, so I'm running the training for a 9B model, then will re-run HumanEval (a full one this time). I've shown most of my homework in the article, but will be posting to github after I clean things…

humaneval
I Built a desktop app for generating LLM fine-tuning datasets — started it a week ago while learning FT (www.reddit.com) +31 9w

Hey, I've been building side projects with Claude Code for a few months, but I'm completely new to fine-tuning — started experimenting maybe a week ago. From day one I wanted a GUI for the dataset side of the workflow, so this desktop app…

↯ Fine Tuning ↯ Qwen 2.5 humaneval fine-tuning claude-code
Show HN: Agentic Intent Benchmark (github.com via hn) +2 4w

intent-bench An open-source benchmark measuring whether providing structured intent to coding agents improves implementation effectiveness. What This Measures Existing agent benchmarks (SWE-bench, HumanEval, Aider Polyglot) test single-req…

↯ Swe Bench humaneval aider swe-bench+1
how do you decide between q4 and q5 on a 70b when 24gb is the cap? (www.reddit.com) +23 4w

ran into the q4 vs q5 wall again this morning. 70b model.

humaneval
Hito 2B: +35 on GSM8K, 75% on ARC-Challenge, 95% on HumanEval-style (www.reddit.com) +21 9w

[Release] Hito 2B — structured reasoning via trained cognitive tags, +35 pts on GSM8K vs base Qwen3.5-2B (head-to-head) Been cooking this for ~6 months. Finally shipping.

↯ Qwen 3.5 humaneval
Luce DFlash + PFlash on 7900XTX: Qwen3.6-27B at 2.24x decode and 3.05x prefill vs llama.cpp HIP (www.reddit.com) +1 5w

Tested a bit on my XTX, a bit share hope helpful, thanks to Lucebox! Lucebox DFlash + PFlash PR #119 Reproduction Report (RX 7900 XTX) Hardware Environment Component Spec GPU AMD Radeon RX 7900 XTX (Navi 31, gfx1100) VRAM 24 GiB GDDR6 (~93…

↯ Qwen 3.6 humaneval llama
I built a 1v1 nuclear strategy game to benchmark LLM reasoning (instead of just QCMs) — Age of LLM (www.reddit.com via reddit) 2w

In 2017, I watched OpenAI Five destroy pro players at Dota 2. That moment taught me something: games are the ultimate test of emergent intelligence.

humaneval mmlu openai
I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math (www.reddit.com) 6w

A few months ago, I got stuck on one line in the DeepSeek-R1 paper. It said models could improve through verifiable rewards.

humaneval deepseek
Qwen3.6:27b vs qwen3-coder:30b vs deepseek-coder:33b on code gen, tool calling, and agent tasks (www.reddit.com) 6 7w

Ran a full eval against four local models last weekend and the spread between them is wider than I expected. All running through Ollama on CPU, no cloud, same prompts, same hardware.

↯ Function Calling ↯ Qwen 3.6 humaneval function-calling ollama+1
BigCodeBench: The Next Generation of HumanEval (huggingface.co) 105w

humaneval

← all tags