#swe-bench

10 items

Claude is now adopting the advisor strategy (www.reddit.com via reddit) 400 pts·58 replies· 7d

swe-bench haiku sonnet+2
Running gpt and glm-5.1 side by side. Honestly can’t tell the difference (www.reddit.com via reddit) 24 pts·18 replies· 2d

↯ Opus 4.6 swe-bench glm gpt-5+1
Local-first agent evaluation collapses once runs are long and stateful? (www.reddit.com via reddit) 4 pts·6 replies· 3d

swe-bench
Checking my model vibes against SWE-Bench Pro (blog.nilenso.com via hn) 3 pts· 3d

swe-bench
Claude down? TokenMonopoly will help you find the best deals in AI subs (tokenmonopoly.com via hn) 2 pts· 3d

TokenMonopoly Live leaderboard of AI API deals — pricing, subscriptions, and SWE-bench scores for Claude, GPT, Gemini, Kimi, DeepSeek, Llama and more. Compare 27 benchmarked models across 96 hosts by price-per-performance, refreshed daily.

swe-bench deepseek llama+2
Ask HN: Opus 4.7 – is anyone measuring the real token cost on agentic tasks? (news.ycombinator.com via hn) 1 pts· 1h

Shipped today. The benchmarks are real: 87.6% SWE-bench (from 80.8%), +13% on coding tasks, 3x more resolved production tasks on Rakuten-SWE-Bench.

↯ Opus 4.7 swe-bench agentic opus
Compare harnesses not models: Blitzy vs. GPT-5.4 on SWE-Bench Pro (quesma.com via hn) 1 pts· 2d

↯ GPT 5.4 swe-bench gpt-5
I set up Opus as a strategic advisor for my Sonnet workflow. Here is the subagent config that makes it work. (www.reddit.com via reddit) 2 replies· 8h

Anthropic published the Advisor Strategy this week. The idea: a cheaper model does the actual work, a stronger model only gets consulted on hard decisions.

swe-bench sonnet opus+3
DeepSeek V4 reportedly drops late April. 1M context, multimodal, Claude-level coding. (www.reddit.com via reddit) 21 replies· 1d

Leaks point to late April release. Key specs 1M token context window Native multimodal (image/video input) Projected ~85% SWE-Bench Verified (ties or beats Claude Opus 4.6) Base model remains free.

swe-bench deepseek opus+1
Stop donating your salary to OpenAI: Why Minimax M2.5 is making GPT-5.2 Thinking look like an overpriced dinosaur for coding plans. (www.reddit.com via reddit) 10 replies· 8w

hallucination swe-bench glm+5

← all tags