#swe-bench
10 items
Claude is now adopting the advisor strategy (www.reddit.com via reddit) Running gpt and glm-5.1 side by side. Honestly can’t tell the difference (www.reddit.com via reddit) Local-first agent evaluation collapses once runs are long and stateful? (www.reddit.com via reddit) Checking my model vibes against SWE-Bench Pro (blog.nilenso.com via hn) Claude down? TokenMonopoly will help you find the best deals in AI subs (tokenmonopoly.com via hn) TokenMonopoly Live leaderboard of AI API deals — pricing, subscriptions, and SWE-bench scores for Claude, GPT, Gemini, Kimi, DeepSeek, Llama and more. Compare 27 benchmarked models across 96 hosts by price-per-performance, refreshed daily.
Ask HN: Opus 4.7 – is anyone measuring the real token cost on agentic tasks? (news.ycombinator.com via hn) Shipped today. The benchmarks are real: 87.6% SWE-bench (from 80.8%), +13% on coding tasks, 3x more resolved production tasks on Rakuten-SWE-Bench.
Compare harnesses not models: Blitzy vs. GPT-5.4 on SWE-Bench Pro (quesma.com via hn) I set up Opus as a strategic advisor for my Sonnet workflow. Here is the subagent config that makes it work. (www.reddit.com via reddit) Anthropic published the Advisor Strategy this week. The idea: a cheaper model does the actual work, a stronger model only gets consulted on hard decisions.
DeepSeek V4 reportedly drops late April. 1M context, multimodal, Claude-level coding. (www.reddit.com via reddit) Leaks point to late April release. Key specs 1M token context window Native multimodal (image/video input) Projected ~85% SWE-Bench Verified (ties or beats Claude Opus 4.6) Base model remains free.
Stop donating your salary to OpenAI: Why Minimax M2.5 is making GPT-5.2 Thinking look like an overpriced dinosaur for coding plans. (www.reddit.com via reddit)