event

Swe Bench

43 items · started 2024-08-13 · ongoing (last activity 2026-06-09)

  1. I can’t wait for DeepSWE to include Fable 5 in the benchmark so people can understand that Mythos is mostly hype. In the official benchmark, Opus 4.8 was supposed to be better at programming than 5.5 (SWE-bench Pro), but in one real benchm…

  2. Microsoft just released MAI-Code-1-Flash — a 5B parameter coding model built for fast, efficient developer assistance. Numbers that caught my eye: - 51.2% on SWE-Bench Pro (Claude Haiku 4.5 scores 35.2%) - 71.6% on SWE-Bench Verified (Haik…

  3. Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax…

  4. MiniMax (official) @MiniMax_AI Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench H…

  5. Claude Code degraded for the week before Opus 4.8's release Our SWE-Bench-Pro tracker caught a statistically significant, weeklong drop in Claude Code's pass rate just before Opus 4.8 shipped, and the recovery that followed. We run Claude…

  6. intent-bench An open-source benchmark measuring whether providing structured intent to coding agents improves implementation effectiveness. What This Measures Existing agent benchmarks (SWE-bench, HumanEval, Aider Polyglot) test single-req…

  7. This is mini-swe-agent v2 Read the migration guide. For the previous version, check out the v1 documentation or the v1 branch.

  8. From the website, it touts: Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity: Tasks span a broad pool of 91 repositories acro…

  9. Hi all, Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months.

  10. swebench-verified A three-stage agent pipeline for SWE-bench Verified, built to be re-run and inspected by skeptics. The point of this repo is not the score.

  11. Bito's AI Architect cuts Claude Code's token cost by 47% on SWE-Bench Pro. It gives the coding agent codebase context, a continuously updated, structured map of every repository served over MCP.

  12. https://preview.redd.it/jyiiwn2o0f2h1.png?width=962&format=png&auto=webp&s=6a96d2b9fe7bffcc75e8d5865161ec3727d46d58 Link to blog : https://qwen.ai/blog?id=qwen3.7

  13. AI Architect tops SWE-Bench Pro Claude Opus 4.6 Without context with system context Even advanced coding agents resolve fewer than 52% of tasks when changes span large codebases and require coordinated, multi-file updates. These long-horiz…

  14. Saw the tech-insider breakdown comparing Claude Code and Cursor head-to-head this week. Numbers are kind of hard to ignore: 80.8% SWE-bench for Claude Code, 74% for Cursor, and a 67% blind-quality win rate for Claude Code on real tasks.

  15. The 30-Second Verdict Best for science & reasoning: Gemini 3.1 Pro — leads GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). Best for coding: ChatGPT (GPT-5.5) — 88.7% on SWE-Bench Verified.

  16. DeepSeek V4: The Open-Source Model Frontier Labs Feared DeepSeek V4 ships under MIT with $0.30/M output tokens — 83x cheaper than Claude Opus 4.7 — while scoring 80.6% on SWE-bench Verified. The agentic-coding price floor just moved an ord…

  17. Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.

  18. The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s S…

  19. ./ProgramBench Can language models rebuild programs from scratch? Given only a compiled binary and its documentation, agents must architect and implement a complete codebase that reproduces the original program's behavior.

  20. I’m working on an assessment where I need to create a coding task (basically SWE-bench style). The idea is: take an existing repo (I’m using pydantic) write tests that fail on the current code provide a patch that fixes it and the task sho…

  21. From 1930 to SWE-bench Models and training data We fine-tune Alec Radford's 1930 vintage LLM — pre-trained only on pre-1931 data — to solve SWE-bench issues. After just 250 training examples the model lands its first fix (a small patch to…

  22. Trying some if Bartowski's Q4 quants. Using Vulkan with the latest main branch as of a few hours ago.

  23. Mythos’ system card contains the following graph to support its argument that Mythos performs better on SWE-bench: Anthropic and others are worried LLMs are memorizing SWE-bench, so they asked an LLM to estimate the probability that a solu…

  24. Here's what I did: Built a proxy that intercepts Codex's calls to OpenAI and rewrites them on the fly. Replayed 3,807 rounds of SWE-bench Verified traces through it: avg prompt 44k → 6k tokens (-87%).

  25. If you had to build a context window manager in 24h, would you stick to the existing model or come up with something better? Here's what I did: 1.

  26. "SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals." https://www.anthropic.com/news/claude-opus-4-7

  27. 27B Dense vs. 35B-A3B MoE): - Dense still holds the crown: It still wins out on most tasks overall.

  28. Opus 4.7 landed on Cursor yesterday. The model is better — SWE-bench jumped from 80.8% to 87.6%.

  29. Anthropic shipped Opus 4.7 yesterday. The headline numbers are real: 64.3% on SWE-bench Pro (up from 53.4%), best-in-class on MCP-Atlas at 77.3% for multi-tool orchestration, 14% improvement on multi-step agentic reasoning, and one-third f…

  30. Shipped today. The benchmarks are real: 87.6% SWE-bench (from 80.8%), +13% on coding tasks, 3x more resolved production tasks on Rakuten-SWE-Bench.

  31. Anthropic published the Advisor Strategy this week. The idea: a cheaper model does the actual work, a stronger model only gets consulted on hard decisions.

  32. Leaks point to late April release. Key specs 1M token context window Native multimodal (image/video input) Projected ~85% SWE-Bench Verified (ties or beats Claude Opus 4.6) Base model remains free.

  33. So I have been running gpt and glm-5.1 side by side lately and tbh the gap is way smaller than what im paying for On SWE-Bench Pro glm-5.1 actually took the top spot globally, beat gpt-5.4 and opus 4.6. overall coding score is like 55 vs g…

  34. An independent audit of agentic scaffolding and harnesses. We analyze how agent workflows, codebase documentation, and test verification impact performance compared to raw base models like GPT-5.4, Gemini 3.1 Pro, and Claude Code.

  35. TokenMonopoly Live leaderboard of AI API deals — pricing, subscriptions, and SWE-bench scores for Claude, GPT, Gemini, Kimi, DeepSeek, Llama and more. Compare 27 benchmarked models across 96 hosts by price-per-performance, refreshed daily.

  36. I thought GPT models felt slow and token hungry, and Claude models were faster. I was wrong.

  37. I started out running agent evaluations locally because most ai agent benchmarks and examples assume that setup. And to be fair local runs do work for debugging and small experiments.

  38. We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and your agents can consult Opus mid-task when they hit a hard decision.

← all threads