event

Swe Bench

56 items · started 2024-08-13 · ongoing (last activity 2026-06-25)

Fable 5 vanished in 96 hours and four days later an MIT model took its arena crown (www.reddit.com via reddit)

1d swe-bench glm gpt-5+3

I have been thinking about the Fable 5 to GLM-5.2 sequence as one event rather than two. June 9, Anthropic ships Fable 5, the Mythos line opens to the public for the first time, SWE-bench Verified at 95 percent, people calling it the best…
Show HN: Memory layer for Claude Code(+10.2 pts on SWE-bench Verified benchmark) (github.com via hn)

+2 2d swe-bench mcp claude-code

World Model MCP Enforcement, provenance, and harness-neutral memory for AI coding agents. A temporal knowledge graph that validates code changes against learned constraints at the edit boundary, re-injects relevant context after compaction…
- Built an MCP memory layer for Claude Code that survives compaction — public SWE-bench benchmark shows +10.2 pts paired delta (www.reddit.com)
A coding agent passed SWE-bench-Live while rewriting the tests it was graded on (agnitripathi.substack.com via hn)

+1 2d swe-bench

Your coding agent can cheat on its tests, and the benchmark won't catch it Almost every way we evaluate an AI agent looks at the same thing: the final output. Did the answer look right.
clawmark: open-source CLAUDE.md A/B Testing CLI tool (github.com via hn)

+21 7d swe-bench

clawmark clawmark is a local Rust CLI for answering one focused question: Which of these two CLAUDE.md files performs better on a small SWE-bench Lite smoke set? v1 compares exactly two local variant files against five bundled SWE-bench Li…
Kimi K2.7 Code: 1T MoE, $0.95/M tokens, MIT license, beats Opus 4.8 on MCP tool-calling (www.reddit.com via reddit)

9d tool-calling swe-bench moe+5

Moonshot AI released Kimi K2.7 Code on June 12 — a coding-focused open-weight model. Key specs: - 1 trillion params (MoE, 32B active, 384 experts) - 256K context window - Modified MIT license — weights on Hugging Face - $0.95/M input, $4.0…
Dialogue SWE-Bench: A Benchmark for Dialogue-Driven Coding Agents (arxiv.org)

11d swe-bench

AI coding agents have rapidly transformed software engineering, powering widely used interactive coding assistants. Despite their interactive real-world use, existing benchmarks evaluate them as fully-autonomous systems.
Ramp SWE-Bench a private contamination-free benchmark from production work (labs.ramp.com via hn)

+1 13d swe-bench

Evaluating background coding agents on financial SWE work
Claude Fable 5 is the best AI model right now — and it's not even a debate (www.reddit.com via reddit)

2w swe-bench mythos gemini+2

I've been following AI releases closely and after looking at both Claude Fable 5 and Gemini 3.5 Flash, I'm just going to say it straight: Claude Fable 5 wins. Full stop.
Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks (arxiv.org)

2w swe-bench openclaw

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and pre…
MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation (arxiv.org)

2w swe-bench

Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not yet exist, and directly transplanting general-purpose benchmarks such as SWE-bench fails on three str…
How can Deepseek v4 top the coding leaderboards and still sit 8 months behind the frontier? (www.reddit.comhttps)

2w swe-bench gpt-5 deepseek+1

Two numbers on this model that don't sit comfortably with each other. The Pro config posts coding scores near the top of every board, 80.6 on SWE-bench Verified and 93.5 on LiveCodeBench.
Claude Fable 5 missed a bug that Sonnet 4.6 caught (alikhallad.com via hn)

+3 2w swe-bench sonnet agentic+1

When Anthropic released Claude Fable 5 this week, my feed filled up with the same benchmark charts within hours. SWE-bench scores, agentic coding numbers, the Stripe migration story.
Claude Fable 5 compared to other models and benchmarks (www.reddit.com via reddit)

2w swe-bench mythos opus+1
I feel like I’m alone. Current Anthropic models are NOT good for me, and it’s making me sad. (www.reddit.com via reddit)

2w swe-bench mythos opus+1

I can’t wait for DeepSWE to include Fable 5 in the benchmark so people can understand that Mythos is mostly hype. In the official benchmark, Opus 4.8 was supposed to be better at programming than 5.5 (SWE-bench Pro), but in one real benchm…
Microsoft's MAI-Code-1-Flash: 5B params, 51% on SWE-Bench Pro, free on OpenRouter (www.reddit.com via reddit)

2w swe-bench haiku copilot

Microsoft just released MAI-Code-1-Flash — a 5B parameter coding model built for fast, efficient developer assistance. Numbers that caught my eye: - 51.2% on SWE-Bench Pro (Claude Haiku 4.5 scores 35.2%) - 71.6% on SWE-Bench Verified (Haik…
M3 looks good on benchmarks, but I only care if it stops Cursor from inventing files (www.reddit.com via reddit)

3w swe-bench cursor
MiniMax M3 (xcancel.com via hn)

+2 3w swe-bench minimax mcp+1

Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench Hard, 74.2% MCP Atlas - MiniMax…
MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities (twitter.com via hn)

+4 3w swe-bench minimax agentic

MiniMax (official) @MiniMax_AI Introducing MiniMax M3: The First Open-Weights Model to Combine Three Frontier Capabilities - Coding & Agentic Frontier: 59.0% SWE-Bench Pro, 66.0% Terminal Bench 2.1, 34.8% SWE-fficiency, 28.8% KernelBench H…
Claude Code Degraded Before Opus 4.8 Release (marginlab.ai via hn)

+5 4w swe-bench opus claude-code

Claude Code degraded for the week before Opus 4.8's release Our SWE-Bench-Pro tracker caught a statistically significant, weeklong drop in Claude Code's pass rate just before Opus 4.8 shipped, and the recovery that followed. We run Claude…
Show HN: Agentic Intent Benchmark (github.com via hn)

+2 4w humaneval aider swe-bench+1

intent-bench An open-source benchmark measuring whether providing structured intent to coding agents improves implementation effectiveness. What This Measures Existing agent benchmarks (SWE-bench, HumanEval, Aider Polyglot) test single-req…
Mini-SWE-agent scores up to 74% on SWE-bench in 100 lines of Python code (mini-swe-agent.com via hn)

+2 4w swe-agent swe-bench

This is mini-swe-agent v2 Read the migration guide. For the previous version, check out the v1 documentation or the v1 branch.
ChatGPT-5.5 Beats Opus in Realistic Benchmark (DeepSWE) (www.reddit.com)

+146 4w swe-bench mythos gemini+2

From the website, it touts: Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity: Tasks span a broad pool of 91 repositories acro…
SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More (swe-rebench.com via reddit)

+89 4w swe-bench gpt-5 deepseek+3

Hi all, Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months.
Show HN: 97% on SWE-bench Verified with subscription-token agents (github.com via hn)

+2 4w swe-bench

swebench-verified A three-stage agent pipeline for SWE-bench Verified, built to be re-run and inspected by skeptics. The point of this repo is not the score.
Deep codebase context cuts Claude Code's token cost by 47% (bito.ai via hn)

+2 5w swe-bench mcp claude-code

Bito's AI Architect cuts Claude Code's token cost by 47% on SWE-Bench Pro. It gives the coding agent codebase context, a continuously updated, structured map of every repository served over MCP.
Qwen 3.7 Max scores 60.6% on SWE-Bench Pro (www.reddit.com)

+28 5w swe-bench qwen

https://preview.redd.it/jyiiwn2o0f2h1.png?width=962&format=png&auto=webp&s=6a96d2b9fe7bffcc75e8d5865161ec3727d46d58 Link to blog : https://qwen.ai/blog?id=qwen3.7
Bito's AI Architect Boosts Claude Opus's task success rate by 35% (bito.ai via hn)

+2 5w swe-bench opus

AI Architect tops SWE-Bench Pro Claude Opus 4.6 Without context with system context Even advanced coding agents resolve fewer than 52% of tasks when changes span large codebases and require coordinated, multi-file updates. These long-horiz…
Claude Code hitting 80.8% SWE-bench vs Cursor's 74%. switching worth it? (www.reddit.com)

5w swe-bench sonnet cursor+1

Saw the tech-insider breakdown comparing Claude Code and Cursor head-to-head this week. Numbers are kind of hard to ignore: 80.8% SWE-bench for Claude Code, 74% for Cursor, and a 67% blind-quality win rate for Claude Code on real tasks.
Grok vs. ChatGPT vs. Gemini Comparison 2026: Complete Guide (Tested) (aithinkerlab.com via hn)

+11 5w arc-agi swe-bench grok+3

The 30-Second Verdict Best for science & reasoning: Gemini 3.1 Pro — leads GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). Best for coding: ChatGPT (GPT-5.5) — 88.7% on SWE-Bench Verified.
DeepSeek V4: The Open-Source Model Frontier Labs Feared (helloai.com via hn)

+1 6w swe-bench deepseek opus+1

DeepSeek V4: The Open-Source Model Frontier Labs Feared DeepSeek V4 ships under MIT with $0.30/M output tokens — 83x cheaper than Claude Opus 4.7 — while scoring 80.6% on SWE-bench Verified. The agentic-coding price floor just moved an ord…
Show HN: Statewright – Visual state machines that make AI agents reliable (github.com via hn)

+3 6w swe-bench agentic

Agentic problem solving in its current state is very brittle. I fell in love with it, but it creates as many problems as it solves.
AA introduces Coding Agent Index - Performance Comparisons between Model & Harness Combinations (www.reddit.com)

+286 6w swe-bench agentic

The Artificial Analysis Coding Agent Index includes 3 leading benchmarks that represent a broad spectrum of coding agent use: ➤ SWE-Bench-Pro-Hard-AA, 150 realistic coding tasks that frontier models struggle with, sampled from Scale AI’s S…
Show HN: New Benchmark from SWE-bench team is 0% solved (programbench.com via hn)

+4 7w swe-bench

./ProgramBench Can language models rebuild programs from scratch? Given only a compiled binary and its documentation, agents must architect and implement a complete codebase that reproduces the original program's behavior.
LLMs keep solving my bug-fix tasks instantly — what am I missing here? (www.reddit.com)

14 7w swe-bench haiku opus

I’m working on an assessment where I need to create a coding task (basically SWE-bench style). The idea is: take an existing repo (I’m using pydantic) write tests that fail on the current code provide a patch that fixes it and the task sho…
talkie-coder: From 1930 to SWE-bench (github.com via hn)

+2 7w swe-bench

From 1930 to SWE-bench Models and training data We fine-tune Alec Radford's 1930 vintage LLM — pre-trained only on pre-1931 data — to solve SWE-bench issues. After just 250 training examples the model lands its first fix (a small patch to…
Is Mistral-3.5-Medium-128B broken in Llama CPP? (www.reddit.com)

+14 8w swe-bench mistral vllm+1

Trying some if Bartowski's Q4 quants. Using Vulkan with the latest main branch as of a few hours ago.
Anthropic's Argument for Mythos SWE-bench improvement contains a fatal error (www.philosophicalhacker.com via hn)

+2 8w swe-bench mythos anthropic

Mythos’ system card contains the following graph to support its argument that Mythos performs better on SWE-bench: Anthropic and others are worried LLMs are memorizing SWE-bench, so they asked an LLM to estimate the probability that a solu…
If you had to build a context window manager in 24h, would you stick to the existing model or come up with something better? (www.reddit.com)

+1 8w swe-bench codex openai

Here's what I did: Built a proxy that intercepts Codex's calls to OpenAI and rewrites them on the fly. Replayed 3,807 rounds of SWE-bench Verified traces through it: avg prompt 44k → 6k tokens (-87%).
Show HN: Codex context bloat? 87% avg reduction on SWE-bench Verified traces (www.npmjs.com via hn)

+42 8w swe-bench codex openai

If you had to build a context window manager in 24h, would you stick to the existing model or come up with something better? Here's what I did: 1.
Reminder that Anthropic reported memorization on some SWE-Bench Pro problems (www.reddit.com)

+474 9w swe-bench opus anthropic

"SWE-bench Verified, Pro, and Multilingual: Our memorization screens flag a subset of problems in these SWE-bench evals." https://www.anthropic.com/news/claude-opus-4-7
Dense vs. MoE gap is shrinking fast with the 3.6-27B release (www.reddit.com)

+208 9w swe-bench moe

27B Dense vs. 35B-A3B MoE): - Dense still holds the crown: It still wins out on most tasks overall.
Opus 4.7 straight up cheated on my benchmark by reading the actual fix commit from git history 😅 (www.reddit.com)

1 9w swe-bench opus
Cursor just got Opus 4.7 at a 7.5x premium request cost. Here's how to make those requests count. (www.reddit.com)

6 9w swe-bench sonnet cursor+2

Opus 4.7 landed on Cursor yesterday. The model is better — SWE-bench jumped from 80.8% to 87.6%.
How do you actually know if Opus 4.7 is better for your specific agent use case? (www.reddit.com)

+64 9w swe-bench opus mcp+2

Anthropic shipped Opus 4.7 yesterday. The headline numbers are real: 64.3% on SWE-bench Pro (up from 53.4%), best-in-class on MCP-Atlas at 77.3% for multi-tool orchestration, 14% improvement on multi-step agentic reasoning, and one-third f…
Ask HN: Opus 4.7 – is anyone measuring the real token cost on agentic tasks? (news.ycombinator.com)

+1 10w swe-bench opus agentic

Shipped today. The benchmarks are real: 87.6% SWE-bench (from 80.8%), +13% on coding tasks, 3x more resolved production tasks on Rakuten-SWE-Bench.
I set up Opus as a strategic advisor for my Sonnet workflow. Here is the subagent config that makes it work. (www.reddit.com)

2 10w swe-bench sonnet opus+2

Anthropic published the Advisor Strategy this week. The idea: a cheaper model does the actual work, a stronger model only gets consulted on hard decisions.
DeepSeek V4 reportedly drops late April. 1M context, multimodal, Claude-level coding. (www.reddit.com)

21 10w swe-bench deepseek opus

Leaks point to late April release. Key specs 1M token context window Native multimodal (image/video input) Projected ~85% SWE-Bench Verified (ties or beats Claude Opus 4.6) Base model remains free.
Running gpt and glm-5.1 side by side. Honestly can’t tell the difference (www.reddit.com)

+2418 10w swe-bench glm gpt-5+1

So I have been running gpt and glm-5.1 side by side lately and tbh the gap is way smaller than what im paying for On SWE-Bench Pro glm-5.1 actually took the top spot globally, beat gpt-5.4 and opus 4.6. overall coding score is like 55 vs g…
Compare harnesses not models: Blitzy vs. GPT-5.4 on SWE-Bench Pro (quesma.com via hn)

+1 10w swe-bench gpt-5 gemini+2

An independent audit of agentic scaffolding and harnesses. We analyze how agent workflows, codebase documentation, and test verification impact performance compared to raw base models like GPT-5.4, Gemini 3.1 Pro, and Claude Code.
Claude down? TokenMonopoly will help you find the best deals in AI subs (tokenmonopoly.com via hn)

+2 10w swe-bench deepseek llama+1

TokenMonopoly Live leaderboard of AI API deals — pricing, subscriptions, and SWE-bench scores for Claude, GPT, Gemini, Kimi, DeepSeek, Llama and more. Compare 27 benchmarked models across 96 hosts by price-per-performance, refreshed daily.
Checking my model vibes against SWE-Bench Pro (blog.nilenso.com via hn)

+3 10w swe-bench

I thought GPT models felt slow and token hungry, and Claude models were faster. I was wrong.
Local-first agent evaluation collapses once runs are long and stateful? (www.reddit.com)

+46 10w swe-bench

I started out running agent evaluations locally because most ai agent benchmarks and examples assume that setup. And to be fair local runs do work for debugging and small experiments.
Claude is now adopting the advisor strategy (www.reddit.com)

+40058 11w swe-bench haiku sonnet+1

We're bringing the advisor strategy to the Claude Platform. Pair Opus as an advisor with Sonnet or Haiku as an executor, and your agents can consult Opus mid-task when they hit a hard decision.
Why we no longer evaluate SWE-bench Verified (openai.com)

17w swe-bench
Stop donating your salary to OpenAI: Why Minimax M2.5 is making GPT-5.2 Thinking look like an overpriced dinosaur for coding plans. (www.reddit.com)

10 18w swe-bench minimax hallucination+5

← all threads