#arc-agi

20 items

The Human Baseline for ARC-AGI-3 has been updated (www.reddit.com) +26366 10w

could not extract summary

arc-agi
Common GPT 5.5 pricing misconception. (www.reddit.com) +12130 9w

Many people have pointed out that ChatGPT 5.5 appears to be twice as expensive as 5.4 based on API pricing, which makes it look pricier than Opus 4.7. But the comparison is not that simple.

↯ ChatGPT 5.5 arc-agi chatgpt opus+1
LLMs do fine on ARC-AGI-3 if they are allowed to search over game logs (www.reddit.com) +4019 7w

I was reading the comments to this post and the overall opinion seemed to be that harness makes little/no difference for ARC-AGI-3. Turns out, it makes a huge difference: Hill-climbing ARC-AGI-3 TLDR: if you save game logs - taken actions,…

↯ Opus 4.6 arc-agi gpt-5 opus
ARC-AGI-3 Update (GPT-5.5 High and Opus4.7) (www.reddit.com) +3027 7w

- GPT-5.5: 0.43% - Opus 4.7: 0.18% ARC-AGI-3 is no joke. I can’t wait to see which models finally crack.

↯ Opus 4.7 arc-agi gpt-5 opus
GPT vs Claude in a bomberman-style 1v1 game (www.reddit.com) +195 10w

A few weeks ago, ARC-AGI 3 was released. For those unfamiliar, it’s a benchmark designed to study agentic intelligence through interactive environments.

arc-agi agentic
OpenClaw leads official ARC-AGI-3 community leaderboard (arcprize.org via hn) +3 5w

ARC-AGI Community Leaderboard ARC-AGI has gained significant popularity over the past two years, and we've been overwhelmed by the number of researchers and builders who want to showcase their work to the community. The ARC-AGI Community L…

arc-agi openclaw
Playing Around with the ARC-AGI-3 Benchmark (bengoertzel.substack.com via reddit) +31 6w

AGI, frontier science, maniacal metaphysics, decentralizationist politics, life and consciousness extension and expansion, psi and psychedelics and etc. etc.

arc-agi
A one-parameter model that gets 100% on ARC-AGI-2 (eitanturok.github.io via hn) +2 3w

TLDR: I built a model that has only one parameter and gets 100% on ARC-AGI-2, the million-dollar benchmark that pushes reasoning models to their limits. Using chaos theory and some deliberate cheating, I crammed every answer into a single…

arc-agi
Anthropic Opus 4.8 is new SOTA on ARC-AGI-3, Score: 1.5%, –$10K (xcancel.com via hn) +2 3w

Anthropic Opus 4.8 is new SOTA on ARC-AGI-3 Score: 1.5%, ~$10K ARC-AGI-3 analysis notes: * Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures * Opus 4.8 succeeded on early levels, but still co…

↯ Opus 4.8 arc-agi opus anthropic
Does Ring look like a default agent model to you, or a model you route only to harder steps? (www.reddit.com) +21 4w

Ring-2.6-1T made me think less about “is this good?” and more about routing. The public profile looks like something I'd at least test for harder agent steps: PinchBench 87.60, AIME 26 95.83, GPQA Diamond 88.27, Tau2-Bench Telecom 95.32, b…

arc-agi
Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3 (arcprize.org via hn) +21 7w

Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3 AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome.

↯ Opus 4.7 arc-agi gpt-5 opus
Measuring Human Performance on ARC-AGI-3 (arcprize.org via hn) +2 10w

Measuring Human Performance on ARC-AGI-3 AGI is here when a system can learn like a human. However there is still a gap between what humans can learn and what AI can learn.

arc-agi
Show HN: A Bomberman-style 1v1 game where LLMs compete in real time (github.com via hn) +22 10w

A few weeks ago, ARC-AGI 3 was released. For those unfamiliar, it’s a benchmark designed to study agentic intelligence through interactive environments.

arc-agi agentic
If Ring only handled one failure mode in your agent stack, which one gets it first: tool-choice ambiguity, retry loops, or final-answer checks? (www.reddit.com) +11 4w

Ring-2.6-1T made me think about failure placement more than headline strength. It’s a trillion-parameter reasoning model for agent workflows with high and xhigh reasoning-effort modes.

arc-agi
TranscendPlexity: 540/540 ARC-AGI-1/2/3, 13 tasks with 0% AI solve rate, solved (github.com via hn) +1 4w

🔓 13 "Impossible" ARC-AGI-2 Tasks — All Solved These 13 ARC-AGI-2 evaluation tasks have never been solved by any AI system — not GPT-4, not Claude, not Gemini, not NVARC, not MindsAI, not any Kaggle submission. They have a 0% AI solve rate…

↯ Gpt 4 ↯ GPT 4 arc-agi gpt-4 gemini
Grok vs. ChatGPT vs. Gemini Comparison 2026: Complete Guide (Tested) (aithinkerlab.com via hn) +11 5w

The 30-Second Verdict Best for science & reasoning: Gemini 3.1 Pro — leads GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). Best for coding: ChatGPT (GPT-5.5) — 88.7% on SWE-Bench Verified.

↯ Swe Bench ↯ Gemini 3.1 arc-agi swe-bench grok+3
11.67% ARC-AGI-2 Local Eval on a Single 4090: The TOPAS Recursive Architecture (www.reddit.com) +11 7w

I'm not sure too many people care about the ARC-AGI-2 competition anymore, but still...I thought some might find this interesting. They're running it one last time this year.

arc-agi
Replayable traces of Claude Code runs on ARC-AGI-3 public demo games (arc-agi-runs.web.app via hn) +1 7w

GAME ar2511 bp3513 cd8211 dc2210 g50t10 ka5911 lf5210 ls201 m0r029 r11l1 re863 su1510 vc3311 wa3010 VARIANT A0-replay-m0r01 A1-fresh-m0r03 A10-unseen-game3 A11-smaller-model3 A2-no-theory3 A3-no-journal3 A4-no-scratchpads3 A5-no-code-writi…

arc-agi claude-code
Structural Grid Descriptors Predict Within-Task Solver Success on ARC-AGI (arxiv.org) 2w

arc-agi
Executable World Models for ARC-AGI-3 in the Era of Coding Agents (arxiv.org) 2w

arc-agi

← all tags