could not extract summary
#arc-agi
20 items
The Human Baseline for ARC-AGI-3 has been updated (www.reddit.com) Common GPT 5.5 pricing misconception. (www.reddit.com) Many people have pointed out that ChatGPT 5.5 appears to be twice as expensive as 5.4 based on API pricing, which makes it look pricier than Opus 4.7. But the comparison is not that simple.
LLMs do fine on ARC-AGI-3 if they are allowed to search over game logs (www.reddit.com) I was reading the comments to this post and the overall opinion seemed to be that harness makes little/no difference for ARC-AGI-3. Turns out, it makes a huge difference: Hill-climbing ARC-AGI-3 TLDR: if you save game logs - taken actions,…
ARC-AGI-3 Update (GPT-5.5 High and Opus4.7) (www.reddit.com) - GPT-5.5: 0.43% - Opus 4.7: 0.18% ARC-AGI-3 is no joke. I can’t wait to see which models finally crack.
GPT vs Claude in a bomberman-style 1v1 game (www.reddit.com) A few weeks ago, ARC-AGI 3 was released. For those unfamiliar, it’s a benchmark designed to study agentic intelligence through interactive environments.
OpenClaw leads official ARC-AGI-3 community leaderboard (arcprize.org via hn) ARC-AGI Community Leaderboard ARC-AGI has gained significant popularity over the past two years, and we've been overwhelmed by the number of researchers and builders who want to showcase their work to the community. The ARC-AGI Community L…
Playing Around with the ARC-AGI-3 Benchmark (bengoertzel.substack.com via reddit) AGI, frontier science, maniacal metaphysics, decentralizationist politics, life and consciousness extension and expansion, psi and psychedelics and etc. etc.
A one-parameter model that gets 100% on ARC-AGI-2 (eitanturok.github.io via hn) TLDR: I built a model that has only one parameter and gets 100% on ARC-AGI-2, the million-dollar benchmark that pushes reasoning models to their limits. Using chaos theory and some deliberate cheating, I crammed every answer into a single…
Anthropic Opus 4.8 is new SOTA on ARC-AGI-3, Score: 1.5%, –$10K (xcancel.com via hn) Anthropic Opus 4.8 is new SOTA on ARC-AGI-3 Score: 1.5%, ~$10K ARC-AGI-3 analysis notes: * Opus 4.8 read the environment an abstraction *above* Opus 4.7, as objects & systems, not pictures * Opus 4.8 succeeded on early levels, but still co…
Does Ring look like a default agent model to you, or a model you route only to harder steps? (www.reddit.com) Ring-2.6-1T made me think less about “is this good?” and more about routing. The public profile looks like something I'd at least test for harder agent steps: PinchBench 87.60, AIME 26 95.83, GPQA Diamond 88.27, Tau2-Bench Telecom 95.32, b…
Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3 (arcprize.org via hn) Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3 AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome.
Measuring Human Performance on ARC-AGI-3 (arcprize.org via hn) Measuring Human Performance on ARC-AGI-3 AGI is here when a system can learn like a human. However there is still a gap between what humans can learn and what AI can learn.
Show HN: A Bomberman-style 1v1 game where LLMs compete in real time (github.com via hn) A few weeks ago, ARC-AGI 3 was released. For those unfamiliar, it’s a benchmark designed to study agentic intelligence through interactive environments.
If Ring only handled one failure mode in your agent stack, which one gets it first: tool-choice ambiguity, retry loops, or final-answer checks? (www.reddit.com) Ring-2.6-1T made me think about failure placement more than headline strength. It’s a trillion-parameter reasoning model for agent workflows with high and xhigh reasoning-effort modes.
TranscendPlexity: 540/540 ARC-AGI-1/2/3, 13 tasks with 0% AI solve rate, solved (github.com via hn) 🔓 13 "Impossible" ARC-AGI-2 Tasks — All Solved These 13 ARC-AGI-2 evaluation tasks have never been solved by any AI system — not GPT-4, not Claude, not Gemini, not NVARC, not MindsAI, not any Kaggle submission. They have a 0% AI solve rate…
Grok vs. ChatGPT vs. Gemini Comparison 2026: Complete Guide (Tested) (aithinkerlab.com via hn) The 30-Second Verdict Best for science & reasoning: Gemini 3.1 Pro — leads GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). Best for coding: ChatGPT (GPT-5.5) — 88.7% on SWE-Bench Verified.
11.67% ARC-AGI-2 Local Eval on a Single 4090: The TOPAS Recursive Architecture (www.reddit.com) I'm not sure too many people care about the ARC-AGI-2 competition anymore, but still...I thought some might find this interesting. They're running it one last time this year.
Replayable traces of Claude Code runs on ARC-AGI-3 public demo games (arc-agi-runs.web.app via hn) GAME ar2511 bp3513 cd8211 dc2210 g50t10 ka5911 lf5210 ls201 m0r029 r11l1 re863 su1510 vc3311 wa3010 VARIANT A0-replay-m0r01 A1-fresh-m0r03 A10-unseen-game3 A11-smaller-model3 A2-no-theory3 A3-no-journal3 A4-no-scratchpads3 A5-no-code-writi…
Executable World Models for ARC-AGI-3 in the Era of Coding Agents (arxiv.org) Structural Grid Descriptors Predict Within-Task Solver Success on ARC-AGI (arxiv.org)