My last observation re: Anthropic's sabotage (xcancel.com via hn)
My last observation re: Anthropic’s secret sabotage safety policy, is that it undermines actually good safety policy. How?
New OpenAI Academy courses for the next era of work (openai.com)
AI is giving organizations a new capacity to act. Work that once waited for scarce time or expertise can increasingly move forward with AI.
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning (arxiv.org) discussed ↗
When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's…
/architect: Cut Fable token cost. Fable is orchestrator/reviewer, Codex is builder (www.reddit.com via reddit)
https://github.com/DanMcInerney/architect-loop Fable absolutely rules, but the load-bearing work of coding agents is in the design and the review, not the actual coding. So this is two skills: /architect uses Fable as the orchestrator and…
Claude Fable is relentlessly proactive (simonwillison.net)
Claude Fable is relentlessly proactive 11th June 2026 After two days of experience with Claude Fable 5 I think the best way to describe it is relentlessly proactive. It knows a whole lot of tricks and it will deploy pretty much any of them…
-
197 items
model roundup
GPT 5.5On [Date], a significant leak of the OpenAI Codex model, referred to as GPT-5.5, was captured on video before it was patched. The incident involved models named Arcanine and Glacier-alpha.
- 28m /architect: Reduce Fable tokens by 80%, Fable orchestrates/reviews, Codex builds
- 5h Ask HN: Favorite prompts for improving LLM output?
- 9h What one person can ship in 4 days with two frontier models: a ranking engine, an in-game economy, an AI talk show, and a missions system — for a game that "died" years ago.
- 12h OMG Fable one-shotting everything
- 16h Fable 5 added to the Artificial Analysis Coding Agent Index... barely 1 point ahead of GPT-5.5 ???
132 itemsmodel roundup
Opus 4.8Claude AI has released Opus 4.8, an upgrade to their Opus class of models available in version 2.1.154 of their software on March 16, 2023, which includes enhanced coding and professional task capabilities along with improved judgment and honesty. Users are reporting usage resets following the update.
- 1h 35 days of claude code, usage ~$50,000 tokens. Total price: ~$200 dollars
- 7h tested Claude Fable 5 and Opus 4.8 across 917 coding-agent scenarios. Fable won by 0.9 points.
- 7h Introducing: DNR-Bench: Do-not-respond Benchmark
- 8h Usage insights
- 9h Ask HN: Is Claude Fable 5 built from scratch or just better data?
Lawsuit: ChatGPT validated suicidal woman's distrust of crisis lines (arstechnica.com)
Last year, a 24-year-old Canadian woman was in a mental health crisis and turned to ChatGPT for help. Hours later, that woman, Alice Carrier, took her own life.
Show HN: Rubric – test what your LLM agent did, not just what it said (github.com via hn)
Rubric Agent behavior testing for LLM apps. Test what your agent did — tools called, arguments, trace, latency — not just what it said.
datasette-agent 0.2a0 (simonwillison.net)
10th June 2026 Highlights from the release notes: - Tools can now ask the user questions mid-execution. Tools that declare a context parameter receive aToolContext object, andawait context.ask_user(...) can ask a yes/no, multiple-choice (o…
- datasette-agent 0.1a4 (simonwillison.net)
- Show HN: Datasette Agent (simonwillison.net via hn)
- datasette-agent 0.1a3 (simonwillison.net)
+2 more
- datasette-agent 0.1a2 (simonwillison.net)
- datasette-agent 0.1a1 (simonwillison.net)
The Role of Feedback Alignment in Self-Distillation (arxiv.org) discussed ↗
Superficial Beliefs in LLM Decision-Making (arxiv.org) discussed ↗
-
360 items
event
Anthropic MythosAnthropic's new update, Claude Mythos, has garnered attention from top AI security researchers like Carlini, who found numerous bugs. The update is noted for its speed and effectiveness, with Anthropic identifying a significant security flaw in FFmpeg and quickly submitting patches.
77 itemsmodel roundup
Gemma 4Gemma 4 is a family of open-source multimodal models from Google DeepMind, including sizes up to 31B parameters and featuring Dense and Mixture-of-Experts architectures. Notable community highlights include the release of Gemma 4 12B as an encoder-free unified model for laptops, its availability via llama-server on a RTX 5070 Ti GPU, and detailed visual guides showcasing its capabilities.
ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity (arxiv.org) discussed ↗
partnerships lead at a series A SaaS. ~1 cold partner outreach per week, ~3 follow-ups per week.
llm 0.32a3 (simonwillison.net)
9th June 2026 Almost entirely written by the new Claude Fable 5, see my write-up for more details. Recent articles - Initial impressions of Claude Fable 5 - 9th June 2026 - Running Python code in a sandbox with MicroPython and WASM - 6th J…
AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis (arxiv.org) discussed ↗
part-time B2B consultant, 8 active clients. context is my entire job.
Investing in multi-agent AI safety research (deepmind.google)
-
51 items
event
DeepmindGoogle DeepMind has released "Deep Research Max," advancing autonomous research agents, while also facing challenges and competition from other AI companies like Anthropic and Ineffable Intelligence. Meanwhile, DeepMind workers in the UK have voted to unionize, and former DeepMind architect Demis Hassabis is at the center of legal drama involving Elon Musk.
- 1d Google DeepMind is worried about what happens when millions of agents start to interact
- 1d Show HN: Magenta Real-Time Music Generation on iPhone, Without the GPU
- 2d The Great Reframing...
- 3d Show HN: VQAScore – open eval metric/reward model, now for text-to-video
- 7d Inside Google DeepMind: Reasoning, Omni, and Shipping Frontier AI
PULSE8.ai Cortex Agent-native knowledge OS built on Markdown PULSE8.ai Cortex is an agent-native knowledge OS built on Markdown. It gives AI agents and humans a shared vault backed by a typed knowledge graph, full-text search, and a MarkIt…
Steganography Without Modification: Hidden Communication via LLM Seeds (arxiv.org) discussed ↗
Breaking the Ice: Analyzing Cold Start Latency in vLLM (arxiv.org) discussed ↗
The 98% Problem: A Survey of Harness Engineering for AI Agents (labs.beconfident.app via hn)
Agent quality lives below the model: in the loop, the context engine, the tool surface, the safety stack, and the evaluator. A survey of harness engineering as of mid-2026, introducing GROOM, an open-source self-maintaining knowledge harne…
- Agent Harness Engineering: A Survey (picrew.github.io via hn)
TripoSplat Generate 3D models from a single image I asked a coding agent to build a beautiful website showcasing the monuments of Paris as 3D Gaussian splats. I never opened an image generator.
Fable > Opus sprint strategy (www.reddit.com via reddit)
With the release of Fable and the rug pull coming in the 22nd I got to thinking. How can I maximise its usage.