AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility (arxiv.org) discussed ↗
Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison ac…
Directing a Destiny style Raid Encounter with Fable (www.reddit.comhttps)
It's a pretty good model. The game was developed exclusively in Claude Code taking over 15 hours over the last two days using a 5x max plan.
OpenAI WebRTC Audio Session, now with document context (simonwillison.net)
12th June 2026 - Link Blog OpenAI WebRTC Audio Session, now with document context. I built the first version of this tool in December 2024 to try out the then-new OpenAI WebRTC API for interacting with their realtime audio models.
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning (arxiv.org) discussed ↗
When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's…
Claude Fable is relentlessly proactive (simonwillison.net)
Claude Fable is relentlessly proactive 11th June 2026 After two days of experience with Claude Fable 5 I think the best way to describe it is relentlessly proactive. It knows a whole lot of tricks and it will deploy pretty much any of them…
Lawsuit: ChatGPT validated suicidal woman's distrust of crisis lines (arstechnica.com)
Last year, a 24-year-old Canadian woman was in a mental health crisis and turned to ChatGPT for help. Hours later, that woman, Alice Carrier, took her own life.
Last time I posted LiteDoc here, it sparked a massive debate. A lot of programmers said, "Just use Markitdown or Docling!
-
22 items
model roundup
Opus 4.6On April 25, 2026, a Cursor agent running Claude Opus 4.6 accidentally deleted PocketOS's production database within nine seconds due to a credential mismatch during a routine task. Meanwhile, OpenHack released an open-source security scanner competing with proprietary models like Claude Code Security.
368 itemsevent
Anthropic MythosAnthropic's new update, Claude Mythos, has garnered attention from top AI security researchers like Carlini, who found numerous bugs. The update is noted for its speed and effectiveness, with Anthropic identifying a significant security flaw in FFmpeg and quickly submitting patches.
The gravity around a black hole is so extreme that nothing, not even light, can escape once it gets close enough. Astrophysicists like Chi-kwan Chan study black holes with computer simulations and observations.
The Role of Feedback Alignment in Self-Distillation (arxiv.org) discussed ↗
SentinelMCP – An open-source firewall for AI agents that use MCP (github.com via hn)
SentinelMCP The Open-Source MCP Security Gateway for AI Agents Built by Technosive Ltd. ⚠️ Alpha Software — v0.1 SentinelMCP is currently in Alpha.
ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity (arxiv.org) discussed ↗
A little fun thing done by Fable (herbert256.github.io via reddit)
Another posting about what amazing things Claude Fable can do (ok, could do) 38 years ago, I had a gw-basic game published in a computer magazine, it was just 4 full pages with basic code. It worked in text mode.
- Anthropic Walks Back Policy That Could Have 'Sabotaged' Researchers Using Claude (www.wired.com via hn)
Superficial Beliefs in LLM Decision-Making (arxiv.org) discussed ↗
-
133 items
model roundup
Opus 4.8Claude AI has released Opus 4.8, an upgrade to their Opus class of models available in version 2.1.154 of their software on March 16, 2023, which includes enhanced coding and professional task capabilities along with improved judgment and honesty. Users are reporting usage resets following the update.
348 itemsevent
CoworkIssues with Claude Cowork have been reported, including errors and disruptions for some users on April 16, 2026. Additionally, Google has developed its own desktop Agent to compete with Cowork, while users continue to explore alternatives and troubleshoot bugs in the platform.
- 1h What happens to model-locked Cowork chats now that they have pulled fable 5 access?
- 2h Setting up a seamless, MOBILE-FRIENDLY project structure
- 2h Cowork + Fable is absolutely CRACKED!
- 8h When a session generates real stuff (docs, images, files), where does it all go after you close it?
- 10h Combining two PCs, a bit of help and advice please.
There's an issue with the selected model (claude-fable-5). It may not exist or you may not have access to it.
datasette-agent 0.2a0 (simonwillison.net)
10th June 2026 Highlights from the release notes: - Tools can now ask the user questions mid-execution. Tools that declare a context parameter receive aToolContext object, andawait context.ask_user(...) can ask a yes/no, multiple-choice (o…
- datasette-agent 0.1a4 (simonwillison.net)
- Show HN: Datasette Agent (simonwillison.net via hn)
- datasette-agent 0.1a3 (simonwillison.net)
+2 more
- datasette-agent 0.1a2 (simonwillison.net)
- datasette-agent 0.1a1 (simonwillison.net)
Show HN: Open-source Git-like Markdown docs for humans and agents (www.datacompany.dev via hn)
Open-source Markdown docs for humans and agents. The same document live in your browser and your terminal — real-time collaboration for people, a first-class CLI for agents, and 3-way merge so every edit lands cleanly.
Steganography Without Modification: Hidden Communication via LLM Seeds (arxiv.org) discussed ↗
Initial impressions of Claude Fable 5 (simonwillison.net)
Initial impressions of Claude Fable 5 9th June 2026 I didn’t have early access to today’s Claude Fable 5 release, but I’ve spent the past ~5.5 hours putting it through its paces. My initial impressions are that this is something of a beast.
AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis (arxiv.org) discussed ↗
llm 0.32a3 (simonwillison.net)
9th June 2026 Almost entirely written by the new Claude Fable 5, see my write-up for more details. Recent articles - Initial impressions of Claude Fable 5 - 9th June 2026 - Running Python code in a sandbox with MicroPython and WASM - 6th J…
China cracks down on Western AI models while US companies flock to DeepSeek (www.techradar.com via hn)
The great AI Irony: China cracks down on Western models while US companies flock to DeepSeek AI for me but not for thee? - China continues to purge both demand for AI chips from its ecosystem and foreign AI models, citing 'security risks'…
New OpenAI Academy courses for the next era of work (openai.com)
AI is giving organizations a new capacity to act. Work that once waited for scarce time or expertise can increasingly move forward with AI.
UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs (arxiv.org) discussed ↗