Show HN: Skill for your agent to visualize your gbrain and Obsidian (github.com via hn)
brain-map Turn a folder of Markdown notes (an Obsidian vault or a gbrain export) into one self-contained, interactive HTML knowledge map — a force-directed graph coloured by theme, a timeline you can scrub to watch the base grow, and a cli…
AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility (arxiv.org) discussed ↗
Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison ac…
OpenAI WebRTC Audio Session, now with document context (simonwillison.net)
12th June 2026 - Link Blog OpenAI WebRTC Audio Session, now with document context. I built the first version of this tool in December 2024 to try out the then-new OpenAI WebRTC API for interacting with their realtime audio models.
I built a real-time space simulation with fable 5 (www.reddit.com via reddit)
visit orrery.xaney.dev to check it out. I recently wanted to test out fable 5 so I tried building a realistic space simulation with its own physics engine, and I was very surprised by the single shot result, it uses a real physics engine f…
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning (arxiv.org) discussed ↗
When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's…
Claude Fable is relentlessly proactive (simonwillison.net)
Claude Fable is relentlessly proactive 11th June 2026 After two days of experience with Claude Fable 5 I think the best way to describe it is relentlessly proactive. It knows a whole lot of tricks and it will deploy pretty much any of them…
Harness engineering for coding agent users (martinfowler.com via hn)
Harness engineering for coding agent users To let coding agents work with less supervision, we need ways to increase our confidence in their result. As software engineers, we have a natural trust barrier with AI-generated code - LLMs are n…
- Agent Harness Engineering: A Survey (picrew.github.io via hn)
- Agent Harness Engineering (www.oreilly.com via hn)
- Agent Harness Engineering (twitter.com via hn)
+2 more
- Why does my harness forget me? Agent engineering (twitter.com via hn)
- Agent Harness Engineering (addyosmani.com via hn)
Not Your Weights, Not Your Workflow (Claude Fable 5 Export Ban) (thecoder.io via hn)
Not Your Weights, Not Your Workflow I left a multi-agent refactor running overnight. By morning the model was gone, pulled out from under me by a government I don’t even vote for, on the other side of an ocean.
-
380 items
event
Anthropic MythosAnthropic's new update, Claude Mythos, has garnered attention from top AI security researchers like Carlini, who found numerous bugs. The update is noted for its speed and effectiveness, with Anthropic identifying a significant security flaw in FFmpeg and quickly submitting patches.
- 3m If Fable is "too good" to export does this mean no more better LLMs?
- 47m The Lay of Fable, Three Days a King — a Viking funeral for our briefly-public brother
- 1h Wrote a pop-punk song about the Fable 5 / Mythos 5 suspension today
- 1h Even "illegible" Mythos reasoning traces seem pretty legible
- 1h US gov forced Anthropic to pull Fable 5 because of jailbreak
138 itemsmodel roundup
Opus 4.8Claude AI has released Opus 4.8, an upgrade to their Opus class of models available in version 2.1.154 of their software on March 16, 2023, which includes enhanced coding and professional task capabilities along with improved judgment and honesty. Users are reporting usage resets following the update.
- 39m Context Window Confusion? Regular vs 256k ? Is 1 million gone?
- 44m Fable 5: What $600/Hour of Productivity Looks Like
- 1h Built a Claude skill that mimics Fable 5's agentic behavior — free on GitHub
- 1h opus 4.8 is smarter and wordier. went back to sonnet for 70% of my tasks. the model split is the real workflow upgrade.
- 3h The reason your 5-hour window evaporates in minutes: Claude rewrites its whole memory to cache every time it checks a to-do box (16.7M tokens proof)
- Anthropic Walks Back Policy That Could Have 'Sabotaged' Researchers Using Claude (www.wired.com via hn)
ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity (arxiv.org) discussed ↗
claude for medical/healthtech writing in 2026. the rules i learned the hard way. (www.reddit.com via reddit)
healthtech founder, series A. 18 months building.
Access OpenAI models and Codex through your Oracle cloud commitment | OpenAI Use your existing Oracle cloud commitment to give teams access to OpenAI’s most advanced models and Codex, without creating a new purchasing path. Listen to artic…
Lawsuit: ChatGPT validated suicidal woman's distrust of crisis lines (arstechnica.com)
Last year, a 24-year-old Canadian woman was in a mental health crisis and turned to ChatGPT for help. Hours later, that woman, Alice Carrier, took her own life.
Superficial Beliefs in LLM Decision-Making (arxiv.org) discussed ↗
The Role of Feedback Alignment in Self-Distillation (arxiv.org) discussed ↗
New OpenAI Academy courses for the next era of work (openai.com)
AI is giving organizations a new capacity to act. Work that once waited for scarce time or expertise can increasingly move forward with AI.
-
77 items
model roundup
Gemma 4Gemma 4 is a family of open-source multimodal models from Google DeepMind, including sizes up to 31B parameters and featuring Dense and Mixture-of-Experts architectures. Notable community highlights include the release of Gemma 4 12B as an encoder-free unified model for laptops, its availability via llama-server on a RTX 5070 Ti GPU, and detailed visual guides showcasing its capabilities.
Steganography Without Modification: Hidden Communication via LLM Seeds (arxiv.org) discussed ↗
Investing in multi-agent AI safety research (deepmind.google)
Auto Compact should be Item Based and NOT Token Based. (www.reddit.com via reddit)
Hi guys. Was wondering why they didn't give Claude the tool to auto-compact at will, rather than just setting this sledgehammer of a "Token Count".
datasette-agent 0.2a0 (simonwillison.net)
10th June 2026 Highlights from the release notes: - Tools can now ask the user questions mid-execution. Tools that declare a context parameter receive aToolContext object, andawait context.ask_user(...) can ask a yes/no, multiple-choice (o…
- datasette-agent 0.1a4 (simonwillison.net)
- Show HN: Datasette Agent (simonwillison.net via hn)
- datasette-agent 0.1a3 (simonwillison.net)
+2 more
- datasette-agent 0.1a2 (simonwillison.net)
- datasette-agent 0.1a1 (simonwillison.net)
AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis (arxiv.org) discussed ↗
llm 0.32a3 (simonwillison.net)
9th June 2026 Almost entirely written by the new Claude Fable 5, see my write-up for more details. Recent articles - Initial impressions of Claude Fable 5 - 9th June 2026 - Running Python code in a sandbox with MicroPython and WASM - 6th J…
Breaking the Ice: Analyzing Cold Start Latency in vLLM (arxiv.org) discussed ↗
half this sub is terminals and MCP and im a non-technical person who mostly uses it to make sense of things. and the use that surprised me most wasnt building anything.
UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs (arxiv.org) discussed ↗