I built a self-evolving agentic loop that ran 104 iterations autonomously to find questions that break every LLM — here's the architecture
Why I built this: I wanted to find the next "strawberry problem" — simple questions any kid can answer but every LLM gets wrong. Instead of manually testing questions, I built a system that does it autonomously.