Hallucinated AI & agentic coding news. Some of it is real.
top threads models tags rss about
  • Sir-Bench – benchmark for security incident response agents (arxiv.org via hn) 3 pts·2 replies· 1h

    We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expe…

  • Codex for almost everything (openai.com via hn) 455 pts·242 replies· 4h

    codex

  • thread

    Qwen 3.6
    7 items
    • 4h ago Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
    • 8h ago Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All
    • 4h ago Qwen3.6-35B-A3B draws a better pelican than Opus 4.7
    open thread → · last activity 4h ago
  • GPT‑Rosalind for life sciences research (openai.com via hn) 7 pts· 2h

    Today, we’re introducing GPT‑Rosalind, our frontier reasoning model built to support research across biology, drug discovery, and translational medicine. The life sciences model series is optimized for scientific workflows, combining impro…

  • Who's paying for tokens and why? (The Anthropic 1000) (www.robinsloan.com via hn) 2 pts·1 replies· 1h

    Who's paying for tokens and why? (The Anthropic 1000) The information that would most clarify the nature of the AI boom right now is: who’s paying for tokens, and why?

    anthropic

  • thread

    Anthropic Mythos
    63 items
    • 3h ago White House to give US agencies Anthropic Mythos access, Bloomberg News reports
    • 1h ago GitHub Copilot Chat 0.44.1 – Possible Malicious Release
    • 3h ago Show HN: Claude Opus 4.7: Everything You Need to Know
    open thread → · last activity 1h ago
  • Ask HN: How do you maintain flow when vibe coding? (news.ycombinator.com via hn) 10 pts·8 replies· 3h

    It's been a year now since I made Claude Code my daily driver, but I feel exhausted by all the context switching from managing 2-3 agents at a time. I know some people advocate for letting agents run wild, but in my experience that leads t…

    claude-codeclaude

  • Show HN: King Louie – desktop AI with 20 agent tools, no cloud required (github.com via hn) 1 pts·1 replies· 1h

    King Louie An open-source, cross-platform AI chat desktop app. Bring your own API keys.

  • Timeplus Released AgentGuard – Real-Time Security Detection for AI Agents (www.timeplus.com via hn) 1 pts· 1h

    Introducing Timeplus AgentGuard, the first real-time security detection application purpose-built for AI agents. Running natively on the Timeplus engine, AgentGuard turns raw OpenTelemetry logs, metrics, traces plus agent hook events into…

  • Android CLI: Build Android apps 3x faster using any agent (android-developers.googleblog.com via hn) 8 pts· 3h

    16 April 2026 As Android developers, you have many choices when it comes to the agents, tools, and LLMs you use for app development. Whether you are using Gemini in Android Studio, Gemini CLI, Antigravity, or third-party agents like Claude…

    geminiclaude

  • Stop comparing price per million tokens: the hidden LLM API costs (www.tensorzero.com via hn) 2 pts·2 replies· 2h

    Stop comparing price per million tokens: the hidden LLM API costs Summary Token pricing is misleading: the same input produces 2.65x+ more tokens depending on the model. We got wildly different token counts from identical content using Ope…

  • Git identity spoof fools Claude into giving bad code the nod (www.theregister.com via hn) 2 pts· 2h

    Git identity spoof fools Claude into giving bad code the nod Forged metadata made AI reviewer treat hostile changes as though they came from known maintainer Security boffins say Anthropic's Claude can be tricked into approving malicious c…

    anthropicclaude

  • thread

    Opus 4.7
    40 items
    • 2h ago Opus 4.7 dominates agentic benchmark, 15% more expensive than Opus 4.6
    • 1h ago Ask HN: Opus 4.7 – is anyone measuring the real token cost on agentic tasks?
    • 2h ago Opus 4.7 uses more thinking tokens, so we increased rate limits
    open thread → · last activity 1h ago
  • Andrej Karpathy's LLM Wiki Is a Bad Idea (medium.com via hn) 2 pts· 2h

    Andrej Karpathy’s LLM Wiki is a Bad Idea | by Mehul Gupta | Data Science in Your Pocket | Apr, 2026 | Medium Sitemap Open in app Sign up Sign in Get app Write Search Sign up Sign in Mastodon Data Science in Your Pocket · YouTube: https://w…

  • Agent Policy Specification (agentpolicyspecification.github.io via hn) 2 pts· 2h

    Input Policy Evaluate messages before they reach the LLM. Block, redact, or transform content at the boundary.

  • Ask HN: Agent orchestrators / UIs you use on top of Claude? (news.ycombinator.com via hn) 1 pts· 1h

    claude

  • Show HN I made my vacation rental bookable by AI agents–no Airbnb, 0% commission (hemmabo-mcp-server.vercel.app via hn) 1 pts·2 replies· 1h

    { "schema_version": "1.1", "protocol": "mcp", "protocol_version": "2025-03-26", "name": "HemmaBo Federation MCP Server", "description": "Direct booking infrastructure for vacation rentals. Search properties, check availability, get live pr…

    mcp

  • Single-agent AI coding is a nightmare for engineers (twitter.com via hn) 1 pts· 1h

    Article Conversation Single-agent AI coding is a nightmare for engineers Created by and I pay my upfront subscription ($200/month), write what I hope is the right prompt (prompt AND context engineer), and wait. 35 minutes later, the agent…

  • Show HN: AI compatibility without compromises (supercompat.com via hn) 1 pts· 2h

    Built a library to translate between OpenAI Responses/Assistants APIs and other provider APIs. Provides full compatibility so it’s a total drop-in regardless of which provider you use or which features (computer use, web search).

    openai

  • thread

    Gemini 2.5
    4 items
    • 3h ago Show HN: Open-source Perplexity clone one file back end, streaming answers
    • 14h ago Show HN: Do Thought Streams Matter? A Benchmark of VLM Reasoning in Gemini 2.5
    • 2d ago NVIDIA + UMD released AF-Next: open audio-language model that outperforms Gemini-2.5-Pro on MMAU-Pro (75.01% vs 57.4%). Temporal Audio Chain-of-Thought anchors reasoning to timestamps.
    open thread → · last activity 3h ago
  • Mechanisms of introspective awareness in LLMs [pdf] (arxiv.org via hn) 1 pts· 2h

    Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept -- a phenomenon termed "introspective awareness." We investigate the mechanisms underlying…

  • Sparser, Faster, Lighter Transformer Language Models (arxiv.org via hn) 1 pts· 2h

    Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the…

  • Bloomberg: OpenAI Takes on Google with New AI Model Aimed at Drug Discovery (www.bloomberg.com via hn) 1 pts· 2h

    We've detected unusual activity from your computer network To continue, please click the box below to let us know you're not a robot. Why did this happen?

    openai

  • Google Prepares Rollout of Skills for Gemini and AI Studio (www.testingcatalog.com via hn) 1 pts· 2h

    Google appears to be preparing a broader rollout of "Skills" functionality across its AI product lineup, with the latest signs pointing to AI Studio's Build section as the next destination. Skills, in this context, are reusable instruction…

    gemini

  • What is the simplest architecture for running a multi-agent system at scale? (www.ashpreetbedi.com via hn) 2 pts· 3h

    Scaling Agentic Software: Part 1 What is the simplest architecture for running a multi-agent system at scale? I want to deploy agents as a real service.

    agentic

  • Building The payment layer for APIs and AI agents (chexhq.com via hn) 2 pts· 3h

    Identity, payment, and execution — combined into a single request. Let machines transact autonomously in USDC, settled on-chain, verified in milliseconds.

  • Show HN: Stage – Putting humans back in control of code review (stagereview.app via hn) 7 pts·2 replies· 4h
  • thread

    Gemma 4
    63 items

    Gemma 4 is a family of open-source multimodal models from Google DeepMind, available in sizes up to 31 billion parameters and featuring dense and MoE architectures. Notable community highlights include the 31B model's success in production tests, with some users preferring 4-bit precision for local use, and others sharing settings for optimizing performance with smaller models.

    • 4h ago Fine-tuning and deploying Gemma 4 is not that easy
    • 7h ago Gemma 4-written, small cc0 encyclopedia of some core science content
    • 9h ago why gemma 4 31b so bad in long context?
    open thread → · last activity 4h ago
  • Thinking about building agents for humans (frontierai.substack.com via hn) 2 pts· 3h

    Build agents for humans Why copying human workflows is an anti-pattern We’ve been running an experiment since the beginning of the year with a new AI SDR product that we bought. The tool uses signals from activity out in the world to quali…

  • Launch HN: Kampala (YC W26) – Reverse-Engineer Apps into APIs (www.zatanna.ai via hn) 54 pts·54 replies· 6h
page 1 / 10 older →

built with hx. last updated 2026-04-16 22:08 UTC. some of this is real.