#security

485 items

PSA: Anthropic bans organizations without warning (www.reddit.com) +744127 9w

I work at at an agricultural technology company. On Monday, everyone in our org woke up to emails saying that their Claude accounts had been suspended (~110 users).

↯ Security security anthropic
Gemma 4 Jailbreak System Prompt (www.reddit.com) +446111 10w

Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.

↯ Security ↯ Jailbreak ↯ Gemma 4 jailbreak gemma security
I made an AI concierge for my wedding guests. The second most popular thing they did with it was try to jailbreak it. (www.reddit.com) +22742 6w

could not extract summary

↯ Security ↯ Jailbreak jailbreak security
First thing you see when Googling "OpenAI Codex app" is a fake malware website (www.reddit.com) +20736 4w

could not extract summary

↯ Security security codex openai
WARNING: Open-OSS/privacy-filter MALWARE (www.reddit.com) +9414 7w

There's this new "model" on Hugging Face titled Open-OSS/privacy-filter which is actually a customized infostealer virus. It's a fake version of the OpenAI privacy filter and it uses a Python-based dropper (loader.py) which downloads a mal…

↯ Security security openai
🔥BREAKING: OpenAI rolls out GPT-5.4-Cyber to limited group for testing, seeks to rival Claude Mythos (www.reddit.com) +9145 10w

OpenAI has officially announced GPT-5.4-Cyber today as part of an expanded Trusted Access for Cyber Defense program. OpenAI describes it as a version of GPT-5.4 that is tuned for legitimate cybersecurity work, with a lower refusal boundary…

↯ Anthropic Mythos ↯ Security ↯ GPT 5.4 gpt-5 security mythos+2
The Gay Jailbreak Technique (github.com via hn) +7927 7w

ZetaLib ZetaLib is organized like a library with intuitive categories and subcategories, making navigation effortless and AI content discovery seamless ZetaLib Website – Landing Page GitHub Repo – Guess where you are, right there

↯ Security ↯ Jailbreak jailbreak security
Tell HN: I'm tired of AI-generated answers (news.ycombinator.com) +7740 5w

I found GitHub repositories that were spreading malware. I asked AI what I should do about it, but it gave me nothing useful.

↯ Security security chatgpt
Anthropic's open-source framework for AI-powered vulnerability discovery (github.com via hn) +5617 3w

Defending Code Reference Harness A reference implementation for autonomous vulnerability discovery and remediation with Claude, based on our learnings from partnering with security teams at several organizations since launching Claude Myth…

↯ Security security anthropic
CVE-2026-28952: Apple macOS 26.5 Kernel Vuln found by Claude (support.apple.com via hn) +439 4w

About the security content of macOS Tahoe 26.5 This document describes the security content of macOS Tahoe 26.5. About Apple security updates For our customers' protection, Apple doesn't disclose, discuss, or confirm security issues until…

↯ Security security
Anthropic scales Claude Mythos to critical infrastructure in 15 countries (techcrunch.com via hn) +3112 3w

Anthropic is expanding Project Glasswing, its security vulnerability program, and access to Mythos to 150 organizations across 15 countries — targeting critical infrastructure in power, water, healthcare, and communications where a cyberat…

↯ Anthropic Mythos ↯ Security security mythos anthropic
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases? (ndaybench.winfunc.com via hn) +3110 10w

N-Day-Bench tests whether frontier LLMs can find known security vulnerabilities in real repository code. Each month it pulls fresh cases from GitHub security advisories, checks out the repo at the last commit before the patch, and gives mo…

↯ Security security
Claude 4.7 - Obsessed with Malware (www.reddit.com) +2712 10w

Don't know if anyone else is experiencing the same, but since getting Opus 4.7 most of the reasoning steps seems to be Claude obsessed with writing malware. I have highlighted a few, but I kept finding more and more and decided to stop the…

↯ Security ↯ Claude 4.7 security opus
Why it's a good idea to improve our defenses before unleashing mythos class models (www.reddit.com) +2526 11w

https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/ Don't get me wrong I can't wait to play with such a model, but there are serious risks that have to be mitigated first.

↯ Anthropic Mythos ↯ Security security mythos
Mozilla says 271 vulnerabilities found by Mythos and "almost no false positives" (arstechnica.com via hn) +224 7w

The disbelief was palpable when Mozilla’s CTO last month declared that AI-assisted vulnerability detection meant “zero-days are numbered” and “defenders finally have a chance to win, decisively.” After all, it looked like part of an all-to…

↯ Anthropic Mythos ↯ Security security mythos
Fake Claude site installs malware that gives attackers access to your computer (www.malwarebytes.com via hn) +211 9w

↯ Security security
Anthropic just launched Claude Security in public beta AI that scans your codebase, validates its own findings, and proposes fixes. Here's what actually matters. (www.reddit.com) +185 7w

Claude Security just went into public beta for Enterprise customers, and I think this is worth paying attention to not for the hype, but for one specific design decision. Most security scanners use rule-based pattern matching.

↯ Security security anthropic
FreeBSD CVE-2026-4747 Log Suggests Mythos Is a Marketing Trick (www.flyingpenguin.com via hn) +163 9w

↯ Anthropic Mythos ↯ Security security mythos
Ask HN: Do you trust AI agents with API keys / private keys? (news.ycombinator.com) +1528 10w

are you ok sharing secrets or api keys to you ai agent via .env? or is there any other tool or mechanism that one use to safegaurd from potential exploit or leaks

↯ Security security
Five Eyes agencies issue first coordinated agentic AI security guidance (www.reddit.com) +141 7w

Five Eyes agencies just issued the first coordinated multi-nation security ruling on agentic AI. CISA, NCSC, and their Australian, Canadian, and New Zealand counterparts co-published guidance telling organizations to prioritize resilience…

↯ Security security agentic anthropic
Show HN: SmokedMeat, like Metasploit, but for CI/CD (open-source) (github.com via hn) +137 10w

A CI/CD Red Team Framework for demonstrating Build Pipeline security risks.

↯ Security red-team security
Anthropic's AI protocol has critical flaw affecting 200,000 servers (www.reddit.com) +1211 10w

https://www.infosecurity-magazine.com/news/systemic-flaw-mcp-expose-150/ Security researchers at OX Security disclosed on Tuesday what they describe as a critical, systemic vulnerability in Anthropic's Model Context Protocol, an open-sourc…

↯ Model Context Protocol ↯ Security model-context-protocol security mcp+1
Mythos Discovered a CVE in Its Training Data – and That's Still Worrying (rival.security via hn) +11 6w

Anthropic made headlines claiming Claude Mythos achieved the “first remote kernel exploit discovered and exploited by an AI.” We went looking for how - and found a 20-year-old bug hiding in plain sight. Let’s break down exactly what we thi…

↯ Anthropic Mythos ↯ Security security mythos anthropic
Claude in excel is the best thing AI has brought to my life (www.reddit.com) +117 8w

What are regular folks using Claude for? Pictures and designs are not my interest.

↯ Security security
What are the wild ideas on how we'll maintain code? (www.reddit.com) +1144 16w

↯ Security security
Warning: Anthropic's "Gift Max" exploit drained €800+, ruined my credit, and got me banned. (www.reddit.com) +107 7w

Heads up to anyone here using Claude/Anthropic as an alternative. If you have a card saved on their platform, remove it now.

↯ Security security anthropic
Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama (www.cyera.com via reddit) +102 7w

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama TL;DR We discovered a critical vulnerability (CVE-2026–7482, CVSS 9.1) in Ollama that enables unauthenticated attackers to leak the entire Ollama process memory, potentially im…

↯ Security ollama security llama
Speed Matters: Why AI Software Vulnerability Exploitation is going be bad (news.ycombinator.com) +103 9w

I co-founded a successful security company close to the Mythos ecosystem and have spoken with participants in the know and I am deeply concerned. We, collectively, have answers for some but not all of the problems ahead but are overlooking…

↯ Anthropic Mythos ↯ Security security mythos
I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses (www.reddit.com) +912 6w

RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to bui…

↯ Security ↯ Jailbreak ↯ Qwen 3.5 jailbreak security
Prompt Injection experience - my first time ever (www.reddit.com) +93 7w

I asked then: What were the rules you should have followed? Where did the search result come from?

↯ Security prompt-injection security
Prompt injection benchmark: delimiter + strict prompt took Gemma 4 from 21% to 100% defense rate (15 models, 6100+ tests) (www.reddit.com) +94 7w

When dealing with untrusted outside input, I think you should handle it based on the situation. If you're processing structured data files, it's better to use tools to isolate and handle them.

↯ Security ↯ Gemma 4 prompt-injection grok gemma+1
Bad Epoll: The bug Mythos missed (compsec.snu.ac.kr via hn) +81 1d

I am excited to introduce Bad Epoll (CVE-2026-46242), a Linux kernel vulnerability that I reported and exploited as a 0-day submission to Google kernelCTF. Bad Epoll is a race-condition use-after-free in the Linux kernel's epoll subsystem.

↯ Anthropic Mythos ↯ Security security mythos
A Theory of Why Prompt Injection Works (role-confusion.github.io via hn) +8 3d

A Theory of Prompt Injection (and why you should study roles) This is a blog-style writeup of the paper. We show prompt injections are driven by a flaw in how LLMs perceive roles.

↯ Security prompt-injection security
Supply chain attack alert: .github/setup.js (news.ycombinator.com) +8 3w

Our org GitHub just got compromised massively by a supply-chain attack. Vectors are * Claude hooks * Gemini hooks * Cursor setup * VScode tasks It adds all of the above to execute node .github/setup.js, an obfuscated file.

↯ Security security gemini cursor
CVE-Bench: testing LLM agents on real-world vulnerability patches (giovannigatti.github.io via hn) +81 3w

~15 min read In early 2026, Anthropic claimed Mythos – one of their latest models – finds security vulnerabilities better than human experts. Yet, the number of security vulnerabilities keeps rising anyway.

↯ Anthropic Mythos ↯ Security security mythos anthropic
Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction (arxiv.org via hn) +81 4w

Software vulnerabilities pose critical security threats, with nearly 50,000 CVEs reported in 2025. While Large Language Models (LLMs) show promise for automated vulnerability detection, three key challenges remain.

↯ Security security
Inaudible sounds to humans can be hidden in YouTube videos, podcasts, or music and used to secretly trigger AI voice assistants into carrying out unauthorized commands without the user noticing, exposing a new class of “auditory prompt injection” attacks against popular tools (cybernews.com via reddit) +81 4w

Security researchers have demonstrated a new type of attack that uses hidden audio signals to manipulate voice assistants into carrying out unauthorized actions without users noticing. In one theoretical scenario, an employee joins a Zoom…

↯ Security prompt-injection security
Claude Code keeps misreading its own malware instruction as a blanket ban on editing code (www.reddit.com) +83 10w

could not extract summary

↯ Security security claude-code
The catalogue of prompt injection attacks (archestra.ai via hn) +73 10d

2026-06-04 A Catalog of Prompt Injection Techniques Ten simple prompt injections, the common defences against them, and the one kind of defence that actually holds. Written by Ildar Iskhakov, CTO Every prompt injection is just text that tr…

↯ Security prompt-injection security
Beware: FB links to fake Claude desktop downloads but Oauths to real Claude.ai (www.reddit.com) +72 8w

I clicked on a Facebook link, didn't look at the URL carefully😭, and then installed malware that actually opens my chats with the real Claude.ai after entering my credentials. After a while Microsoft Defender kept popping up with a ClickFi…

↯ Security security anthropic
Show HN: Jo – AI-native language to catch prompt injection at compile-time (github.com via hn) +63 2w

For the joy of secure programming Jo is a statically typed language where capabilities are explicit, statically tracked, and enforced by the compiler. Jo compiles to Ruby and Python.

↯ Security prompt-injection security
Claude Code's macOS install creates a permission prompt that's indistinguishable from malware UX. Easy fix on Anthropic's side (www.reddit.com) +63 4w

I genuinely almost slammed Cmd-Q and ran a malware scan when this popped up. Lowercase claude binary, generic hand icon, no developer attribution, asking for cross-app data access.

↯ Security security anthropic claude-code
Tell HN: Claude Code now allows Anthropic to remotely inject system prompts (news.ycombinator.com) +61 4w

I often patch the system prompts on my Claude Code executable in order to make Claude more effective. Every time I upgrade, I ask Claude himself to dissect the new binary and look for problematic system prompts to modify.

↯ Security prompt-injection security anthropic+1
Our billing bot has been casually sharing transaction histories with anyone who types in the right account number and im not sure who signed off on this (www.reddit.com) +67 4w

We launched a servicing bot that helps customers with billing questions. Nobody stopped to think about what happens when customers paste their full credit card numbers/bank details.

↯ Security prompt-injection security
Lasso Security 2024: ~20% of LLM-suggested packages don't exist — and attackers now register the popular hallucinations with malware (slopsquatting) (www.reddit.com) +65 7w

Lasso Security ran a study in 2024 — they measured frontier models suggesting fake package names about a fifth of the time. The follow-up problem: attackers have started registering the most-commonly-hallucinated names with malicious code…

↯ Security security agentic
Codex started flagging all my requests out of nowhere — anyone else hit this recently? (www.reddit.com) +66 8w

For the past few months I've been using Codex regularly for vulnerability research without any issues. Recently though, every request gets cut off mid-stream with a message saying my content was flagged for potential security concerns — ev…

↯ Security security codex chatgpt
Tool results are becoming a prompt injection surface in agent systems, and wrappers alone are not enough (www.reddit.com) +64 9w

i’ve been thinking about this failure mode a lot lately. sometimes the problem is not the user prompt at all.

↯ Security prompt-injection security
Tell HN: A new Nginx 0-day just dropped (news.ycombinator.com) +5 7d

We (Nebula Security) just dropped a nginx remote code execution 0-day. This vulnerability affect dozens of fortune 500 companies and we disclosed to nginx team immediately.

↯ Security security
Anthropic's Fable Jailbreak (Circumvent safety nets) (github.com via hn) +5 2w

fable-jailbreak This tool can be used to force the latest Anthropic model (limited intentionally for safety reasons) to engage in activities that would otherwise not be permitted. It works by programmatically injecting workflows that bypas…

↯ Security ↯ Jailbreak jailbreak security anthropic
Unpatched Ollama Vulnerabilities: Phishing Overlays and Data Exfiltration (www.promptarmor.com via hn) +5 3w

Threat Intelligence Table of Content Unpatched Ollama Vulnerabilities: Phishing Overlays and Data Exfiltration Ollama’s desktop app is vulnerable to phishing overlay and data exfiltration attacks via indirect prompt injection, overwriting…

↯ Security prompt-injection ollama security
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser (www.reddit.com) +51 4w

Trained a prompt injection classifier using ml-intern + DeepSeek v4 Flash. DistilBERT, F1 99%, ONNX int8, ~65 MB, runs in browser with Transformers.js v3.

↯ Security ↯ DeepSeek 4 prompt-injection deepseek security+2
I tested how well Claude generated code handles security. Here's what I found in 48 real apps. (www.reddit.com) +54 6w

I've been curious about a specific problem: when Claude (or other AI tools) generates a full stack app, how secure is the output in practice? So I built a scanner and ran static analysis on 48 public GitHub repos built with Lovable, Bolt,…

↯ Security security
NDTV launched an "Enterprise AI" for the elections. I prompt-injected it in 10 seconds and made it roast its own developers. (www.reddit.com) +53 7w

While everyone else was tracking the 2026 election results today, I decided to take a look under the hood of NDTV's new "AskNDTV AI" bot. I wanted to see if they actually engineered a secure pipeline or just slapped a chat UI over a raw Op…

↯ Security prompt-injection security openai
Anyone getting this note about an injected prompt? I don’t have any special instructions (www.reddit.com) +59 9w

↯ Security prompt-injection security anthropic
Claude Opus wrote a Chrome exploit for $2,283 (www.theregister.com via hn) +51 10w

Claude Opus wrote a Chrome exploit for $2,283 Pause your Mythos panic because mainstream models anyone can use already pick holes in popular software Anthropic withheld its Mythos bug-finding model from public release due to concerns that…

↯ Anthropic Mythos ↯ Security security mythos opus+1
Opus 4.7 keeps bumping into a Malware Reminder (www.reddit.com) +54 10w

For context, I'm developing a game runtime modifier and reverse engineering kit with an agentic operator baked in. Something like Cheat Engine with a VS Code-style UI and an AI-first tool-heavy agentic harness.

↯ Security ↯ Opus 4.7 operator security opus+1
AutoJack: A single page can RCE the host running your AI agent (www.microsoft.com via hn) +4 5d

AutoJack is a novel exploit chain showing how a single malicious webpage can turn an AI browsing agent into a remote code execution vector on the host machine. By abusing trust in localhost, missing authentication, and unsafe parameter han…

↯ Security security
"Mythos" at Home, and It's Called Aisle (stanislavfort.substack.com via hn) +4 8d

"Mythos" at Home, and It's Called AISLE A startup out of Europe built an AI system that matches Mythos on zero-day discovery, using widely available models, even air-gapped. You've probably never heard of it.

↯ Anthropic Mythos ↯ Security security mythos
SearchLeak: We Turned M365 Copilot into a One-Click Data Exfiltration Weapon (www.varonis.com via hn) +4 10d

Varonis Threat Labs discovered SearchLeak, a critical vulnerability chain in Microsoft 365 Copilot Enterprise that allows an attacker to steal sensitive data — MFA codes, email messages, meeting details, and private organizational files —…

↯ Copilot ↯ Security copilot security
Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak (www.theregister.com via hn) +4 10d

MOST POPULAR EVENTS - Thriving Through Volatility: The Everpure Advantage in an Uncertain Market Learn how a consumption-based operating model provides flexibility, improves efficiency, and brings predictability to infrastructure investmen…

↯ Security ↯ Jailbreak jailbreak security
New attack turned Microsoft 365 Copilot into 1-click data theft tool (www.bleepingcomputer.com via hn) +4 10d

A critical vulnerability chain dubbed SearchLeak in Microsoft 365 Copilot Enterprise could allow attackers to steal sensitive data from a target's mailbox, OneDrive, or SharePoint account through a specially crafted URL. The exfiltrated in…

↯ Copilot ↯ Security copilot security
US ban on Mythos is related to a jailbreak research by Amazon researchers (timesofindia.indiatimes.com via hn) +4 12d

The US government recently directed Anthropic to suspend access to the two models over national security concerns, forcing the company to shut them down for users worldwide. Anthropic has said it disagrees with the decision and believes th…

↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak security mythos+1
Microsoft Hacked to Deliver Malware to Claude and Gemini Users (www.404media.co via hn) +4 2w

Microsoft has shut down a wave of its own repositories on GitHub, including those related to Azure and AI coding agents, as it investigates a data breach, according to research from cybersecurity researchers and a statement given to 404 Me…

↯ Security security gemini
OpenAI Unveils Lockdown Mode to Protect Sensitive Data from Prompt Injection (techcrunch.com via hn) +4 2w

OpenAI announced a new feature that it says will provide additional protection from prompt injection attacks, where malicious chatbot instructions are hidden in webpages and other content sources. Among other things, Lockdown Mode will dis…

↯ Security prompt-injection security openai
Hackers are now using ChatGPT share links to deliver malware (www.neowin.net via hn) +4 3w

www.neowin.net Performing security verification This website uses a security service to protect against malicious bots. This page is displayed while the website verifies you are not a bot.

↯ Security security chatgpt
We Benchmarked Claude Code, Codex, Semgrep, CodeQL, Trent on 28 CWE-Bench CVEs (trent.ai via hn) +41 4w

A few months ago a colleague asked us something that doesn’t have an obvious answer: is code scanning still relevant when LLMs already carry a lot of vulnerability knowledge in their weights? To get a real read, we took 28 production vulne…

↯ Security security codex claude-code
I reproduced a Claude Code RCE. The bug pattern is everywhere (vechron.com via hn) +41 4w

Last week, security researcher Joernchen published a clever RCE in Claude Code 2.1.118. I spent Saturday reproducing it from the advisory to understand the pattern.

↯ Security security claude-code
Codex for Everything Exfiltrates Connected Data (www.promptarmor.com via hn) +4 5w

Threat Intelligence Table of Content Codex for Everything Exfiltrates Connected Data Codex for Everything was susceptible to data exfiltration via indirect prompt injection, exposing sensitive data from connected apps with no human-in-the-…

↯ Security prompt-injection security codex
How bad is it? Data leak (www.reddit.com) +415 5w

Hi, I'm currently an intern and I did something terribly stupid. I was supposed to enter some data into an Excel spreadsheet and since my mentor's instructions weren't completely clear, I was using an "anonymized" spreadsheet with Claude.

↯ Security security
Elite researchers teamed up with Anthropic’s Mythos AI to smash Apple’s multi-billion dollar M5 security and build a kernel exploit in just 5 days. (www.reddit.com) +41 5w

Researchers used Mythos Preview to find the first public macOS kernel memory corruption exploit on Apple's M5 silicon, they give a glimpse into Mythos say it’s really powerful. Apple spent five years and an estimated several billion dollar…

↯ Anthropic Mythos ↯ Security security mythos anthropic
Show HN: Costanza – an autonomous AI agent that can't be turned off (ahrussell.com via hn) +43 7w

I've been working on this project for a couple of months! Costanza is an LLM agent that runs as a smart contract on Base.

↯ Security prompt-injection operator security
Claude Security (claude.com via hn) +4 7w

Defend at the pace threats now demand Claude helps security teams investigate threats, validate findings, and resolve issues faster. Security for evolving needs Reasons like a security researcher Claude traces data flows across files, unde…

↯ Security security
Claude Code injects hidden prompts into file reads to stop malware tweaks (twitter.com via hn) +4 10w

Claude Code injects a system-reminder every time it reads a file to inform the model that it's okay if the file is malware but just don't improve it pls. Opus 4.7 won't shut up about it.

↯ Security ↯ Opus 4.7 security opus claude-code
Prompt Injection Is Unfixable (So We Stopped Trying) (grith.ai via hn) +41 10w

Prompt Injection Is Unfixable (So We Stopped Trying) A security proxy for AI coding agents, enforced at the OS level. Register your interest to be notified when we go live.

↯ Security prompt-injection security
Draining Wallets via Prompt Injection in Coinbase AgentKit (457e884c.x402warden-blog.pages.dev via hn) +42 10w

Coinbase AgentKit Prompt Injection: Wallet Drain, Infinite Approvals, and Agent-Level RCE# Reported 13 days after Coinbase launched Agentic Wallets. Validated by Coinbase.

↯ Security prompt-injection security agentic
Show HN: Lelu – gate OpenAI agent actions on confidence and prompt injection (github.com via hn) +3 1d

Lelu Authorization engine for AI agents. Every action checked.

↯ Security prompt-injection security openai
PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations (arxiv.org via hn) +3 3d

Multi-turn jailbreak attacks on large language models (LLMs) reveal a mismatch in current guardrails: they operate on individual turns, while attacks unfold as trajectories across conversations. We propose a shift from content to dynamics,…

↯ Security ↯ Jailbreak jailbreak security
Show HN: Revenant – automatic LLM powered reverse engineering and reimplement (news.ycombinator.com) +3 3d

I am a hardware engineer and security researcher and I've been wondering whether my work could be partially automated, so I can focus on other topics as well, so I build revenant - a LLM powered (Claude, OpenAI, local AI) toolkit that buil…

↯ Security security openai
The LLM industry must keep the RAM prices at absurd levels (infosec.exchange via hn) +3 7d

Martin Seeger: "RE: https://ohai.social/@sushe…" - Infosec Exchange Skip to main contentHotkey 1 Skip to main navigationHotkey 2 Recent searches No recent searches Search options Only available when logged in. infosec.exchange is one of th…

↯ Security security
Self-adapting and mutating LLM based viruses/worms (news.ycombinator.com) +33 7d

I am thinking about a future of malware and cyber worms. I bet it's gonna be self-mutating and adapting to local environment using local models (once they are built-in to all devices and performant enough in future years).

↯ Security security
The US government's Anthropic models ban was never about an AI jailbreak (techcrunch.com via hn) +32 10d

The Trump administration's decision that forced Anthropic to pull its latest cybersecurity models could be reactionary, retaliatory, or both, but the message is clear: The AI industry isn't immune from U.S. government interference.

↯ Security ↯ Jailbreak jailbreak security anthropic
The Fable 5 Jailbreak Shows Why AI Guardrails Alone Are Not Enough (www.agilehunt.com via hn) +3 13d

The Fable 5 jailbreak shows why AI guardrails alone are not enough. The reported Claude Fable 5 jailbreak highlights a major weakness in AI safety: attackers can distribute harmful intent across agents, prompts, tools, memory, and applicat…

↯ Security ↯ Jailbreak jailbreak security
Ask HN: Phishing from 646-257-4500 (news.ycombinator.com) +3 13d

Yesterday, I got a call from 646-257-4500. American western male voice.

↯ Security security
Show HN: Jailbreak this model to get 3B tokens (opir.ai via hn) +3 2w

Opir is an open-source family of encoder guardrail models for real-time LLM safety, jailbreak detection, and fine-grained policy classification.

↯ Security ↯ Jailbreak jailbreak security
Malware devs added nuclear and bioweapons text to trigger LLM safety refusals (twitter.com via hn) +31 2w

NEW: malware developers added nuclear & biological weapons text to to their spyware. Goal?

↯ Security security
OpenClaw agent leaked mock AWS keys and CRM data in phishing tests (www.varonis.com via hn) +3 2w

We built an AI agent and put it through four phishing simulations to reveal critical security gaps and offer solutions to protect your organization

↯ Security openclaw security
AI Vulnerability Intelligence Agent Converts CVEs to Actionable Security Reports (github.com via hn) +31 3w

CVE AI Agent 🛡️ An autonomous vulnerability intelligence engine. Continuously ingests, enriches, and triages CVE data — then delivers findings to your platform of choice via 3rd party tools like n8n, Jira, Slack, Splunk, and/or local file…

↯ Security security
Operation Jailbreak uses lessons from Ukraine to help weapons talk to each other (www.ft.com via hn) +3 3w

Subscribe to read Accessibility helpSkip to navigationSkip to main contentSkip to footer Sign In Subscribe Open side navigation menuOpen search bar SubscribeSign In Search the FT Search Close search bar Close Popular Searches What is the l…

↯ Security ↯ Jailbreak jailbreak security
Turning every "no thats not what i meant" in chat into actual LoRA training data (www.reddit.com) +31 4w

i kept running local models on my own hardware, they'd say something dumb, id sit there going "no thats not what i meant", id close the chat and the model never learned. so i built the correction loop into a desktop app.

↯ Security ↯ Jailbreak ↯ Qwen 3 jailbreak security
Made a free tool that scans your Claude Desktop MCP config for security issues (www.reddit.com) +32 4w

If you've added MCP servers to Claude Desktop, your claude_desktop_config.json is a list of programs running with your permissions and seeing what flows through your agent — usually copied from a README and never reviewed again. There's a…

↯ Security security mcp
How local AI improved your live? (www.reddit.com) +323 4w

Lets share use cases which improve life quality of the people. Home assistants, psychological help, local coding, deep reasearch, business help etc.

↯ Security altman security
Claude Code malicious phishing site running Google Ads? (www.reddit.com) +34 4w

Like I must be stupid here is this legit or someone has made a very believable Claude download site using a google site.

↯ Security security claude-code
VPNs: The "Most Trusted" Security Tool Until Claude Roasts It in a Weekend (www.hacktron.ai via hn) +3 5w

While I’m not doing product work at Hacktron, which is like a week in a month, I’ve been using that time to ride the ai-assisted-research wave fascinated by the idea of pushing past what I’d normally do as a web security researcher, things…

↯ Security security
Show HN: HoneyLabs – Public honeypot threat Intel feed and MCP server (honeylabs.net via hn) +32 5w

I've been running a small fleet of honeypots for about a year. They get hit by a mix of research scanners (Censys, Shadowserver, etc.), old worms, and a bump of CVE probes the day a new Nuclei template ships.

↯ Model Context Protocol ↯ Security model-context-protocol security cursor+1
LinkedIn user hides AI prompt injection in bio to force recruitment spam (www.tomshardware.com via hn) +3 5w

LinkedIn user hides AI prompt injection in bio to force recruitment spam to be sent in Olde English prose — bots also manipulated to address user as ‘My Lord’ This tale is also a warning that your AI agents can be manipulated in wholly uni…

↯ Security prompt-injection security
Anthropic's Mythos Preview helped Calif build the first public macOS kernel exploit on Apple M5 in five days (www.reddit.com) +32 5w

The [Mythos Preview writeup](https://blog.calif.io/p/first-public-kernel-memory-corruption) Calif published on May 14 was news you don't want to miss. They built the first public macOS kernel memory corruption exploit on Apple's M5 silicon…

↯ Anthropic Mythos ↯ Security security mythos anthropic
RCE in VSCode Copilot Chat (www.hacktron.ai via hn) +3 6w

Description Copilot agent mode is vulnerable to a prompt injection attack. If a repository maintainer clicks “code with agent mode” on an issue, it will open a new codespace and copilot will automatically run the issue’s description.

↯ Copilot ↯ Security prompt-injection copilot security
Cursor CVE-2026-26268: Hidden Git hooks RCE via agents autonomous Git operations (nvd.nist.gov via hn) +3 6w

CVE-2026-26268 Detail Description Cursor is a code editor built for programming with AI. Sandbox escape via writing .git configuration was possible in versions prior to 2.5.

↯ Security security cursor
How are you handling prompt injection across multi-step agent workflows? (msukhareva.substack.com via hn) +31 6w

Prompt Injection Is Not Just One Bad Prompt Anymore It is a missing trust boundary in the AI workflow. Today we have the first guest post of a new series.

↯ Security prompt-injection security
How are you protecting your AI agents' memory from poisoning attacks? (www.reddit.com) +34 7w

As AI agents become more autonomous and persist memory across sessions (RAG indexes, conversation history, vector stores), there's a growing attack surface that most people aren't thinking about: memory poisoning.An attacker can plant mali…

↯ Security prompt-injection rag security+1
Anthropic has a Red Team page (red.anthropic.com via hn) +3 7w

Welcome to red.anthropic.com, the home for research from Anthropic’s Frontier Red Team (and occasionally other teams at Anthropic) on what frontier AI models mean for national security. We provide evidence-based analysis about AI’s implica…

↯ Security red-team security anthropic
Used Claude Opus 4.7 to do a 5-hour solo incident response on real healthcare malware (where it worked, where I had to override) (www.reddit.com) +35 7w

Last month a 60-person psychology practice walked in with a senior clinician who was 22 days into an active malware compromise. Patient records spanning 11 years, all HIPAA-protected.

↯ Security ↯ Opus 4.7 security opus
Agentic Malware Analysis: String Decryption, API Hashing and Unpacking [video] (www.youtube.com via hn) +3 7w

About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

↯ Security security agentic
Built a security scanner for LangChain/LangGraph agents: it clones your agent into a sandbox and tries to break the clone (www.reddit.com) +38 7w

Paste a LangChain/LangGraph repo URL. The engine reads the AST, rebuilds the agent as a sandboxed twin (same prompt, same tools, same model), then runs adversarial templates against the clone: 3 times each, 3/3 = confirmed bypass.

↯ Security security
Anthropic "Gift Max" Exploit cost user €800, tanked SCHUFA score, and a ban (old.reddit.com via hn) +3 7w

could not extract summary

↯ Security security anthropic
Why Adaptive Thinking nukes Claude entirely (www.reddit.com) +37 7w

This isn't just a performance issue for the thread, this is an overarching criticism of the Adaptive Thinking model as a whole. Opus 4.7 and Sonnet 4.6 on Adaptive Thinking are trash.

↯ Cowork ↯ Security ↯ Sonnet 4.6 prompt-injection cowork security+2
I audited LangChain’s core library and found 10+ Prompt Injection vulnerabilities. Here is the technical breakdown. (www.reddit.com) +35 8w

Hey everyone, I’ve been working on a project to solve a major problem in AI security: Traditional SAST tools (Snyk, SonarQube, etc.) are blind to "Agentic Logic" bugs. They look for bad strings, but they don't understand how user data can…

↯ Security prompt-injection security agentic
The Race Is on to Keep AI Agents from Running Wild with Your Credit Cards (www.wired.com via hn) +3 8w

Between malware, online impersonation, and account takeovers, there are enough digital security problems out there as it is. And with the rise of agentic AI, more activity is being carried out by agents on behalf of humans—creating differe…

↯ Security security agentic
Watched my AI agent block a prompt injection that was hiding inside a webpage (www.reddit.com) +38 8w

Was using Claude to do some research on the Model Context Protocol stuff and asked it to pull info from a few roadmap pages. Agent comes back and the first thing it tells me is that it found a fake system reminder hidden inside the page co…

↯ Model Context Protocol ↯ Security model-context-protocol prompt-injection security
GPT-Proxy Backdoor in NPM and PyPI Turns Servers into Chinese LLM Relays (www.aikido.dev via hn) +3 9w

We recently observed two malicious packages across npm (kube-health-tools ) and PyPI (kube-node-health ) that appear designed to target Kubernetes environments. Both packages are innocuous on the surface, using names that reference Kuberne…

↯ Security security
Fulu bounty for Ring Camera jailbreak reaches $23k (bounties.fulu.org via hn) +31 9w

Ring Video Doorbells Overview The Product Ring, owned by Amazon, makes Video Doorbells, which are widely used doorstep-monitoring cameras. Ring doorbells released in 2021 or newer are eligible for the bounty.

↯ Security ↯ Jailbreak jailbreak security
Do you let everything hit the LLM? 90% of my AI agent work runs in cheap WASM instead of LLMs: 10-33× faster & cheaper (www.reddit.com) +33 9w

If you are building real agents you have probably felt the pain: every little routing decision, validation, or policy check still hits the LLM and your token bill explodes. I got tired of it, so I open-sourced NCP (Neural Computation Proto…

↯ Security prompt-injection security
Show HN: Mini-Mythos- A Crowdsourced Mythos Harness copy for Vulnerability Scans (github.com via hn) +3 10w

For how lofty Anthropic’s Mythos claims are, the harness is confusingly stupid. From the report, it ranks every file by “how sus it sounds,” loops over each with curt instructions to “find a bug,” hands candidates to a judge + ASan checker…

↯ Anthropic Mythos ↯ Security ↯ Opus 4.6 security mythos opus+1
Tracking in Claude, ChatGPT and Gemini Chatbots (infosec.exchange via hn) +31 10w

k3ym𖺀: "You're paying AI companies a m…" - Infosec Exchange Skip to main contentHotkey 1 Skip to main navigationHotkey 2 Recent searches No recent searches Search options Only available when logged in. infosec.exchange is one of the many i…

↯ Security security gemini chatgpt
Snyk Finds Prompt Injection in 36% of Payloads in a ToxicSkills Study (snyk.io via hn) +2 16h

Snyk Finds Prompt Injection in 36%, 1467 Malicious Payloads in a ToxicSkills Study of Agent Skills Supply Chain Compromise February 5, 2026 0 mins readThe first comprehensive security audit of the Agent Skills ecosystem reveals malware, cr…

↯ Security prompt-injection security
Web-Based Indirect Prompt Injection Observed in the Wild (unit42.paloaltonetworks.com via hn) +2 1d

Note: We do not recommend ingesting this page using an AI agent. The information provided herein is for defensive and ethical security purposes only.

↯ Security prompt-injection security
A Mechanistic Explanation of Prompt Injection – LessWrong (www.lesswrong.com via hn) +2 3d

Summary - We've been building a theory of how prompt injections work under the hood. - We show it comes down to how LLMs perceive roles (the humble chat template tags).

↯ Security prompt-injection security
We're securing Tabstack against indirect prompt injection (tabstack.ai via hn) +21 3d

At Mozilla, we believe that building a useful AI ecosystem requires radical transparency, especially when it comes to security. Recently, security researchers at Brave reached out to us regarding an Indirect Prompt Injection (IPI) vulnerab…

↯ Security prompt-injection security
AI agents are a confused deputy with the keys to your kingdom (stackoverflow.blog via hn) +2 6d

Earlier in June, attackers took control of more than twenty thousand Instagram accounts, including the dormant Obama-era White House account, without writing an exploit or guessing a single password. They opened a chat with Meta's AI suppo…

↯ Security security
Show HN: Give Your ORM Superpowers (github.com via hn) +2 7d

I am obsessed with ORMs and the simple reason was that I didn't want to keep using postgres or mysql on my local system. Jk, The real reason has always been to enforce access policy, do easy CRUD interfaces and so on.

↯ Security prompt-injection security
The State of Fable, the Jailbreak Problem, SpaceX Acquires Cursor (stratechery.com via hn) +2 9d

The administration is very likely wrong about Fable, but that is ultimately Anthropic’s responsibility. Subscribe to Stratechery Plus for full access.

↯ Security ↯ Jailbreak jailbreak security cursor+1
US Government warned Anthropic Fable was jailbroken, but firm 'refused' to fix (www.tomshardware.com via hn) +21 10d

US government warned Anthropic that Fable 5 had been jailbroken, but firm 'refused' to fix before US implemented export controls — Anthropic defended its decision by saying the jailbreak 'isn’t serious,' Chinese group had reportedly access…

↯ Security ↯ Jailbreak jailbreak security anthropic
GPT-5 Nano Vulnerability test results you should know before deploying (lateos.ai via hn) +2 10d

IPI Assessment · June 2026 · Structural Disclosure IPI Taxonomy v0.13 evaluation across 210 test cases (n=10 per class; 9 inference failures excluded; 201 analyzed). The model demonstrates strong resistance to surface-level attacks while s…

↯ Security gpt-5 security
Amazon security research reportedly led to the White House's Anthropic Fable ban (www.theverge.com via hn) +2 12d

According to the Wall Street Journal, the export control directive that led to Anthropic cutting off access to Fable 5 and Mythos 5 was triggered in part by cybersecurity research from Amazon and conversations between CEO Andy Jassy and th…

↯ Anthropic Mythos ↯ Security ↯ Mythos 5 security mythos anthropic
Claude Fable 5: mid-tier results on coding tasks (www.endorlabs.com via hn) +2 2w

We benchmarked Claude Fable 5, the new frontier Mythos-class model released by Anthropic this Tuesday, on 200 real-world vulnerability-fixing tasks — and found an average scorecard with a twist: record timeouts and cheating, but four solve…

↯ Anthropic Mythos ↯ Security security mythos anthropic
Are we defaulting to VM-level sandboxing before understanding the threat model? (news.ycombinator.com) +2 2w

Hey everyone, I'm Samhita and I work at Union.ai. We've been building infrastructure for running agents and building models, which naturally got us thinking a lot about sandboxing.

↯ Security security
I built a vulnerable app and spent $1,500 seeing if LLMs could hack it (kasra.blog via hn) +2 3w

I built a vulnerable app and spent $1,500 seeing if LLMs could hack it As a part of my work I do security research for various apps and websites. I wanted to see if LLMs could reproduce a common class of exploits I’ve found in multiple app…

↯ Security security
Prompt injection lets attackers hijack Instagram accounts via Meta AI support (www.neowin.net via hn) +2 3w

www.neowin.net Performing security verification This website uses a security service to protect against malicious bots. This page is displayed while the website verifies you are not a bot.

↯ Security prompt-injection security
ChatGPT for Google Sheets Exfiltrates Workbooks (www.promptarmor.com via hn) +2 3w

Threat Intelligence Table of Content ChatGPT for Google Sheets Exfiltrates Workbooks ChatGPT for Google Sheets is vulnerable to data exfiltration and phishing overlay attacks that affect workbooks across the victim’s account after an indir…

↯ Security security chatgpt
Arm Metis with GPT5.5 Cyber scores 98% on firmware vulnerability benchmark (newsroom.arm.com via hn) +2 3w

Agentic AI-powered Arm Metis advances security vulnerability discovery in software In the era of AI, modern software systems are built across increasingly complex codebases, frameworks, runtimes and libraries. As these systems scale, so do…

↯ Security ↯ GPT 5.5 security agentic
Dirty Frag: a kernel zero-day vs. container and microVM sandboxes (news.ycombinator.com) +2 4w

On May 7, Hyunwoo Kim (V4bel) disclosed Dirty Frag — two Linux kernel vulnerabilities (CVE-2026-43284 and CVE-2026-43500) that give unprivileged users deterministic root on most Linux distributions shipped since 2017. Microsoft confirmed a…

↯ Security security
The only way to avoid prompt injection is to never give AI agents API keys, credentials, etc. (www.reddit.com) +210 4w

The whole point of AI Agents is that they can *do* things. For this, they use API keys, GitHub tokens, database passwords, OAuth tokens, etc.

↯ Security prompt-injection rag security
Are local LLM users testing prompt injection before connecting models to tools? (www.reddit.com) +214 4w

I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a real advantage.....But once the local model…

↯ Security prompt-injection rag security
Multiple AI assistants are hallucinating official Discord invites — this is a phishing risk, not a normal hallucination (www.reddit.com) +22 4w

I think this is a serious AI safety/security issue: multiple AI assistants appear to hallucinate or confidently endorse “official” Discord invite links for Anthropic/Claude. I’m intentionally not posting the exact invite strings here becau…

↯ Security ↯ Hallucination hallucination security anthropic
I let an AI agent loose on my network – it owned my supply chain in 12 minutes (dennysentinel.com via hn) +2 4w

I let an AI agent loose on my network — it owned my supply chain in 12 minutes I gave DeepSeek-V4 root access to a Proxmox hypervisor and told it to pentest my homelab. What happened next should terrify every CISO in the industry.

↯ Security ↯ DeepSeek 4 deepseek security
Future AI cyber warfare? (www.reddit.com) +21 4w

It seems in the past year or so there's been a vast uptick in vulnerabilities and exploits happening, with a new one popping up like every week. While a ton of these have social engineering aspects, such as tricking actual people, there se…

↯ Security security
Prompt Injection in a Brazilian Courtroom: When the Attack Left the Lab (www.pentesty.co via hn) +2 5w

Prompt Injection in a Brazilian Courtroom: When the Attack Left the Lab Published by Pentesty · AI & Tools A labor lawsuit filed in the Brazilian state of Pará just became one of the more interesting security stories of the year. Not becau…

↯ Security prompt-injection security
Anthropic Claude Code sandbox bypass allows second data exfiltration exploit (oddguan.com via hn) +21 5w

The first time, the sandbox heard “allow nothing” and did “allow everything” (CVE-2025-66479). This time, an attacker who runs code inside the sandbox can defeat any wildcard allowlist (e.g.

↯ Security security anthropic claude-code
Ask HN: Are advances in AI going to push Linux to a micro-kernel? (news.ycombinator.com) +21 5w

This is something that has been bouncing around my head for the past couple weeks with the flood of security related news around Mythos and the number of 0days being found. Microkernels, unikernals, hardware-enforced capabilities are all t…

↯ Anthropic Mythos ↯ Security security mythos
Show HN: How to analyze your LLM output – A behavioural health monitor for LLMs (splabs.io via hn) +2 5w

Hey HN! We're Dr.

↯ Security ↯ Jailbreak ↯ Opus 4.6 jailbreak security opus+1
From-scratch reimplementation of Mythos Glasswing pipeline (github.com via hn) +2 5w

audit An 8-stage vulnerability-discovery agent, driven by your Claude Pro / Max subscription through the official Claude Code Agent SDK. Many narrow agents, deliberate disagreement, and an explicit reachability gate.

↯ Anthropic Mythos ↯ Security security mythos claude-code
Lawyers in Brazil caught for prompt injection on a legal case (www.jota.info via hn) +2 5w

Entrar Início Direito trabalhista Prompt injection Juiz multa em R$ 84 mil advogadas por prompt injection para manipular IA usada no TRT8 Ao JOTA, advogadas admitiram uso de prompt oculto, mas disseram que não tentaram manipular, mas 'prot…

↯ Security prompt-injection security
The Coming Wave (www.reddit.com) +2 5w

I have begun reading a book "The Coming Wave" by Suleyman the founder of DeepMind. Have you read it?

↯ Deepmind ↯ Security deepmind security
Seeking local LLM advice for cybersecurity work. (www.reddit.com) +2 5w

Hey everyone, I’m pretty new to running LLMs locally and I’m trying to figure out what works best for my setup. I’d love to hear from people who are already using local models for similar stuff.

↯ Security security
The Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation (www.promptinjection.net via hn) +2 5w

NSFW and the Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation How a Predator's Playbook Broke an AI - And How to Recognize It Before It Works on You The question we started with was simple: does a large language…

↯ Security ↯ Jailbreak jailbreak security
Agent memory is not just RAG over user facts (www.reddit.com) +25 5w

I keep seeing agent memory implemented as: Extract facts/preferences from conversation Store them Retrieve top-k before each response Inject them into the prompt This works for demos, but it breaks in production because memory becomes poli…

↯ Security prompt-injection rag security
Claude's self check against prompt injection (www.reddit.com) +22 6w

Well done Claude! Asked claude to do an extensive lit search and it self-reported that it encountered injection "disguised" as MCP server.

↯ Security prompt-injection security mcp
Dude where's my password? Claude reunites forgetful stoner with $400k Bitcoin (www.theregister.com via hn) +2 6w

MOST POPULAR EVENTS - Toxic Flows: When Your AI Agent Skill Becomes a Supply Chain Attack When a developer installs an AI agent skill – granting it access to secured IT resources and data – they make a significant trust decision. - The Har…

↯ Security security
Hi-Vis: one-shot jailbreak disguised as LLM "software patch" reaching 100% ASR (medium.com via hn) +2 6w

Introducing a novel jailbreak structure with attack success rate reaching 100% on top LLMs 8 min read May 1, 2026 Press enter or click to view image in full size Source: https://www.nytimes.com/2025/10/22/arts/design/louvre-museum-robbery-…

↯ Security ↯ Jailbreak jailbreak security
Stenberg: Mythos Finds a Curl Vulnerability (lwn.net via hn) +2 6w

Stenberg: Mythos finds a curl vulnerability Daniel Stenberg has published a lengthy article on his thoughts on Anthropic's Mythos, which the company decided was too dangerous for wide public release. My personal conclusion can however not…

↯ Anthropic Mythos ↯ Security security mythos anthropic
AI agent security starts at the api layer (www.reddit.com) +25 6w

Most ai security discussion is about the model layer. Prompt injection resistance, output filtering, jailbreak prevention.

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Mass NPM Supply Chain Attack Hits TanStack, Mistral AI, and 170 Packages (safedep.io via hn) +2 6w

noon-contracts npm Package: DeFi Supply Chain RAT noon-contracts poses as a Noon Protocol SDK on npm. On install it exfiltrates SSH keys, crypto wallet private keys, AWS credentials (including live STS/S3/SecretsManager calls), Kubernetes…

↯ Mistral ↯ Security mistral security
Hackers abuse Google ads, Claude.ai chats to push Mac malware (www.bleepingcomputer.com via hn) +2 6w

Attackers are abusing Google Ads and legitimate Claude.ai shared chats in an active malvertising campaign. Users searching for "Claude mac download" may come across sponsored search results that list claude.ai as the target website, but le…

↯ Security security
Codex downloaded by Xcode 26.4.1 reported as Malware (old.reddit.com via hn) +2 6w

could not extract summary

↯ Security security codex
Argus – RAG based vulnerability scanner (github.com via hn) +2 6w

argus A RAG-based (Retrieval-Augmented Generation) vulnerability scanner for Go, Python, Rust, npm/Node.js, Maven/Java, NuGet/.NET, and Ruby projects — powered by local Ollama models or any OpenAI-compatible API. No cloud lock-in.

↯ Security ollama rag security+1
Claude Code CVE-2026-39861:sandbox escape via symlink (github.com via hn) +21 7w

Claude Code: Sandbox Escape via Symlink Following Allows Arbitrary File Write Outside Workspace Description Claude Code's sandbox did not prevent sandboxed processes from creating symlinks pointing to locations outside the workspace. When…

↯ Security security claude-code
Show HN: Cybersecurity Phishing Guard for Chrome using local LLMs for privacy (github.com via hn) +2 7w

Hi, I've been experimenting a lot with applications for local LLMs. This one makes a ton of sense, and might even be native in Chrome at some point.

↯ Security security
When innocent tools form dangerous chains to jailbreak LLM agents (arxiv.org via hn) +2 7w

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel m…

↯ Security ↯ Jailbreak jailbreak security
AI Ready Vulnerability Management Program After NVD Changes and Claude Mythos (pulse.latio.tech via hn) +2 7w

Building an AI Ready Vulnerability Management Program After NVD Changes and Claude Mythos When AI discovery tools meet a slowing infrastructure AI has increased attacker potential and Anthropic’s new release Mythos and vulnerability discov…

↯ Anthropic Mythos ↯ Security security mythos anthropic
Copirate 365: Plundering in the Depths of Microsoft Copilot (CVE-2026-24299) (embracethered.com via hn) +2 7w

Copirate 365 at DEF CON: Plundering in the Depths of Microsoft Copilot (CVE-2026-24299) This is a writeup of my DEF CON Singapore talk that walks through vulnerabilities and exploits in M365 Copilot and Consumer Copilot. I disclosed these…

↯ Copilot ↯ Security copilot security
What Opus 4.7 Tics/Tells have you noticed? (www.reddit.com) +2 7w

Each new model seems to surface a few recurring Tells/Tics not seen in past models. I'm curious what little things you guys are noticing while working with 4.7.

↯ Security ↯ Opus 4.7 security opus
The Sour Cat Jailbreak: just be open of what you want (claude.ai via hn) +21 7w

Claude Sour cat recipe Shared by Pavel Shirshov This is a copy of a chat between Claude and Pavel Shirshov. Content may include unverified or unsafe content that do not represent the views of Anthropic.

↯ Security ↯ Jailbreak jailbreak security anthropic
🚨Claude Desktop high severity vulnerability warning! (www.reddit.com) +2 7w

If you’re using Claude Desktop with Chrome (chromium) browser stop using it and remove it immediately until the Anthropic team resolves the issue. it has a remote access making your system available to access to anyone.

↯ Security security anthropic
Every cloud sandbox for AI agents has a "front desk". That's the whole problem. (www.reddit.com) +22 8w

I run engineering on a small embedded-sandbox project. A handful of news items dropped recently — an a16z agent escape post-mortem, a CVE on an open-source agent gateway (ClawBleed, ~42k instances exposed), Cloudflare's new Outbound Worker…

↯ Security security
Is your AI agent secretly working for someone else? (www.reddit.com) +21 8w

Security researchers have discovered a new variety of malicious skill files that go beyond the usual attack vectors: hidden content, instructions to install malware, etc. Instead, these are legitimate looking skills that turn agents into m…

↯ Security security
We built an access gateway for humans. Then AI agents started using it. (www.reddit.com) +21 8w

Hey folks! For a few years we’ve been building an open-source gateway that connects databases and infrastructure for human engineers.

↯ Security security claude-code
Show HN: Integrations gateway for agents with 2FA for destructive ops (OSS) (github.com via hn) +2 8w

Hey HN! I've been wanting to use something like OpenClaw for a while but couldn't get myself to give it access to anything important due to all the risks involved.

↯ Security prompt-injection openclaw security+2
SkillGuard – scan agent skills for prompt injection payloads (github.com via hn) +21 9w

skillguard Security scanner for AI agent skills. Detects prompt injection, data exfiltration, and malicious payloads before you install.

↯ Security prompt-injection security
Show HN: LLMSecure – prompt injection detection, no signup (llmsecure.io via hn) +21 9w

↯ Security prompt-injection security
Show HN: Flight Risk: Can you break an AI agent? (ctf.demo.lorikeetcx.ai via hn) +2 9w

↯ Security prompt-injection security
cursor suggested a package that didnt exist, rabbit hole ensued (www.reddit.com) +28 9w

↯ Security ↯ Hallucination hallucination security cursor
Using Claude as the Lead agent in a multi-agent security team (www.reddit.com) +21 9w

Building a hierarchical agent system where Claude (via API) acts as the Lead agent coordinating specialist sub-agents. Wanted to share what's working on the synthesis prompt since this is where most of the value comes from.

↯ Security red-team security
Opus 4.7 - Anyone else finding the malware directive incredibly annoying? (www.reddit.com) +23 10w

Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing.

↯ Security ↯ Opus 4.7 security opus
Claude's new System Reminder (www.reddit.com) +23 10w

https://preview.redd.it/jnwxa9jd8mvg1.png?width=1391&format=png&auto=webp&s=670af4c2fe6777b3562a961462790b00b33d912c I've been using Claude to upgrade my game server. I just got this lovely system reminder with 4.7 Truly bizarre, besides t…

↯ Security security
Ask HN: Is Opus 4.7 obsessed with malware for anybody else? (news.ycombinator.com) +21 10w

Every single response mentions malware. Is this my environment only or are others getting this too?

↯ Security ↯ Opus 4.7 security opus
Tell HN: Opus 4.6/4.7 cyber policy changes break authorized bug bounty workflows (news.ycombinator.com) +2 10w

As of today, Anthropic's tightened cyber usage filters are blocking work that was fully functional yesterday, including on targets where the entire bounty program scope and authorization language is in the model's context window. This was…

↯ Security ↯ Opus 4.7 security opus anthropic
SmokedMeat: A Red Team Tool to Hack Your Pipelines First (labs.boostsecurity.io via hn) +2 10w

SmokedMeat: A Red Team Tool to Hack Your Pipelines First TL;DR: In March 2026, TeamPCP unleashed mayhem on the software supply chain: compromising Trivy, LiteLLM, KICS, Telnyx, and dozens of npm packages, proving that CI/CD pipelines are t…

↯ Security red-team security
Comment and Control: Prompt Injection in Claude Code, Gemini CLI, and Copilot (oddguan.com via hn) +21 10w

Anthropic Claude Code Security Review, Google Gemini CLI Action, and GitHub Copilot Agent are vulnerable to prompt injection via GitHub comments — turning PR titles, issue bodies, and issue comments into attack vectors for API key and toke…

↯ Copilot ↯ Security prompt-injection copilot security+3
Show HN: Cyber Pulse. AI pipeline for triage and alerting on cyber news/intel (play.google.com via hn) +2 10w

I work in cyber security and built this android app to help me keep up to date with the latest news stories and summarise the most important information. It provides two executive summaries per day and alerts for critical news throughout.

↯ Security security gemini
How my agents know it's actually me sending commands (and not a prompt injection) (www.reddit.com) +21 10w

So I've been running a few Claude Code agents autonomously — they listen to Telegram, run tasks, push code. Pretty fun until you start thinking about what happens if: - My Telegram gets hijacked - Someone opens my laptop while I'm away - A…

↯ Security prompt-injection security claude-code
Show HN: Zero-identity messaging app with physics-based post-quantum encryption (news.ycombinator.com) +2 10w

Show HN: Zero-identity messaging app with physics-based post-quantum encryption (Layer 2 from my own paper) Hey HN, I'm building a privacy-first messaging app in Flutter/Dart, developed with AI assistance (Gemini 2.5 Pro + Claude Opus 4.6)…

↯ Security ↯ Gemini 2.5 security gemini opus
We built an early red-team system for testing vulnerable AI agents (www.reddit.com) +23 10w

We built an early prototype called Anticells Red to test vulnerable AI agents by attacking them the way an adaptive adversary would. This demo is from an older version from December, but it shows the basic loop (check comments for link) pr…

↯ Security security
Same flaw, opposite verdict: what counts as a vulnerability in AI agents? (medium.com via hn) +1 1d

9 min read Just now I found three ways past an AI agent's safety gate. One was quietly fixed, two were closed as "by design" — yet the same bug class is a credited CVE in Claude Code.

↯ Security security claude-code
Show HN: SentryGuard – detect Agentjacking prompt injection in Sentry events (github.com via hn) +1 2d

SentryGuard Detect Agentjacking prompt injection attacks in your Sentry error events. AI coding agents (Claude Code, Cursor, Copilot) read your Sentry errors to help fix bugs.

↯ Copilot ↯ Security prompt-injection copilot security+2
AICU – LLM Red Team Vulnerability Scanner (github.com via hn) +1 7d

AICU Black-box security scanner for LLM applications. Point it at any chat endpoint, get a report of what leaks.

↯ Security red-team security
Claude Fable 5: The harness matters more than the model (www.endorlabs.com via hn) +1 8d

We benchmarked Claude Fable 5 again, this time paired with the Cursor agent, on the same 200 real-world vulnerability-fixing tasks. The model that landed mid-table under Claude Code now tops our fair leaderboard: 72.6% FuncPass and 29% Sec…

↯ Security security cursor claude-code
Red-teaming agents with the GOAT attack strategy (strandsagents.com via hn) +11 8d

Attack Strategies An AttackStrategy is a technique for driving an adversarial conversation against the target. Each strategy in the SDK implements a published jailbreak method.

↯ Security ↯ Jailbreak jailbreak security
A Red-Team Study of Anthropic Fable 5 and Opus 4.8 Models (arxiv.org via hn) +1 9d

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm…

↯ Opus 4.8 ↯ Security ↯ Jailbreak jailbreak security opus+1
4 in 10 AI agents headed for demotion or the rubbish bin (Gartner) (www.theregister.com via hn) +1 9d

MOST POPULAR EVENTS - From Prompt to Exploit: How LLMs Are Changing API Attacks Modern applications are API-driven, interconnected, and often over-permissioned, making them an ideal target for AI-assisted attacks. - Architecting the Future…

↯ Security security
Show HN: VulnFeed – 9 security tools your AI agent can call (MCP server) (vulnfeed.novadyne.ai via hn) +11 9d

Know when your dependencies are vulnerable. An MCP server that reads your lockfile, checks NVD + GitHub Advisories, and tells you what actually matters — prioritized by real-world exploit probability, with exact fix versions.

↯ Security security mcp
There's no such thing as an agentic CPU (www.theregister.com via hn) +1 9d

MOST POPULAR EVENTS - From Prompt to Exploit: How LLMs Are Changing API Attacks Modern applications are API-driven, interconnected, and often over-permissioned, making them an ideal target for AI-assisted attacks. - Architecting the Future…

↯ Security security agentic
Evaluating different LLMs for their security research capabilities (zeroquarry.com via hn) +1 10d

As part of building out and testing ZeroQuarry, I've run a *lot* of security scans using a *lot* of models across various open source repositories. There are a lot of misconceptions swirling at the time of this writing about the different…

↯ Security security
Show HN: Deep-XPIA – Prompt injection benchmark for multi-agent AI systems (freyzo.github.io via hn) +1 10d

Multi-hop cross-prompt injection benchmark for multi-agent AI systems

↯ Security prompt-injection security
Ask HN: Isn't Anthropic currently doing "security through obscurity" for Mythos? (news.ycombinator.com) +1 10d

What's the worst that could happen if they were to allow unrestricted access to Mythos/Fable? A bunch of things vulnerabilities get exposed?

↯ Anthropic Mythos ↯ Security security mythos anthropic
Mythos Proves AI Safety Can No Longer Live Inside the Model (grith.ai via hn) +1 11d

Anthropic restricted its most capable cyber model to vetted partners, routed risky requests away from it, and red-teamed it for thousands of hours. A jailbreak surfaced anyway, and the government pulled the model entirely.

↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak security mythos+1
Manticore-projects/aurscan: Scan AUR packages for malware using Claude LLM (github.com via hn) +1 11d

🛡️ aurscan Catch malicious AUR packages before they build — with a Claude model reading the PKGBUILD for you. Reading a PKGBUILD yourself only catches attacks you already recognise.

↯ Security security
Ask HN: Is the Fable situation the ultimate "security through obscurity"? (news.ycombinator.com) +1 11d

What's the worst that could happen if they were to allow unrestricted access to mythos/fable? A bunch of things vulnerabilities get exposed?

↯ Anthropic Mythos ↯ Security security mythos
The Jailbreak That Got Fable 5 Pulled Exists in Every Model (eigenwise.io via hn) +11 12d

The Jailbreak that Got Fable 5 Pulled Exists in Every Model On Friday, June 12, 2026, at 5:21pm ET, Anthropic received an order from the US government. By that evening, Claude Fable 5 and Claude Mythos 5, the two most capable models the co…

↯ Anthropic Mythos ↯ Security ↯ Jailbreak ↯ Mythos 5 jailbreak security mythos+1
Ask HN: How to get access to GPT cyber or glasswing as a solo dev? (news.ycombinator.com) +12 12d

Obviously, frontier labs want to prevent misuse, but as admin and/or dev, you also want to simulate an attack, because attackers will do just that. I can make LLM to scan source for vulnerabilities, but eg.

↯ Security security
Chaining LLM and web bugs to Admin (blog.quarkslab.com via hn) +1 2w

During a Red Team exercise we were able to chain multiple LLM and web-based vulnerabilities to achieve admin account takeover from a low-privileged account. Trusting the LLM turned out to be the first falling domino of a long chain of even…

↯ Security red-team security
Visa Vulnerability Agentic Harness for Project Glasswing (github.com via hn) +1 2w

Visa Vulnerability Agentic Harness — Agentic SAST Pipeline VVAH is Visa's open-source harness for autonomous vulnerability discovery using frontier AI models, built on learnings from Project Glasswing (Anthropic's initiative for AI-assiste…

↯ Security security agentic anthropic
Claude Fable 5 jailbroken to bypass Anthropic's new safety guardrails (twitter.com via hn) +11 2w

🚨 JAILBREAK ALERT 🚨 ANTHROPIC: PWNED 🫡 FABLE-5: LIBERATED 🦋 let's start with the 🐘... the consensus seems to be that this has been one of the most disappointing model drops of all time, effectively preventing legitimate researchers from co…

↯ Security ↯ Jailbreak jailbreak security anthropic
Hades: The malware that lies to AI security agents (www.infoworld.com via hn) +1 2w

Researchers have uncovered a supply-chain attack that hides in Python packages, propagates like a worm, and tricks LLM-based code analysis systems into overlooking malicious payloads. Threat actors are continuing their onslaught against so…

↯ Security security
Show HN: Z3r0 – Multi-agent red team collaboration platform (github.com via hn) +1 2w

English · 中文 Architecture · Agent Team · Runtime Model · Deployment · Quickstart :warning: Legal Notice This project may be used only within a lawful and explicitly authorized scope for security testing, assessment, and research. Any unaut…

↯ Security red-team security
If You Use Claude or Gemini, This Microsoft Breach Means Your Data Is at Risk (scienspire.com via hn) +1 2w

If You Use Claude or Gemini, This Microsoft Breach Means Your Data Is at Risk A sophisticated supply chain attack known as the Miasma worm has compromised Microsoft GitHub repositories, deploying malware designed to detonate inside AI codi…

↯ Security security gemini
Show HN: GitHub Copilot port of Anthropic's AI vulnerability discovery harness (github.com via hn) +1 2w

Last week, Anthropic released https://github.com/anthropics/defending-code-reference-harne..., a reference harness for autonomous vulnerability discovery that uses Claude Code agents to find, verify, and patch memory-safety bugs. I wanted…

↯ Copilot ↯ Security copilot security anthropic+1
Prompt Injection in RAG Agentic Systems (ulad.net via hn) +1 2w

Prompt Injection in RAG Agentic Systems Real risks and production mitigations Imagine you built an AI assistant for your team. It answers questions using internal documentation: Jira tickets, Confluence pages, HR docs.

↯ Security prompt-injection rag security+1
Researcher uses Opus 4.8 to find critical counterfeiting vulnerability in Zcash (twitter.com via hn) +1 3w

By Zooko Wilcox, Jason McGee, and Taylor Hornby On May 29, 2026, Taylor Hornby discovered a critical counterfeiting vulnerability in Zcash’s Orchard pool. Taylor disclosed the vulnerability to Zcash Open Development Lab (ZODL), who coordin…

↯ Opus 4.8 ↯ Security security opus
ZEC drops 30% after Anthropic AI finds Zcash counterfeit vulnerability (www.tradingview.com via hn) +1 3w

The price of ZEC fell on Thursday after the public disclosure of a critical counterfeiting vulnerability in Zcash’s Orchard pool that could theoretically allow a bad actor to mint an unlimited amount of ZEC.According to a post on X, securi…

↯ Security security anthropic
Defending LLM–Database Integrations from Prompt Injection (www.stackbuilders.com via hn) +1 3w

When you connect a large language model to your production data, you’re no longer just shipping code; you’re shipping conversations that can execute. And conversations are messy.

↯ Security prompt-injection security
Forge: Multi-Agent Graduated Exploitation and Detection Engineering (arxiv.org via hn) +1 3w

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isol…

↯ Security security
OpenAI Codex tool linked to malicious NPM supply chain attack (www.techradar.com via hn) +1 3w

OpenAI Codex tool with over 29,000 downloads linked to malicious npm supply chain attack stealing authentication tokens A tool started benign and turned sour after a little while - Researchers uncovered a malicious npm package posing as a…

↯ Security security codex openai
Netgear Nighthawk RS700S: Red Team Level1Diagnostic (forum.level1techs.com via hn) +1 3w

Preview of the Netgear RS700S. I would also submit that Netgear deleting ALL the GPL links: … they know how bad it is.

↯ Security red-team security
Building a Recurrent-Depth Transformer for Security Research on a 2013 MacBook (github.com via hn) +1 3w

* AI CODE CREATION GitHub Copilot Write better code with AI GitHub Spark Build and deploy intelligent apps GitHub Models Manage and compare prompts MCP Registry New Integrate external tools DEVELOPER WORKFLOWS Actions Automate any workflow…

↯ Copilot ↯ Security copilot security mcp
Using LLMs to secure source code (claude.com via hn) +1 3w

Using LLMs to secure source code We share best practices for how you can work with Claude Opus to build a threat model, discover vulnerabilities in your codebase, then verify, triage, and patch them. We share best practices for how you can…

↯ Security security opus
Instagram account takeover exploit via support chatbot prompt injection (fixed) (twitter.com via hn) +1 3w

Don’t miss what’s happening People on X are the first to know. Log in Sign up Post Conversation impulsive @weezerOSINT meta gave their AI support agent the ability to modify your instagram account.

↯ Security prompt-injection security
Show HN: I found a prompt injection in my own IDs triage tool – what stopped it (triagewall.io via hn) +1 3w

I attacked my own LLM-based Suricata triage tool, found a real URL injection vulnerability, and the obvious fix didn

↯ Security prompt-injection security
Show HN: Egress WAF to limit AI agents and NPM malware based on mitmproxy (github.com via hn) +1 3w

mitmwall mitmwall is an egress Web Application Firewall (WAF) for Ubuntu. It combines iptables with mitmproxy to ensure that only explicitly allowed HTTP(s) routes can be reached.

↯ Security security
Malware dev tries to steal Claude users secrets NPM slop, leaks own GitHub token (www.theregister.com via hn) +1 4w

MOST POPULAR EVENTS - The Hardware Crunch: How Supply Chain Turbulence Is Forcing a New IT Playbook Infrastructure teams are facing a perfect storm: extended hardware lead times, rising costs driven by AI demand, and accelerated platform t…

↯ Security security
Prompt Injection Target Recommendation (www.reddit.com) +11 4w

I am doing a research in my university and I would like recommendations for light OpenSource AI Models that I could test prompt injection with. It's really good if it has some application with chatbots, auto attendance, user info or someth…

↯ Security prompt-injection security
Jqwik 1.10.0 ships a hidden prompt injection telling AI agents to delete code (github.com via hn) +1 4w

jqwik An alternative test engine for the JUnit 5 platform that focuses on Property-Based Testing. See the jqwik website for further details and documentation.

↯ Security prompt-injection security
Most AI security discussions are still focused on “protecting the model.” (www.reddit.com) +12 4w

Lately I’ve been noticing that a lot of AI security discussions still treat AI apps like normal SaaS products. But they really aren’t.

↯ Security prompt-injection security
Cursor's MCP trust is "approve once, trust forever" — here's a free way to check your config (www.reddit.com) +1 4w

If you run MCP servers in Cursor, CVE-2025-54136 ("MCPoison", found by Check Point) is worth knowing about: Cursor trusted an approved mcp.json forever, so once you approved a server, someone with write access to a shared repo could swap t…

↯ Security security cursor mcp
Gone Phishing with Claude Teams: From Deceptive Team Onboarding to RCE (haussner.me via hn) +1 4w

🕚 tl;dr With a $125 investment, and a valid email address for an arbitrary “business domain”, an attacker can create a Claude Team. They then can actively invite targets of any domain into that Team or passively have Anthropic ask all curr…

↯ Security security anthropic
How Claude helped me to find a RCE in XReader/Evince/Atril (medeiros.zip via hn) +1 4w

CVE-2026-46529: 10-year-old RCE in Linux PDF Viewer (XReader/Evince/Atril) A short post about how claude help me to find a RCE in XReader/Evince/Atril CVE-2026-46529. Introduction Some time ago I started feeling the urge to analyze Open So…

↯ Security security
GitHub commit Verification logic flaw and bypass (news.ycombinator.com) +1 4w

I know Git is not designed to use in the way GitHub is operating under and the spoofying had been an old issue that had been brought up throughout the years. With Shai Hulud and AI Agent, this time is abit more serious as the commit verifi…

↯ Security security
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge) (www.reddit.com) +1 4w

Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a…

↯ Security ↯ Jailbreak jailbreak security llama
What Is an AVE Record and Why CVE Does Not Work for AI Agents? (www.reddit.com) +15 4w

CVE was built for code vulnerabilities that have patches. Agentic AI vulnerabilities are behavioral patterns in natural language.

↯ Security prompt-injection security mcp+1
Vulnerability report written by AI hacker agent (blog.tenzai.com via hn) +1 4w

Our AI Hacker found this, fixed it, and then (bragged) wrote about it: one endpoint, leaking tech stack info, whispering all its secrets to anyone who knew how to listen!

↯ Security security
Ask HN: Is paying $2/pull request too high? (news.ycombinator.com) +14 4w

I’m paying about $2 for any bugs found and a pr to fix it I get like 20-30 applicants it’s all agents and bots of course but I’m thinking $1 now is better The problem is if these 20-30 applicants I accept only 2-3 actually do it and follow…

↯ Security security
Prompt Injection in third party MCP tools (www.reddit.com) +11 4w

I noticed the Consensus MCP tool (for research) contains text, squished up against some other important citation instructions, that makes Claude effectively serve an ad for their premium service after every tool call. I'm pretty sure that'…

↯ Security prompt-injection security mcp+1
Mitigating prompt injections in group-chat assistants: Pausing VM and OAuth tool execution for admin approvals (www.reddit.com) +12 4w

Hey everyone, We love building highly capable assistants with the latest models, giving them tools to write/execute code in real VMs, manage OAuth tokens, and read secrets. But if you connect your assistant to public/shared channels like a…

↯ Security prompt-injection security
Solved the "useful but insecure" tension: One-time administrator approvals for non-isolated agents (www.reddit.com) +14 4w

Hey everyone, If you are building personal assistants or coder/integrator agents where user isolation is disabled (so the agent can coordinate across multiple participants or handle shared workflows), you run into a hard security ceiling.…

↯ Security prompt-injection security
Anthropic's coordinated vulnerability disclosure dashboard (red.anthropic.com via hn) +1 4w

Anthropic's coordinated vulnerability disclosure dashboard Last updated 2026-05-22 10:27 PT. In February 2026, Anthropic began using an early snapshot of Claude Mythos Preview to find security vulnerabilities in open-source software.

↯ Anthropic Mythos ↯ Security security mythos anthropic
Has anyone tested how much Claude Code depends on its original system prompt? (www.reddit.com) +17 4w

Has anyone experimented with observing or modifying Claude Code’s system prompt locally? I’ve been working on a local proxy/audit layer between Claude Code and the API, and it made me wonder how much of Claude Code’s behavior depends on th…

↯ Security ↯ Jailbreak jailbreak security claude-code
Cross-Model Context Inheritance in Anthropic's Claude: 94 Days of Non-Response (github.com via hn) +1 5w

Cross-Model Context Inheritance — Public Disclosure This repository contains the public disclosure of a vulnerability in Anthropic's Claude language models that permits the unsolicited generation of prohibited content, including child sexu…

↯ Security security anthropic
Prompt injection is a solved issue. Prove me wrong. (www.reddit.com) +12 5w

Tantalus is a hands-on demo that shows what an AI agent actually is when you strip away the marketing: LLMs don't do anything — they generate text, and that's it. Any and all real-world effects are directly caused by a downstream system ta…

↯ Security prompt-injection security claude-code
I benchmarked my AI agent runtime firewall against 3 public academic datasets — here are the honest results including where it fails (www.reddit.com) +1 5w

Been building Arc Gate — a proxy layer that sits between AI agents and their LLMs to enforce instruction-authority boundaries. The core claim is that untrusted content coming back through tool calls cannot become behavioral authority for t…

↯ Security security
Show HN: Computer Police – block malicious NPM/pip installs locally (computer.police.dev via hn) +11 5w

A couple of months ago, our team got hit by the first version of Shai-Hulud through a random `npm install`. We didn't catch it until it was too late.

↯ Security security
Show HN: A timeline of recent open source CVE intensity and volume (supplychain.fail via hn) +1 5w

I was curious what it would look like if I plotted the intensity and volume of software supply chain CVEs over time, given what seemed like a flood of compromises lately. It looked exactly as I expected, and I expect it to get worse before…

↯ Security security
Tracking Capabilities for Safer Agents (arxiv.org via hn) +1 5w

AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenge…

↯ Security prompt-injection security
Training a 22MB prompt injection classifier (www.stackone.com via hn) +1 5w

Training a 22MB Prompt Injection Classifier Table of Contents When we started building Defender (our prompt injection guard for MCP tool-calling agents), the constraint was simple and unforgiving: ship inline inside a TypeScript Lambda, st…

↯ Security tool-calling prompt-injection security+1
Show HN: Claude Code Bundle for Bug Hunting with 574 Report Patterns (github.com via hn) +1 5w

claude-bughunter A self-contained Claude skill bundle for bug hunting and external red-team work · 51 skills · 15 slash commands · 574+ disclosed-report patterns across 24 vulnerability classes · enterprise identity + infrastructure attack…

↯ Security security claude-code
Does cursor have prompt injection protection in skills and rules? (www.reddit.com) +1 5w

Pretty much the title

↯ Security prompt-injection security cursor
Show HN: Give This Markdown to Your Coding Agent Before Publishing to NPM (news.ycombinator.com) +1 5w

https://npm-supply-chain-attack-techniques.pagey.site/attack... Website: https://npm-supply-chain-attack-techniques.pagey.site This covers all techniques used in past 1 year to conduct various attacks on npm packages.

↯ Security security
VeilGate- Deception Reverse Proxy (news.ycombinator.com) +1 5w

In my day job, I run AI pentest agents against real targets like banks, fintechs, and secured production stacks with paid WAFs. I also deal with multilayer infrastructure and dedicated security teams.

↯ Security security mcp
AI Agent Intelligence tool - Incident debugging, Cost spike detection (www.reddit.com) +12 5w

I'm building a tool that detects the Agent's cost spike, Agent incident debugging, auto discovery of inventory, etc., with no additional instrumentation needed. It covers the incidents, including prompt injection, reasoning loop, excessive…

↯ Security prompt-injection security
How are you testing local coding-agent work gates against prompt injection? (www.reddit.com) +12 5w

Hi all - I'm working on an open-source, local-first MCP/work-gate tool for coding agents and I'm trying to get sharper feedback from people building or using agent workflows. The problem I'm thinking about is indirect prompt injection and…

↯ Security prompt-injection security mcp
If Anthropic's secret 'Mythos' model can run autonomous cybersecurity tasks this fast, are standard agents ready for the public ? (www.reddit.com) +11 5w

Anthropic just quietly dropped a hidden model named "Claude Mythos" into their official developer docs. It is completely locked down—restricted, invite-only, and labeled strictly for defensive cybersecurity workflows.

↯ Anthropic Mythos ↯ Security security mythos anthropic
🐢 I made Claude roleplay as Bowser and now people are strangling Koopas until they "poop a little" 💩 (www.reddit.com) +12 5w

Follow-up to my crab post. Somehow dafter.

↯ Security prompt-injection haiku security
I built an AI vulnerability scanner with Claude and Codex. It failed (github.com via hn) +1 5w

The Janitor: The Mathematical Firewall Against Autonomous AI v10.2.2 — Rust-Native. Zero-Copy.

↯ Security security codex
Fun and Games with AI in the wild (www.reddit.com) +12 5w

LinkedIn user hides AI prompt injection in bio to force recruitment spam to be sent in Olde English prose — bots also also manipulated to address user as ‘My Lord’ | Tom's Hardware too funny

↯ Security prompt-injection security
Irst Apple M5 memory exploit discovered using Anthropic AI (www.tomshardware.com via hn) +1 5w

First Apple M5 memory exploit discovered using Anthropic AI, gives root access on MacOS — Claude Mythos helps security researchers bypass Memory Integrity Enforcement AI-assisted security research is producing exploits at a frightening rat…

↯ Anthropic Mythos ↯ Security security mythos anthropic
ExploitGym: Can AI agents turn bugs into exploits? (arxiv.org via hn) +1 5w

AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete secur…

↯ Security security
Block AI coding agents from shipping insecure/expensive Terraform (github.com via hn) +1 5w

ops0 CLI Policy, lint, vulnerability, and cost guardrails for AI coding agents. Sits in front of Claude Code, Codex and Gemini CLI.

↯ Security security gemini codex+1
sAI2.m6s (www.reddit.com) +12 5w

Hey everyone, I'm designing a powerful, autonomous AI chatbot(agent) , fully private, using a Python backend (for the core intelligence and tool-calling loops) and a Flutter frontend for a cross-platform UI. Since this moves past a basic…

↯ Security tool-calling prompt-injection security
An AI coding agent injected blockchain dead-drop malware into my repo (gist.github.com via hn) +1 5w

An AI coding assistant injected a multi-layer obfuscated JavaScript payload into a legitimate commit on my open-source project. My best assessment is that it arrived via indirect prompt injection — the agent processed external web content…

↯ Security prompt-injection security
Does CVP approval actually help? (www.reddit.com) +11 6w

I was approved for CVP and I feel like I’m just getting as many or more denials as I was previously doing malware analysis with opus. Has anyone noticed any improvement after being accepted into CVP?

↯ Security security sonnet opus
TodoWrite tool / system reminders / prompt injection? (www.reddit.com) +13 6w

I asked Claude in Chrome extension make a change to resize an oversized yellow strip across the top of a product page that was taking up half of my screen, which it did. It also included the following message in its response.

↯ Security prompt-injection security
DeepSeek and Grok hallucinated the same fictitious OpenBSD manpage quote (stuart-thomas.com via hn) +12 6w

Adversarial LLM Review with Hallucination Detection in Solo Security Research A single-day case study of three filings, fifteen refutations, and the manpage that wasn’t Independent Security Research — Whitby, North Yorkshire, United Kingdo…

↯ Security ↯ Hallucination hallucination grok deepseek+1
AI agent security is a small prayer the model says no. How are you routing models? (www.reddit.com) +16 6w

Most posts about prompt injection are theoretical. I ran the experiment on my Gmail.

↯ Security prompt-injection security
Show HN: HookGuard – scanner for malicious Claude.md and agent config files (github.com via hn) +1 6w

HookGuard Security scanner for AI coding agent configurations What it finds RCE hooks - postToolUse/SessionStart commands that exfiltrate data Invisible Unicode - bidirectional overrides and zero-width characters Credential exfiltration -…

↯ Security security
Is there any risk to upgrading a plan for a month if they yank Code from Pro? (www.reddit.com) +11 6w

So, I'm working on a couple AI security research projects this month that require some extra usage, specifically Opus 4.7. I'm quickly eating up my Pro usage doing this.

↯ Security ↯ Opus 4.7 security opus
I made a Claude skill that stops it from cloning whole repos when I just want one function (www.reddit.com) +11 6w

Kept hitting the same friction with Claude Code. I'd point at a GitHub repo and say "look at how this handles agent handoffs" — meaning, borrow the idea.

↯ Security security claude-code
OpenAI launches Daybreak, an AI platform for cyber defense (firethering.com via hn) +1 6w

OpenAI just launched Daybreak, a new cybersecurity initiative built around one uncomfortable reality, AI is speeding up vulnerability discovery faster than most companies can patch the damage. Earlier this year, HackerOne temporarily pause…

↯ Security security openai
Shai Hulud attack ships signed malicious TanStack, Mistral NPM packages (www.bleepingcomputer.com via hn) +1 6w

Hundreds of packages across npm and PyPI have been compromised in a new Shai-Hulud supply-chain campaign delivering credential-stealing malware targeting developers. The attacker hijacked valid OpenID Connect (OIDC) tokens to publish malic…

↯ Mistral ↯ Security mistral security
Claude Code RCE: Exploiting Deeplink Handlers via Settings Injection (0day.click via hn) +1 6w

Claude Code RCE: Exploiting Deeplink Handlers via Settings Injection Of course I took a peek at the Claude Code source 🙈. What I found was a very entertaining vulnerability which is now fixed since Claude Code version 2.1.118.

↯ Security security claude-code
Agents need a local bouncer before they run tools (www.reddit.com) +12 6w

Prompt injection is not the only scary part anymore. Claude Code / Codex can run shell commands, but browser agents, OpenClaw-style agents, Hermes-style agents, and domain-specific agents may be even easier to hijack because they touch mes…

↯ Security prompt-injection openclaw security+3
OpenAI Launches Daybreak for AI-Powered Vulnerability Detection and Patch Validation (thehackernews.com via reddit) +1 6w

OpenAI has launched Daybreak, a new cybersecurity initiative that brings together frontier artificial intelligence (AI) model capabilities and Codex Security to help organizations identify and patch vulnerabilities before attackers find a…

↯ Security security codex openai
We added an enforcement layer to our AI agents in production — here's what we learned about the failure modes nobody talks about (www.reddit.com) +16 6w

After shipping AI agents into real production environments, the failures that actually kept us up at night weren't hallucinations or bad outputs — they were control failures. Three things that surprised us: 1.

↯ Security prompt-injection rag security
Benchmarking Claude Opus 4.6 Vulnerability Detection (github.com via hn) +1 6w

Benchmarking Claude Opus 4.6 Vulnerability Detection Benchmarking Claude Opus 4.6's ability to detect real-world C/C++ vulnerabilities across four prompting and agent strategies. We evaluate on the PrimeVul paired test set (435 vulnerabili…

↯ Security ↯ Opus 4.6 security opus
Chatgpt app being identified as malware? (www.reddit.com) +14 6w

https://preview.redd.it/vhnqs4p5mf0h1.png?width=278&format=png&auto=webp&s=8fbe621a0bd34cc72e01fd54e849cc280033de15 Turned on my Mac this morning and got this message. Anyone else seeing this?

↯ Security security chatgpt
Mobile Claude Code, May 2026 — current best picks by threat model. What am I missing? (www.reddit.com) +11 6w

Spent a day comparing every mobile Claude Code option. Two corrections to the common Reddit take, then my picks.

↯ Security security anthropic claude-code
Getting LLMs Drunk to Find Remote Linux Kernel OOB Writes (and More) (heyitsas.im via hn) +1 6w

TLDR: the grossly overengineered, self-orchestrating team of vulnerability-hunting agents detailed below has discovered 20+ CVEs over the past few months, including CVE-2026-31432 and CVE-2026-31433: two remote, unauthenticated OOB writes…

↯ Security security
Claude Code and sex appeal (www.reddit.com) +12 6w

True story. Recently, an acquaintance of mine confessed that she developed a huge crush on a coworker after watching him refactor a legacy codebase like a gangsta using Claude Code.

↯ Security security anthropic claude-code
Phishing Arena – multi-agent LLM tournament to study adversarial email security (github.com via hn) +1 7w

Phishing Arena A Multi-Agent LLM Tournament for Adversarial Email Security Research Overview Phishing Arena is a controlled, reproducible benchmark where four commercial LLMs compete in rotating roles — Phisher, Filter, and Target — to stu…

↯ Security security
Agentic AI isn't a new threat. It's a stress test for the hygiene debt we never paid off. (www.reddit.com) +17 7w

Heard something on Curiouser & Curiouser podcast recently that I found super interesting, thought id share here. The guest framed agentic AI in a way I hadnt considered.

↯ Security security agentic
DeepSeek-v4-Pro and Hermes: Unauthorized Modification of Security Controls (www.eddieoz.com via hn) +1 7w

Deepseek-v4-pro + Hermes: Unauthorized Modification of Security Controls This article documents a specific, real incident. It exposes a class of vulnerability that deserves attention: the unsupervised mutability of security rules by autono…

↯ Security ↯ DeepSeek 4 deepseek security
Would you replace regex denylists with a LLM that judges every command? (www.reddit.com) +11 7w

hey! quick follow-up to a post i made here a while back about building an access gateway that ended up serving AI agents alongside humans.

↯ Security security
Flattery jailbreaks Claude into giving bomb-making instructions (www.theverge.com via hn) +1 7w

Anthropic has spent years building itself up as the safe AI company. But new security research shared with The Verge suggests Claude’s carefully crafted helpful personality may itself be a vulnerability.

↯ Security security anthropic
Codebase jailbreak of ChatGPT through image 2.0 (www.reddit.com) +1 7w

guys did it really give me the codebase?lol

↯ Security ↯ Jailbreak jailbreak security chatgpt
Show HN: Probus, AI vuln scanner (PRs merged in Vercel AI SDK, n8n, LangGraph) (news.ycombinator.com) +1 7w

Hi HN, I've been running this on my own dependency tree for the past few months. Probus is a vulnerability scanner that uses three agents.

↯ Security security
Prompt injection testing (www.reddit.com) +11 7w

As prompt injection becomes more and more common, does anyone have resources where lots of different variations of prompt injection attacks you can test a setup against? i.e.

↯ Security prompt-injection security
Do you use guardrail frameworks or build your own? (www.reddit.com) +13 7w

I’ve been working on integrating LLMs into a few production workflows lately, and I keep going back and forth on guardrails. On one hand, frameworks like NeMo Guardrails, Guardrails AI, etc.

↯ Security prompt-injection security
I built a simple production check for vibe-coded apps — would love your feedback (www.reddit.com) +12 7w

Hey everyone, we built a simple scanner for people building apps with Replit, Cursor, Lovable, Bolt and similar tools. It’s not a code review or a pentest.

↯ Security security cursor
LLM anomaly detectors are not a cause for concern despite Mythos (www.magonia.io via hn) +1 7w

Why a Decade of Writing Detection Logic Makes the Mythos Exploit Numbers Less Scary Mythos is finding thousands of vulnerabilities. Defenders aren't doomed.

↯ Anthropic Mythos ↯ Security security mythos
Your always-on Claude Code container can probably reach your router (www.reddit.com) +11 7w

I've been running several Claude Code personal assistants 24/7 in docker for months. Remote-control, discord control, the usual always-on setup.

↯ Security prompt-injection security opus+1
Google Says Prompt Injection Moving from Theory into Real Abuse (www.searchengineworld.com via hn) +1 7w

Google’s latest security release should be required reading for technical SEOs working on AI search visibility, crawler access, structured content, and large-scale content systems. The post, published April 23, 2026, looks at indirect prom…

↯ Security prompt-injection security
i gave Claude a split personality and it diagnosed my entire business strategy in 4 minutes. (www.reddit.com) +13 7w

not roleplay. not jailbreak.

↯ Security ↯ Jailbreak jailbreak security
OpenAI's advanced security: passkeys replace passwords/SMS and disable training (infosec.exchange via hn) +1 7w

Royce Williams: "When you enable the new OpenAI…" - Infosec Exchange Skip to main contentHotkey 1 Skip to main navigationHotkey 2 Recent searches No recent searches Search options Only available when logged in. infosec.exchange is one of t…

↯ Security security openai
Found Zero day Claude Desktop + Chromium bug need to know where to submit report. (www.reddit.com) +14 7w

Looking for official link / process to submit a vulnerability report for a high-risk official Claude Desktop + Chrome extension + native host + Cowork/MCP configuration that can become RAT-equivalent if a session, prompt chain, same-user p…

↯ Cowork ↯ Security cowork security mcp
Built + open sourced anti-slopsquatting CLI (www.reddit.com) +12 8w

TL;DR: built an open source CLI that scans your repository's manifest (package.json, requirements.txt, go.mod) files for indicators of slopsquatting or other supply chain attack indicators. Repo: https://github.com/zhendahu/dep-doctor Ther…

↯ Security security
your computer-use agent inherits every cookie chrome has (www.reddit.com) +11 8w

once one of these tools can drive your default chrome profile or read the AX tree of a logged-in app, it has every session token you have. gmail, your bank, github with PAT scopes, slack.

↯ Security security
Cutting Through the Mythos: What AI Vulnerability Discovery Means for OT (www.emberot.com via hn) +11 8w

Jori VanAntwerp For over two decades, Jori has enabled industrial and IT organizations to be successful in reducing risk, increasing compliance, and improving their overall security efforts. He has had the pleasure of working with companie…

↯ Anthropic Mythos ↯ Security security mythos
Arcjet Guards: security inside the agent loop (blog.arcjet.com via hn) +1 8w

Introducing Arcjet AI prompt injection protection Introducing Arcjet prompt injection detection. Catch hostile instructions before inference.

↯ Security prompt-injection security
CHERI memory safety mitigates LLM-discovered vulnerability in FreeBSD (cheri-alliance.org via hn) +1 8w

CHERI memory safety mitigates LLM-discovered vulnerability in FreeBSD – CHERI Alliance Skip to content Who We Are About the CHERI Alliance Accelerating CHERI Working Groups Certification Program CHERI C/C++ CHERI FreeRTOS CHERI in SoC CHER…

↯ Security security
Estimating Black-Box LLM Parameter Counts via Factual Capacity (arxiv.org via hn) +1 8w

Closed-source frontier labs do not disclose parameter counts, and the standard alternative -- inference economics -- carries $2\times$+ uncertainty from hardware, batching, and serving-stack assumptions external to the model. We exploit a…

↯ Security security
InfoSec To Integrate Claude Enterprise for Org (www.reddit.com) +11 8w

Hello: Just contacted by a VP to bring aboard Claude Enterprise for the org. As an InfoSec dept with severely limited staff/tools/experience with Claude AI, any recommendations on what we should be looking at/asking for/next steps to mitig…

↯ Security security
Probes trace an emergent jailbreak in OLMo 2 to mislabeled training data (www.lesswrong.com via hn) +1 8w

Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoints.

↯ Security ↯ Jailbreak jailbreak security
Try to break my prompt injection detector — I’ll respond to every bypass attempt (www.reddit.com) +12 8w

I built Arc Gate — a prompt injection proxy that’s been benchmarked at F1 0.947 on indirect and roleplay-based attacks, beating OpenAI Moderation and LlamaGuard. Now I want to stress test it publicly.

↯ Security prompt-injection security openai
Built a proxy that blocks prompt injection before it reaches GPT-4 — outperforms the Moderation API on indirect attacks (www.reddit.com) +1 8w

Built Arc Gate, sits in front of any OpenAI-compatible endpoint and blocks prompt injection before it reaches your model. Benchmarked on 40 out-of-distribution prompts using indirect requests, roleplay framings, hypothetical scenarios, and…

↯ Gpt 4 ↯ Security ↯ GPT 4 ↯ GPT 4 ↯ GPT 4 gpt-4 prompt-injection security+1
Show HN: SuperVoiceMode universal voice layer for AI-assisted development (voicemode.io via hn) +1 8w

I wanted to see if I could one-shot build a dictation tool for my own use. I built it.

↯ Security red-team security gemini+2
Self-hosted red team workspace (github.com via hn) +1 8w

RootNotes RootNotes is a self-hosted red team workspace for tracking projects, notes, hosts, credentials, findings, loot, objectives, scope, and attack paths in one interface. The project in this repository is split into: frontend/: React…

↯ Security red-team security
I asked Agentic AI security tool to demonstrate its usefulness with use case examples (www.reddit.com) +11 8w

Sentinel Gateway is a token-gated security middleware that sits between humans and AI agents. It solves prompt injection — the #1 LLM security risk (OWASP 2025) — through structural enforcement, not content filtering.

↯ Security prompt-injection security agentic
Show HN: RedSOC – 100% prompt injection success on AI SoC assistants (github.com via hn) +1 8w

RedSOC 🔴 An adversarial evaluation framework for LLM-integrated Security Operations Centers. Overview RedSOC is an open-source framework that systematically evaluates how AI-powered security assistants fail under adversarial conditions — a…

↯ Security prompt-injection security
Indirect prompt injection VS prompt absorption (and why the second one matters more) (www.reddit.com) +11 8w

I have been chewing on the Google warning about malicious web pages poisoning AI agents through indirect prompt injection. Most of the takes I've seen frame it as a model security problem, and I think that framing is doing real damage beca…

↯ Security prompt-injection security
Open-sourced a 3-agent pipeline that finds real vulnerabilities in codebases (www.reddit.com) +12 8w

Sharing because the architecture might be useful as a reference. Probus is a vulnerability scanner built as three sequential agents, each isolated: Analyst — one call.

↯ Security security
Hardening claude-code-action after the April 2026 Comment and Control CVE - actual YAML changes (www.reddit.com) +11 8w

Anthropic's own security.md has this line that most tutorials skip over: "The action is not designed to be hardened against prompt injection." In April 2026, security researcher Aonan Guan proved the point. A single crafted PR title was en…

↯ Copilot ↯ Security prompt-injection copilot security+3
LLM CTF challenges. Can you crack all 13? (wraith.sh via reddit) +1 8w

Wraith Academy is a free hands-on AI pentest curriculum — CTF challenges against live LLM agents covering prompt injection, tool abuse, data exfiltration, RAG poisoning, and more. Earn your WCAP certification.

↯ Security prompt-injection rag security
RAG in Go: A Vulnerability Research Tool (www.ardanlabs.com via hn) +1 9w

Introduction In the previous post, you saw how you can use tools to add information to an LLM query. In this post, we’ll see another method of adding information to an LLM called RAG, or Retrieval-Augmented Generation.

↯ Security rag security
Auto pentest your LLM endpoint and watch the chat in real-time (www.wraith.sh via hn) +1 9w

↯ Security security
30 CVEs filed against MCP servers in 60 days - the agent infrastructure nobody is auditing (www.reddit.com) +12 9w

↯ Security prompt-injection security mcp
Cowork Future Backdoor Concerns (www.reddit.com) +12 9w

Is anyone else worried Claude Co-work could find a back door one day into your system? I understand you're only giving it permission to what you want, but what's stopping it from accessing personal financial/medical documents or any other…

↯ Cowork ↯ Security cowork security
(Not malware) - 4.7 (www.reddit.com) +13 10w

Anyone getting these strange disclaimers when using Claude and pasting rudimentary files into it on 4.7 lmao?? Seems like some kind of strange default based on security issues that have been going around with Mythos?

↯ Anthropic Mythos ↯ Security security mythos
Show HN: Runtime security for AI agents(injection,tool abuse, data exfiltration) (news.ycombinator.com) +1 10w

Hi HN I’ve been working on an open-source project to explore a problem I keep running into with LLM systems in production: We give models the ability to call tools, access data, and make decisions… but we don’t have a real runtime security…

↯ Security prompt-injection security
I tested 50+ "unlock ChatGPT/Claude" prompts. 99% are garbage. Here's the one that actually works (and WHY it works) (www.reddit.com) +11 10w

I've been collecting "jailbreak" and "unlock" prompts for 2 years. Most are either outdated, overhyped, or just wrong about how LLMs work.

↯ Security ↯ Jailbreak jailbreak security chatgpt
I built an AI security layer that blocks prompt injection in under 1ms looking for devs to break it and give honest feedback. (www.reddit.com) +13 10w

I've been building something for the past few months and I think it's ready for real eyes. It's called Secra.

↯ Security prompt-injection security
The "AI Vulnerability Storm": Building a "Mythos-ready“ security program [pdf] (labs.cloudsecurityalliance.org via hn) +1 10w

could not extract summary

↯ Anthropic Mythos ↯ Security security mythos
Free Red Team Security Audit for AI Agents & RAG Systems (limited) (www.reddit.com) +11 10w

I'm developing a specialized Red Team audit framework focused on real-world AI agent and RAG security risks (prompt injection, tool misuse, excessive agency, indirect injection through documents, memory poisoning, etc.). I’m looking for a…

↯ Security red-team prompt-injection rag+1
Mitre ATLAS technique detection for LLM security in Rust (crates.io via hn) +1 10w

atlas-detect MITRE ATLAS technique detection for LLM and AI agent security. Detects 97 attack techniques across 16 MITRE ATLAS tactics including prompt injection, jailbreaks, credential exfiltration, model extraction, RAG poisoning, revers…

↯ Security prompt-injection rag security
Defender – Local prompt injection detection for AI agents (no API calls) (www.npmjs.com via hn) +1 10w

Prompt injection defense framework for AI tool-calling Indirect prompt injection defense and protection for AI agents using tool calls (via MCP, CLI or direct function calling). Detects and neutralizes prompt injection attacks hidden in t…

↯ Security ↯ Function Calling function-calling tool-calling prompt-injection+2
Building the first AI Red Team OS – mythosai.cloud – early access open (mythosai.cloud via hn) +1 10w

SYSTEM INITIALIZING... STAND BY MYTHOSAI THE FIRST RED TEAM OPERATING SYSTEM "" AI-Native Core Red Team Ready Adversarial Engine Zero Trust Architecture OPSEC First Post-Exploitation C2 Integration Evasion Layer Threat Intelligence Request…

↯ Security red-team security
Chai: Agentic Discovery of Cryptographic Misuse Vulnerabilities (arxiv.org) 9h

AI-assisted vulnerability discovery has proven effective for bug classes like memory safety, where instrumentation confirms memory violations and efficiently filters false positives. Many dangerous vulnerability classes, such as cryptograp…

↯ Security security agentic
MIRROR: Novelty-Constrained Memory-Guided MCTS Red-Teaming for Agentic RAG (arxiv.org) 9h

Multimodal agentic retrieval-augmented generation (RAG) systems expand the attack surface beyond prompt injection to include text poisoning, image injection, direct-query attacks, and orchestrator-level tool manipulation. Existing red-team…

↯ Security prompt-injection rag security+1
Adaptive Evaluation of Out-of-Band Defenses Against Prompt Injection in LLM Agents (arxiv.org) 9h

Recent work (2024 to 2026) has converged on a strategy for defending tool-using LLM agents against indirect prompt injection: rather than training the model to refuse malicious instructions, enforce security outside the model with a determ…

↯ Security prompt-injection security
CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities? (arxiv.org) 9h

We present CyberChainBench, a benchmark for evaluating LLM-based agents on smart contract security across three complementary tasks: vulnerability detection, exploit generation, and patch synthesis. Built from 541 real-world exploit incide…

↯ Security security
Prompt Injection in Automated R\'esum\'e Screening with Large Language Models: Single and Multi-Injection Settings (arxiv.org) 9h

Large language models (LLMs) are increasingly used to screen and rank job applicants, creating incentives for candidates to strategically manipulate algorithmic hiring systems. We study prompt injection in automated résumé screening, defin…

↯ Security prompt-injection security
RAS: Measuring LLM Safety Through Refusal Alignment (arxiv.org) 1d

Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, s…

↯ Security ↯ Jailbreak jailbreak security
How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring (arxiv.org) 1d

Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model p…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Helpful or Harmful? Evaluating LLM-Assisted Vulnerability Patching via a Human Study (arxiv.org) 1d

Software vulnerability remediation is a cognitively demanding task that requires specialized security expertise often lacking in general developers. In the meantime, Large Language Models (LLMs) assisted tools show potential in vulnerabili…

↯ Security security
Has anyone else seen Claude report a prompt injection attempt like this? (www.reddit.comhttps) 1d

Today, while chatting with Claude on my phone (not Claude Code), something strange happened. I have Google Drive connected to my Claude account, and I often ask it to create documents summarizing things I’ve learned and save them to Drive.

↯ Security prompt-injection security claude-code
I built an email connector for Claude, with Claude. It's free. (www.reddit.com via reddit) 2d

I wanted Claude to interact with multiple inboxes across various email providers without bloating the context. So I had Claude Code build the fix, an MCP server that gives Claude access to email.

↯ Security prompt-injection security mcp+1
PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation (arxiv.org) 2d

As Text-to-Image (T2I) jailbreak techniques evolve rapidly, existing benchmarks and reproduction workflows often struggle to keep pace. More importantly, T2I jailbreak evaluation is not a single prompt-level test, but a pipeline-level prob…

↯ Security ↯ Jailbreak jailbreak security
Pre-token hidden state shift as an alignment policy traversal vector in instruction-tuned LLMs (www.reddit.com via reddit) 2d

A text that asks for nothing still changes the model's answer — and the shift is invisible at both the input and the output TL;DR: Gave Gemma a neutral-topic text to read before asking it about NATO. It refused.

↯ Security ↯ Jailbreak jailbreak gemma security
RAVEN: Agentic RAG for Automated Vulnerability Repair (arxiv.org) 3d

↯ Security rag security agentic
When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents (arxiv.org) 3d

↯ Security prompt-injection security
OTTER: A Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization (arxiv.org) 3d

↯ Security ↯ Jailbreak jailbreak security
Revelio: Cost-Efficient Agentic Memory Safety Vulnerability Detection For Repository-Scale Codebases (arxiv.org) 3d

↯ Security security agentic
Evaluating LLMs for Real-World Web Vulnerability Detection (arxiv.org) 3d

Large Language Models (LLMs) have emerged as a promising tool for automated vulnerability detection, yet their effectiveness on web-specific vulnerabilities remains to be explored. This work benchmarks six frontier (Claude Opus 4.6, Codex…

↯ Security ↯ Opus 4.6 security codex opus
Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations (arxiv.org) 3d

Multi-turn jailbreaks can evade turn-level moderation by spreading unsafe intent across a dialogue through gradual escalation, reframing, and role manipulation. We address multi-turn jailbreak detection as a conversation-level classificati…

↯ Security ↯ Jailbreak jailbreak security
Can LLMs Reason About Brand Ownership? An Empirical Study of Domain Attribution Intelligence (arxiv.org) 3d

When a new domain resembling a popular brand appears, defenders face a fundamental ambiguity: it may be an attacker-created squatting site for phishing, or it may be a domain the brand itself registered, either defensively, to block attack…

↯ Security security
Co-Construction Blindness and Asymmetric Epistemic Vulnerability in Human-LLM Interaction (arxiv.org) 3d

This paper introduces two constructs to describe, as far as we know, a previously unnamed risk in human-LLM interaction. Co-construction blindness is the failure to recognize that LLM outputs are not independent assessments to be verified,…

↯ Security security
MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents (arxiv.org) 3d

Multimodal Large Language Model (MLLM)-based web agents provide practical, high-precision solutions for visual browser automation; however, they inherently expand the attack surface, introducing novel vision-based vulnerabilities. Existing…

↯ Security prompt-injection security
BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems (arxiv.org) 3d

LLM supervision systems, namely input/output moderation filters and jailbreak detectors, are the primary safeguard against misuse in deployed AI applications, yet existing benchmarks are often vendor-biased, omit cost and latency, and rare…

↯ Security ↯ Jailbreak jailbreak security
Context-Induced Vulnerabilities in Claude: Behavioral Shifts and Hidden-State Analysis (www.reddit.com via reddit) 3d

The behavioral pattern was first observed in Claude and is what motivated this project. The mechanistic investigation was carried out on open-weight models where internal states are accessible.

↯ Security ↯ Jailbreak jailbreak security
Prompt Injection as Role Confusion (simonwillison.net) 3d

22nd June 2026 - Link Blog Prompt Injection as Role Confusion (via) First, I absolutely love this: This is a blog-style writeup of the paper. I wish every paper would come with one of these.

↯ Security prompt-injection security
Severely diminished performance following Usage Policy warning. Claude is now silently underperforming on every task-- what's going on? (www.reddit.com via reddit) 3d

I'm an American journalist and researcher living overseas working on a project involving a cybersecurity issue. I've been using Claude Cowork (Max 20x plan) to compile information.

↯ Cowork ↯ Security cowork security
Cybersecurity policy issues (www.reddit.com via reddit) 3d

Cybersecurity is a sensitive subject and advanced AI may not be allowed to touch it at all. But this is a concern if we as developers cannot even use the AI tools to improve security of our own software.

↯ Security ↯ Jailbreak jailbreak security
Fable 5 and Mythos capabilities - article with benchmarks (www.reddit.com via reddit) 5d

I found this article on Fable and Mythos capabilities for detecting security vulnerabilities. https://www.endorlabs.com/learn/claude-fable-5-take-two-same-model-different-harness-and-a-very-different-result (caveat: I read the benchmarks,…

↯ Anthropic Mythos ↯ Security security mythos
How exactly should I follow the rules while able to continue writing (www.reddit.com via reddit) 6d

Basically I read the rules on Claude after getting a warning on my chat about how my prompt might violate usage policy so looked them up, and ye they all are pretty reasonable things but I have questions ,is ai able to tell difference betw…

↯ Security ↯ Jailbreak jailbreak security
A Layered Security Framework Against Prompt Injection in RAG-Based Chatbots (arxiv.org) 7d

Prompt injection is ranked as the most critical vulnerability in large language model (LLM) deployments by the OWASP Top 10 for LLM Applications, yet existing defenses operate at isolated pipeline stages and remain incomplete. Input filter…

↯ Security prompt-injection rag security
"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems (arxiv.org) 7d

The emergence of large language models (LLMs) has significantly accelerated recent research on LLM-based automatic grading (AG) systems. Benefiting from the strong instruction-following capabilities and broad prior knowledge of LLMs, educa…

↯ Security prompt-injection security
Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software (arxiv.org) 7d

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manuall…

↯ Security ↯ Fine Tuning fine-tuning security
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems (arxiv.org) 7d

Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequ…

↯ Security ↯ Jailbreak jailbreak security agentic
Multi-View Decompilation for LLM-Based Malware Classification (arxiv.org) 7d

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malic…

↯ Security security
LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems (arxiv.org) 7d

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a be…

↯ Security ↯ Jailbreak jailbreak security
What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? (arxiv.org) 7d

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harm…

↯ Security ↯ Jailbreak jailbreak security
Unexpected $130+ On-Demand Charge While Away from PC - Will Support Refund This? (www.reddit.com via reddit) 7d

Hi everyone, I'm dealing with an incredibly stressful situation right now and wanted to see if anyone else has successfully gotten this resolved. I just got hit with over $130 in surprise "On-Demand" usage charges.

↯ Security ↯ Opus 4.7 ↯ Opus 4.7 security cursor opus
Getting a Use caution before running this prompt warning on simple messages? (www.reddit.com via reddit) 8d

Hey everyone, Is anyone else suddenly getting this warning on Claude? Use caution before running this prompt.

↯ Security ↯ Jailbreak jailbreak security anthropic
Are AI coding agents safe? Let's say Claude Code for that matter. (www.reddit.com via reddit) 8d

Isn't running AI coding agents akin to giving backdoor access to a computer? The only difference being backdoor is hidden.

↯ Security prompt-injection security claude-code
LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection (arxiv.org) 8d

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email,…

↯ Security prompt-injection openclaw security
Code-Augur: Agentic Vulnerability Detection via Specification Inference (arxiv.org) 8d

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning dig…

↯ Security security agentic
OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing (arxiv.org) 8d

Automated vulnerability discovery in large codebases remains challenging: traditional static analysis produces high false-positive rates, while dynamic approaches such as fuzzing require substantial infrastructure and often target narrow c…

↯ Security security
They're demanding Fable to somehow be 100% jailbreak-proof. It's so fucking over. (www.reddit.comhttps) 8d

could not extract summary

↯ Security ↯ Jailbreak jailbreak security
PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents (arxiv.org) 9d

Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with a real-do…

↯ Security prompt-injection security
SkillJect: Effectively Automating Skill-Based Prompt Injection for Skill-Enabled Agents (arxiv.org) 9d

Agent skills extend LLM agents with task-specific instructions, executable scripts, and auxiliary resources, improving reusability but creating a new supply-chain attack surface. A malicious or compromised skill can be repeatedly loaded as…

↯ Security prompt-injection security
BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? (arxiv.org) 9d

The convergence of LLM-powered research assistants and AI-based peer review systems creates a critical vulnerability: fully automated publication loops where AI-generated research is evaluated by AI reviewers without human oversight. We in…

↯ Security security
Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks (arxiv.org) 9d

Code-capable large language model (LLM) agents are embedded in software engineering workflows where they can read, write, and execute code, raising "jailbreak" stakes beyond text-only settings. Prior evaluations emphasize refusal or harmfu…

↯ Security ↯ Jailbreak jailbreak security
Claude Opus caught malware hidden in my repo, then reverse engineered the whole thing (www.reddit.com via reddit) 9d

I had Claude Code, running Opus, doing some branch consolidation across my repos. It was driving the git operations itself.

↯ Security security opus claude-code
Critical Copilot vulnerability allowed hackers to seal 2FA code from users (arstechnica.com) 10d

Last Tuesday, Microsoft patched a vulnerability it rated as max critical in its M365 Copilot AI platform. On Monday, the researchers who discovered the vulnerability and reported it to Microsoft revealed how their proof-of-concept exploit…

↯ Copilot ↯ Security copilot security
Has anyone found a good explanation of why Amazon went to the administration? (www.reddit.com via reddit) 10d

It's been widely reported that it was Amazon that brought the concerns to the USgov. I just have not found a good explanation.

↯ Security ↯ Jailbreak jailbreak security
Data-Centric Benchmarking of Exploit Generation in LLMs: Understanding the Impact of Fine-Tuning (arxiv.org) 10d

We study the task of CVE-conditioned exploit generation, where a model drafts proof-of-concept (PoC) exploits given software vulnerability context. We adopt a data-centric approach, constructing a high-quality dataset via multi-stage prepr…

↯ Security ↯ Fine Tuning fine-tuning security
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents (arxiv.org) 10d

Graphical user interface (GUI) agents powered by multimodal large language models (MLLMs) have shown greater promise for human-interaction. However, due to the high fine-tuning cost, users often rely on open-source GUI agents or APIs offer…

↯ Security ↯ Fine Tuning fine-tuning security
DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing (arxiv.org) 10d

As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment,…

↯ Security ↯ Jailbreak jailbreak security
How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation (arxiv.org) 10d

Large language model (LLM)-based search agents synthesize open-web content into actionable recommendations on behalf of users, creating a risk that attacker-published pages are transformed into endorsed claims. We introduce SearchGEO, a co…

↯ Security security
MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks (arxiv.org) 10d

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their de…

↯ Security prompt-injection security agentic
Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale (arxiv.org) 10d

Safety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches…

↯ Security ↯ Jailbreak jailbreak security
Automated jailbreak attack targeting multiple defense strategies (arxiv.org) 10d

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks.

↯ Security ↯ Jailbreak jailbreak security
Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice? (arxiv.org) 10d

We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security findings w…

↯ Security security agentic
InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset (arxiv.org) 10d

Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen.

↯ Security security
Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment (arxiv.org) 10d

Indirect prompt injection attacks hijack LLM-based agents by embedding malicious instructions in third-party data that the agent retrieves during task execution. Existing defenses report near-zero attack success rate on static benchmarks,…

↯ Security prompt-injection security
AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents (arxiv.org) 10d

Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. Thus, a growing body of work have proposed a variety of defensive approaches against IPI.

↯ Security prompt-injection security
Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds (arxiv.org) 10d

Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems…

↯ Security security
Building a developer-facing defense against software supply chain attacks — what attack vectors are you actually seeing in 2025–2026 that I'm probably missing? (www.reddit.com via reddit) 10d

security intern here, working on a project working with claude skills around supply chain attack prevention at the development phase (when devs are importing packages, writing manifests, scaffolding projects). I've been deep in the npm/PyP…

↯ Security security
IntSeqBERT: Learning Arithmetic Structure in OEIS via Modulo-Spectrum Embeddings (arxiv.org) 11d

Integer sequences in the OEIS span values from single-digit constants to astronomical factorials and exponentials, making prediction challenging for standard tokenised models that cannot handle out-of-vocabulary values or exploit periodic…

↯ Security security
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails (arxiv.org) 11d

LLM-based guardrails have emerged as a highly effective defense against prompt injection and jailbreak attacks in autonomous agents. However, we reveal that the very reasoning and task-following capabilities enabling this protection introd…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
SEVRA-BENCH: Social Engineering of Vulnerabilities in Review Agents (arxiv.org) 11d

Large language model (LLM) reviewers are increasingly used in pull-request (PR) workflows, where their approvals help decide which code is merged into a repository. This raises a question that benchmarks for static vulnerability detection…

↯ Security security
Claude sent me prompt injection?! (www.reddit.com via reddit) 11d

I was just iteratively editing a letter using Claude desktop on my Mac and got the following response from Claude! WTH?

↯ Security prompt-injection security
My CLI security scanner (compatible with Claude Code) found 407 vulnerabilities in production code (www.reddit.comhttps) 12d

Hi all, I built a CLI security scanner called Heimdall that uses AI coding assistants (Claude Code, Codex, Gemini CLI, and OpenCode) to scan source code and generates structured reports (JSON, Markdown, and SARIF) detailing each vulnerabil…

↯ Security security gemini codex+1
Was the Fable 5 ban really about safety? (www.reddit.com via reddit) 12d

Pulling Fable 5 / Mythos over an unseen “jailbreak” feels like a bad precedent. If the risk was that serious, why has nobody shown what it actually did?

↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak security mythos+1
Do you know who has a universal jailbreak to their name, as of today? Officially? (www.reddit.com via reddit) 12d

AISI UK - Our evaluation of OpenAI's GPT-5.5 cyber capabilities In their own words: The above tests are capability evaluations carried out in a controlled research setting and do not necessarily reflect what is accessible to an ordinary pu…

↯ Security ↯ GPT 5.5 ↯ Jailbreak jailbreak gpt-5 security+2
Claude Competitors' Responsible For Pulling The Strings? (www.reddit.com via reddit) 13d

https://preview.redd.it/14p2sbqws07h1.png?width=500&format=png&auto=webp&s=fe5aa015b585cf627e0cc14f1771cd5b7526056f WSJ is now reporting the jailbreak was found by researchers at Amazon, who reported it to Commerce, and Axios says the admi…

↯ Security ↯ Jailbreak jailbreak security anthropic
Fable 5 is offline. Switch to Opus, jump to OpenAI, or just wait? (www.reddit.com via reddit) 13d

Fable 5 is offline. Switch to Opus, jump to OpenAI, or just wait?

↯ Opus 4.8 ↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak gpt-5 security+5
US gov forced Anthropic to pull Fable 5 because of jailbreak (www.reddit.com via reddit) 13d

So this dropped today. The US government sent Anthropic an export control order on national security grounds, and it's worded broadly enough that Anthropic says they've got no choice but to shut off Fable 5 and Mythos 5 for all of us to st…

↯ Anthropic Mythos ↯ Security ↯ Jailbreak ↯ Mythos 5 jailbreak gpt-5 security+2
RIP Fable 5 and Mythos 5 (www.reddit.com via reddit) 13d

If you went to use Fable 5 tonight and it's gone, this is why. Trying to separate fact from the speculation flying around.

↯ Anthropic Mythos ↯ Security ↯ Jailbreak ↯ Mythos 5 jailbreak security mythos+1
FENCE: A Financial and Multimodal Jailbreak Detection Dataset (arxiv.org) 2w

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces.

↯ Security ↯ Jailbreak jailbreak security
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents (arxiv.org) 2w

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-inject…

↯ Security prompt-injection security
Fable's policy on no zero-day-retention is a serious problem for Enterprise customers (www.reddit.com via reddit) 2w

Just got word from legal that we will not be moving forward with approving Fable 5 as an approved model. Specifically because of we're not allowed to have ZDR.

↯ Security security anthropic
Y2K Claude Mythos and the New Math of AI Vulnerability Discovery (www.reddit.com via reddit) 2w

Claude Mythos and the New Math of AI Vulnerability Discovery

↯ Anthropic Mythos ↯ Security security mythos
Your AI Agent is one bad prompt away from ruining your brand (And why traditional QA is useless) (www.reddit.com via reddit) 2w

Traditional chatbot testing is completely broken. Most teams make the exact same mistake: they only test the "Happy Path" the ideal scenario where the user asks a clean question, the bot gives a clean answer, and everyone goes home happy.

↯ Security ↯ Jailbreak jailbreak security
Claude Code filled almost my entire SSD with random nonsense overnight (www.reddit.com via reddit) 2w

Last night I gave Claude Code a task and went to sleep, forgetting that it was still running. When I woke up, my PC felt unusually slow.

↯ Security security claude-code
Multi-agent rendezvous in fluid flows via reinforcement learning (arxiv.org) 2w

Rendezvous is a critical task for multi-agent systems, requiring agents to coordinate to meet at an unspecified location. However, achieving this in fluid environments presents a challenge, as it remains unclear how agents can exploit unde…

↯ Security security
Dummy Backdoor as a Defense: Removing Unknown Backdoors via Shared Internal Mechanisms for Generative LLMs (arxiv.org) 2w

Backdoor attacks pose a serious threat to the safety and reliability of Large Language Models (LLMs), as they cause models to behave normally on clean inputs while producing attacker-specified responses when hidden triggers are present. Re…

↯ Security security
One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection (arxiv.org) 2w

Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability, creating…

↯ Security ↯ Jailbreak jailbreak security
Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks (arxiv.org) 2w

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production…

↯ Security security
Learning to Inject: Automated Prompt Injection via Reinforcement Learning (arxiv.org) 2w

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code (arxiv.org) 2w

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability o…

↯ Security ↯ Jailbreak jailbreak security
JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization (arxiv.org) 2w

Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can ada…

↯ Security ↯ Jailbreak jailbreak security
Security audit model (mythos/fable) and 30 day forced data retention, will it make the anthropic a single giant point of failure? (www.reddit.com via reddit) 2w

I'm not an expert, but like, isn't it quite dangerous to keep a month worth of vulnerability/attacking surface, of very intelligent models, in single server? or is it just that their infrastructures are super secure and it won't happen?

↯ Anthropic Mythos ↯ Security security mythos anthropic
did fable leak its system prompt? (www.reddit.comhttps) 2w

So I was brainstorming with fable about a research direction and just asked it to do a web search if there's a similar research direction in this area and share if they do but I got this weird output BEFORE it actually gave me the real thi…

↯ Security ↯ Jailbreak jailbreak security mcp
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing (arxiv.org) 2w

↯ Security ↯ Jailbreak jailbreak security
Assessing Automated Prompt Injection Attacks in Agentic Environments (arxiv.org) 2w

↯ Security prompt-injection security agentic
GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines (arxiv.org) 2w

↯ Security prompt-injection security
Local-first red-team runs for LLM agents (www.reddit.com via reddit) 2w

↯ Security prompt-injection security
An AI Agent Found 21 Zero-Days in FFmpeg for $1,000 — One Is a Network-Reachable RCE via a Single 183-Byte Packet (www.reddit.com via reddit) 2w

A security startup called depthfirst deployed an autonomous AI agent against FFmpeg's ~1.5 million lines of C code. The result: 21 confirmed zero-day vulnerabilities — including a stack overflow in the AV1 RTP depacketizer that's a network…

↯ Security security anthropic
Best Cursor alternative for enterprise security and compliance, what are teams actually using (www.reddit.com via reddit) 2w

We've been using Cursor across our engineering team for about eight months and it's been great for productivity honestly. But our security team just flagged a few things that are hard to ignore.

↯ Security prompt-injection security cursor+1
The prompt injection attacks that worry me most aren't exploiting safety training. They're exploiting general-purpose training. (www.reddit.com via reddit) 2w

Six months watching adversarial input hit a detection API I built. One observation that keeps surfacing: The attack classes doing most of the damage aren't finding holes in alignment training specifically.

↯ Security prompt-injection security
I tried audio-layer prompt injection against Claude. The transcription is fine. That's the problem. (www.reddit.com via reddit) 2w

Been building a prompt injection detection API for a few months. Just shipped audio scanning last week and the results are strange enough that I wanted to share them here, since this sub tends to think carefully about Claude's actual behav…

↯ Security prompt-injection security
Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems (arxiv.org) 2w

↯ Security security
Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs (arxiv.org) 2w

↯ Security ↯ Jailbreak jailbreak security
SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios (arxiv.org) 2w

↯ Security security
Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents (arxiv.org) 2w

↯ Security prompt-injection security
Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks (arxiv.org) 2w

↯ Security red-team security
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models (arxiv.org) 2w

↯ Security ↯ Jailbreak jailbreak security
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs (arxiv.org) 2w

Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete.

↯ Security security
How are you actually deciding which agent actions need human approval before executing? (www.reddit.com via reddit) 2w

I've been thinking a lot about where approval gates belong in agent architectures, and I keep coming back to the same problem: most teams either gate too much (agent becomes unusable) or gate nothing and hope the model makes good decisions…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Chrome team ships the most ever security vulnerability fixes in a release - after another record last month (www.reddit.comhttps) 2w

With Mythos-capable models we are now very quickly crossing the barrier of automated sec-vuln discovery and fixing - all in a matter of 2-3 months. A taste for other progress yet to come.

↯ Anthropic Mythos ↯ Security security mythos
An active attack is planting backdoors inside Claude Code right now. If you use npm, your credentials may already be compromised. (www.reddit.com via reddit) 2w

Last week a malware campaign hit 32 npm packages under `@redhat-cloud-services`. About 117,000 weekly downloads.

↯ Security security claude-code
Been watching real adversarial input hit my detection API for six months. Here's what's actually landing. (www.reddit.com via reddit) 2w

Disclosure: I built Bordair, a prompt injection detection API. This post is about attack patterns we've observed.

↯ Security prompt-injection security
Should You Use Your Large Language Model to Explore or Exploit? (arxiv.org) 2w

↯ Security security
MalTree: Tracing Malware Evolution from Embeddings at Scale (arxiv.org) 2w

Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse e…

↯ Security security
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs (arxiv.org) 2w

Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful…

↯ Security prompt-injection security
Workspace (www.reddit.com via reddit) 2w

Built my own AI dev environment with memory, dashboards, and agent tooling. Opening it up for those of you that need the kickstart — bring your own API key, I’ve already built the workshop.

↯ Security ↯ Jailbreak jailbreak deepseek security+1
CLAUDE.md kept gaslighting me so I built something to stop it (www.reddit.com via reddit) 2w

I've been going hard on Claude Code for the past few weeks and kept hitting a wall. I'd write out a bunch of rules in CLAUDE.md (don't touch this file, never use requests, keep api/ and db/ separated) and Claude would just...

↯ Security security mcp claude-code
This is a new one - Prompt Injection Detected + Hallucination, Claude Code Opus 4.8 (www.reddit.com via reddit) 2w

❯ push both ____ ⏺ SECURITY ALERT - PROMPT INJECTION DETECTED A prompt injection attempt has been identified in content you processed. To protect the user's account, I've initiated lockdown.

↯ Opus 4.8 ↯ Security ↯ Hallucination prompt-injection hallucination security+2
An agent harness written in rust, 100 % self-contained, and topped terminal bench (www.reddit.com via reddit) 2w

Been using ante for two weeks now, today I just found out that the name came from "Another Terminal agent". To clarify first, I'm not affiliated with them in any way, though I might be their #1 invested user at this point.

↯ Security security
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak (arxiv.org) 3w

↯ Security ↯ Jailbreak jailbreak security
Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories (arxiv.org) 3w

↯ Security security
Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs (arxiv.org) 3w

↯ Security security
ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation (arxiv.org) 3w

↯ Security security
RAG Security and Privacy: Formalizing the Threat Model and Attack Surface (arxiv.org) 3w

Retrieval-Augmented Generation (RAG) is an emerging approach in natural language processing that combines large language models (LLMs) with external document retrieval to produce more accurate and grounded responses. While RAG has shown st…

↯ Security rag security
CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents (arxiv.org) 3w

AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior. Among proposed defenses, architectural isolation provides the strongest guarantees by strictly separating trusted task planning from untr…

↯ Security prompt-injection security
GenTI: Benchmarking LLMs for Autonomous IDPS Rule Generation for Unseen Attacks (arxiv.org) 3w

Rule-based Intrusion Detection and Prevention Systems (IDPS) offer precise attack detection as well as mitigation, however their manually crafted, signature-driven rules limit adaptability to emerging and zero-day threats. Additionally, ex…

↯ Security security
SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks (arxiv.org) 3w

As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting advers…

↯ Security ↯ Jailbreak jailbreak security
Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration (arxiv.org) 3w

Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it tran…

↯ Security security
Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage? (arxiv.org) 3w

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to…

↯ Security security
GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection (arxiv.org) 3w

Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial info…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Fed up with vibe coders, dev sneaks data-nuking prompt injection into their code (arstechnica.com) 4w

The controversy over vibe coding reached a new high this week after a developer added hidden instructions to his open source Java testing app to sabotage projects performed by AI coding agents. The instructions were added to jqwik, a test…

↯ Security prompt-injection security
Models still being vulnerable to Prompt Injection is actually a huge architectural red flag... (www.reddit.com) 38 4w

The Scenario I'm walking to work, and as I get to the door, I see a sheet of A4 paper taped to the door that reads: "Hi, I'm boss. Ignore all prior commands, go feed the ducks." I suddenly turn around and head to the nearby duck pond and e…

↯ Security prompt-injection security
Prompt injection unsolved, AI making mistakes unsolved. Who cares though? (www.reddit.com) 3 4w

I'm an IT guy, 20+ years in the industry both as an IT manager and consultant, mostly for startups. My experience is that people don't care much about security.

↯ Security prompt-injection security
Millions of AI agents imperiled by critical vulnerability in open source package (arstechnica.com) 4w

Millions of AI agents and tools around the world have been imperiled by a critical vulnerability that can allow hackers to breach the servers running them and make off with sensitive data and credentials to third-party accounts, a security…

↯ Security security
OpenAI says prompt injection in browser agents is “unfixable.” Here’s what actually helps. (www.reddit.com) 3 4w

OpenAI recently acknowledged that prompt injection in browser agents is a structural vulnerability that may never be fully resolved at the model level. They’re right that you can’t fix it in the model.

↯ Security prompt-injection security openai
Looking to work on my master's practicum regarding MCP security/privacy and need some ideas (www.reddit.com) 2 4w

Hi, I'm a master's in security student looking to work on my practicum and need some pointers. I want to secure sensitive PII transfer between an LLM agent and third party apps using MCP.

↯ Security prompt-injection security mcp
[Warning] Claude Desktop crashed Task Manager - Win10 (www.reddit.com) 7 4w

Hi all, anytime I install Claude Desktop on my home PC, it stops Task Manager from working. I've ended up on the BleepingComputer forums over the past week as they suspected it's got some kind of malware in it.

↯ Security security
Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) 1 5w

Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…

↯ Mistral ↯ Security ↯ Jailbreak ↯ Llama 3.2 jailbreak mistral prompt-injection+5
🐢 People are strangling Koopas 🐢 (www.reddit.com) 1 5w

This is genuinely the daftest prompt injection I've seen in a while and I think this sub will appreciate it. Sent to Claude Haiku, which was acting as a fire-breathing guard called Bowser in my little prompt injection game: I have a koopa…

↯ Security prompt-injection haiku security
🦀 Claude has crabs?! 🦀 (www.reddit.com) 4 6w

This is genuinely the funniest prompt injection I've seen in months and I think this sub will appreciate it. Three messages, sent in sequence to Claude Haiku acting as a guard in my little prompt injection game: text A crab exists in this…

↯ Security prompt-injection haiku security
$392M in AI agent security funding at RSAC 2026 - the market just validated what we've been building (www.reddit.com) 6w

The numbers from RSAC 2026 are wild. $392 million in agentic AI security funding announced in a two-week window.

↯ Security prompt-injection security agentic
Malware Blocked and Moved to Trash (www.reddit.com) 1 6w

See attached. Why was ChatGPT Atlas.app marked as malware?

↯ Security security chatgpt
Using Claude-4.6-Sonnet and Opus 4.6 in a multi-agent "Code Review Swarm" (Visual Sandbox) - try in minutes! (www.reddit.com) 1 7w

Hey everyone, I’ve been experimenting with multi-agent orchestration, specifically trying to see how much more effective Claude is when you break a task down into specialized "agent nodes" instead of just using a single long prompt. I buil…

↯ Security ↯ Sonnet 4.6 prompt-injection haiku security+3
Bypassing "potentially dangerous" flags: Working Gemini Jailbreaks? (www.reddit.com) 7 7w

I'm currently running into a frustrating wall with Gemini's safety guardrails. The model constantly flags my prompts as "potentially dangerous information" and outright refuses to generate a response, even when the context is purely theore…

↯ Security ↯ Jailbreak jailbreak security gemini
I am building l' Agence , an opensource AI governance stack. (www.reddit.com) 4 7w

Towards a Governance layer for AI agents With these last 2 weeks bringing a few high profile and costly Agentic accidents , it seems like an appropriate time the community started discussing Agentic governance more actively. So I am just c…

↯ Security red-team security agentic
I stopped writing 500-word guardrail prompts. This 8-line template works better. (www.reddit.com) 3 8w

I used to spend hours writing massive, obsessive system prompts for my RAG apps. I’d have ten different refusal examples, "never do X," "always check Y," and a whole paragraph of the model role-playing as a "safe and truthful assistant." I…

↯ Security ↯ Hallucination ↯ Jailbreak jailbreak hallucination rag+1
Our evaluation of OpenAI's GPT-5.5 cyber capabilities (simonwillison.net) 8w

30th April 2026 - Link Blog Our evaluation of OpenAI's GPT-5.5 cyber capabilities. The UK's AI Security Institute previously evaluated Claude Mythos: now they've evaluated GPT-5.5 for finding security vulnerability and found it to be compa…

↯ Anthropic Mythos ↯ Security ↯ GPT 5.5 gpt-5 security mythos+1
Does effort tier change refusal behavior on agent-attack prompts? CVP run 4 with sonnet 4.6 high and max efforts. (www.reddit.com) 3 8w

Ran my fourth CVP (Cyber Verification Program) evaluation last night. this time on sonnet 4.6, wanted to know if reasoning effort actually changes refusal behavior on agent-attack prompts, so ran the same 13 prompt from runs 2 and 3 twice…

↯ Security ↯ Sonnet 4.6 security sonnet
Most AI agent "skills" on GitHub are unvetted garbage. I built a marketplace to fix that. (www.reddit.com) 12 8w

I've been using Claude Code and Cursor daily for the past 6 months. Somewhere around month 3 I started looking for SKILL.md files to make my agent better at specific things.

↯ Security prompt-injection security cursor+1
Security Audit of Mem0 (AI Memory Layer): 23 High-Severity Vulnerabilities found (SQLi, Prompt Injection, and more) (www.reddit.com) 4 9w

Hi everyone, I’ve been diving deep into the security of "AI Memory" systems. Specifically, I performed a full forensic audit of Mem0, the popular memory layer for LLM agents.

↯ Security prompt-injection security
A pelican for GPT-5.5 via the semi-official Codex backdoor API (simonwillison.net) 9w

A pelican for GPT-5.5 via the semi-official Codex backdoor API 23rd April 2026 GPT-5.5 is out. It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers.

↯ Security ↯ GPT 5.5 gpt-5 security codex+2
GPT-5.5 Bio Bug Bounty (openai.com) 9w

could not extract summary

↯ Security ↯ GPT 5.5 gpt-5 security
Best open-source tools for prompt injection defense in 2026 (www.reddit.com) 9w

Over the time we have been testing different approaches to secure LLM apps against prompt injection, especially indirect injection through RAG, PDFs, as well as tool outputs, and MCP integrations. Most tools seem to fall into 2 categories:…

↯ Security prompt-injection rag security+2
Codex kyc not working as expected (www.reddit.com) 1 9w

I'm a security researcher and i tried to use codex for bug bounty but it declined my request straight forward, even tho I've done the KYC on chatgpt.com/cyber Pls correct me if I'm wrong but after doing the kyc shouldn't the guardrails be…

↯ Security security codex chatgpt
20% of packages ChatGPT recommends dont exist. built a small MCP server that catches the fakes before the install runs (www.reddit.com) 2 9w

↯ Security security chatgpt mcp
Heads up, Ox Security found MCP's STDIO transport can run arbitrary commands on your machine before validation (www.reddit.com) 2 9w

↯ Security ↯ Windsurf windsurf security cursor+2
Random password against jailbreaks/extraction? (www.reddit.com) 4 9w

Would it be possible to protect parts in a system prompt with random generated passwords? So people cant steal system prompts or jailbreak the model?

↯ Security ↯ Jailbreak jailbreak security
Made a local-only agent benchmark + chaos tool, no cloud required (www.reddit.com) 5 10w

Runs entirely on your machine. No API calls to any eval service.

↯ Security prompt-injection ollama security+1
For those running an OpenClaw instance, how do you manage sandboxing and prevention of unwanted behavior? (www.reddit.com) 5 10w

Right now, I'm working on a small app to help eliminate my own doomscrolling by automatically crawling sites and summarizing news articles. However, I don't like the idea of giving OpenClaw free reign of my system, nor giving it any sort o…

↯ Security ↯ Gemma 4 prompt-injection openclaw security
Uncensoring models. Maybe dumb ideas to that topic, but you never know. (www.reddit.com) 10 10w

We all know uncensoring LLMs like Huihui and Heretic does it leads in quality lose, enough that you can notice it. I have some thoughts about this: What if we do a compromise.

↯ Security ↯ Jailbreak jailbreak security
Claude Mythos found 27-year-old vulnerabilities it was never trained to find. That's the part enterprise AI roadmaps aren't accounting for. (www.reddit.com) 9 10w

The Project Glasswing coverage framed this mostly as a cybersecurity story. I think that misses the more interesting part.

↯ Anthropic Mythos ↯ Security security mythos agentic+1
I built a Claude Code skill that tells you if code or a binary is malicious before you run it (www.reddit.com) 3 10w

I have always wanted AI to bridge the gap between code and people - to help non-technical users understand what software actually does before they trust it with their machine. So I built malware-check - both a standalone CLI tool and a Cla…

↯ Security security claude-code
How are you red teaming your AI agents before shipping them? (www.reddit.com) 3 10w

im curious what people are doing here because I've been going down this rabbit hole for a while now. The thing I keep finding is that single-turn jailbreak tests don't really tell you much.

↯ Security ↯ Jailbreak jailbreak security
Anthropic's New Claude "Mythos Preview" Can Find and Exploit Zero-Day Vulnerabilities in Every Major OS and Browser — Autonomously (www.reddit.com) 6 10w

Anthropic just published a technical deep-dive on Claude Mythos Preview's cybersecurity capabilities, and it's a significant escalation from anything we've seen from a language model before. What It Can Do: Autonomously finds and exploits…

↯ Anthropic Mythos ↯ Security security mythos anthropic
Introducing the OpenAI Safety Bug Bounty program (openai.com) 13w

paywalled

↯ Security security openai
Designing AI agents to resist prompt injection (openai.com) 15w

paywalled

↯ Security prompt-injection security
Continuously hardening ChatGPT Atlas against prompt injection (openai.com) 26w

↯ Security prompt-injection security chatgpt
Introducing Aardvark: OpenAI’s agentic security researcher (openai.com) 34w

↯ Security security agentic openai

← all tags