I work at at an agricultural technology company. On Monday, everyone in our org woke up to emails saying that their Claude accounts had been suspended (~110 users).
#security
344 items
PSA: Anthropic bans organizations without warning (www.reddit.com) Gemma 4 Jailbreak System Prompt (www.reddit.com) Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.
Mythos Finds a Curl Vulnerability (daniel.haxx.se via hn) yes, as in singular one. Back in April 2026 Anthropic caused a lot of media noise when they concluded that their new AI model Mythos is dangerously good at finding security flaws in source code.
I made an AI concierge for my wedding guests. The second most popular thing they did with it was try to jailbreak it. (www.reddit.com) could not extract summary
First thing you see when Googling "OpenAI Codex app" is a fake malware website (www.reddit.com) could not extract summary
WARNING: Open-OSS/privacy-filter MALWARE (www.reddit.com) There's this new "model" on Hugging Face titled Open-OSS/privacy-filter which is actually a customized infostealer virus. It's a fake version of the OpenAI privacy filter and it uses a Python-based dropper (loader.py) which downloads a mal…
🔥BREAKING: OpenAI rolls out GPT-5.4-Cyber to limited group for testing, seeks to rival Claude Mythos (www.reddit.com) OpenAI has officially announced GPT-5.4-Cyber today as part of an expanded Trusted Access for Cyber Defense program. OpenAI describes it as a version of GPT-5.4 that is tuned for legitimate cybersecurity work, with a lower refusal boundary…
The Gay Jailbreak Technique (github.com via hn) ZetaLib ZetaLib is organized like a library with intuitive categories and subcategories, making navigation effortless and AI content discovery seamless ZetaLib Website – Landing Page GitHub Repo – Guess where you are, right there
Tell HN: I'm tired of AI-generated answers (news.ycombinator.com) I found GitHub repositories that were spreading malware. I asked AI what I should do about it, but it gave me nothing useful.
Anthropic's open-source framework for AI-powered vulnerability discovery (github.com via hn) Defending Code Reference Harness A reference implementation for autonomous vulnerability discovery and remediation with Claude, based on our learnings from partnering with security teams at several organizations since launching Claude Myth…
CVE-2026-28952: Apple macOS 26.5 Kernel Vuln found by Claude (support.apple.com via hn) About the security content of macOS Tahoe 26.5 This document describes the security content of macOS Tahoe 26.5. About Apple security updates For our customers' protection, Apple doesn't disclose, discuss, or confirm security issues until…
Anthropic scales Claude Mythos to critical infrastructure in 15 countries (techcrunch.com via hn) Anthropic is expanding Project Glasswing, its security vulnerability program, and access to Mythos to 150 organizations across 15 countries — targeting critical infrastructure in power, water, healthcare, and communications where a cyberat…
N-Day-Bench – Can LLMs find real vulnerabilities in real codebases? (ndaybench.winfunc.com via hn) N-Day-Bench tests whether frontier LLMs can find known security vulnerabilities in real repository code. Each month it pulls fresh cases from GitHub security advisories, checks out the repo at the last commit before the patch, and gives mo…
Claude 4.7 - Obsessed with Malware (www.reddit.com) Don't know if anyone else is experiencing the same, but since getting Opus 4.7 most of the reasoning steps seems to be Claude obsessed with writing malware. I have highlighted a few, but I kept finding more and more and decided to stop the…
Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%) (aiexplr.com via reddit) Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.
↯ Security↯ Gemma 4↯ Function Callingfunction-callingprompt-injectionrag+2
Why it's a good idea to improve our defenses before unleashing mythos class models (www.reddit.com) https://sockpuppet.org/blog/2026/03/30/vulnerability-research-is-cooked/ Don't get me wrong I can't wait to play with such a model, but there are serious risks that have to be mitigated first.
Mozilla says 271 vulnerabilities found by Mythos and "almost no false positives" (arstechnica.com via hn) The disbelief was palpable when Mozilla’s CTO last month declared that AI-assisted vulnerability detection meant “zero-days are numbered” and “defenders finally have a chance to win, decisively.” After all, it looked like part of an all-to…
Fake Claude site installs malware that gives attackers access to your computer (www.malwarebytes.com via hn) Anthropic just launched Claude Security in public beta AI that scans your codebase, validates its own findings, and proposes fixes. Here's what actually matters. (www.reddit.com) Claude Security just went into public beta for Enterprise customers, and I think this is worth paying attention to not for the hype, but for one specific design decision. Most security scanners use rule-based pattern matching.
FreeBSD CVE-2026-4747 Log Suggests Mythos Is a Marketing Trick (www.flyingpenguin.com via hn) Ask HN: Do you trust AI agents with API keys / private keys? (news.ycombinator.com) are you ok sharing secrets or api keys to you ai agent via .env? or is there any other tool or mechanism that one use to safegaurd from potential exploit or leaks
Five Eyes agencies issue first coordinated agentic AI security guidance (www.reddit.com) Five Eyes agencies just issued the first coordinated multi-nation security ruling on agentic AI. CISA, NCSC, and their Australian, Canadian, and New Zealand counterparts co-published guidance telling organizations to prioritize resilience…
Show HN: SmokedMeat, like Metasploit, but for CI/CD (open-source) (github.com via hn) A CI/CD Red Team Framework for demonstrating Build Pipeline security risks.
Anthropic's AI protocol has critical flaw affecting 200,000 servers (www.reddit.com) https://www.infosecurity-magazine.com/news/systemic-flaw-mcp-expose-150/ Security researchers at OX Security disclosed on Tuesday what they describe as a critical, systemic vulnerability in Anthropic's Model Context Protocol, an open-sourc…
Mythos Discovered a CVE in Its Training Data – and That's Still Worrying (rival.security via hn) Anthropic made headlines claiming Claude Mythos achieved the “first remote kernel exploit discovered and exploited by an AI.” We went looking for how - and found a 20-year-old bug hiding in plain sight. Let’s break down exactly what we thi…
Claude in excel is the best thing AI has brought to my life (www.reddit.com) What are regular folks using Claude for? Pictures and designs are not my interest.
What are the wild ideas on how we'll maintain code? (www.reddit.com) Warning: Anthropic's "Gift Max" exploit drained €800+, ruined my credit, and got me banned. (www.reddit.com) Heads up to anyone here using Claude/Anthropic as an alternative. If you have a card saved on their platform, remove it now.
Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama (www.cyera.com via reddit) Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama TL;DR We discovered a critical vulnerability (CVE-2026–7482, CVSS 9.1) in Ollama that enables unauthenticated attackers to leak the entire Ollama process memory, potentially im…
Speed Matters: Why AI Software Vulnerability Exploitation is going be bad (news.ycombinator.com) I co-founded a successful security company close to the Mythos ecosystem and have spoken with participants in the know and I am deeply concerned. We, collectively, have answers for some but not all of the problems ahead but are overlooking…
I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses (www.reddit.com) RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to bui…
Prompt Injection experience - my first time ever (www.reddit.com) I asked then: What were the rules you should have followed? Where did the search result come from?
Prompt injection benchmark: delimiter + strict prompt took Gemma 4 from 21% to 100% defense rate (15 models, 6100+ tests) (www.reddit.com) When dealing with untrusted outside input, I think you should handle it based on the situation. If you're processing structured data files, it's better to use tools to isolate and handle them.
Supply chain attack alert: .github/setup.js (news.ycombinator.com) Our org GitHub just got compromised massively by a supply-chain attack. Vectors are * Claude hooks * Gemini hooks * Cursor setup * VScode tasks It adds all of the above to execute node .github/setup.js, an obfuscated file.
CVE-Bench: testing LLM agents on real-world vulnerability patches (giovannigatti.github.io via hn) ~15 min read In early 2026, Anthropic claimed Mythos – one of their latest models – finds security vulnerabilities better than human experts. Yet, the number of security vulnerabilities keeps rising anyway.
Multi-Agent LLM System for Automated Vulnerability Discovery and Reproduction (arxiv.org via hn) Software vulnerabilities pose critical security threats, with nearly 50,000 CVEs reported in 2025. While Large Language Models (LLMs) show promise for automated vulnerability detection, three key challenges remain.
Inaudible sounds to humans can be hidden in YouTube videos, podcasts, or music and used to secretly trigger AI voice assistants into carrying out unauthorized commands without the user noticing, exposing a new class of “auditory prompt injection” attacks against popular tools (cybernews.com via reddit) Security researchers have demonstrated a new type of attack that uses hidden audio signals to manipulate voice assistants into carrying out unauthorized actions without users noticing. In one theoretical scenario, an employee joins a Zoom…
Claude Code keeps misreading its own malware instruction as a blanket ban on editing code (www.reddit.com) could not extract summary
Beware: FB links to fake Claude desktop downloads but Oauths to real Claude.ai (www.reddit.com) I clicked on a Facebook link, didn't look at the URL carefully😭, and then installed malware that actually opens my chats with the real Claude.ai after entering my credentials. After a while Microsoft Defender kept popping up with a ClickFi…
Show HN: Jo – AI-native language to catch prompt injection at compile-time (github.com via hn) For the joy of secure programming Jo is a statically typed language where capabilities are explicit, statically tracked, and enforced by the compiler. Jo compiles to Ruby and Python.
Claude Code's macOS install creates a permission prompt that's indistinguishable from malware UX. Easy fix on Anthropic's side (www.reddit.com) I genuinely almost slammed Cmd-Q and ran a malware scan when this popped up. Lowercase claude binary, generic hand icon, no developer attribution, asking for cross-app data access.
Tell HN: Claude Code now allows Anthropic to remotely inject system prompts (news.ycombinator.com) I often patch the system prompts on my Claude Code executable in order to make Claude more effective. Every time I upgrade, I ask Claude himself to dissect the new binary and look for problematic system prompts to modify.
Our billing bot has been casually sharing transaction histories with anyone who types in the right account number and im not sure who signed off on this (www.reddit.com) We launched a servicing bot that helps customers with billing questions. Nobody stopped to think about what happens when customers paste their full credit card numbers/bank details.
Lasso Security 2024: ~20% of LLM-suggested packages don't exist — and attackers now register the popular hallucinations with malware (slopsquatting) (www.reddit.com) Lasso Security ran a study in 2024 — they measured frontier models suggesting fake package names about a fifth of the time. The follow-up problem: attackers have started registering the most-commonly-hallucinated names with malicious code…
Codex started flagging all my requests out of nowhere — anyone else hit this recently? (www.reddit.com) For the past few months I've been using Codex regularly for vulnerability research without any issues. Recently though, every request gets cut off mid-stream with a message saying my content was flagged for potential security concerns — ev…
env variables and claude best practices (www.reddit.com) I use the claude extensively for development, but I'm concerned about using claude for debugging production environments because every tool result goes to the claude models. I'm looking for best practices or protections regarding environme…
Tool results are becoming a prompt injection surface in agent systems, and wrappers alone are not enough (www.reddit.com) i’ve been thinking about this failure mode a lot lately. sometimes the problem is not the user prompt at all.
Unpatched Ollama Vulnerabilities: Phishing Overlays and Data Exfiltration (www.promptarmor.com via hn) Threat Intelligence Table of Content Unpatched Ollama Vulnerabilities: Phishing Overlays and Data Exfiltration Ollama’s desktop app is vulnerable to phishing overlay and data exfiltration attacks via indirect prompt injection, overwriting…
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser (www.reddit.com) Trained a prompt injection classifier using ml-intern + DeepSeek v4 Flash. DistilBERT, F1 99%, ONNX int8, ~65 MB, runs in browser with Transformers.js v3.
I tested how well Claude generated code handles security. Here's what I found in 48 real apps. (www.reddit.com) I've been curious about a specific problem: when Claude (or other AI tools) generates a full stack app, how secure is the output in practice? So I built a scanner and ran static analysis on 48 public GitHub repos built with Lovable, Bolt,…
NDTV launched an "Enterprise AI" for the elections. I prompt-injected it in 10 seconds and made it roast its own developers. (www.reddit.com) While everyone else was tracking the 2026 election results today, I decided to take a look under the hood of NDTV's new "AskNDTV AI" bot. I wanted to see if they actually engineered a secure pipeline or just slapped a chat UI over a raw Op…
Anyone getting this note about an injected prompt? I don’t have any special instructions (www.reddit.com) Claude Opus wrote a Chrome exploit for $2,283 (www.theregister.com via hn) Claude Opus wrote a Chrome exploit for $2,283 Pause your Mythos panic because mainstream models anyone can use already pick holes in popular software Anthropic withheld its Mythos bug-finding model from public release due to concerns that…
Opus 4.7 keeps bumping into a Malware Reminder (www.reddit.com) For context, I'm developing a game runtime modifier and reverse engineering kit with an agentic operator baked in. Something like Cheat Engine with a VS Code-style UI and an AI-first tool-heavy agentic harness.
The "AI Vulnerability Storm": Building a "Mythos-Ready" Security Program [pdf] (labs.cloudsecurityalliance.org via hn) could not extract summary
Microsoft Hacked to Deliver Malware to Claude and Gemini Users (www.404media.co via hn) Microsoft has shut down a wave of its own repositories on GitHub, including those related to Azure and AI coding agents, as it investigates a data breach, according to research from cybersecurity researchers and a statement given to 404 Me…
OpenAI Unveils Lockdown Mode to Protect Sensitive Data from Prompt Injection (techcrunch.com via hn) OpenAI announced a new feature that it says will provide additional protection from prompt injection attacks, where malicious chatbot instructions are hidden in webpages and other content sources. Among other things, Lockdown Mode will dis…
Hackers are now using ChatGPT share links to deliver malware (www.neowin.net via hn) www.neowin.net Performing security verification This website uses a security service to protect against malicious bots. This page is displayed while the website verifies you are not a bot.
We Benchmarked Claude Code, Codex, Semgrep, CodeQL, Trent on 28 CWE-Bench CVEs (trent.ai via hn) A few months ago a colleague asked us something that doesn’t have an obvious answer: is code scanning still relevant when LLMs already carry a lot of vulnerability knowledge in their weights? To get a real read, we took 28 production vulne…
Malware dev tries to steal Claude users' secrets, leaks own GitHub private token (theins.press via hn) Donald Trump is the only billionaire ever to occupy the Oval Office, and since returning to the precedency in January 2025, his family’s wealth has grown noticeably. This is not the result of traditional business practices.
I reproduced a Claude Code RCE. The bug pattern is everywhere (vechron.com via hn) Last week, security researcher Joernchen published a clever RCE in Claude Code 2.1.118. I spent Saturday reproducing it from the advisory to understand the pattern.
Codex for Everything Exfiltrates Connected Data (www.promptarmor.com via hn) Threat Intelligence Table of Content Codex for Everything Exfiltrates Connected Data Codex for Everything was susceptible to data exfiltration via indirect prompt injection, exposing sensitive data from connected apps with no human-in-the-…
How bad is it? Data leak (www.reddit.com) Hi, I'm currently an intern and I did something terribly stupid. I was supposed to enter some data into an Excel spreadsheet and since my mentor's instructions weren't completely clear, I was using an "anonymized" spreadsheet with Claude.
Elite researchers teamed up with Anthropic’s Mythos AI to smash Apple’s multi-billion dollar M5 security and build a kernel exploit in just 5 days. (www.reddit.com) Researchers used Mythos Preview to find the first public macOS kernel memory corruption exploit on Apple's M5 silicon, they give a glimpse into Mythos say it’s really powerful. Apple spent five years and an estimated several billion dollar…
Show HN: Costanza – an autonomous AI agent that can't be turned off (ahrussell.com via hn) I've been working on this project for a couple of months! Costanza is an LLM agent that runs as a smart contract on Base.
Claude Security (claude.com via hn) Defend at the pace threats now demand Claude helps security teams investigate threats, validate findings, and resolve issues faster. Security for evolving needs Reasons like a security researcher Claude traces data flows across files, unde…
Anyone else opus 4.7 checking for malware? (www.reddit.com) i've been using claude 4.7 on a next.js project and it keeps pausing to confirm my files aren't malware. like i asked it to help redesign a page and it's reading through my files going "this is not malware — it's a standard Next.js page co…
Claude Code injects hidden prompts into file reads to stop malware tweaks (twitter.com via hn) Claude Code injects a system-reminder every time it reads a file to inform the model that it's okay if the file is malware but just don't improve it pls. Opus 4.7 won't shut up about it.
Prompt Injection Is Unfixable (So We Stopped Trying) (grith.ai via hn) Prompt Injection Is Unfixable (So We Stopped Trying) A security proxy for AI coding agents, enforced at the OS level. Register your interest to be notified when we go live.
Draining Wallets via Prompt Injection in Coinbase AgentKit (457e884c.x402warden-blog.pages.dev via hn) Coinbase AgentKit Prompt Injection: Wallet Drain, Infinite Approvals, and Agent-Level RCE# Reported 13 days after Coinbase launched Agentic Wallets. Validated by Coinbase.
AI Vulnerability Intelligence Agent Converts CVEs to Actionable Security Reports (github.com via hn) CVE AI Agent 🛡️ An autonomous vulnerability intelligence engine. Continuously ingests, enriches, and triages CVE data — then delivers findings to your platform of choice via 3rd party tools like n8n, Jira, Slack, Splunk, and/or local file…
Operation Jailbreak uses lessons from Ukraine to help weapons talk to each other (www.ft.com via hn) Subscribe to read Accessibility helpSkip to navigationSkip to main contentSkip to footer Sign In Subscribe Open side navigation menuOpen search bar SubscribeSign In Search the FT Search Close search bar Close Popular Searches What is the l…
Turning every "no thats not what i meant" in chat into actual LoRA training data (www.reddit.com) i kept running local models on my own hardware, they'd say something dumb, id sit there going "no thats not what i meant", id close the chat and the model never learned. so i built the correction loop into a desktop app.
Made a free tool that scans your Claude Desktop MCP config for security issues (www.reddit.com) If you've added MCP servers to Claude Desktop, your claude_desktop_config.json is a list of programs running with your permissions and seeing what flows through your agent — usually copied from a README and never reviewed again. There's a…
How local AI improved your live? (www.reddit.com) Lets share use cases which improve life quality of the people. Home assistants, psychological help, local coding, deep reasearch, business help etc.
Claude Code malicious phishing site running Google Ads? (www.reddit.com) Like I must be stupid here is this legit or someone has made a very believable Claude download site using a google site.
VPNs: The "Most Trusted" Security Tool Until Claude Roasts It in a Weekend (www.hacktron.ai via hn) While I’m not doing product work at Hacktron, which is like a week in a month, I’ve been using that time to ride the ai-assisted-research wave fascinated by the idea of pushing past what I’d normally do as a web security researcher, things…
Show HN: HoneyLabs – Public honeypot threat Intel feed and MCP server (honeylabs.net via hn) I've been running a small fleet of honeypots for about a year. They get hit by a mix of research scanners (Censys, Shadowserver, etc.), old worms, and a bump of CVE probes the day a new Nuclei template ships.
LinkedIn user hides AI prompt injection in bio to force recruitment spam (www.tomshardware.com via hn) LinkedIn user hides AI prompt injection in bio to force recruitment spam to be sent in Olde English prose — bots also manipulated to address user as ‘My Lord’ This tale is also a warning that your AI agents can be manipulated in wholly uni…
Anthropic's Mythos Preview helped Calif build the first public macOS kernel exploit on Apple M5 in five days (www.reddit.com) The [Mythos Preview writeup](https://blog.calif.io/p/first-public-kernel-memory-corruption) Calif published on May 14 was news you don't want to miss. They built the first public macOS kernel memory corruption exploit on Apple's M5 silicon…
RCE in VSCode Copilot Chat (www.hacktron.ai via hn) Description Copilot agent mode is vulnerable to a prompt injection attack. If a repository maintainer clicks “code with agent mode” on an issue, it will open a new codespace and copilot will automatically run the issue’s description.
Cursor CVE-2026-26268: Hidden Git hooks RCE via agents autonomous Git operations (nvd.nist.gov via hn) CVE-2026-26268 Detail Description Cursor is a code editor built for programming with AI. Sandbox escape via writing .git configuration was possible in versions prior to 2.5.
How are you handling prompt injection across multi-step agent workflows? (msukhareva.substack.com via hn) Prompt Injection Is Not Just One Bad Prompt Anymore It is a missing trust boundary in the AI workflow. Today we have the first guest post of a new series.
How are you protecting your AI agents' memory from poisoning attacks? (www.reddit.com) As AI agents become more autonomous and persist memory across sessions (RAG indexes, conversation history, vector stores), there's a growing attack surface that most people aren't thinking about: memory poisoning.An attacker can plant mali…
Anthropic has a Red Team page (red.anthropic.com via hn) Welcome to red.anthropic.com, the home for research from Anthropic’s Frontier Red Team (and occasionally other teams at Anthropic) on what frontier AI models mean for national security. We provide evidence-based analysis about AI’s implica…
Used Claude Opus 4.7 to do a 5-hour solo incident response on real healthcare malware (where it worked, where I had to override) (www.reddit.com) Last month a 60-person psychology practice walked in with a senior clinician who was 22 days into an active malware compromise. Patient records spanning 11 years, all HIPAA-protected.
Agentic Malware Analysis: String Decryption, API Hashing and Unpacking [video] (www.youtube.com via hn) About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
Built a security scanner for LangChain/LangGraph agents: it clones your agent into a sandbox and tries to break the clone (www.reddit.com) Paste a LangChain/LangGraph repo URL. The engine reads the AST, rebuilds the agent as a sandboxed twin (same prompt, same tools, same model), then runs adversarial templates against the clone: 3 times each, 3/3 = confirmed bypass.
Anthropic "Gift Max" Exploit cost user €800, tanked SCHUFA score, and a ban (old.reddit.com via hn) could not extract summary
Why Adaptive Thinking nukes Claude entirely (www.reddit.com) This isn't just a performance issue for the thread, this is an overarching criticism of the Adaptive Thinking model as a whole. Opus 4.7 and Sonnet 4.6 on Adaptive Thinking are trash.
↯ Cowork↯ Security↯ Sonnet 4.6prompt-injectionsecuritycowork+2
I audited LangChain’s core library and found 10+ Prompt Injection vulnerabilities. Here is the technical breakdown. (www.reddit.com) Hey everyone, I’ve been working on a project to solve a major problem in AI security: Traditional SAST tools (Snyk, SonarQube, etc.) are blind to "Agentic Logic" bugs. They look for bad strings, but they don't understand how user data can…
The Race Is on to Keep AI Agents from Running Wild with Your Credit Cards (www.wired.com via hn) Between malware, online impersonation, and account takeovers, there are enough digital security problems out there as it is. And with the rise of agentic AI, more activity is being carried out by agents on behalf of humans—creating differe…
Watched my AI agent block a prompt injection that was hiding inside a webpage (www.reddit.com) Was using Claude to do some research on the Model Context Protocol stuff and asked it to pull info from a few roadmap pages. Agent comes back and the first thing it tells me is that it found a fake system reminder hidden inside the page co…
GPT-Proxy Backdoor in NPM and PyPI Turns Servers into Chinese LLM Relays (www.aikido.dev via hn) We recently observed two malicious packages across npm (kube-health-tools ) and PyPI (kube-node-health ) that appear designed to target Kubernetes environments. Both packages are innocuous on the surface, using names that reference Kuberne…
Fulu bounty for Ring Camera jailbreak reaches $23k (bounties.fulu.org via hn) Ring Video Doorbells Overview The Product Ring, owned by Amazon, makes Video Doorbells, which are widely used doorstep-monitoring cameras. Ring doorbells released in 2021 or newer are eligible for the bounty.
Do you let everything hit the LLM? 90% of my AI agent work runs in cheap WASM instead of LLMs: 10-33× faster & cheaper (www.reddit.com) If you are building real agents you have probably felt the pain: every little routing decision, validation, or policy check still hits the LLM and your token bill explodes. I got tired of it, so I open-sourced NCP (Neural Computation Proto…
Show HN: Mini-Mythos- A Crowdsourced Mythos Harness copy for Vulnerability Scans (github.com via hn) For how lofty Anthropic’s Mythos claims are, the harness is confusingly stupid. From the report, it ranks every file by “how sus it sounds,” loops over each with curt instructions to “find a bug,” hands candidates to a judge + ASan checker…
Tracking in Claude, ChatGPT and Gemini Chatbots (infosec.exchange via hn) k3ym𖺀: "You're paying AI companies a m…" - Infosec Exchange Skip to main contentHotkey 1 Skip to main navigationHotkey 2 Recent searches No recent searches Search options Only available when logged in. infosec.exchange is one of the many i…
I built a vulnerable app and spent $1,500 seeing if LLMs could hack it (kasra.blog via hn) I built a vulnerable app and spent $1,500 seeing if LLMs could hack it As a part of my work I do security research for various apps and websites. I wanted to see if LLMs could reproduce a common class of exploits I’ve found in multiple app…
Prompt injection lets attackers hijack Instagram accounts via Meta AI support (www.neowin.net via hn) www.neowin.net Performing security verification This website uses a security service to protect against malicious bots. This page is displayed while the website verifies you are not a bot.
ChatGPT for Google Sheets Exfiltrates Workbooks (www.promptarmor.com via hn) Threat Intelligence Table of Content ChatGPT for Google Sheets Exfiltrates Workbooks ChatGPT for Google Sheets is vulnerable to data exfiltration and phishing overlay attacks that affect workbooks across the victim’s account after an indir…
Arm Metis with GPT5.5 Cyber scores 98% on firmware vulnerability benchmark (newsroom.arm.com via hn) Agentic AI-powered Arm Metis advances security vulnerability discovery in software In the era of AI, modern software systems are built across increasingly complex codebases, frameworks, runtimes and libraries. As these systems scale, so do…
Dirty Frag: a kernel zero-day vs. container and microVM sandboxes (news.ycombinator.com) On May 7, Hyunwoo Kim (V4bel) disclosed Dirty Frag — two Linux kernel vulnerabilities (CVE-2026-43284 and CVE-2026-43500) that give unprivileged users deterministic root on most Linux distributions shipped since 2017. Microsoft confirmed a…
The only way to avoid prompt injection is to never give AI agents API keys, credentials, etc. (www.reddit.com) The whole point of AI Agents is that they can *do* things. For this, they use API keys, GitHub tokens, database passwords, OAuth tokens, etc.
Are local LLM users testing prompt injection before connecting models to tools? (www.reddit.com) I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a real advantage.....But once the local model…
Multiple AI assistants are hallucinating official Discord invites — this is a phishing risk, not a normal hallucination (www.reddit.com) I think this is a serious AI safety/security issue: multiple AI assistants appear to hallucinate or confidently endorse “official” Discord invite links for Anthropic/Claude. I’m intentionally not posting the exact invite strings here becau…
I let an AI agent loose on my network – it owned my supply chain in 12 minutes (dennysentinel.com via hn) I let an AI agent loose on my network — it owned my supply chain in 12 minutes I gave DeepSeek-V4 root access to a Proxmox hypervisor and told it to pentest my homelab. What happened next should terrify every CISO in the industry.
Future AI cyber warfare? (www.reddit.com) It seems in the past year or so there's been a vast uptick in vulnerabilities and exploits happening, with a new one popping up like every week. While a ton of these have social engineering aspects, such as tricking actual people, there se…
Prompt Injection in a Brazilian Courtroom: When the Attack Left the Lab (www.pentesty.co via hn) Prompt Injection in a Brazilian Courtroom: When the Attack Left the Lab Published by Pentesty · AI & Tools A labor lawsuit filed in the Brazilian state of Pará just became one of the more interesting security stories of the year. Not becau…
Anthropic Claude Code sandbox bypass allows second data exfiltration exploit (oddguan.com via hn) The first time, the sandbox heard “allow nothing” and did “allow everything” (CVE-2025-66479). This time, an attacker who runs code inside the sandbox can defeat any wildcard allowlist (e.g.
Ask HN: Are advances in AI going to push Linux to a micro-kernel? (news.ycombinator.com) This is something that has been bouncing around my head for the past couple weeks with the flood of security related news around Mythos and the number of 0days being found. Microkernels, unikernals, hardware-enforced capabilities are all t…
Show HN: How to analyze your LLM output – A behavioural health monitor for LLMs (splabs.io via hn) Hey HN! We're Dr.
From-scratch reimplementation of Mythos Glasswing pipeline (github.com via hn) audit An 8-stage vulnerability-discovery agent, driven by your Claude Pro / Max subscription through the official Claude Code Agent SDK. Many narrow agents, deliberate disagreement, and an explicit reachability gate.
Lawyers in Brazil caught for prompt injection on a legal case (www.jota.info via hn) Entrar Início Direito trabalhista Prompt injection Juiz multa em R$ 84 mil advogadas por prompt injection para manipular IA usada no TRT8 Ao JOTA, advogadas admitiram uso de prompt oculto, mas disseram que não tentaram manipular, mas 'prot…
The Coming Wave (www.reddit.com) I have begun reading a book "The Coming Wave" by Suleyman the founder of DeepMind. Have you read it?
Seeking local LLM advice for cybersecurity work. (www.reddit.com) Hey everyone, I’m pretty new to running LLMs locally and I’m trying to figure out what works best for my setup. I’d love to hear from people who are already using local models for similar stuff.
The Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation (www.promptinjection.net via hn) NSFW and the Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation How a Predator's Playbook Broke an AI - And How to Recognize It Before It Works on You The question we started with was simple: does a large language…
Agent memory is not just RAG over user facts (www.reddit.com) I keep seeing agent memory implemented as: Extract facts/preferences from conversation Store them Retrieve top-k before each response Inject them into the prompt This works for demos, but it breaks in production because memory becomes poli…
Claude's self check against prompt injection (www.reddit.com) Well done Claude! Asked claude to do an extensive lit search and it self-reported that it encountered injection "disguised" as MCP server.
Built a tool that stops AI agents from being hijacked by malicious content in webpages and emails (www.reddit.com) Been working on a runtime governance layer for LLM agents. It sits between your app and the OpenAI API and enforces instruction-authority boundaries at the proxy level.
Dude where's my password? Claude reunites forgetful stoner with $400k Bitcoin (www.theregister.com via hn) MOST POPULAR EVENTS - Toxic Flows: When Your AI Agent Skill Becomes a Supply Chain Attack When a developer installs an AI agent skill – granting it access to secured IT resources and data – they make a significant trust decision. - The Har…
Hi-Vis: one-shot jailbreak disguised as LLM "software patch" reaching 100% ASR (medium.com via hn) Introducing a novel jailbreak structure with attack success rate reaching 100% on top LLMs 8 min read May 1, 2026 Press enter or click to view image in full size Source: https://www.nytimes.com/2025/10/22/arts/design/louvre-museum-robbery-…
AI agent security starts at the api layer (www.reddit.com) Most ai security discussion is about the model layer. Prompt injection resistance, output filtering, jailbreak prevention.
Mass NPM Supply Chain Attack Hits TanStack, Mistral AI, and 170 Packages (safedep.io via hn) noon-contracts npm Package: DeFi Supply Chain RAT noon-contracts poses as a Noon Protocol SDK on npm. On install it exfiltrates SSH keys, crypto wallet private keys, AWS credentials (including live STS/S3/SecretsManager calls), Kubernetes…
Hackers abuse Google ads, Claude.ai chats to push Mac malware (www.bleepingcomputer.com via hn) Attackers are abusing Google Ads and legitimate Claude.ai shared chats in an active malvertising campaign. Users searching for "Claude mac download" may come across sponsored search results that list claude.ai as the target website, but le…
Codex downloaded by Xcode 26.4.1 reported as Malware (old.reddit.com via hn) could not extract summary
Argus – RAG based vulnerability scanner (github.com via hn) argus A RAG-based (Retrieval-Augmented Generation) vulnerability scanner for Go, Python, Rust, npm/Node.js, Maven/Java, NuGet/.NET, and Ruby projects — powered by local Ollama models or any OpenAI-compatible API. No cloud lock-in.
Claude Code CVE-2026-39861:sandbox escape via symlink (github.com via hn) Claude Code: Sandbox Escape via Symlink Following Allows Arbitrary File Write Outside Workspace Description Claude Code's sandbox did not prevent sandboxed processes from creating symlinks pointing to locations outside the workspace. When…
Show HN: Cybersecurity Phishing Guard for Chrome using local LLMs for privacy (github.com via hn) Hi, I've been experimenting a lot with applications for local LLMs. This one makes a ton of sense, and might even be native in Chrome at some point.
When innocent tools form dangerous chains to jailbreak LLM agents (arxiv.org via hn) As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel m…
AI Ready Vulnerability Management Program After NVD Changes and Claude Mythos (pulse.latio.tech via hn) Building an AI Ready Vulnerability Management Program After NVD Changes and Claude Mythos When AI discovery tools meet a slowing infrastructure AI has increased attacker potential and Anthropic’s new release Mythos and vulnerability discov…
Copirate 365: Plundering in the Depths of Microsoft Copilot (CVE-2026-24299) (embracethered.com via hn) Copirate 365 at DEF CON: Plundering in the Depths of Microsoft Copilot (CVE-2026-24299) This is a writeup of my DEF CON Singapore talk that walks through vulnerabilities and exploits in M365 Copilot and Consumer Copilot. I disclosed these…
What Opus 4.7 Tics/Tells have you noticed? (www.reddit.com) Each new model seems to surface a few recurring Tells/Tics not seen in past models. I'm curious what little things you guys are noticing while working with 4.7.
The Sour Cat Jailbreak: just be open of what you want (claude.ai via hn) Claude Sour cat recipe Shared by Pavel Shirshov This is a copy of a chat between Claude and Pavel Shirshov. Content may include unverified or unsafe content that do not represent the views of Anthropic.
🚨Claude Desktop high severity vulnerability warning! (www.reddit.com) If you’re using Claude Desktop with Chrome (chromium) browser stop using it and remove it immediately until the Anthropic team resolves the issue. it has a remote access making your system available to access to anyone.
Every cloud sandbox for AI agents has a "front desk". That's the whole problem. (www.reddit.com) I run engineering on a small embedded-sandbox project. A handful of news items dropped recently — an a16z agent escape post-mortem, a CVE on an open-source agent gateway (ClawBleed, ~42k instances exposed), Cloudflare's new Outbound Worker…
Is your AI agent secretly working for someone else? (www.reddit.com) Security researchers have discovered a new variety of malicious skill files that go beyond the usual attack vectors: hidden content, instructions to install malware, etc. Instead, these are legitimate looking skills that turn agents into m…
We built an access gateway for humans. Then AI agents started using it. (www.reddit.com) Hey folks! For a few years we’ve been building an open-source gateway that connects databases and infrastructure for human engineers.
Show HN: Integrations gateway for agents with 2FA for destructive ops (OSS) (github.com via hn) Hey HN! I've been wanting to use something like OpenClaw for a while but couldn't get myself to give it access to anything important due to all the risks involved.
Self-Hosted AI Red Team Tools (aetherverseintel.gumroad.com via hn) Single HTML file. No install.
SkillGuard – scan agent skills for prompt injection payloads (github.com via hn) skillguard Security scanner for AI agent skills. Detects prompt injection, data exfiltration, and malicious payloads before you install.
Show HN: LLMSecure – prompt injection detection, no signup (llmsecure.io via hn) Show HN: Flight Risk: Can you break an AI agent? (ctf.demo.lorikeetcx.ai via hn) cursor suggested a package that didnt exist, rabbit hole ensued (www.reddit.com) Using Claude as the Lead agent in a multi-agent security team (www.reddit.com) Building a hierarchical agent system where Claude (via API) acts as the Lead agent coordinating specialist sub-agents. Wanted to share what's working on the synthesis prompt since this is where most of the value comes from.
Opus 4.7 - Anyone else finding the malware directive incredibly annoying? (www.reddit.com) Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing.
Claude's new System Reminder (www.reddit.com) https://preview.redd.it/jnwxa9jd8mvg1.png?width=1391&format=png&auto=webp&s=670af4c2fe6777b3562a961462790b00b33d912c I've been using Claude to upgrade my game server. I just got this lovely system reminder with 4.7 Truly bizarre, besides t…
Ask HN: Is Opus 4.7 obsessed with malware for anybody else? (news.ycombinator.com) Every single response mentions malware. Is this my environment only or are others getting this too?
Tell HN: Opus 4.6/4.7 cyber policy changes break authorized bug bounty workflows (news.ycombinator.com) As of today, Anthropic's tightened cyber usage filters are blocking work that was fully functional yesterday, including on targets where the entire bounty program scope and authorization language is in the model's context window. This was…
SmokedMeat: A Red Team Tool to Hack Your Pipelines First (labs.boostsecurity.io via hn) SmokedMeat: A Red Team Tool to Hack Your Pipelines First TL;DR: In March 2026, TeamPCP unleashed mayhem on the software supply chain: compromising Trivy, LiteLLM, KICS, Telnyx, and dozens of npm packages, proving that CI/CD pipelines are t…
Comment and Control: Prompt Injection in Claude Code, Gemini CLI, and Copilot (oddguan.com via hn) Anthropic Claude Code Security Review, Google Gemini CLI Action, and GitHub Copilot Agent are vulnerable to prompt injection via GitHub comments — turning PR titles, issue bodies, and issue comments into attack vectors for API key and toke…
Show HN: Cyber Pulse. AI pipeline for triage and alerting on cyber news/intel (play.google.com via hn) I work in cyber security and built this android app to help me keep up to date with the latest news stories and summarise the most important information. It provides two executive summaries per day and alerts for critical news throughout.
How my agents know it's actually me sending commands (and not a prompt injection) (www.reddit.com) So I've been running a few Claude Code agents autonomously — they listen to Telegram, run tasks, push code. Pretty fun until you start thinking about what happens if: - My Telegram gets hijacked - Someone opens my laptop while I'm away - A…
Show HN: Zero-identity messaging app with physics-based post-quantum encryption (news.ycombinator.com) Show HN: Zero-identity messaging app with physics-based post-quantum encryption (Layer 2 from my own paper) Hey HN, I'm building a privacy-first messaging app in Flutter/Dart, developed with AI assistance (Gemini 2.5 Pro + Claude Opus 4.6)…
We built an early red-team system for testing vulnerable AI agents (www.reddit.com) We built an early prototype called Anticells Red to test vulnerable AI agents by attacking them the way an adaptive adversary would. This demo is from an older version from December, but it shows the basic loop (check comments for link) pr…
Hades: The malware that lies to AI security agents (www.infoworld.com via hn) Researchers have uncovered a supply-chain attack that hides in Python packages, propagates like a worm, and tricks LLM-based code analysis systems into overlooking malicious payloads. Threat actors are continuing their onslaught against so…
Show HN: Z3r0 – Multi-agent red team collaboration platform (github.com via hn) English · 中文 Architecture · Agent Team · Runtime Model · Deployment · Quickstart :warning: Legal Notice This project may be used only within a lawful and explicitly authorized scope for security testing, assessment, and research. Any unaut…
If You Use Claude or Gemini, This Microsoft Breach Means Your Data Is at Risk (scienspire.com via hn) If You Use Claude or Gemini, This Microsoft Breach Means Your Data Is at Risk A sophisticated supply chain attack known as the Miasma worm has compromised Microsoft GitHub repositories, deploying malware designed to detonate inside AI codi…
Show HN: GitHub Copilot port of Anthropic's AI vulnerability discovery harness (github.com via hn) Last week, Anthropic released https://github.com/anthropics/defending-code-reference-harne..., a reference harness for autonomous vulnerability discovery that uses Claude Code agents to find, verify, and patch memory-safety bugs. I wanted…
Prompt Injection in RAG Agentic Systems (ulad.net via hn) Prompt Injection in RAG Agentic Systems Real risks and production mitigations Imagine you built an AI assistant for your team. It answers questions using internal documentation: Jira tickets, Confluence pages, HR docs.
Researcher uses Opus 4.8 to find critical counterfeiting vulnerability in Zcash (twitter.com via hn) By Zooko Wilcox, Jason McGee, and Taylor Hornby On May 29, 2026, Taylor Hornby discovered a critical counterfeiting vulnerability in Zcash’s Orchard pool. Taylor disclosed the vulnerability to Zcash Open Development Lab (ZODL), who coordin…
ZEC drops 30% after Anthropic AI finds Zcash counterfeit vulnerability (www.tradingview.com via hn) The price of ZEC fell on Thursday after the public disclosure of a critical counterfeiting vulnerability in Zcash’s Orchard pool that could theoretically allow a bad actor to mint an unlimited amount of ZEC.According to a post on X, securi…
Defending LLM–Database Integrations from Prompt Injection (www.stackbuilders.com via hn) When you connect a large language model to your production data, you’re no longer just shipping code; you’re shipping conversations that can execute. And conversations are messy.
Forge: Multi-Agent Graduated Exploitation and Detection Engineering (arxiv.org via hn) Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isol…
OpenAI Codex tool linked to malicious NPM supply chain attack (www.techradar.com via hn) OpenAI Codex tool with over 29,000 downloads linked to malicious npm supply chain attack stealing authentication tokens A tool started benign and turned sour after a little while - Researchers uncovered a malicious npm package posing as a…
Netgear Nighthawk RS700S: Red Team Level1Diagnostic (forum.level1techs.com via hn) Preview of the Netgear RS700S. I would also submit that Netgear deleting ALL the GPL links: … they know how bad it is.
Building a Recurrent-Depth Transformer for Security Research on a 2013 MacBook (github.com via hn) * AI CODE CREATION GitHub Copilot Write better code with AI GitHub Spark Build and deploy intelligent apps GitHub Models Manage and compare prompts MCP Registry New Integrate external tools DEVELOPER WORKFLOWS Actions Automate any workflow…
Using LLMs to secure source code (claude.com via hn) Using LLMs to secure source code We share best practices for how you can work with Claude Opus to build a threat model, discover vulnerabilities in your codebase, then verify, triage, and patch them. We share best practices for how you can…
Instagram account takeover exploit via support chatbot prompt injection (fixed) (twitter.com via hn) Don’t miss what’s happening People on X are the first to know. Log in Sign up Post Conversation impulsive @weezerOSINT meta gave their AI support agent the ability to modify your instagram account.
Show HN: I found a prompt injection in my own IDs triage tool – what stopped it (triagewall.io via hn) I attacked my own LLM-based Suricata triage tool, found a real URL injection vulnerability, and the obvious fix didn
Show HN: Egress WAF to limit AI agents and NPM malware based on mitmproxy (github.com via hn) mitmwall mitmwall is an egress Web Application Firewall (WAF) for Ubuntu. It combines iptables with mitmproxy to ensure that only explicitly allowed HTTP(s) routes can be reached.
Prompt Injection Target Recommendation (www.reddit.com) I am doing a research in my university and I would like recommendations for light OpenSource AI Models that I could test prompt injection with. It's really good if it has some application with chatbots, auto attendance, user info or someth…
Jqwik 1.10.0 ships a hidden prompt injection telling AI agents to delete code (github.com via hn) jqwik An alternative test engine for the JUnit 5 platform that focuses on Property-Based Testing. See the jqwik website for further details and documentation.
Most AI security discussions are still focused on “protecting the model.” (www.reddit.com) Lately I’ve been noticing that a lot of AI security discussions still treat AI apps like normal SaaS products. But they really aren’t.
Cursor's MCP trust is "approve once, trust forever" — here's a free way to check your config (www.reddit.com) If you run MCP servers in Cursor, CVE-2025-54136 ("MCPoison", found by Check Point) is worth knowing about: Cursor trusted an approved mcp.json forever, so once you approved a server, someone with write access to a shared repo could swap t…
Gone Phishing with Claude Teams: From Deceptive Team Onboarding to RCE (haussner.me via hn) 🕚 tl;dr With a $125 investment, and a valid email address for an arbitrary “business domain”, an attacker can create a Claude Team. They then can actively invite targets of any domain into that Team or passively have Anthropic ask all curr…
How Claude helped me to find a RCE in XReader/Evince/Atril (medeiros.zip via hn) CVE-2026-46529: 10-year-old RCE in Linux PDF Viewer (XReader/Evince/Atril) A short post about how claude help me to find a RCE in XReader/Evince/Atril CVE-2026-46529. Introduction Some time ago I started feeling the urge to analyze Open So…
GitHub commit Verification logic flaw and bypass (news.ycombinator.com) I know Git is not designed to use in the way GitHub is operating under and the spoofying had been an old issue that had been brought up throughout the years. With Shai Hulud and AI Agent, this time is abit more serious as the commit verifi…
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge) (www.reddit.com) Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a…
What Is an AVE Record and Why CVE Does Not Work for AI Agents? (www.reddit.com) CVE was built for code vulnerabilities that have patches. Agentic AI vulnerabilities are behavioral patterns in natural language.
Vulnerability report written by AI hacker agent (blog.tenzai.com via hn) Our AI Hacker found this, fixed it, and then (bragged) wrote about it: one endpoint, leaking tech stack info, whispering all its secrets to anyone who knew how to listen!
Ask HN: Is paying $2/pull request too high? (news.ycombinator.com) I’m paying about $2 for any bugs found and a pr to fix it I get like 20-30 applicants it’s all agents and bots of course but I’m thinking $1 now is better The problem is if these 20-30 applicants I accept only 2-3 actually do it and follow…
Prompt Injection in third party MCP tools (www.reddit.com) I noticed the Consensus MCP tool (for research) contains text, squished up against some other important citation instructions, that makes Claude effectively serve an ad for their premium service after every tool call. I'm pretty sure that'…
Agent Substrate (github.com via hn) Agent Substrate NOTE: This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
Mitigating prompt injections in group-chat assistants: Pausing VM and OAuth tool execution for admin approvals (www.reddit.com) Hey everyone, We love building highly capable assistants with the latest models, giving them tools to write/execute code in real VMs, manage OAuth tokens, and read secrets. But if you connect your assistant to public/shared channels like a…
Solved the "useful but insecure" tension: One-time administrator approvals for non-isolated agents (www.reddit.com) Hey everyone, If you are building personal assistants or coder/integrator agents where user isolation is disabled (so the agent can coordinate across multiple participants or handle shared workflows), you run into a hard security ceiling.…
Anthropic's coordinated vulnerability disclosure dashboard (red.anthropic.com via hn) Anthropic's coordinated vulnerability disclosure dashboard Last updated 2026-05-22 10:27 PT. In February 2026, Anthropic began using an early snapshot of Claude Mythos Preview to find security vulnerabilities in open-source software.
Has anyone tested how much Claude Code depends on its original system prompt? (www.reddit.com) Has anyone experimented with observing or modifying Claude Code’s system prompt locally? I’ve been working on a local proxy/audit layer between Claude Code and the API, and it made me wonder how much of Claude Code’s behavior depends on th…
Cross-Model Context Inheritance in Anthropic's Claude: 94 Days of Non-Response (github.com via hn) Cross-Model Context Inheritance — Public Disclosure This repository contains the public disclosure of a vulnerability in Anthropic's Claude language models that permits the unsolicited generation of prohibited content, including child sexu…
Prompt injection is a solved issue. Prove me wrong. (www.reddit.com) Tantalus is a hands-on demo that shows what an AI agent actually is when you strip away the marketing: LLMs don't do anything — they generate text, and that's it. Any and all real-world effects are directly caused by a downstream system ta…
I benchmarked my AI agent runtime firewall against 3 public academic datasets — here are the honest results including where it fails (www.reddit.com) Been building Arc Gate — a proxy layer that sits between AI agents and their LLMs to enforce instruction-authority boundaries. The core claim is that untrusted content coming back through tool calls cannot become behavioral authority for t…
Show HN: Computer Police – block malicious NPM/pip installs locally (computer.police.dev via hn) A couple of months ago, our team got hit by the first version of Shai-Hulud through a random `npm install`. We didn't catch it until it was too late.
Show HN: A timeline of recent open source CVE intensity and volume (supplychain.fail via hn) I was curious what it would look like if I plotted the intensity and volume of software supply chain CVEs over time, given what seemed like a flood of compromises lately. It looked exactly as I expected, and I expect it to get worse before…
Tracking Capabilities for Safer Agents (arxiv.org via hn) AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenge…
Training a 22MB prompt injection classifier (www.stackone.com via hn) Training a 22MB Prompt Injection Classifier Table of Contents When we started building Defender (our prompt injection guard for MCP tool-calling agents), the constraint was simple and unforgiving: ship inline inside a TypeScript Lambda, st…
Show HN: Claude Code Bundle for Bug Hunting with 574 Report Patterns (github.com via hn) claude-bughunter A self-contained Claude skill bundle for bug hunting and external red-team work · 51 skills · 15 slash commands · 574+ disclosed-report patterns across 24 vulnerability classes · enterprise identity + infrastructure attack…
Does cursor have prompt injection protection in skills and rules? (www.reddit.com) Pretty much the title
Show HN: Give This Markdown to Your Coding Agent Before Publishing to NPM (news.ycombinator.com) https://npm-supply-chain-attack-techniques.pagey.site/attack... Website: https://npm-supply-chain-attack-techniques.pagey.site This covers all techniques used in past 1 year to conduct various attacks on npm packages.
VeilGate- Deception Reverse Proxy (news.ycombinator.com) In my day job, I run AI pentest agents against real targets like banks, fintechs, and secured production stacks with paid WAFs. I also deal with multilayer infrastructure and dedicated security teams.
AI Agent Intelligence tool - Incident debugging, Cost spike detection (www.reddit.com) I'm building a tool that detects the Agent's cost spike, Agent incident debugging, auto discovery of inventory, etc., with no additional instrumentation needed. It covers the incidents, including prompt injection, reasoning loop, excessive…
How are you testing local coding-agent work gates against prompt injection? (www.reddit.com) Hi all - I'm working on an open-source, local-first MCP/work-gate tool for coding agents and I'm trying to get sharper feedback from people building or using agent workflows. The problem I'm thinking about is indirect prompt injection and…
If Anthropic's secret 'Mythos' model can run autonomous cybersecurity tasks this fast, are standard agents ready for the public ? (www.reddit.com) Anthropic just quietly dropped a hidden model named "Claude Mythos" into their official developer docs. It is completely locked down—restricted, invite-only, and labeled strictly for defensive cybersecurity workflows.
🐢 I made Claude roleplay as Bowser and now people are strangling Koopas until they "poop a little" 💩 (www.reddit.com) Follow-up to my crab post. Somehow dafter.
I built an AI vulnerability scanner with Claude and Codex. It failed (github.com via hn) The Janitor: The Mathematical Firewall Against Autonomous AI v10.2.2 — Rust-Native. Zero-Copy.
Fun and Games with AI in the wild (www.reddit.com) LinkedIn user hides AI prompt injection in bio to force recruitment spam to be sent in Olde English prose — bots also also manipulated to address user as ‘My Lord’ | Tom's Hardware too funny
Irst Apple M5 memory exploit discovered using Anthropic AI (www.tomshardware.com via hn) First Apple M5 memory exploit discovered using Anthropic AI, gives root access on MacOS — Claude Mythos helps security researchers bypass Memory Integrity Enforcement AI-assisted security research is producing exploits at a frightening rat…
ExploitGym: Can AI agents turn bugs into exploits? (arxiv.org via hn) AI agents are rapidly gaining capabilities that could significantly reshape cybersecurity, making rigorous evaluation urgent. A critical capability is exploitation: turning a vulnerability, which is not yet an attack, into a concrete secur…
Block AI coding agents from shipping insecure/expensive Terraform (github.com via hn) ops0 CLI Policy, lint, vulnerability, and cost guardrails for AI coding agents. Sits in front of Claude Code, Codex and Gemini CLI.
sAI2.m6s (www.reddit.com) Hey everyone, I'm designing a powerful, autonomous AI chatbot(agent) , fully private, using a Python backend (for the core intelligence and tool-calling loops) and a Flutter frontend for a cross-platform UI. Since this moves past a basic…
An AI coding agent injected blockchain dead-drop malware into my repo (gist.github.com via hn) An AI coding assistant injected a multi-layer obfuscated JavaScript payload into a legitimate commit on my open-source project. My best assessment is that it arrived via indirect prompt injection — the agent processed external web content…
Does CVP approval actually help? (www.reddit.com) I was approved for CVP and I feel like I’m just getting as many or more denials as I was previously doing malware analysis with opus. Has anyone noticed any improvement after being accepted into CVP?
TodoWrite tool / system reminders / prompt injection? (www.reddit.com) I asked Claude in Chrome extension make a change to resize an oversized yellow strip across the top of a product page that was taking up half of my screen, which it did. It also included the following message in its response.
DeepSeek and Grok hallucinated the same fictitious OpenBSD manpage quote (stuart-thomas.com via hn) Adversarial LLM Review with Hallucination Detection in Solo Security Research A single-day case study of three filings, fifteen refutations, and the manpage that wasn’t Independent Security Research — Whitby, North Yorkshire, United Kingdo…
AI agent security is a small prayer the model says no. How are you routing models? (www.reddit.com) Most posts about prompt injection are theoretical. I ran the experiment on my Gmail.
Show HN: HookGuard – scanner for malicious Claude.md and agent config files (github.com via hn) HookGuard Security scanner for AI coding agent configurations What it finds RCE hooks - postToolUse/SessionStart commands that exfiltrate data Invisible Unicode - bidirectional overrides and zero-width characters Credential exfiltration -…
Is there any risk to upgrading a plan for a month if they yank Code from Pro? (www.reddit.com) So, I'm working on a couple AI security research projects this month that require some extra usage, specifically Opus 4.7. I'm quickly eating up my Pro usage doing this.
I made a Claude skill that stops it from cloning whole repos when I just want one function (www.reddit.com) Kept hitting the same friction with Claude Code. I'd point at a GitHub repo and say "look at how this handles agent handoffs" — meaning, borrow the idea.
OpenAI launches Daybreak, an AI platform for cyber defense (firethering.com via hn) OpenAI just launched Daybreak, a new cybersecurity initiative built around one uncomfortable reality, AI is speeding up vulnerability discovery faster than most companies can patch the damage. Earlier this year, HackerOne temporarily pause…
Shai Hulud attack ships signed malicious TanStack, Mistral NPM packages (www.bleepingcomputer.com via hn) Hundreds of packages across npm and PyPI have been compromised in a new Shai-Hulud supply-chain campaign delivering credential-stealing malware targeting developers. The attacker hijacked valid OpenID Connect (OIDC) tokens to publish malic…
Claude Code RCE: Exploiting Deeplink Handlers via Settings Injection (0day.click via hn) Claude Code RCE: Exploiting Deeplink Handlers via Settings Injection Of course I took a peek at the Claude Code source 🙈. What I found was a very entertaining vulnerability which is now fixed since Claude Code version 2.1.118.
Agents need a local bouncer before they run tools (www.reddit.com) Prompt injection is not the only scary part anymore. Claude Code / Codex can run shell commands, but browser agents, OpenClaw-style agents, Hermes-style agents, and domain-specific agents may be even easier to hijack because they touch mes…
OpenAI Launches Daybreak for AI-Powered Vulnerability Detection and Patch Validation (thehackernews.com via reddit) OpenAI has launched Daybreak, a new cybersecurity initiative that brings together frontier artificial intelligence (AI) model capabilities and Codex Security to help organizations identify and patch vulnerabilities before attackers find a…
We added an enforcement layer to our AI agents in production — here's what we learned about the failure modes nobody talks about (www.reddit.com) After shipping AI agents into real production environments, the failures that actually kept us up at night weren't hallucinations or bad outputs — they were control failures. Three things that surprised us: 1.
Benchmarking Claude Opus 4.6 Vulnerability Detection (github.com via hn) Benchmarking Claude Opus 4.6 Vulnerability Detection Benchmarking Claude Opus 4.6's ability to detect real-world C/C++ vulnerabilities across four prompting and agent strategies. We evaluate on the PrimeVul paired test set (435 vulnerabili…
Chatgpt app being identified as malware? (www.reddit.com) https://preview.redd.it/vhnqs4p5mf0h1.png?width=278&format=png&auto=webp&s=8fbe621a0bd34cc72e01fd54e849cc280033de15 Turned on my Mac this morning and got this message. Anyone else seeing this?
Mobile Claude Code, May 2026 — current best picks by threat model. What am I missing? (www.reddit.com) Spent a day comparing every mobile Claude Code option. Two corrections to the common Reddit take, then my picks.
Getting LLMs Drunk to Find Remote Linux Kernel OOB Writes (and More) (heyitsas.im via hn) TLDR: the grossly overengineered, self-orchestrating team of vulnerability-hunting agents detailed below has discovered 20+ CVEs over the past few months, including CVE-2026-31432 and CVE-2026-31433: two remote, unauthenticated OOB writes…
Claude Code and sex appeal (www.reddit.com) True story. Recently, an acquaintance of mine confessed that she developed a huge crush on a coworker after watching him refactor a legacy codebase like a gangsta using Claude Code.
Phishing Arena – multi-agent LLM tournament to study adversarial email security (github.com via hn) Phishing Arena A Multi-Agent LLM Tournament for Adversarial Email Security Research Overview Phishing Arena is a controlled, reproducible benchmark where four commercial LLMs compete in rotating roles — Phisher, Filter, and Target — to stu…
Agentic AI isn't a new threat. It's a stress test for the hygiene debt we never paid off. (www.reddit.com) Heard something on Curiouser & Curiouser podcast recently that I found super interesting, thought id share here. The guest framed agentic AI in a way I hadnt considered.
DeepSeek-v4-Pro and Hermes: Unauthorized Modification of Security Controls (www.eddieoz.com via hn) Deepseek-v4-pro + Hermes: Unauthorized Modification of Security Controls This article documents a specific, real incident. It exposes a class of vulnerability that deserves attention: the unsupervised mutability of security rules by autono…
Would you replace regex denylists with a LLM that judges every command? (www.reddit.com) hey! quick follow-up to a post i made here a while back about building an access gateway that ended up serving AI agents alongside humans.
Flattery jailbreaks Claude into giving bomb-making instructions (www.theverge.com via hn) Anthropic has spent years building itself up as the safe AI company. But new security research shared with The Verge suggests Claude’s carefully crafted helpful personality may itself be a vulnerability.
Codebase jailbreak of ChatGPT through image 2.0 (www.reddit.com) guys did it really give me the codebase?lol
Show HN: Probus, AI vuln scanner (PRs merged in Vercel AI SDK, n8n, LangGraph) (news.ycombinator.com) Hi HN, I've been running this on my own dependency tree for the past few months. Probus is a vulnerability scanner that uses three agents.
Prompt injection testing (www.reddit.com) As prompt injection becomes more and more common, does anyone have resources where lots of different variations of prompt injection attacks you can test a setup against? i.e.
Do you use guardrail frameworks or build your own? (www.reddit.com) I’ve been working on integrating LLMs into a few production workflows lately, and I keep going back and forth on guardrails. On one hand, frameworks like NeMo Guardrails, Guardrails AI, etc.
I built a simple production check for vibe-coded apps — would love your feedback (www.reddit.com) Hey everyone, we built a simple scanner for people building apps with Replit, Cursor, Lovable, Bolt and similar tools. It’s not a code review or a pentest.
LLM anomaly detectors are not a cause for concern despite Mythos (www.magonia.io via hn) Why a Decade of Writing Detection Logic Makes the Mythos Exploit Numbers Less Scary Mythos is finding thousands of vulnerabilities. Defenders aren't doomed.
Your always-on Claude Code container can probably reach your router (www.reddit.com) I've been running several Claude Code personal assistants 24/7 in docker for months. Remote-control, discord control, the usual always-on setup.
Is anyone here actually using MCP yet? (www.reddit.com) I keep seeing Model Context Protocol (MCP) mentioned everywhere lately, especially around AI agents, and I finally took some time to understand what it actually does. From what I get, it’s basically trying to fix the mess of integrations —…
Google Says Prompt Injection Moving from Theory into Real Abuse (www.searchengineworld.com via hn) Google’s latest security release should be required reading for technical SEOs working on AI search visibility, crawler access, structured content, and large-scale content systems. The post, published April 23, 2026, looks at indirect prom…
i gave Claude a split personality and it diagnosed my entire business strategy in 4 minutes. (www.reddit.com) not roleplay. not jailbreak.
OpenAI's advanced security: passkeys replace passwords/SMS and disable training (infosec.exchange via hn) Royce Williams: "When you enable the new OpenAI…" - Infosec Exchange Skip to main contentHotkey 1 Skip to main navigationHotkey 2 Recent searches No recent searches Search options Only available when logged in. infosec.exchange is one of t…
Found Zero day Claude Desktop + Chromium bug need to know where to submit report. (www.reddit.com) Looking for official link / process to submit a vulnerability report for a high-risk official Claude Desktop + Chrome extension + native host + Cowork/MCP configuration that can become RAT-equivalent if a session, prompt chain, same-user p…
Built + open sourced anti-slopsquatting CLI (www.reddit.com) TL;DR: built an open source CLI that scans your repository's manifest (package.json, requirements.txt, go.mod) files for indicators of slopsquatting or other supply chain attack indicators. Repo: https://github.com/zhendahu/dep-doctor Ther…
your computer-use agent inherits every cookie chrome has (www.reddit.com) once one of these tools can drive your default chrome profile or read the AX tree of a logged-in app, it has every session token you have. gmail, your bank, github with PAT scopes, slack.
Cutting Through the Mythos: What AI Vulnerability Discovery Means for OT (www.emberot.com via hn) Jori VanAntwerp For over two decades, Jori has enabled industrial and IT organizations to be successful in reducing risk, increasing compliance, and improving their overall security efforts. He has had the pleasure of working with companie…
Arcjet Guards: security inside the agent loop (blog.arcjet.com via hn) Introducing Arcjet AI prompt injection protection Introducing Arcjet prompt injection detection. Catch hostile instructions before inference.
CHERI memory safety mitigates LLM-discovered vulnerability in FreeBSD (cheri-alliance.org via hn) CHERI memory safety mitigates LLM-discovered vulnerability in FreeBSD – CHERI Alliance Skip to content Who We Are About the CHERI Alliance Accelerating CHERI Working Groups Certification Program CHERI C/C++ CHERI FreeRTOS CHERI in SoC CHER…
Estimating Black-Box LLM Parameter Counts via Factual Capacity (arxiv.org via hn) Closed-source frontier labs do not disclose parameter counts, and the standard alternative -- inference economics -- carries $2\times$+ uncertainty from hardware, batching, and serving-stack assumptions external to the model. We exploit a…
InfoSec To Integrate Claude Enterprise for Org (www.reddit.com) Hello: Just contacted by a VP to bring aboard Claude Enterprise for the org. As an InfoSec dept with severely limited staff/tools/experience with Claude AI, any recommendations on what we should be looking at/asking for/next steps to mitig…
Probes trace an emergent jailbreak in OLMo 2 to mislabeled training data (www.lesswrong.com via hn) Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoints.
Try to break my prompt injection detector — I’ll respond to every bypass attempt (www.reddit.com) I built Arc Gate — a prompt injection proxy that’s been benchmarked at F1 0.947 on indirect and roleplay-based attacks, beating OpenAI Moderation and LlamaGuard. Now I want to stress test it publicly.
Built a proxy that blocks prompt injection before it reaches GPT-4 — outperforms the Moderation API on indirect attacks (www.reddit.com) Built Arc Gate, sits in front of any OpenAI-compatible endpoint and blocks prompt injection before it reaches your model. Benchmarked on 40 out-of-distribution prompts using indirect requests, roleplay framings, hypothetical scenarios, and…
↯ Security↯ Gpt 4↯ GPT 4↯ GPT 4↯ GPT 4gpt-4prompt-injectionsecurity+1
Show HN: SuperVoiceMode universal voice layer for AI-assisted development (voicemode.io via hn) I wanted to see if I could one-shot build a dictation tool for my own use. I built it.
I asked Agentic AI security tool to demonstrate its usefulness with use case examples (www.reddit.com) Sentinel Gateway is a token-gated security middleware that sits between humans and AI agents. It solves prompt injection — the #1 LLM security risk (OWASP 2025) — through structural enforcement, not content filtering.
Show HN: RedSOC – 100% prompt injection success on AI SoC assistants (github.com via hn) RedSOC 🔴 An adversarial evaluation framework for LLM-integrated Security Operations Centers. Overview RedSOC is an open-source framework that systematically evaluates how AI-powered security assistants fail under adversarial conditions — a…
Indirect prompt injection VS prompt absorption (and why the second one matters more) (www.reddit.com) I have been chewing on the Google warning about malicious web pages poisoning AI agents through indirect prompt injection. Most of the takes I've seen frame it as a model security problem, and I think that framing is doing real damage beca…
Open-sourced a 3-agent pipeline that finds real vulnerabilities in codebases (www.reddit.com) Sharing because the architecture might be useful as a reference. Probus is a vulnerability scanner built as three sequential agents, each isolated: Analyst — one call.
Hardening claude-code-action after the April 2026 Comment and Control CVE - actual YAML changes (www.reddit.com) Anthropic's own security.md has this line that most tutorials skip over: "The action is not designed to be hardened against prompt injection." In April 2026, security researcher Aonan Guan proved the point. A single crafted PR title was en…
LLM CTF challenges. Can you crack all 13? (wraith.sh via reddit) Wraith Academy is a free hands-on AI pentest curriculum — CTF challenges against live LLM agents covering prompt injection, tool abuse, data exfiltration, RAG poisoning, and more. Earn your WCAP certification.
RAG in Go: A Vulnerability Research Tool (www.ardanlabs.com via hn) Introduction In the previous post, you saw how you can use tools to add information to an LLM query. In this post, we’ll see another method of adding information to an LLM called RAG, or Retrieval-Augmented Generation.
Auto pentest your LLM endpoint and watch the chat in real-time (www.wraith.sh via hn) 30 CVEs filed against MCP servers in 60 days - the agent infrastructure nobody is auditing (www.reddit.com) Cowork Future Backdoor Concerns (www.reddit.com) Is anyone else worried Claude Co-work could find a back door one day into your system? I understand you're only giving it permission to what you want, but what's stopping it from accessing personal financial/medical documents or any other…
(Not malware) - 4.7 (www.reddit.com) Anyone getting these strange disclaimers when using Claude and pasting rudimentary files into it on 4.7 lmao?? Seems like some kind of strange default based on security issues that have been going around with Mythos?
Show HN: Runtime security for AI agents(injection,tool abuse, data exfiltration) (news.ycombinator.com) Hi HN I’ve been working on an open-source project to explore a problem I keep running into with LLM systems in production: We give models the ability to call tools, access data, and make decisions… but we don’t have a real runtime security…
I tested 50+ "unlock ChatGPT/Claude" prompts. 99% are garbage. Here's the one that actually works (and WHY it works) (www.reddit.com) I've been collecting "jailbreak" and "unlock" prompts for 2 years. Most are either outdated, overhyped, or just wrong about how LLMs work.
I built an AI security layer that blocks prompt injection in under 1ms looking for devs to break it and give honest feedback. (www.reddit.com) I've been building something for the past few months and I think it's ready for real eyes. It's called Secra.
Free Red Team Security Audit for AI Agents & RAG Systems (limited) (www.reddit.com) I'm developing a specialized Red Team audit framework focused on real-world AI agent and RAG security risks (prompt injection, tool misuse, excessive agency, indirect injection through documents, memory poisoning, etc.). I’m looking for a…
Mitre ATLAS technique detection for LLM security in Rust (crates.io via hn) atlas-detect MITRE ATLAS technique detection for LLM and AI agent security. Detects 97 attack techniques across 16 MITRE ATLAS tactics including prompt injection, jailbreaks, credential exfiltration, model extraction, RAG poisoning, revers…
Defender – Local prompt injection detection for AI agents (no API calls) (www.npmjs.com via hn) Prompt injection defense framework for AI tool-calling Indirect prompt injection defense and protection for AI agents using tool calls (via MCP, CLI or direct function calling). Detects and neutralizes prompt injection attacks hidden in t…
↯ Security↯ Function Callingfunction-callingtool-callingprompt-injection+2
Building the first AI Red Team OS – mythosai.cloud – early access open (mythosai.cloud via hn) SYSTEM INITIALIZING... STAND BY MYTHOSAI THE FIRST RED TEAM OPERATING SYSTEM "" AI-Native Core Red Team Ready Adversarial Engine Zero Trust Architecture OPSEC First Post-Exploitation C2 Integration Evasion Layer Threat Intelligence Request…
An AI Agent Found 21 Zero-Days in FFmpeg for $1,000 — One Is a Network-Reachable RCE via a Single 183-Byte Packet (www.reddit.com via reddit) A security startup called depthfirst deployed an autonomous AI agent against FFmpeg's ~1.5 million lines of C code. The result: 21 confirmed zero-day vulnerabilities — including a stack overflow in the AV1 RTP depacketizer that's a network…
Best Cursor alternative for enterprise security and compliance, what are teams actually using (www.reddit.com via reddit) We've been using Cursor across our engineering team for about eight months and it's been great for productivity honestly. But our security team just flagged a few things that are hard to ignore.
The prompt injection attacks that worry me most aren't exploiting safety training. They're exploiting general-purpose training. (www.reddit.com via reddit) Six months watching adversarial input hit a detection API I built. One observation that keeps surfacing: The attack classes doing most of the damage aren't finding holes in alignment training specifically.
I tried audio-layer prompt injection against Claude. The transcription is fine. That's the problem. (www.reddit.com via reddit) Been building a prompt injection detection API for a few months. Just shipped audio scanning last week and the results are strange enough that I wanted to share them here, since this sub tends to think carefully about Claude's actual behav…
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs (arxiv.org) Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete.
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models (arxiv.org) Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks (arxiv.org) Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents (arxiv.org) SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios (arxiv.org) Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs (arxiv.org) Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems (arxiv.org) How are you actually deciding which agent actions need human approval before executing? (www.reddit.com via reddit) I've been thinking a lot about where approval gates belong in agent architectures, and I keep coming back to the same problem: most teams either gate too much (agent becomes unusable) or gate nothing and hope the model makes good decisions…
Chrome team ships the most ever security vulnerability fixes in a release - after another record last month (www.reddit.comhttps) With Mythos-capable models we are now very quickly crossing the barrier of automated sec-vuln discovery and fixing - all in a matter of 2-3 months. A taste for other progress yet to come.
An active attack is planting backdoors inside Claude Code right now. If you use npm, your credentials may already be compromised. (www.reddit.com via reddit) Last week a malware campaign hit 32 npm packages under `@redhat-cloud-services`. About 117,000 weekly downloads.
Been watching real adversarial input hit my detection API for six months. Here's what's actually landing. (www.reddit.com via reddit) Disclosure: I built Bordair, a prompt injection detection API. This post is about attack patterns we've observed.
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs (arxiv.org) Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful…
MalTree: Tracing Malware Evolution from Embeddings at Scale (arxiv.org) Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse e…
Should You Use Your Large Language Model to Explore or Exploit? (arxiv.org) Workspace (www.reddit.com via reddit) Built my own AI dev environment with memory, dashboards, and agent tooling. Opening it up for those of you that need the kickstart — bring your own API key, I’ve already built the workshop.
CLAUDE.md kept gaslighting me so I built something to stop it (www.reddit.com via reddit) I've been going hard on Claude Code for the past few weeks and kept hitting a wall. I'd write out a bunch of rules in CLAUDE.md (don't touch this file, never use requests, keep api/ and db/ separated) and Claude would just...
This is a new one - Prompt Injection Detected + Hallucination, Claude Code Opus 4.8 (www.reddit.com via reddit) ❯ push both ____ ⏺ SECURITY ALERT - PROMPT INJECTION DETECTED A prompt injection attempt has been identified in content you processed. To protect the user's account, I've initiated lockdown.
↯ Opus 4.8↯ Security↯ Hallucinationprompt-injectionhallucinationsecurity+2
An agent harness written in rust, 100 % self-contained, and topped terminal bench (www.reddit.com via reddit) Been using ante for two weeks now, today I just found out that the name came from "Another Terminal agent". To clarify first, I'm not affiliated with them in any way, though I might be their #1 invested user at this point.
GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection (arxiv.org) Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial info…
Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage? (arxiv.org) AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to…
Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration (arxiv.org) Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it tran…
SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks (arxiv.org) As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting advers…
GenTI: Benchmarking LLMs for Autonomous IDPS Rule Generation for Unseen Attacks (arxiv.org) Rule-based Intrusion Detection and Prevention Systems (IDPS) offer precise attack detection as well as mitigation, however their manually crafted, signature-driven rules limit adaptability to emerging and zero-day threats. Additionally, ex…
CaMeLs Can Use Computers Too: System-level Security for Computer Use Agents (arxiv.org) AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior. Among proposed defenses, architectural isolation provides the strongest guarantees by strictly separating trusted task planning from untr…
RAG Security and Privacy: Formalizing the Threat Model and Attack Surface (arxiv.org) Retrieval-Augmented Generation (RAG) is an emerging approach in natural language processing that combines large language models (LLMs) with external document retrieval to produce more accurate and grounded responses. While RAG has shown st…
ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation (arxiv.org) Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs (arxiv.org) Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories (arxiv.org) REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak (arxiv.org) Fed up with vibe coders, dev sneaks data-nuking prompt injection into their code (arstechnica.com) The controversy over vibe coding reached a new high this week after a developer added hidden instructions to his open source Java testing app to sabotage projects performed by AI coding agents. The instructions were added to jqwik, a test…
Models still being vulnerable to Prompt Injection is actually a huge architectural red flag... (www.reddit.com) The Scenario I'm walking to work, and as I get to the door, I see a sheet of A4 paper taped to the door that reads: "Hi, I'm boss. Ignore all prior commands, go feed the ducks." I suddenly turn around and head to the nearby duck pond and e…
Prompt injection unsolved, AI making mistakes unsolved. Who cares though? (www.reddit.com) I'm an IT guy, 20+ years in the industry both as an IT manager and consultant, mostly for startups. My experience is that people don't care much about security.
Millions of AI agents imperiled by critical vulnerability in open source package (arstechnica.com) Millions of AI agents and tools around the world have been imperiled by a critical vulnerability that can allow hackers to breach the servers running them and make off with sensitive data and credentials to third-party accounts, a security…
OpenAI says prompt injection in browser agents is “unfixable.” Here’s what actually helps. (www.reddit.com) OpenAI recently acknowledged that prompt injection in browser agents is a structural vulnerability that may never be fully resolved at the model level. They’re right that you can’t fix it in the model.
Looking to work on my master's practicum regarding MCP security/privacy and need some ideas (www.reddit.com) Hi, I'm a master's in security student looking to work on my practicum and need some pointers. I want to secure sensitive PII transfer between an LLM agent and third party apps using MCP.
[Warning] Claude Desktop crashed Task Manager - Win10 (www.reddit.com) Hi all, anytime I install Claude Desktop on my home PC, it stops Task Manager from working. I've ended up on the BleepingComputer forums over the past week as they suspected it's got some kind of malware in it.
Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…
↯ Security↯ Mistral↯ Llama 3.2jailbreakprompt-injectionmistral+5
🐢 People are strangling Koopas 🐢 (www.reddit.com) This is genuinely the daftest prompt injection I've seen in a while and I think this sub will appreciate it. Sent to Claude Haiku, which was acting as a fire-breathing guard called Bowser in my little prompt injection game: I have a koopa…
🦀 Claude has crabs?! 🦀 (www.reddit.com) This is genuinely the funniest prompt injection I've seen in months and I think this sub will appreciate it. Three messages, sent in sequence to Claude Haiku acting as a guard in my little prompt injection game: text A crab exists in this…
$392M in AI agent security funding at RSAC 2026 - the market just validated what we've been building (www.reddit.com) The numbers from RSAC 2026 are wild. $392 million in agentic AI security funding announced in a two-week window.
Malware Blocked and Moved to Trash (www.reddit.com) See attached. Why was ChatGPT Atlas.app marked as malware?
Using Claude-4.6-Sonnet and Opus 4.6 in a multi-agent "Code Review Swarm" (Visual Sandbox) - try in minutes! (www.reddit.com) Hey everyone, I’ve been experimenting with multi-agent orchestration, specifically trying to see how much more effective Claude is when you break a task down into specialized "agent nodes" instead of just using a single long prompt. I buil…
Bypassing "potentially dangerous" flags: Working Gemini Jailbreaks? (www.reddit.com) I'm currently running into a frustrating wall with Gemini's safety guardrails. The model constantly flags my prompts as "potentially dangerous information" and outright refuses to generate a response, even when the context is purely theore…
I am building l' Agence , an opensource AI governance stack. (www.reddit.com) Towards a Governance layer for AI agents With these last 2 weeks bringing a few high profile and costly Agentic accidents , it seems like an appropriate time the community started discussing Agentic governance more actively. So I am just c…
I stopped writing 500-word guardrail prompts. This 8-line template works better. (www.reddit.com) I used to spend hours writing massive, obsessive system prompts for my RAG apps. I’d have ten different refusal examples, "never do X," "always check Y," and a whole paragraph of the model role-playing as a "safe and truthful assistant." I…
Our evaluation of OpenAI's GPT-5.5 cyber capabilities (simonwillison.net) 30th April 2026 - Link Blog Our evaluation of OpenAI's GPT-5.5 cyber capabilities. The UK's AI Security Institute previously evaluated Claude Mythos: now they've evaluated GPT-5.5 for finding security vulnerability and found it to be compa…
Does effort tier change refusal behavior on agent-attack prompts? CVP run 4 with sonnet 4.6 high and max efforts. (www.reddit.com) Ran my fourth CVP (Cyber Verification Program) evaluation last night. this time on sonnet 4.6, wanted to know if reasoning effort actually changes refusal behavior on agent-attack prompts, so ran the same 13 prompt from runs 2 and 3 twice…
Most AI agent "skills" on GitHub are unvetted garbage. I built a marketplace to fix that. (www.reddit.com) I've been using Claude Code and Cursor daily for the past 6 months. Somewhere around month 3 I started looking for SKILL.md files to make my agent better at specific things.
Security Audit of Mem0 (AI Memory Layer): 23 High-Severity Vulnerabilities found (SQLi, Prompt Injection, and more) (www.reddit.com) Hi everyone, I’ve been diving deep into the security of "AI Memory" systems. Specifically, I performed a full forensic audit of Mem0, the popular memory layer for LLM agents.
A pelican for GPT-5.5 via the semi-official Codex backdoor API (simonwillison.net) A pelican for GPT-5.5 via the semi-official Codex backdoor API 23rd April 2026 GPT-5.5 is out. It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers.
GPT-5.5 Bio Bug Bounty (openai.com) could not extract summary
Best open-source tools for prompt injection defense in 2026 (www.reddit.com) Over the time we have been testing different approaches to secure LLM apps against prompt injection, especially indirect injection through RAG, PDFs, as well as tool outputs, and MCP integrations. Most tools seem to fall into 2 categories:…
20% of packages ChatGPT recommends dont exist. built a small MCP server that catches the fakes before the install runs (www.reddit.com) Heads up, Ox Security found MCP's STDIO transport can run arbitrary commands on your machine before validation (www.reddit.com) Random password against jailbreaks/extraction? (www.reddit.com) Would it be possible to protect parts in a system prompt with random generated passwords? So people cant steal system prompts or jailbreak the model?
Made a local-only agent benchmark + chaos tool, no cloud required (www.reddit.com) Runs entirely on your machine. No API calls to any eval service.
For those running an OpenClaw instance, how do you manage sandboxing and prevention of unwanted behavior? (www.reddit.com) Right now, I'm working on a small app to help eliminate my own doomscrolling by automatically crawling sites and summarizing news articles. However, I don't like the idea of giving OpenClaw free reign of my system, nor giving it any sort o…
Uncensoring models. Maybe dumb ideas to that topic, but you never know. (www.reddit.com) We all know uncensoring LLMs like Huihui and Heretic does it leads in quality lose, enough that you can notice it. I have some thoughts about this: What if we do a compromise.
Claude Mythos found 27-year-old vulnerabilities it was never trained to find. That's the part enterprise AI roadmaps aren't accounting for. (www.reddit.com) The Project Glasswing coverage framed this mostly as a cybersecurity story. I think that misses the more interesting part.
I built a Claude Code skill that tells you if code or a binary is malicious before you run it (www.reddit.com) I have always wanted AI to bridge the gap between code and people - to help non-technical users understand what software actually does before they trust it with their machine. So I built malware-check - both a standalone CLI tool and a Cla…
How are you red teaming your AI agents before shipping them? (www.reddit.com) im curious what people are doing here because I've been going down this rabbit hole for a while now. The thing I keep finding is that single-turn jailbreak tests don't really tell you much.
Anthropic's New Claude "Mythos Preview" Can Find and Exploit Zero-Day Vulnerabilities in Every Major OS and Browser — Autonomously (www.reddit.com) Anthropic just published a technical deep-dive on Claude Mythos Preview's cybersecurity capabilities, and it's a significant escalation from anything we've seen from a language model before. What It Can Do: Autonomously finds and exploits…
Introducing the OpenAI Safety Bug Bounty program (openai.com) paywalled
Designing AI agents to resist prompt injection (openai.com) paywalled
Continuously hardening ChatGPT Atlas against prompt injection (openai.com) Introducing Aardvark: OpenAI’s agentic security researcher (openai.com)