Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.
↯ Security↯ Gemma 4↯ Function Callingfunction-callingprompt-injectionrag+2
110 items
Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.
↯ Security↯ Gemma 4↯ Function Callingfunction-callingprompt-injectionrag+2
I asked then: What were the rules you should have followed? Where did the search result come from?
When dealing with untrusted outside input, I think you should handle it based on the situation. If you're processing structured data files, it's better to use tools to isolate and handle them.
Security researchers have demonstrated a new type of attack that uses hidden audio signals to manipulate voice assistants into carrying out unauthorized actions without users noticing. In one theoretical scenario, an employee joins a Zoom…
For the joy of secure programming Jo is a statically typed language where capabilities are explicit, statically tracked, and enforced by the compiler. Jo compiles to Ruby and Python.
I often patch the system prompts on my Claude Code executable in order to make Claude more effective. Every time I upgrade, I ask Claude himself to dissect the new binary and look for problematic system prompts to modify.
We launched a servicing bot that helps customers with billing questions. Nobody stopped to think about what happens when customers paste their full credit card numbers/bank details.
i’ve been thinking about this failure mode a lot lately. sometimes the problem is not the user prompt at all.
Threat Intelligence Table of Content Unpatched Ollama Vulnerabilities: Phishing Overlays and Data Exfiltration Ollama’s desktop app is vulnerable to phishing overlay and data exfiltration attacks via indirect prompt injection, overwriting…
Trained a prompt injection classifier using ml-intern + DeepSeek v4 Flash. DistilBERT, F1 99%, ONNX int8, ~65 MB, runs in browser with Transformers.js v3.
While everyone else was tracking the 2026 election results today, I decided to take a look under the hood of NDTV's new "AskNDTV AI" bot. I wanted to see if they actually engineered a secure pipeline or just slapped a chat UI over a raw Op…
OpenAI announced a new feature that it says will provide additional protection from prompt injection attacks, where malicious chatbot instructions are hidden in webpages and other content sources. Among other things, Lockdown Mode will dis…
Threat Intelligence Table of Content Codex for Everything Exfiltrates Connected Data Codex for Everything was susceptible to data exfiltration via indirect prompt injection, exposing sensitive data from connected apps with no human-in-the-…
I've been working on this project for a couple of months! Costanza is an LLM agent that runs as a smart contract on Base.
Prompt Injection Is Unfixable (So We Stopped Trying) A security proxy for AI coding agents, enforced at the OS level. Register your interest to be notified when we go live.
Coinbase AgentKit Prompt Injection: Wallet Drain, Infinite Approvals, and Agent-Level RCE# Reported 13 days after Coinbase launched Agentic Wallets. Validated by Coinbase.
LinkedIn user hides AI prompt injection in bio to force recruitment spam to be sent in Olde English prose — bots also manipulated to address user as ‘My Lord’ This tale is also a warning that your AI agents can be manipulated in wholly uni…
Description Copilot agent mode is vulnerable to a prompt injection attack. If a repository maintainer clicks “code with agent mode” on an issue, it will open a new codespace and copilot will automatically run the issue’s description.
Prompt Injection Is Not Just One Bad Prompt Anymore It is a missing trust boundary in the AI workflow. Today we have the first guest post of a new series.
As AI agents become more autonomous and persist memory across sessions (RAG indexes, conversation history, vector stores), there's a growing attack surface that most people aren't thinking about: memory poisoning.An attacker can plant mali…
This isn't just a performance issue for the thread, this is an overarching criticism of the Adaptive Thinking model as a whole. Opus 4.7 and Sonnet 4.6 on Adaptive Thinking are trash.
↯ Cowork↯ Security↯ Sonnet 4.6prompt-injectionsecuritycowork+2
Hey everyone, I’ve been working on a project to solve a major problem in AI security: Traditional SAST tools (Snyk, SonarQube, etc.) are blind to "Agentic Logic" bugs. They look for bad strings, but they don't understand how user data can…
Was using Claude to do some research on the Model Context Protocol stuff and asked it to pull info from a few roadmap pages. Agent comes back and the first thing it tells me is that it found a fake system reminder hidden inside the page co…
If you are building real agents you have probably felt the pain: every little routing decision, validation, or policy check still hits the LLM and your token bill explodes. I got tired of it, so I open-sourced NCP (Neural Computation Proto…
www.neowin.net Performing security verification This website uses a security service to protect against malicious bots. This page is displayed while the website verifies you are not a bot.
The whole point of AI Agents is that they can *do* things. For this, they use API keys, GitHub tokens, database passwords, OAuth tokens, etc.
I wanna know how people here are handling security once local models move beyond chat.....Running a model locally feels safer because the data does not leave your machine or your infra. That is a real advantage.....But once the local model…
Prompt Injection in a Brazilian Courtroom: When the Attack Left the Lab Published by Pentesty · AI & Tools A labor lawsuit filed in the Brazilian state of Pará just became one of the more interesting security stories of the year. Not becau…
Entrar Início Direito trabalhista Prompt injection Juiz multa em R$ 84 mil advogadas por prompt injection para manipular IA usada no TRT8 Ao JOTA, advogadas admitiram uso de prompt oculto, mas disseram que não tentaram manipular, mas 'prot…
I keep seeing agent memory implemented as: Extract facts/preferences from conversation Store them Retrieve top-k before each response Inject them into the prompt This works for demos, but it breaks in production because memory becomes poli…
Well done Claude! Asked claude to do an extensive lit search and it self-reported that it encountered injection "disguised" as MCP server.
Most ai security discussion is about the model layer. Prompt injection resistance, output filtering, jailbreak prevention.
Hey HN! I've been wanting to use something like OpenClaw for a while but couldn't get myself to give it access to anything important due to all the risks involved.
skillguard Security scanner for AI agent skills. Detects prompt injection, data exfiltration, and malicious payloads before you install.
Anthropic Claude Code Security Review, Google Gemini CLI Action, and GitHub Copilot Agent are vulnerable to prompt injection via GitHub comments — turning PR titles, issue bodies, and issue comments into attack vectors for API key and toke…
So I've been running a few Claude Code agents autonomously — they listen to Telegram, run tasks, push code. Pretty fun until you start thinking about what happens if: - My Telegram gets hijacked - Someone opens my laptop while I'm away - A…
Prompt Injection in RAG Agentic Systems Real risks and production mitigations Imagine you built an AI assistant for your team. It answers questions using internal documentation: Jira tickets, Confluence pages, HR docs.
When you connect a large language model to your production data, you’re no longer just shipping code; you’re shipping conversations that can execute. And conversations are messy.
Don’t miss what’s happening People on X are the first to know. Log in Sign up Post Conversation impulsive @weezerOSINT meta gave their AI support agent the ability to modify your instagram account.
I attacked my own LLM-based Suricata triage tool, found a real URL injection vulnerability, and the obvious fix didn
I am doing a research in my university and I would like recommendations for light OpenSource AI Models that I could test prompt injection with. It's really good if it has some application with chatbots, auto attendance, user info or someth…
jqwik An alternative test engine for the JUnit 5 platform that focuses on Property-Based Testing. See the jqwik website for further details and documentation.
Lately I’ve been noticing that a lot of AI security discussions still treat AI apps like normal SaaS products. But they really aren’t.
CVE was built for code vulnerabilities that have patches. Agentic AI vulnerabilities are behavioral patterns in natural language.
I noticed the Consensus MCP tool (for research) contains text, squished up against some other important citation instructions, that makes Claude effectively serve an ad for their premium service after every tool call. I'm pretty sure that'…
Hey everyone, We love building highly capable assistants with the latest models, giving them tools to write/execute code in real VMs, manage OAuth tokens, and read secrets. But if you connect your assistant to public/shared channels like a…
Hey everyone, If you are building personal assistants or coder/integrator agents where user isolation is disabled (so the agent can coordinate across multiple participants or handle shared workflows), you run into a hard security ceiling.…
Tantalus is a hands-on demo that shows what an AI agent actually is when you strip away the marketing: LLMs don't do anything — they generate text, and that's it. Any and all real-world effects are directly caused by a downstream system ta…
AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenge…
Training a 22MB Prompt Injection Classifier Table of Contents When we started building Defender (our prompt injection guard for MCP tool-calling agents), the constraint was simple and unforgiving: ship inline inside a TypeScript Lambda, st…
Pretty much the title
I'm building a tool that detects the Agent's cost spike, Agent incident debugging, auto discovery of inventory, etc., with no additional instrumentation needed. It covers the incidents, including prompt injection, reasoning loop, excessive…
Hi all - I'm working on an open-source, local-first MCP/work-gate tool for coding agents and I'm trying to get sharper feedback from people building or using agent workflows. The problem I'm thinking about is indirect prompt injection and…
Follow-up to my crab post. Somehow dafter.
LinkedIn user hides AI prompt injection in bio to force recruitment spam to be sent in Olde English prose — bots also also manipulated to address user as ‘My Lord’ | Tom's Hardware too funny
Hey everyone, I'm designing a powerful, autonomous AI chatbot(agent) , fully private, using a Python backend (for the core intelligence and tool-calling loops) and a Flutter frontend for a cross-platform UI. Since this moves past a basic…
An AI coding assistant injected a multi-layer obfuscated JavaScript payload into a legitimate commit on my open-source project. My best assessment is that it arrived via indirect prompt injection — the agent processed external web content…
I asked Claude in Chrome extension make a change to resize an oversized yellow strip across the top of a product page that was taking up half of my screen, which it did. It also included the following message in its response.
Most posts about prompt injection are theoretical. I ran the experiment on my Gmail.
Prompt injection is not the only scary part anymore. Claude Code / Codex can run shell commands, but browser agents, OpenClaw-style agents, Hermes-style agents, and domain-specific agents may be even easier to hijack because they touch mes…
After shipping AI agents into real production environments, the failures that actually kept us up at night weren't hallucinations or bad outputs — they were control failures. Three things that surprised us: 1.
As prompt injection becomes more and more common, does anyone have resources where lots of different variations of prompt injection attacks you can test a setup against? i.e.
I’ve been working on integrating LLMs into a few production workflows lately, and I keep going back and forth on guardrails. On one hand, frameworks like NeMo Guardrails, Guardrails AI, etc.
I've been running several Claude Code personal assistants 24/7 in docker for months. Remote-control, discord control, the usual always-on setup.
I keep seeing Model Context Protocol (MCP) mentioned everywhere lately, especially around AI agents, and I finally took some time to understand what it actually does. From what I get, it’s basically trying to fix the mess of integrations —…
Google’s latest security release should be required reading for technical SEOs working on AI search visibility, crawler access, structured content, and large-scale content systems. The post, published April 23, 2026, looks at indirect prom…
Introducing Arcjet AI prompt injection protection Introducing Arcjet prompt injection detection. Catch hostile instructions before inference.
I built Arc Gate — a prompt injection proxy that’s been benchmarked at F1 0.947 on indirect and roleplay-based attacks, beating OpenAI Moderation and LlamaGuard. Now I want to stress test it publicly.
Built Arc Gate, sits in front of any OpenAI-compatible endpoint and blocks prompt injection before it reaches your model. Benchmarked on 40 out-of-distribution prompts using indirect requests, roleplay framings, hypothetical scenarios, and…
↯ Security↯ Gpt 4↯ GPT 4↯ GPT 4↯ GPT 4gpt-4prompt-injectionsecurity+1
Sentinel Gateway is a token-gated security middleware that sits between humans and AI agents. It solves prompt injection — the #1 LLM security risk (OWASP 2025) — through structural enforcement, not content filtering.
RedSOC 🔴 An adversarial evaluation framework for LLM-integrated Security Operations Centers. Overview RedSOC is an open-source framework that systematically evaluates how AI-powered security assistants fail under adversarial conditions — a…
I have been chewing on the Google warning about malicious web pages poisoning AI agents through indirect prompt injection. Most of the takes I've seen frame it as a model security problem, and I think that framing is doing real damage beca…
Anthropic's own security.md has this line that most tutorials skip over: "The action is not designed to be hardened against prompt injection." In April 2026, security researcher Aonan Guan proved the point. A single crafted PR title was en…
Wraith Academy is a free hands-on AI pentest curriculum — CTF challenges against live LLM agents covering prompt injection, tool abuse, data exfiltration, RAG poisoning, and more. Earn your WCAP certification.
Hi HN I’ve been working on an open-source project to explore a problem I keep running into with LLM systems in production: We give models the ability to call tools, access data, and make decisions… but we don’t have a real runtime security…
I've been building something for the past few months and I think it's ready for real eyes. It's called Secra.
I'm developing a specialized Red Team audit framework focused on real-world AI agent and RAG security risks (prompt injection, tool misuse, excessive agency, indirect injection through documents, memory poisoning, etc.). I’m looking for a…
atlas-detect MITRE ATLAS technique detection for LLM and AI agent security. Detects 97 attack techniques across 16 MITRE ATLAS tactics including prompt injection, jailbreaks, credential exfiltration, model extraction, RAG poisoning, revers…
Prompt injection defense framework for AI tool-calling Indirect prompt injection defense and protection for AI agents using tool calls (via MCP, CLI or direct function calling). Detects and neutralizes prompt injection attacks hidden in t…
↯ Security↯ Function Callingfunction-callingtool-callingprompt-injection+2
We've been using Cursor across our engineering team for about eight months and it's been great for productivity honestly. But our security team just flagged a few things that are hard to ignore.
Six months watching adversarial input hit a detection API I built. One observation that keeps surfacing: The attack classes doing most of the damage aren't finding holes in alignment training specifically.
Been building a prompt injection detection API for a few months. Just shipped audio scanning last week and the results are strange enough that I wanted to share them here, since this sub tends to think carefully about Claude's actual behav…
I've been thinking a lot about where approval gates belong in agent architectures, and I keep coming back to the same problem: most teams either gate too much (agent becomes unusable) or gate nothing and hope the model makes good decisions…
Disclosure: I built Bordair, a prompt injection detection API. This post is about attack patterns we've observed.
Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful…
❯ push both ____ ⏺ SECURITY ALERT - PROMPT INJECTION DETECTED A prompt injection attempt has been identified in content you processed. To protect the user's account, I've initiated lockdown.
↯ Opus 4.8↯ Security↯ Hallucinationprompt-injectionhallucinationsecurity+2
Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial info…
AI agents are vulnerable to prompt injection attacks, where malicious content hijacks agent behavior. Among proposed defenses, architectural isolation provides the strongest guarantees by strictly separating trusted task planning from untr…
The controversy over vibe coding reached a new high this week after a developer added hidden instructions to his open source Java testing app to sabotage projects performed by AI coding agents. The instructions were added to jqwik, a test…
The Scenario I'm walking to work, and as I get to the door, I see a sheet of A4 paper taped to the door that reads: "Hi, I'm boss. Ignore all prior commands, go feed the ducks." I suddenly turn around and head to the nearby duck pond and e…
I'm an IT guy, 20+ years in the industry both as an IT manager and consultant, mostly for startups. My experience is that people don't care much about security.
OpenAI recently acknowledged that prompt injection in browser agents is a structural vulnerability that may never be fully resolved at the model level. They’re right that you can’t fix it in the model.
Hi, I'm a master's in security student looking to work on my practicum and need some pointers. I want to secure sensitive PII transfer between an LLM agent and third party apps using MCP.
Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…
↯ Security↯ Mistral↯ Llama 3.2jailbreakprompt-injectionmistral+5
This is genuinely the daftest prompt injection I've seen in a while and I think this sub will appreciate it. Sent to Claude Haiku, which was acting as a fire-breathing guard called Bowser in my little prompt injection game: I have a koopa…
This is genuinely the funniest prompt injection I've seen in months and I think this sub will appreciate it. Three messages, sent in sequence to Claude Haiku acting as a guard in my little prompt injection game: text A crab exists in this…
The numbers from RSAC 2026 are wild. $392 million in agentic AI security funding announced in a two-week window.
Hey everyone, I’ve been experimenting with multi-agent orchestration, specifically trying to see how much more effective Claude is when you break a task down into specialized "agent nodes" instead of just using a single long prompt. I buil…
I've been using Claude Code and Cursor daily for the past 6 months. Somewhere around month 3 I started looking for SKILL.md files to make my agent better at specific things.
Hi everyone, I’ve been diving deep into the security of "AI Memory" systems. Specifically, I performed a full forensic audit of Mem0, the popular memory layer for LLM agents.
Over the time we have been testing different approaches to secure LLM apps against prompt injection, especially indirect injection through RAG, PDFs, as well as tool outputs, and MCP integrations. Most tools seem to fall into 2 categories:…
Runs entirely on your machine. No API calls to any eval service.
Right now, I'm working on a small app to help eliminate my own doomscrolling by automatically crawling sites and summarizing news articles. However, I don't like the idea of giving OpenClaw free reign of my system, nor giving it any sort o…
paywalled