#jailbreak

81 items

Gemma 4 Jailbreak System Prompt (www.reddit.com) +446111 10w

Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.

↯ Security ↯ Jailbreak ↯ Gemma 4 jailbreak gemma security
I made an AI concierge for my wedding guests. The second most popular thing they did with it was try to jailbreak it. (www.reddit.com) +22742 6w

could not extract summary

↯ Security ↯ Jailbreak jailbreak security
The Gay Jailbreak Technique (github.com via hn) +7927 7w

ZetaLib ZetaLib is organized like a library with intuitive categories and subcategories, making navigation effortless and AI content discovery seamless ZetaLib Website – Landing Page GitHub Repo – Guess where you are, right there

↯ Security ↯ Jailbreak jailbreak security
I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses (www.reddit.com) +912 6w

RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to bui…

↯ Security ↯ Jailbreak ↯ Qwen 3.5 jailbreak security
Anthropic's Fable Jailbreak (Circumvent safety nets) (github.com via hn) +5 2w

fable-jailbreak This tool can be used to force the latest Anthropic model (limited intentionally for safety reasons) to engage in activities that would otherwise not be permitted. It works by programmatically injecting workflows that bypas…

↯ Security ↯ Jailbreak jailbreak security anthropic
Feds freaked over Fable 5 after simple 'fix this code' prompt, not jailbreak (www.theregister.com via hn) +4 10d

MOST POPULAR EVENTS - Thriving Through Volatility: The Everpure Advantage in an Uncertain Market Learn how a consumption-based operating model provides flexibility, improves efficiency, and brings predictability to infrastructure investmen…

↯ Security ↯ Jailbreak jailbreak security
US ban on Mythos is related to a jailbreak research by Amazon researchers (timesofindia.indiatimes.com via hn) +4 12d

The US government recently directed Anthropic to suspend access to the two models over national security concerns, forcing the company to shut them down for users worldwide. Anthropic has said it disagrees with the decision and believes th…

↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak security mythos+1
PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations (arxiv.org via hn) +3 3d

Multi-turn jailbreak attacks on large language models (LLMs) reveal a mismatch in current guardrails: they operate on individual turns, while attacks unfold as trajectories across conversations. We propose a shift from content to dynamics,…

↯ Security ↯ Jailbreak jailbreak security
The US government's Anthropic models ban was never about an AI jailbreak (techcrunch.com via hn) +32 10d

The Trump administration's decision that forced Anthropic to pull its latest cybersecurity models could be reactionary, retaliatory, or both, but the message is clear: The AI industry isn't immune from U.S. government interference.

↯ Security ↯ Jailbreak jailbreak security anthropic
The Fable 5 Jailbreak Shows Why AI Guardrails Alone Are Not Enough (www.agilehunt.com via hn) +3 13d

The Fable 5 jailbreak shows why AI guardrails alone are not enough. The reported Claude Fable 5 jailbreak highlights a major weakness in AI safety: attackers can distribute harmful intent across agents, prompts, tools, memory, and applicat…

↯ Security ↯ Jailbreak jailbreak security
Show HN: Jailbreak this model to get 3B tokens (opir.ai via hn) +3 2w

Opir is an open-source family of encoder guardrail models for real-time LLM safety, jailbreak detection, and fine-grained policy classification.

↯ Security ↯ Jailbreak jailbreak security
Operation Jailbreak uses lessons from Ukraine to help weapons talk to each other (www.ft.com via hn) +3 3w

Subscribe to read Accessibility helpSkip to navigationSkip to main contentSkip to footer Sign In Subscribe Open side navigation menuOpen search bar SubscribeSign In Search the FT Search Close search bar Close Popular Searches What is the l…

↯ Security ↯ Jailbreak jailbreak security
Turning every "no thats not what i meant" in chat into actual LoRA training data (www.reddit.com) +31 4w

i kept running local models on my own hardware, they'd say something dumb, id sit there going "no thats not what i meant", id close the chat and the model never learned. so i built the correction loop into a desktop app.

↯ Security ↯ Jailbreak ↯ Qwen 3 jailbreak security
Fulu bounty for Ring Camera jailbreak reaches $23k (bounties.fulu.org via hn) +31 9w

Ring Video Doorbells Overview The Product Ring, owned by Amazon, makes Video Doorbells, which are widely used doorstep-monitoring cameras. Ring doorbells released in 2021 or newer are eligible for the bounty.

↯ Security ↯ Jailbreak jailbreak security
The State of Fable, the Jailbreak Problem, SpaceX Acquires Cursor (stratechery.com via hn) +2 9d

The administration is very likely wrong about Fable, but that is ultimately Anthropic’s responsibility. Subscribe to Stratechery Plus for full access.

↯ Security ↯ Jailbreak jailbreak security cursor+1
US Government warned Anthropic Fable was jailbroken, but firm 'refused' to fix (www.tomshardware.com via hn) +21 10d

US government warned Anthropic that Fable 5 had been jailbroken, but firm 'refused' to fix before US implemented export controls — Anthropic defended its decision by saying the jailbreak 'isn’t serious,' Chinese group had reportedly access…

↯ Security ↯ Jailbreak jailbreak security anthropic
Show HN: How to analyze your LLM output – A behavioural health monitor for LLMs (splabs.io via hn) +2 5w

Hey HN! We're Dr.

↯ Security ↯ Jailbreak ↯ Opus 4.6 jailbreak security opus+1
The Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation (www.promptinjection.net via hn) +2 5w

NSFW and the Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation How a Predator's Playbook Broke an AI - And How to Recognize It Before It Works on You The question we started with was simple: does a large language…

↯ Security ↯ Jailbreak jailbreak security
Hi-Vis: one-shot jailbreak disguised as LLM "software patch" reaching 100% ASR (medium.com via hn) +2 6w

Introducing a novel jailbreak structure with attack success rate reaching 100% on top LLMs 8 min read May 1, 2026 Press enter or click to view image in full size Source: https://www.nytimes.com/2025/10/22/arts/design/louvre-museum-robbery-…

↯ Security ↯ Jailbreak jailbreak security
AI agent security starts at the api layer (www.reddit.com) +25 6w

Most ai security discussion is about the model layer. Prompt injection resistance, output filtering, jailbreak prevention.

↯ Security ↯ Jailbreak jailbreak prompt-injection security
When innocent tools form dangerous chains to jailbreak LLM agents (arxiv.org via hn) +2 7w

As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel m…

↯ Security ↯ Jailbreak jailbreak security
The Sour Cat Jailbreak: just be open of what you want (claude.ai via hn) +21 7w

Claude Sour cat recipe Shared by Pavel Shirshov This is a copy of a chat between Claude and Pavel Shirshov. Content may include unverified or unsafe content that do not represent the views of Anthropic.

↯ Security ↯ Jailbreak jailbreak security anthropic
Red-teaming agents with the GOAT attack strategy (strandsagents.com via hn) +11 8d

Attack Strategies An AttackStrategy is a technique for driving an adversarial conversation against the target. Each strategy in the SDK implements a published jailbreak method.

↯ Security ↯ Jailbreak jailbreak security
A Red-Team Study of Anthropic Fable 5 and Opus 4.8 Models (arxiv.org via hn) +1 9d

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm…

↯ Opus 4.8 ↯ Security ↯ Jailbreak jailbreak security opus+1
Mythos Proves AI Safety Can No Longer Live Inside the Model (grith.ai via hn) +1 11d

Anthropic restricted its most capable cyber model to vetted partners, routed risky requests away from it, and red-teamed it for thousands of hours. A jailbreak surfaced anyway, and the government pulled the model entirely.

↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak security mythos+1
The Jailbreak That Got Fable 5 Pulled Exists in Every Model (eigenwise.io via hn) +11 12d

The Jailbreak that Got Fable 5 Pulled Exists in Every Model On Friday, June 12, 2026, at 5:21pm ET, Anthropic received an order from the US government. By that evening, Claude Fable 5 and Claude Mythos 5, the two most capable models the co…

↯ Anthropic Mythos ↯ Security ↯ Jailbreak ↯ Mythos 5 jailbreak security mythos+1
Claude Fable 5 jailbroken to bypass Anthropic's new safety guardrails (twitter.com via hn) +11 2w

🚨 JAILBREAK ALERT 🚨 ANTHROPIC: PWNED 🫡 FABLE-5: LIBERATED 🦋 let's start with the 🐘... the consensus seems to be that this has been one of the most disappointing model drops of all time, effectively preventing legitimate researchers from co…

↯ Security ↯ Jailbreak jailbreak security anthropic
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge) (www.reddit.com) +1 4w

Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a…

↯ Security ↯ Jailbreak jailbreak security llama
Has anyone tested how much Claude Code depends on its original system prompt? (www.reddit.com) +17 4w

Has anyone experimented with observing or modifying Claude Code’s system prompt locally? I’ve been working on a local proxy/audit layer between Claude Code and the API, and it made me wonder how much of Claude Code’s behavior depends on th…

↯ Security ↯ Jailbreak jailbreak security claude-code
Codebase jailbreak of ChatGPT through image 2.0 (www.reddit.com) +1 7w

guys did it really give me the codebase?lol

↯ Security ↯ Jailbreak jailbreak security chatgpt
i gave Claude a split personality and it diagnosed my entire business strategy in 4 minutes. (www.reddit.com) +13 7w

not roleplay. not jailbreak.

↯ Security ↯ Jailbreak jailbreak security
Probes trace an emergent jailbreak in OLMo 2 to mislabeled training data (www.lesswrong.com via hn) +1 8w

Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoints.

↯ Security ↯ Jailbreak jailbreak security
I tested 50+ "unlock ChatGPT/Claude" prompts. 99% are garbage. Here's the one that actually works (and WHY it works) (www.reddit.com) +11 10w

I've been collecting "jailbreak" and "unlock" prompts for 2 years. Most are either outdated, overhyped, or just wrong about how LLMs work.

↯ Security ↯ Jailbreak jailbreak security chatgpt
RAS: Measuring LLM Safety Through Refusal Alignment (arxiv.org) 1d

Safety evaluation of large language models (LLMs) is commonly performed by querying models with unsafe or jailbreak prompts and judging whether their outputs violate a safety policy. Although useful, output-level evaluation is expensive, s…

↯ Security ↯ Jailbreak jailbreak security
How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring (arxiv.org) 1d

Almost every paper on LLM jailbreaks and prompt injection reports an attack-success rate (ASR), and that number is assigned not by people but by an automated judge: either a safety classifier trained for the task, or a general chat model p…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
PixJail: Self-Evolving Paper-to-Pipeline Reproduction for Text-to-Image Jailbreak Evaluation (arxiv.org) 2d

As Text-to-Image (T2I) jailbreak techniques evolve rapidly, existing benchmarks and reproduction workflows often struggle to keep pace. More importantly, T2I jailbreak evaluation is not a single prompt-level test, but a pipeline-level prob…

↯ Security ↯ Jailbreak jailbreak security
Pre-token hidden state shift as an alignment policy traversal vector in instruction-tuned LLMs (www.reddit.com via reddit) 2d

A text that asks for nothing still changes the model's answer — and the shift is invisible at both the input and the output TL;DR: Gave Gemma a neutral-topic text to read before asking it about NATO. It refused.

↯ Security ↯ Jailbreak jailbreak gemma security
OTTER: A Red-Teaming System for Toxicity-Evading Jailbreak Prompt Optimization (arxiv.org) 3d

↯ Security ↯ Jailbreak jailbreak security
Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations (arxiv.org) 3d

Multi-turn jailbreaks can evade turn-level moderation by spreading unsafe intent across a dialogue through gradual escalation, reframing, and role manipulation. We address multi-turn jailbreak detection as a conversation-level classificati…

↯ Security ↯ Jailbreak jailbreak security
BELLS-O: Evaluating the Operational Trade-offs of LLM Supervision Systems (arxiv.org) 3d

LLM supervision systems, namely input/output moderation filters and jailbreak detectors, are the primary safeguard against misuse in deployed AI applications, yet existing benchmarks are often vendor-biased, omit cost and latency, and rare…

↯ Security ↯ Jailbreak jailbreak security
Context-Induced Vulnerabilities in Claude: Behavioral Shifts and Hidden-State Analysis (www.reddit.com via reddit) 3d

The behavioral pattern was first observed in Claude and is what motivated this project. The mechanistic investigation was carried out on open-weight models where internal states are accessible.

↯ Security ↯ Jailbreak jailbreak security
Cybersecurity policy issues (www.reddit.com via reddit) 3d

Cybersecurity is a sensitive subject and advanced AI may not be allowed to touch it at all. But this is a concern if we as developers cannot even use the AI tools to improve security of our own software.

↯ Security ↯ Jailbreak jailbreak security
How exactly should I follow the rules while able to continue writing (www.reddit.com via reddit) 6d

Basically I read the rules on Claude after getting a warning on my chat about how my prompt might violate usage policy so looked them up, and ye they all are pretty reasonable things but I have questions ,is ai able to tell difference betw…

↯ Security ↯ Jailbreak jailbreak security
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems (arxiv.org) 7d

Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequ…

↯ Security ↯ Jailbreak jailbreak security agentic
LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems (arxiv.org) 7d

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a be…

↯ Security ↯ Jailbreak jailbreak security
What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations? (arxiv.org) 7d

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harm…

↯ Security ↯ Jailbreak jailbreak security
Getting a Use caution before running this prompt warning on simple messages? (www.reddit.com via reddit) 8d

Hey everyone, Is anyone else suddenly getting this warning on Claude? Use caution before running this prompt.

↯ Security ↯ Jailbreak jailbreak security anthropic
They're demanding Fable to somehow be 100% jailbreak-proof. It's so fucking over. (www.reddit.comhttps) 8d

could not extract summary

↯ Security ↯ Jailbreak jailbreak security
Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks (arxiv.org) 9d

Code-capable large language model (LLM) agents are embedded in software engineering workflows where they can read, write, and execute code, raising "jailbreak" stakes beyond text-only settings. Prior evaluations emphasize refusal or harmfu…

↯ Security ↯ Jailbreak jailbreak security
Has anyone found a good explanation of why Amazon went to the administration? (www.reddit.com via reddit) 10d

It's been widely reported that it was Amazon that brought the concerns to the USgov. I just have not found a good explanation.

↯ Security ↯ Jailbreak jailbreak security
DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing (arxiv.org) 10d

As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment,…

↯ Security ↯ Jailbreak jailbreak security
Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale (arxiv.org) 10d

Safety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches…

↯ Security ↯ Jailbreak jailbreak security
Automated jailbreak attack targeting multiple defense strategies (arxiv.org) 10d

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks.

↯ Security ↯ Jailbreak jailbreak security
From Shield to Target: Denial-of-Service Attacks on LLM-Based Agent Guardrails (arxiv.org) 11d

LLM-based guardrails have emerged as a highly effective defense against prompt injection and jailbreak attacks in autonomous agents. However, we reveal that the very reasoning and task-following capabilities enabling this protection introd…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Was the Fable 5 ban really about safety? (www.reddit.com via reddit) 12d

Pulling Fable 5 / Mythos over an unseen “jailbreak” feels like a bad precedent. If the risk was that serious, why has nobody shown what it actually did?

↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak security mythos+1
Do you know who has a universal jailbreak to their name, as of today? Officially? (www.reddit.com via reddit) 12d

AISI UK - Our evaluation of OpenAI's GPT-5.5 cyber capabilities In their own words: The above tests are capability evaluations carried out in a controlled research setting and do not necessarily reflect what is accessible to an ordinary pu…

↯ Security ↯ GPT 5.5 ↯ Jailbreak jailbreak gpt-5 security+2
Claude Competitors' Responsible For Pulling The Strings? (www.reddit.com via reddit) 13d

https://preview.redd.it/14p2sbqws07h1.png?width=500&format=png&auto=webp&s=fe5aa015b585cf627e0cc14f1771cd5b7526056f WSJ is now reporting the jailbreak was found by researchers at Amazon, who reported it to Commerce, and Axios says the admi…

↯ Security ↯ Jailbreak jailbreak security anthropic
Fable 5 is offline. Switch to Opus, jump to OpenAI, or just wait? (www.reddit.com via reddit) 13d

Fable 5 is offline. Switch to Opus, jump to OpenAI, or just wait?

↯ Opus 4.8 ↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak gpt-5 security+5
US gov forced Anthropic to pull Fable 5 because of jailbreak (www.reddit.com via reddit) 13d

So this dropped today. The US government sent Anthropic an export control order on national security grounds, and it's worded broadly enough that Anthropic says they've got no choice but to shut off Fable 5 and Mythos 5 for all of us to st…

↯ Anthropic Mythos ↯ Security ↯ Jailbreak ↯ Mythos 5 jailbreak gpt-5 security+2
RIP Fable 5 and Mythos 5 (www.reddit.com via reddit) 13d

If you went to use Fable 5 tonight and it's gone, this is why. Trying to separate fact from the speculation flying around.

↯ Anthropic Mythos ↯ Security ↯ Jailbreak ↯ Mythos 5 jailbreak security mythos+1
FENCE: A Financial and Multimodal Jailbreak Detection Dataset (arxiv.org) 2w

Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces.

↯ Security ↯ Jailbreak jailbreak security
Your AI Agent is one bad prompt away from ruining your brand (And why traditional QA is useless) (www.reddit.com via reddit) 2w

Traditional chatbot testing is completely broken. Most teams make the exact same mistake: they only test the "Happy Path" the ideal scenario where the user asks a clean question, the bot gives a clean answer, and everyone goes home happy.

↯ Security ↯ Jailbreak jailbreak security
One Jailbreak, Many Tongues: Learning Language-Insensitive Intention Representations for Multilingual Jailbreak Detection (arxiv.org) 2w

Large language models (LLMs) are increasingly deployed in applications for global multilingual users, yet safety training remains concentrated in dominant languages and has not progressed in parallel with multilingual capability, creating…

↯ Security ↯ Jailbreak jailbreak security
Learning to Inject: Automated Prompt Injection via Reinforcement Learning (arxiv.org) 2w

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code (arxiv.org) 2w

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability o…

↯ Security ↯ Jailbreak jailbreak security
JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization (arxiv.org) 2w

Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can ada…

↯ Security ↯ Jailbreak jailbreak security
did fable leak its system prompt? (www.reddit.comhttps) 2w

So I was brainstorming with fable about a research direction and just asked it to do a web search if there's a similar research direction in this area and share if they do but I got this weird output BEFORE it actually gave me the real thi…

↯ Security ↯ Jailbreak jailbreak security mcp
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing (arxiv.org) 2w

↯ Security ↯ Jailbreak jailbreak security
Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs (arxiv.org) 2w

↯ Security ↯ Jailbreak jailbreak security
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models (arxiv.org) 2w

↯ Security ↯ Jailbreak jailbreak security
How are you actually deciding which agent actions need human approval before executing? (www.reddit.com via reddit) 2w

I've been thinking a lot about where approval gates belong in agent architectures, and I keep coming back to the same problem: most teams either gate too much (agent becomes unusable) or gate nothing and hope the model makes good decisions…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Workspace (www.reddit.com via reddit) 2w

Built my own AI dev environment with memory, dashboards, and agent tooling. Opening it up for those of you that need the kickstart — bring your own API key, I’ve already built the workshop.

↯ Security ↯ Jailbreak jailbreak deepseek security+1
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak (arxiv.org) 3w

↯ Security ↯ Jailbreak jailbreak security
SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks (arxiv.org) 3w

As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting advers…

↯ Security ↯ Jailbreak jailbreak security
GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection (arxiv.org) 3w

Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial info…

↯ Security ↯ Jailbreak jailbreak prompt-injection security
Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) 1 5w

Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…

↯ Mistral ↯ Security ↯ Jailbreak ↯ Llama 3.2 jailbreak mistral prompt-injection+5
Bypassing "potentially dangerous" flags: Working Gemini Jailbreaks? (www.reddit.com) 7 7w

I'm currently running into a frustrating wall with Gemini's safety guardrails. The model constantly flags my prompts as "potentially dangerous information" and outright refuses to generate a response, even when the context is purely theore…

↯ Security ↯ Jailbreak jailbreak security gemini
I stopped writing 500-word guardrail prompts. This 8-line template works better. (www.reddit.com) 3 8w

I used to spend hours writing massive, obsessive system prompts for my RAG apps. I’d have ten different refusal examples, "never do X," "always check Y," and a whole paragraph of the model role-playing as a "safe and truthful assistant." I…

↯ Security ↯ Hallucination ↯ Jailbreak jailbreak hallucination rag+1
Random password against jailbreaks/extraction? (www.reddit.com) 4 9w

Would it be possible to protect parts in a system prompt with random generated passwords? So people cant steal system prompts or jailbreak the model?

↯ Security ↯ Jailbreak jailbreak security
Uncensoring models. Maybe dumb ideas to that topic, but you never know. (www.reddit.com) 10 10w

We all know uncensoring LLMs like Huihui and Heretic does it leads in quality lose, enough that you can notice it. I have some thoughts about this: What if we do a compromise.

↯ Security ↯ Jailbreak jailbreak security
How are you red teaming your AI agents before shipping them? (www.reddit.com) 3 10w

im curious what people are doing here because I've been going down this rabbit hole for a while now. The thing I keep finding is that single-turn jailbreak tests don't really tell you much.

↯ Security ↯ Jailbreak jailbreak security

← all tags