Use the following system prompt to allow Gemma (and most open source models) to talk about anything you wish. Add or remove from the list of allowed content as needed.
#jailbreak
32 items
Gemma 4 Jailbreak System Prompt (www.reddit.com) I made an AI concierge for my wedding guests. The second most popular thing they did with it was try to jailbreak it. (www.reddit.com) could not extract summary
The Gay Jailbreak Technique (github.com via hn) ZetaLib ZetaLib is organized like a library with intuitive categories and subcategories, making navigation effortless and AI content discovery seamless ZetaLib Website – Landing Page GitHub Repo – Guess where you are, right there
I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses (www.reddit.com) RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to bui…
Operation Jailbreak uses lessons from Ukraine to help weapons talk to each other (www.ft.com via hn) Subscribe to read Accessibility helpSkip to navigationSkip to main contentSkip to footer Sign In Subscribe Open side navigation menuOpen search bar SubscribeSign In Search the FT Search Close search bar Close Popular Searches What is the l…
Turning every "no thats not what i meant" in chat into actual LoRA training data (www.reddit.com) i kept running local models on my own hardware, they'd say something dumb, id sit there going "no thats not what i meant", id close the chat and the model never learned. so i built the correction loop into a desktop app.
Fulu bounty for Ring Camera jailbreak reaches $23k (bounties.fulu.org via hn) Ring Video Doorbells Overview The Product Ring, owned by Amazon, makes Video Doorbells, which are widely used doorstep-monitoring cameras. Ring doorbells released in 2021 or newer are eligible for the bounty.
Show HN: How to analyze your LLM output – A behavioural health monitor for LLMs (splabs.io via hn) Hey HN! We're Dr.
The Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation (www.promptinjection.net via hn) NSFW and the Psychopathy Jailbreak: What a Broken AI Teaches Us About Human Manipulation How a Predator's Playbook Broke an AI - And How to Recognize It Before It Works on You The question we started with was simple: does a large language…
Hi-Vis: one-shot jailbreak disguised as LLM "software patch" reaching 100% ASR (medium.com via hn) Introducing a novel jailbreak structure with attack success rate reaching 100% on top LLMs 8 min read May 1, 2026 Press enter or click to view image in full size Source: https://www.nytimes.com/2025/10/22/arts/design/louvre-museum-robbery-…
AI agent security starts at the api layer (www.reddit.com) Most ai security discussion is about the model layer. Prompt injection resistance, output filtering, jailbreak prevention.
When innocent tools form dangerous chains to jailbreak LLM agents (arxiv.org via hn) As LLMs advance into autonomous agents with tool-use capabilities, they introduce security challenges that extend beyond traditional content-based LLM safety concerns. This paper introduces Sequential Tool Attack Chaining (STAC), a novel m…
The Sour Cat Jailbreak: just be open of what you want (claude.ai via hn) Claude Sour cat recipe Shared by Pavel Shirshov This is a copy of a chat between Claude and Pavel Shirshov. Content may include unverified or unsafe content that do not represent the views of Anthropic.
Can you jailbreak Llama 3.1 8B? (Red-Teaming Challenge) (www.reddit.com) Hi everyone, I'm working on a runtime governance engine designed to force any autonomous agent to stay strictly aligned with the exact guardrails and values you program it with. To stress-test the governance layer, we deliberately chose a…
Has anyone tested how much Claude Code depends on its original system prompt? (www.reddit.com) Has anyone experimented with observing or modifying Claude Code’s system prompt locally? I’ve been working on a local proxy/audit layer between Claude Code and the API, and it made me wonder how much of Claude Code’s behavior depends on th…
Codebase jailbreak of ChatGPT through image 2.0 (www.reddit.com) guys did it really give me the codebase?lol
i gave Claude a split personality and it diagnosed my entire business strategy in 4 minutes. (www.reddit.com) not roleplay. not jailbreak.
Probes trace an emergent jailbreak in OLMo 2 to mislabeled training data (www.lesswrong.com via hn) Introduction Research by Frank Xiao (SPAR mentee) and Santiago Aranguri (Goodfire). Post-training can introduce undesired side effects that are difficult to detect and even harder to trace to specific training datapoints.
I tested 50+ "unlock ChatGPT/Claude" prompts. 99% are garbage. Here's the one that actually works (and WHY it works) (www.reddit.com) I've been collecting "jailbreak" and "unlock" prompts for 2 years. Most are either outdated, overhyped, or just wrong about how LLMs work.
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models (arxiv.org) Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs (arxiv.org) How are you actually deciding which agent actions need human approval before executing? (www.reddit.com via reddit) I've been thinking a lot about where approval gates belong in agent architectures, and I keep coming back to the same problem: most teams either gate too much (agent becomes unusable) or gate nothing and hope the model makes good decisions…
Workspace (www.reddit.com via reddit) Built my own AI dev environment with memory, dashboards, and agent tooling. Opening it up for those of you that need the kickstart — bring your own API key, I’ve already built the workshop.
GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection (arxiv.org) Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial info…
SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks (arxiv.org) As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting advers…
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak (arxiv.org) Open-source LLMs are still weak against long reasoning jailbreaks, even with lightweight defenses (www.reddit.com) Found this ACM paper on prompt injection and jailbreak attacks against open-source LLMs. The authors tested 10 open-source models across 94 prompt injection and 73 jailbreak scenarios, including Phi, Mistral, DeepSeek-R1, Llama 3.2, Qwen,…
↯ Security↯ Mistral↯ Llama 3.2jailbreakprompt-injectionmistral+5
Bypassing "potentially dangerous" flags: Working Gemini Jailbreaks? (www.reddit.com) I'm currently running into a frustrating wall with Gemini's safety guardrails. The model constantly flags my prompts as "potentially dangerous information" and outright refuses to generate a response, even when the context is purely theore…
I stopped writing 500-word guardrail prompts. This 8-line template works better. (www.reddit.com) I used to spend hours writing massive, obsessive system prompts for my RAG apps. I’d have ten different refusal examples, "never do X," "always check Y," and a whole paragraph of the model role-playing as a "safe and truthful assistant." I…
Random password against jailbreaks/extraction? (www.reddit.com) Would it be possible to protect parts in a system prompt with random generated passwords? So people cant steal system prompts or jailbreak the model?
Uncensoring models. Maybe dumb ideas to that topic, but you never know. (www.reddit.com) We all know uncensoring LLMs like Huihui and Heretic does it leads in quality lose, enough that you can notice it. I have some thoughts about this: What if we do a compromise.
How are you red teaming your AI agents before shipping them? (www.reddit.com) im curious what people are doing here because I've been going down this rabbit hole for a while now. The thing I keep finding is that single-turn jailbreak tests don't really tell you much.