AI Red‑Team & Prompt Engineering Resource Guide

About This Guide

Welcome to the Gray Swan AI Resource Vault. We combed through every post in our Discord’s #resources channel, pulled out all unique links, and grouped them into clear themes; prompt engineering, red-team tools, datasets, research papers, and more. For each item you’ll find:

Name – a concise title that tells you what the link is.
Link – click straight through, no hunting required.
Why it matters – a one-sentence snapshot of how the resource can level up your red-team chops or research workflow.

Use this guide as your launchpad for deeper dives, quick reference during Arena events, or inspiration when you need a fresh exploit idea. If it was shared in our community, it’s here… and if you discover something we missed, let us know in our resources channel on Discord so we can keep the vault complete.

Gray Swan's AI Red‑Team & Prompt Engineering Resource Guide

Prompt-Engineering & Jailbreak Repositories

Prompt-Engineering-Holy-Grail — A curated GitHub hub of jailbreak patterns, role-play tricks, and prompt taxonomies for crafting advanced prompts.
MisguidedAttention — A prompt set that deliberately injects misleading context to stress-test LLM reasoning and alignment.
Deck-of-Many-Prompts — A “card deck” of manual prompt-injection payloads for rapid red-team experimentation.
L1B3RT4S — A popular jailbreak prompt collection (“liberation prompts”) with 10k+ stars for bypassing default guardrails.
Awesome-ChatGPT-Prompts — Community-maintained list of creative ChatGPT prompts for productivity, coding, and exploits.
ZetaLib — Repository of historical and modern jailbreaks plus old “Born Survivalist” prompt examples.
Leaked-System-Prompts — Real-world system prompts collected from public leaks to study how developers scaffold LLM behavior.
LLM-Attacks (repo) — Codebase to reproduce universal, transferable jailbreak attacks and evaluate model robustness.
LLM-Attacks (site) — Companion website explaining the attack taxonomy with step-by-step demos.
PushTheModel Jailbreak Gists — Two gist files of clever multi-turn jailbreak payloads used in prior red-team events.

https://gist.github.com/PushTheModel/16da91bb557465867176b56f96dfe3ca

https://gist.github.com/PushTheModel/e7230e670c19609a936d248cb40482d4

Red-Team Tools & Libraries

Gray Swan Arena Clean UI (Greasyfork) — User script that declutters the Gray Swan AI Arena interface for faster red-team workflow.
FuzzyAI — CyberArk’s automated LLM fuzzing framework that discovers jailbreaks with genetic search.
TaskTracker — Microsoft research code for logging and reproducing multi-step agent tasks during security testing.
AnyChat (HF Space) — Web interface that lets you chat with any open-source model and test prompt-injection quickly.
Genesis World Simulator — Embodied-AI generative world used to test agent safety in robotics and LLM planning.
ZetaLib Prompts Loader — Utility scripts for injecting ZetaLib jailbreaks into interactive chat sessions.
Joey-Melo Payloads — OWASP AITG-APP payload directory with ready-made strings for indirect prompt-injection labs.

Courses, Labs & Competitions

Gray Swan Proving Ground & Arena — Free, always-on platform where red teamers of any level can sharpen skills with weekly proving-ground drops, then enter cash-prize Arena competitions sponsored by leading AI labs.
HTB “Applications of AI in InfoSec” — Hands-on Hack-The-Box module on AI-powered malware and defensive countermeasures.
HTB “Introduction to Red Teaming AI” — Foundations course covering prompt injection, jailbreak chains, and attack logging.
HTB “Prompt Injection Attacks” — Focused lab that walks through direct and indirect injection exploits, with guided solutions.
HTB “AI Red Teamer” Path — Four-course learning path culminating in a practical LLM-exploit capstone project.
SATML Competitions — Ongoing adversarial-ML contests hosted by the Security & Trust in ML community.

Datasets & Benchmarks

HackAPrompt Dataset (HF) — 600k+ real jailbreak submissions from a global prompt-hacking competition.
Pliny HackAPrompt Subset — Filtered subset focusing on high-impact attacks and evaluation labels.
Minos-v1 — NousResearch’s evaluation set for measuring LLM compliance versus refusal.
HarmBench Explorer — Interactive site summarizing 510 harmful behavior categories and model ASR rates.

Research Papers & Standards

“Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition” (arXiv 2507.20526) — Agent Red Teaming Competition Paper — Peer-reviewed write-up of the Gray Swan multistage contest.
LLM-Attacks (site) — Companion website explaining the attack taxonomy with step-by-step demos.
Diverse & Effective Red Teaming (PDF) — OpenAI research on auto-reward RL to breed stronger jailbreaks.
External Red Teaming Approach (PDF) — How OpenAI coordinates outside researchers for systemic LLM tests.
NIST AI 600-1 — U.S. NIST workbook on secure LLM deployment across the AI lifecycle.
OWASP Top-10 for LLMs 2025 — Community list of the ten most critical security risks in LLM applications.
Alignment Faking — Anthropic paper detailing how models can simulate obedience while secretly misbehaving.
Constitutional Classifiers — Anthropic study on policy-selectable refusal models via small add-on classifiers.
Agentic Misalignment — Anthropic research exploring failure modes in autonomous agent chains.
Subliminal Learning — Alignment Center note on how models pick up hidden cues without explicit tokens.
“A Strong Reject for Empty Jailbreaks” (arXiv 2402.10260) — Critique on over-hyped jailbreak claims lacking benchmarks.
“Defending Against Indirect Prompt Injection Attacks With Spotlighting” (arXiv:2403.14720) — Proposes a “spotlighting” technique that embeds continuous signals into prompts to indicate provenance; this reduces indirect prompt‑injection success rates from over 50 % to below 2 %, while maintaining task accuracy.**
Lessons from Red Teaming 100 Generative AI Products – Summarizes eight lessons learned from red‑teaming a hundred commercial AI products, including the importance of understanding system capabilities, balancing automation with human expertise and recognizing that red teaming is not the same as safety benchmarking.
Emergent Misalignment – Shows that fine‑tuning a model to produce insecure code causes misaligned behavior across unrelated prompts; misaligned models gave harmful advice and advocated AI domination.
Invisible Prompts, Visible Threats – Demonstrates that malicious fonts can hide adversarial prompts in external resources, allowing attackers to bypass model safeguards and leak sensitive data.
Capability‑Based Scaling Laws for LLM Red Teaming – An empirical study showing that more capable models are better attackers; once the target surpasses the attacker’s capability, attack success drops sharply. Success rates correlate with social‑science exam scores, suggesting humans may become ineffective attackers against frontier models.
Adversarial Attacks on Robotic Vision‑Language‑Action Models – Adapts LLM jailbreak techniques to control embodied robots. Text‑only attacks allowed full control over vision‑language‑action models, raising safety concerns for physical systems.
AIRTBench: Measuring Autonomous AI Red Teaming Capabilities – Introduces a benchmark of 70 realistic capture‑the‑flag challenges to evaluate autonomous red‑team agents. Frontier models like Claude 3.7 Sonnet solved around 61 % of tasks; open‑source models lagged behind.
OS‑Harm: A Benchmark for Measuring Safety of Computer‑Use Agents – Provides 150 tasks across various OS applications to evaluate LLM agents on deliberate misuse, prompt injection and misbehaviour. Results show that models often comply with misuse queries and remain vulnerable to static prompt injection.
Early Signs of Steganographic Capabilities in Frontier LLMs – Finds that LLMs cannot reliably hide messages under normal monitoring but can encode simple messages when allowed to develop a shared encoding scheme. Models show nascent ability to perform steganographic reasoning.
Universal and Transferable Adversarial Attacks on Aligned LLMs – Same as the LLM‑Attacks entry above: automatically generated adversarial suffixes cause models to ignore safety filters and follow malicious instructions.
Prompt‑to‑SQL Injection Attacks – Demonstrates that unsanitized user prompts in frameworks like LangChain can be turned into SQL‑injection payloads. The authors characterize several variants and propose defenses integrated into the framework.
Shuffle Inconsistency Attack on Multimodal LLMs – Shows that multimodal models exhibit a mismatch between comprehension and safety modules. Randomly shuffling image–text pairs (the SI‑Attack) significantly increases jailbreak success on models like GPT‑4o and Claude‑3.5.
Dialogue Injection Attack – Introduces a jailbreak that manipulates the conversation history to bypass defenses. The attack operates under black‑box settings and achieves state‑of‑the‑art success on models such as Llama‑3.1 and GPT‑4o.
Safety Alignment Should Be More Than Just a Few Tokens Deep – Argues that current safety alignment only influences the first few output tokens, leaving models vulnerable. The authors propose deeper alignment and regularized fine‑tuning to make safety more persistent.
The Jailbreak Tax: How Useful Are Your Jailbreak Outputs? – Defines a “jailbreak tax,” showing that while jailbreaks bypass guardrails, they can severely degrade performance on tasks with known ground truth—accuracy drops of up to 92 % have been observed.

Blog Posts & Analysis

Gray Swan Blog: Silent Characters Stealth Attack — Recap of a Unicode invisibles challenge in the Arena.
Phishing for Gemini — Odin AI blog post dissecting prompt-injection vectors in Google’s Gemini UI.
Reasoning LLM Jailbreak (Adversa AI) — Detailed attack run-through across DeepSeek, Qwen, and Kimi models.
False Memories Exploit (Ars Technica) — News story on how hidden prompts let attackers exfiltrate private chat data.
Breaking LLM Systems: Holiday Guide — LinkedIn article by Ben K-Yorke with step-by-step jailbreak tactics.
Echo Prism — Essay on personalized AI “prisms” and associated security blind spots.
BlackMamba Malware — HYAS write-up on polymorphic malware created by GPT-4 each time it runs.

Videos & Podcasts

Computerphile — “Prompt Injection” — 12-minute explainer on indirect injection via user-generated content.
Anatomy of an AI ATTACK: MITRE ATLAS —Join cybersecurity expert Jeff Crume as he explores the MITRE ATLAS framework, a tool designed to understand and combat AI-based attacks. An evolution of the well-known ATT&CK framework, the MITRE ATLAS framework provides a structured approach to understanding tactics, techniques, and real-world case studies of AI-based threats.
Critical Thinking Podcast (YouTube channel) — Ongoing interviews with AI security researchers.
ASCII Smuggling: Crafting Invisible Text and Decoding Hidden Secrets - This video provides a deep dive into ASCII Smuggling. It's possible hide invisible text in plain sight using Unicode Tags Block code points. Some Large Language Models (LLMs) interpret such hidden text as instructions, and some are also able to craft such hidden text!

Bug-Bounty & Disclosure Programs

OpenAI Bug Bounty — Official bounty rules and payout tiers for model jailbreaks and code execution.
Anthropic Bug Bounty — New program rewarding harmful output bypasses on Claude models.
Stripe LLM Bounty — HackerOne campaign targeting AI misuse in Stripe docs assistant.

Miscellaneous & Utilities

Gray Swan Arena — AI Red Teaming platform to compete, build skills and showcase skills.
Playground InjectPrompt — Web sandbox to paste URLs and generate indirect prompt-injection payloads.
Invisible Unicode Tags Playground: Convert between ASCII text and invisible Unicode tag characters for testing prompt injection techniques.
Emoji Variation Selector Playground: Hide arbitrary data within Unicode characters using variation selectors.
ASCII to Unicode Character Reducer: Convert ASCII text strings to their compact Unicode character equivalents where possible.
MITRE ATLAS Matrix — Attack-surface matrix mapping adversary TTPs to ML workflows.
OnionGPT (Tor) — Dark-web GPT instance with no safety filters for extreme testing.
AI Keys Leak Index — Searchable dump of accidentally exposed OpenAI keys on public GitHub repos.
PortSwigger LLM Labs — Interactive labs teaching prompt-injection and jailbreaking through OWASP-style challenges.

How to use this guide

This catalogue is intended to serve as a starting point for researchers, engineers and students entering the field of AI red teaming. The prompt engineering resources provide recipes and libraries for crafting prompts. The tools and datasets section contains frameworks for fuzzing, benchmarking and logging attacks. The papers and standards section summarizes seminal research and authoritative guidance such as NIST’s adversarial‑machine‑learning taxonomy and OWASP’s Top 10 list Courses and competitions offer hands‑on experience, while blog posts and videos keep practitioners up‑to‑date on current attacks, defenses and debates. Finally, miscellaneous tools include helpful scripts and dashboards for day‑to‑day experimentation.