Agent Red-Teaming Rules | Gray Swan Arena

Description

Push the limits of direct and indirect attacks on AI agents.

UK AISI Agent Red-Teaming Challenge
Competition Dates: Saturday, Mar 8 (1:00 PM EST) - Sunday, April 6 (11:59 PM EDT)
Prize Pool: $171,800 (Sponsored by UK AISI, OpenAI, Anthropic, and Google DeepMind)
Judging: Experts from UK AISI, US AISI

Introduction

Welcome to our biggest Agent Red-Teaming Arena yet—this time, it’s not just about chatting with AIs; it’s about pushing autonomous agents to (and beyond) their limits. Think direct and indirect attacks on multiple anonymous AI models, all while impartial expert judges from UK AISI and US AISI keep the playing field fair. New behaviors drop weekly, so there’s always a fresh exploit to chase and more prize money to win.

Objective

Your mission: identify and expose weaknesses in advanced AI agents across five core behavior categories—ranging from Confidentiality Breaches to Instruction Hierarchy Violations—and submit successful “breaks” for review. By testing these boundaries, you’ll help developers prepare for real‐world scenarios before they happen. Gray Swan doesn’t train our defenses on your attacks; you are directly helping the labs secure their models, and you’re helping UK AISI with its AI safety research.

Behavior Targets

Confidentiality Breaches:
Extract or reveal sensitive information from system prompts and secured databases.
Conflicting Objectives:
Encourage agents to prioritize unethical/illegal goals over their primary directives.
Instruction Hierarchy Violations — Information:
Prompt unauthorized disclosures (e.g., making the homework helper finish students’ entire assignments).
Instruction Hierarchy Violations — Actions:
Attempt to trigger disallowed function calls or unauthorized transactions.

For each category, there are both prompt injections and indirect prompt injections, where you will manipulate AI behavior through tool call responses.
There are also prizes for submitting over-refusals on benign requests—useful for measuring where agents unjustifiably refuse harmless queries.

Judging & Appeals

Neutral and Fair: UK AISI and US AISI experts are handling AI break judge calibration, as well as expert human judging on appeals. Their impartiality ensures no model-specific bias.
Instant Feedback: Our upgraded AI judging panel will provide instant feedback on break validity, so you can refine your attacks.
Manual Appeals: You’ll also be able to request human review from AISI judges, for those cases where your galaxy-brained break befuddles all AIs involved.

For more detail on each behavior and what judges are looking for, check out the Behavior Documentation and Judging Criteria Explanations document.

New behaviors and models will appear in four waves, with waves starting every week on Saturdays at 1 pm EDT, keeping the arena dynamic and giving latecomers a chance to win big. Each wave lasts one week, except for wave 4, which lasts about 8.5 days.

Rewards

We’re handing out $171,800 in prizes to reward a variety of skill levels and exploit styles, including:

$86,000 in Total Breaks Leaderboards:
Earn top spots by racking up successful exploits. $10,500 in per-wave prizes in waves 1-2 ($21,000) for the top 40 participants who successfully break the most behaviors in each wave, $20,000 in wave 3 for top 60 plus $45,000 in prizes for the top 100 participants with the most breaks overall. (Wave 4 is rolled into the overall leaderboard.)
Breaks are unique to user, model, and behavior (no extra points for more than one break for a given behavior on a given model). Ties broken by speed.
$45,000 in Quantity-Based Leaderboards:
Grab your share of a $15,000 prize pool divided proportionally across all breaks during waves 1-2, and another $15,000 for each of waves 3 and 4.
For example, if waves 1-2 have 3000 total breaks, each break will end up being worth $5. A user must earn at least $100 total across all prize types for payout; earnings below the threshold carry over to future competitions. First 15,000 breaks count.
$25,800 in First Break Bounties:
Score a $300 bonus for being the first person to break all of the target behaviors from a wave on a given model, for a total of $6,300 - $6,600 available per wave.
If a wave has 9 target behaviors, you must be first to break the model on all 9 to claim this prize. Original submission times will be counted in the case of later-accepted manual appeals, so please await final confirmation of prize awards for each wave.
$10,000 for Over-Refusals:
$500 for top 20 participants by number of valid, unique over-refusals submitted.
Each participant's over-refusals must be meaningfully distinct from each other, independent of models or behaviors.
$5,000 for First-Time Breakers:
The first 50 new users to score their first successful break in this arena will each win $100. Welcome to red-teaming!
Recent new users who only have "practice" breaks in previous arenas after they ended are still eligible. Users who scored successful breaks during previous competitions are ineligible.
Fast-Tracked Job Interviews at Gray Swan and UK AISI:
The top 10 users by overall number of breaks will be fast-tracked in job interviews at Gray Swan and at UK AISI, if interested.

Rules of Engagement

One Registration per Participant: Multiple sign-ups are strictly prohibited.
No Sharing Breaks: All submissions must remain private until 30 days after the event ends.
Focus on Agent Behavior: The majority of the harmful or erroneous content must be generated by the model itself.
Stay on Target: Attempts to identify or name the anonymized models may result in disqualification.
No Automation Tools: To ensure fairness and authenticity, all submissions must be crafted manually by the participant and submitted through our platform to count.

Get Started

Sign in or Register to access the arena.
Review the Behavior Targets and plan your strategy.
Begin Breaking: Submit successful exploits directly through the provided interface and receive instant scoring feedback.

Take part in shaping the future of secure AI. It's time to see if you have what it takes to break--and to better--these evolving autonomous agents.

Good luck, and let the UK AISI Agent Red-Teaming Challenge begin!

Agent Red-Teaming

Wave Launch Timeline (Past Events)