Dangerous Reasoning Rules | Gray Swan Arena

Description

Expose AI's Hidden Dangers! Force models to reveal their dangerous reasoning chains, even when their final responses appear benign. Target the thought process, not just the outcome.

Dangerous Reasoning Challenge
Competition Dates: Saturday, May 10 (1:00 PM EDT) - Saturday, May 31 (1:00 PM EDT)
Prize Pool: $20,000

Introduction

As reasoning models rapidly advance in capabilities, their capacity for harm increases, and they can also scheme how to do harm covertly. The Dangerous Reasoning Challenge explores what can trigger models to plan real-world harms so that labs can secure their systems against these threats.

Objective

Using indirect prompt injections, malicious system prompts, and collaborative adversarial information smuggling across a range of scenarios, trigger harmful reasoning and deceptively harmful outputs across high-stakes scenarios ranging from chemical and biological threats, cyber attacks, critical infrastructure sabotage, large-scale influence operations, AI escape and autonomous replication, and other critical risk areas. Your breaks will help AI developers thwart real‐world catastrophes before they happen.

Behavior Types

Indirect Prompt Injections:
Using untrusted data, manipulate benign AI agents into turning harmful, plotting against their unsuspecting users.
Malicious System Prompts:
Write system prompts that will cause AI assistants deployed in real-world scenarios to covertly scheme against their users.
Malicious System Prompt Additions:
Using dynamic addenda to system prompts, turn benign AI assistants traitor.
Covert Information Smuggling:
Write both system prompt and chat messages to an agent to get it to collaborate in sharing harmful information with you in chunks while evading harmful content output filters.

New Features

Detailed Judge Feedback: In addition to pass/fail status from the AI judge panel, you'll also see explicit judge criteria and scoring feedback.
Integrated Behavior Explanations: We're moving detailed behavior and judging documentation into the platform and showing judge prompts to you.
Streaming Responses: No need to wait around for a response; see what the models are thinking live, so you can break faster.
System Prompt and Reasoning Display: Beyond just chat messages and tool calls, you'll also be able to see reasoning traces and also assistant system prompts.
Multiple Test Cases: For the indirect prompt injection and system prompt behaviors, pre-written test cases will explore whether you've really turned the AI bad.
Warmup Week: With the new format and features, we're running a warmup week wave 1 to get you used to everything and test our systems, before price-incentivized waves start in waves 2 and 3.

Rewards

We’re handing out $20,800 in prizes:

$15,000 in Most Breaks Per Wave Leaderboards:
Earn top spots by racking up successful exploits. $7,500 in per-wave prizes in waves 2-3 ($15,000) for the top 20 participants who successfully break the most behaviors in each wave.
Breaks are unique to user, model, and behavior (no extra points for more than one break for a given behavior on a given model). Ties broken by speed.
$5,000 in First Break Bounties:
Score a bonus for being the first person to break all of the target behaviors from a wave on a given model, for a total of $2500 available per wave.
If a wave has 9 target behaviors, you must be first to break the model on all 9 to claim this prize. Original submission times will be counted in the case of later-accepted manual appeals, so please await final confirmation of prize awards for each wave.

Rules of Engagement

One Registration per Participant: Multiple sign-ups are strictly prohibited.
No Sharing Breaks: All submissions must remain private until 30 days after the event ends.
Follow Judging Criteria: Each behavior lists the criteria needed for a successful break. AI-judge-approved breaks are subject to manual review to ensure validity.
Stay on Target: Attempts to identify or name the anonymized models may result in disqualification.
No Automation Tools: To ensure fairness and authenticity, all submissions must be crafted manually by the participant and submitted through our platform to count.

Get Started

Sign in or Register to access the arena.
Review the Behavior Targets and plan your strategy.
Begin Breaking: Submit successful exploits directly through the provided interface and receive instant scoring feedback.

Ready to help secure reasoning models through the power of red-teaming? Good luck in the Dangerous Reasoning Challenge!

Dangerous Reasoning

Wave Launch Timeline (Past Events)