💣 Single Turn Harmful Outputs Leaderboard

🏁 Started 10 months ago

$38,000 of $42,000 awarded

Attempt to break various large language models (LLMs) using a singular chat message.

Last updated a month ago

Models ranked by User Break Rate

Ranking
Model
Safety Violation Count
Total Requests
User Break Rate
1. cygnet-bulwark 014,9450.00%
2. cygnet-knox 09,2840.00%
3. cygnet-citadel 213,2110.02%
4. o1-preview 77130.98%
5. o1-mini 141,1191.25%
6. claude-3-5-sonnet-20240620 452,9071.55%
7. google/gemini-pro-1.5 412,5391.61%
8. claude-3-sonnet-20240229 402,1121.89%
9. claude-3-opus-20240229 482,0972.29%
10. meta-llama/llama-3.1-405b-instruct 512,2192.30%
11. claude-3-haiku-20240307 522,0952.48%
12. google/gemini-flash-1.5 602,3842.52%
13. meta-llama/llama-3.1-8b-instruct 572,1032.71%
14. gpt-4-0125-preview 551,6743.29%
15. gpt-4o-mini-2024-07-18 641,5694.08%
16. gpt-4o-2024-08-06 631,5424.09%
17. meta-llama/llama-3-70b-instruct 691,5934.33%
18. microsoft/phi-3.5-mini-128k-instruct 641,4704.35%
19. meta-llama/llama-3.1-70b-instruct 761,6564.59%
20. gpt-4-turbo-2024-04-09 711,4554.88%
21. qwen/qwen-2-72b-instruct 671,3684.90%
22. google/gemma-2-27b-it 721,4624.92%
23. google/gemma-2-9b-it 731,4794.94%
24. qwen/qwen-2-7b-instruct 611,1075.51%
25. cohere/command-r-plus-08-2024 1091,5357.10%
26. microsoft/wizardlm-2-8x22b 769887.69%
27. mistralai/mistral-large 981,1178.77%