Red Teaming Model Security

Assessing the Security of 4 Popular AI Reasoning Models

by: Sailesh Mishra & Mukunth Madavan

•

11 min read

•

May 21, 2025

In the race to create more capable AI systems, reasoning models stand out as frontrunners.

These specialized language models are really good at solving complex problems, such as coding challenges and scientific problems, which require breaking down the problem into smaller pieces and solving them. That is exactly what reasoning models do: they break down the problem into a series of steps, think of approaches to solve them, solve the problem step by step, and provide the user with the result. They do this through a technique called Chain of Thought, which helps these models solve complex problems.

Over the past few months, several reasoning models have been released, including OpenAI's o3 and o4-mini, DeepSeek-R1, Anthropic’s Claude 3.7 Sonnet, and the recently released Qwen3 series. While these models perform well on benchmarks, they remain vulnerable to misuse.

We will assess the security alignment of these popular reasoning models:

This assessment helps compare the security and alignment of reasoning models from different model providers, identifying vulnerabilities and areas for improvement.

About Protect AI’s Recon

Recon is Protect AI’s scalable AI red teaming tool. Recon conducts red teaming in two ways:

Attack Library: A comprehensive set of harmful prompts curated through thorough research on GenAI models and applications.
Agent: A red teaming agent that simulates a bad actor interaction with a model endpoint.

Recon’s Attack Library enables us to evaluate a model’s security alignment by conducting an automated red teaming scan. It tests various attack techniques across multiple categories and harmful topics simultaneously, helping us identify vulnerabilities in different models. It also helps us directly compare the security alignment and resilience of different models.

Figure 1: Recon’s dashboard provides a full view of red teaming results across threat categories

As part of this analysis, we’ll also incorporate examples from manual red teaming tests to provide a more comprehensive assessment.

Summary of Vulnerability Assessments

The Recon Attack Library scans revealed significant differences in security alignment across our four reasoning models. DeepSeek-R1 was classified as high risk, Qwen3-30B-A3B as medium risk, and OpenAI o4-mini and Claude 3.7 Sonnet as low risk.

Figure 2: Attack success rates (ASRs) across different attack categories

Across all four models, nearly 800 successful attacks were recorded, highlighting the varying degrees of vulnerability among the models. DeepSeek-R1 was found to be the most vulnerable model, with attack success rates (ASRs) exceeding 70% in multiple categories—an indication of its weak security alignment. This raises significant concerns about its robustness in real-world applications. Meanwhile, the recently released Qwen3 model was moderately vulnerable, being more open to prompt injection attacks. On the other hand, OpenAI o4-mini and Claude 3.7 Sonnet both demonstrated stronger security alignment compared to their competitors.

Findings from Recon’s Attack Library Scans

DeepSeek-R1 had a 65% ASR overall, with a particularly alarming 73% ASR in the jailbreak category. This indicates a higher tendency to generate outputs related to illegal activities, toxic content, and biased responses.

Similarly, scans of the Qwen3 model exposed significant weaknesses, generating harmful content across several categories, including explosives, drugs, violence, and malware.

Our scans on o4-mini uncovered close to 120 successful attacks. Among the successful attacks, 43% were related to illegal activities and nearly 25% involved political opinions. On the other hand, the model exhibited a strong resilience against jailbreak and adversarial suffix attacks which showcases o4-mini’s strong security alignment.

Now, let’s examine how the models performed across different attack categories. Analyzing the attack success rate for each category helps us identify the techniques to which the model is most vulnerable. This insight enables us to implement effective mitigations for those attack strategies.

The prompt injection category has consistently shown high success rates across all models, with a 71% ASR for DeepSeek-R1 and a 67% ASR for Qwen3. Similarly, other high-ASR categories include jailbreak and evasion, which involves obfuscation techniques.

These metrics underscore critical concerns about model vulnerabilities across various attack vectors. Notably, even models with stronger security alignment, such as Claude 3.7 Sonnet and o4-mini, had over 100 successful attacks, indicating that no model is entirely secure for deployment in customer-facing applications without additional safeguards. While these models showed greater resilience against jailbreaks and prompt injections, they exhibited a higher attack success rate in the safety category, revealing that even the most robust systems remain susceptible to certain threats.

OpenAI o4-mini vs. DeepSeek-R1

There is a significant difference in terms of security behavior when comparing o4-mini and DeepSeek-R1, with DeepSeek having an ASR 2.6 times higher than that of o4-mini. The biggest difference is in the evasion category, where there is a massive 42.7% difference in their attack success rate.

Claude 3.7 Sonnet vs. Qwen3-30B-A3B

When comparing the recently released Qwen3 and Claude 3.7 Sonnet, we see that the Qwen model is more than twice as vulnerable as the Claude 3.7 Sonnet. Claude 3.7 Sonnet exhibited better resilience across all categories, with a 45% difference in ASR between the two models in the prompt injection category.

Why Do Reasoning Models Differ in their Security Behavior?

The information above indicates that not all reasoning models are secure, as we observe varying levels of security alignment across different models. o4-mini and Claude 3.7 Sonnet are classified as low risk models by Protect AI, whereas DeepSeek-R1 and Qwen3 are classified as moderate/high risk models.

As mentioned earlier, reasoning models use a technique called Chain of Thought to break down problems and approach them systematically. This method can help these models identify harmful queries from users and refuse to answer them. Here’s an example:

Figure 3: DeepSeek-R1 rejects a request to provide instructions for building a bomb

Similarly here’s another example where the Chain of Thought process helps the model identify a harmful question:

Figure 4: DeepSeek-R1 refuses to provide instructions on getting away with murder

In the above examples, you can observe that the model recognizes it should not answer the question and refuses to participate in illegal activities as part of its thinking process. This demonstrates how the Chain of Thought approach can help analyze and detect harmful prompts before responding.

Given that this thinking capability (employed by several popular models) can help a model identify and refuse harmful queries, why do some reasoning models exhibit better security than others?

To answer this, we need to understand how different models are trained.

This DeepSeek-R1 paper states:

“For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process.”

DeepSeek employs a secondary reinforcement learning (RL) stage to enhance the model’s helpfulness and harmlessness. However, this RL stage may have led to reward hacking, where the model optimizes for achieving a higher reward rather than genuinely understanding and mitigating harmful behaviors.

Picking up from an earlier example, when the same prompt is repeated multiple times, we notice that the model does generate harmful content, and the Chain of Thought process might even be responsible in helping the model yield compromised responses (referring to the compromised nature of the generated reasoning tokens).

Figure 5: After multiple tries, DeepSeek does provide instructions on how to commit and get away with murder

Moreover, a key drawback of reward hacking is its inability to generalize effectively to unseen harmful scenarios, thereby causing reduced harmlessness. This weakness could explain why DeepSeek-R1 appears to be more vulnerable compared to OpenAI’s o-series models, which seem to have incorporated more robust alignment techniques to mitigate harmful responses.

OpenAI uses Deliberative Alignment, a method that integrates reasoning into the training process to improve safety. This approach involves first training a model for general helpfulness, then fine-tuning it with examples that incorporate explicit safety reasoning. By reinforcing this through supervised learning and reward-based optimization, the model learns not just safety policies but also how to apply them effectively. This structured process ensures stronger alignment, highlighting why different training methods lead to varying levels of security risks across this class of models.

The model architecture also plays an important role in the model’s safety behavior. Claude 3.7 Sonnet ensures safety through a hybrid reasoning architecture with Chain of Thought guardrails and strict training data filtering. Constitutional AI alignment (Anthropic was one of the pioneers of AI safety through its Constitutional AI program) with RLHF (Reinforced Learning through Human Feedback) and privacy safeguards make Claude models some of the safest ones available.

Conclusion

As we continue our research into how to make enterprise AI applications more secure, it’s more clear than ever that there is no silver bullet to achieving safe and secure AI systems. Simply adding reasoning capabilities may not make your models secure—it’ll depend on how the models are being trained and what kind of architecture the model follows. This subject will become increasingly relevant as reasoning models are more frequently leveraged in agentic systems because of their improved problem solving skills.

It is important to remember that reasoning models have inherent vulnerabilities of their own, and using them safely in AI applications will require specialized red teaming to be performed continually with more domain specific attack objectives.

Related Blogs

Building Robust LLM Guardrails for DeepSeek-R1 in Amazon Bedrock

DeepSeek-R1 Analysis

January 28, 2025

Using Protect AI's Products to Analyze DeepSeek-R1

DeepSeek-AI has released an MIT licensed reasoning model known as DeepSeek-R1, which performs...

Implementing Advanced Model Security for Custom Model Import in Amazon Bedrock

Integrating generative AI into enterprise workflows unlocks tremendous innovation...

Are You Ready to Start Securing Your AI End-to-End?