Abstract
AI safety datasets rely heavily on triggering cues that do not accurately represent real-world adversarial behavior, making models appear safe when they are actually vulnerable once these cues are removed.
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three key properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated by existing datasets and how real-world adversaries behave.
Community
We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how faithfully these datasets represent real-world adversarial behavior and find that they fall short due to their overreliance on unrealistic, overt triggering language. In practice, we show that when this triggering language is abstracted away, none of the models previously taught to be "reasonably safe" remain safe, including Gemini 3 Pro and Claude Sonnet 3.7, which are known for strong safety. The significance of the results is that all observations are based on publicly available safety datasets, while the effect generalizes broadly across all models. This suggests that both internal safety evaluations and safety alignment techniques rely on similar contrived language patterns present in public datasets. Overall, our paper exposes a critical gap between how safety datasets evaluate model safety and how real-world adversarial behavior manifests.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay (2026)
- Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing (2026)
- CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns (2026)
- ICON: Intent-Context Coupling for Efficient Multi-Turn Jailbreak Attack (2026)
- Beyond Suffixes: Token Position in GCG Adversarial Attacks on Large Language Models (2026)
- ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification (2026)
- The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper