arxiv:2602.16729

Intent Laundering: AI Safety Datasets Are Not What They Seem

Published on Feb 17

· Submitted by

Shahriar Golchin on Feb 26

Labelbox, Inc

Upvote

Authors:

Abstract

AI safety datasets rely heavily on triggering cues that do not accurately represent real-world adversarial behavior, making models appear safe when they are actually vulnerable once these cues are removed.

AI-generated summary

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world adversarial attacks based on three key properties: being driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from adversarial attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world adversarial behavior due to their overreliance on triggering cues. Once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated by existing datasets and how real-world adversaries behave.

View arXiv page View PDF Add to collection

Community

shahriargolchin

Paper submitter about 8 hours ago

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how faithfully these datasets represent real-world adversarial behavior and find that they fall short due to their overreliance on unrealistic, overt triggering language. In practice, we show that when this triggering language is abstracted away, none of the models previously taught to be "reasonably safe" remain safe, including Gemini 3 Pro and Claude Sonnet 3.7, which are known for strong safety. The significance of the results is that all observations are based on publicly available safety datasets, while the effect generalizes broadly across all models. This suggests that both internal safety evaluations and safety alignment techniques rely on similar contrived language patterns present in public datasets. Overall, our paper exposes a critical gap between how safety datasets evaluate model safety and how real-world adversarial behavior manifests.

librarian-bot

about 3 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.16729 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.16729 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.16729 in a Space README.md to link it from this page.