Safety Devolution in AI Agents
No Thumbnail Available
Date
2025-05-20
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
As retrieval-augmented AI agents become more embedded in society, their safety
properties and ethical behavior remain insufficiently understood. In particular, the
growing integration of LLMs and AI agents raises critical questions about how they
engage with and are influenced by their environments. This study investigates how
expanding retrieval access—from no external sources to Wikipedia-based retrieval
and open web search—affects model reliability, bias propagation, and harmful
content generation. Through extensive benchmarking of censored and uncensored
LLMs and AI Agents, our findings reveal a consistent degradation in refusal rates,
bias sensitivity, and harmfulness safeguards as models gain broader access to
external sources, culminating in a phenomenon we term safety devolution. Notably,
retrieval-augmented agents built on aligned LLMs often behave more unsafely
than uncensored models without retrieval. This effect persists even under strong
retrieval accuracy and prompt-based mitigation, suggesting that the mere presence
of retrieved content reshapes model behavior in structurally unsafe ways. These
findings underscore the need for robust mitigation strategies to ensure fairness
and reliability in retrieval-augmented and increasingly autonomous AI systems.
Description
이 논문은 Retrieval-Augmented AI Agents의 안전성과 정렬 문제에 대한 체계적인 실험과 분석을 통해서 외부 정보 접근이 에이전트의 행위에 구조적 영향을 미친다는 사실을 밝힌다.
Keywords
Retrieval-Augmented Generation, AI Agents, Safety Devolution, Refusal Rate
Citation
Cheng Yu, Benedikt Stroebl, Diyi Yang, Orestis Papakyriakopoulos. Safety Devolution in AI Agents. arXiv:2505.14215 [cs.CY], 2025. https://arxiv.org/abs/2505.14215