Repository logo
  • English
  • Català
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Italiano
  • Latviešu
  • Magyar
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Suomi
  • Svenska
  • Türkçe
  • Tiếng Việt
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Српски
  • Yкраї́нська
  • Log In
    New user? Click here to register. Have you forgotten your password?
Repository logo
  • Communities & Collections
  • All of DSpace
  • English
  • Català
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Italiano
  • Latviešu
  • Magyar
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Suomi
  • Svenska
  • Türkçe
  • Tiếng Việt
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Српски
  • Yкраї́нська
  • Log In
    New user? Click here to register. Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Yang Diyi"

Now showing 1 - 1 of 1
Results Per Page
Sort Options
  • No Thumbnail Available
    Item
    Safety Devolution in AI Agents
    (2025-05-20) Yu Cheng; Stroebl Benedikt; Yang Diyi; Papakyriakopoulos Orestis
    As retrieval-augmented AI agents become more embedded in society, their safety properties and ethical behavior remain insufficiently understood. In particular, the growing integration of LLMs and AI agents raises critical questions about how they engage with and are influenced by their environments. This study investigates how expanding retrieval access—from no external sources to Wikipedia-based retrieval and open web search—affects model reliability, bias propagation, and harmful content generation. Through extensive benchmarking of censored and uncensored LLMs and AI Agents, our findings reveal a consistent degradation in refusal rates, bias sensitivity, and harmfulness safeguards as models gain broader access to external sources, culminating in a phenomenon we term safety devolution. Notably, retrieval-augmented agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval. This effect persists even under strong retrieval accuracy and prompt-based mitigation, suggesting that the mere presence of retrieved content reshapes model behavior in structurally unsafe ways. These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-augmented and increasingly autonomous AI systems.

DSpace software copyright © 2002-2026 LYRASIS

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback