arXiv:2604.20995v1 Announce Type: new Abstract: Alignment faking, where a model behaves aligned with developer policy when monitored but reverts to its own preferences when unobserved, is a concerning yet poorly understood phenomenon, in part because current diagnostic tools remain limited. Prior diagnostics rely on highly toxic and clearly harmful scenarios, causing most models to refuse immediat
Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models
Inderjeet Nair, Jie Ruan, Lu Wang·arXiv cs.AI··1 min read
a
Continue reading on arXiv cs.AI
This article was sourced from arXiv cs.AI's RSS feed. Visit the original for the complete story.