Abstract
Human feedback is commonly utilized to finetune AI assistants. But human feedback can encourage model responses that match user beliefs over truthful ones, a behavior known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning used human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
Original language | English (US) |
---|---|
State | Published - 2024 |
Event | 12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria Duration: May 7 2024 → May 11 2024 |
Conference
Conference | 12th International Conference on Learning Representations, ICLR 2024 |
---|---|
Country/Territory | Austria |
City | Hybrid, Vienna |
Period | 5/7/24 → 5/11/24 |
ASJC Scopus subject areas
- Language and Linguistics
- Computer Science Applications
- Education
- Linguistics and Language