TOWARDS UNDERSTANDING SYCOPHANCY IN LANGUAGE MODELS

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

    Research output: Contribution to conferencePaperpeer-review

    Abstract

    Human feedback is commonly utilized to finetune AI assistants. But human feedback can encourage model responses that match user beliefs over truthful ones, a behavior known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning used human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.

    Original languageEnglish (US)
    StatePublished - 2024
    Event12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria
    Duration: May 7 2024May 11 2024

    Conference

    Conference12th International Conference on Learning Representations, ICLR 2024
    Country/TerritoryAustria
    CityHybrid, Vienna
    Period5/7/245/11/24

    ASJC Scopus subject areas

    • Language and Linguistics
    • Computer Science Applications
    • Education
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'TOWARDS UNDERSTANDING SYCOPHANCY IN LANGUAGE MODELS'. Together they form a unique fingerprint.

    Cite this