Does Putting a Linguist in the Loop Improve NLU Data Collection?

Alicia Parrish, William Huang, Omar Agha, Soo Hwan Lee, Nikita Nangia, Alex Warstadt, Karmanya Aggarwal, Emily Allaway, Tal Linzen, Samuel R. Bowman

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Many crowdsourced NLP datasets contain systematic artifacts that are identified only after data collection is complete. Earlier identification of these issues should make it easier to create high-quality training and evaluation data. We attempt this by evaluating protocols in which expert linguists work 'in the loop' during data collection to identify and address these issues by adjusting task instructions and incentives. Using natural language inference as a test case, we compare three data collection protocols: (i) a baseline protocol with no linguist involvement, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the writing task, and (iii) an extension that adds direct interaction between linguists and crowdworkers via a chatroom. We find that linguist involvement does not lead to increased accuracy on out-of-domain test sets compared to baseline, and adding a chatroom has no effect on the data. Linguist involvement does, however, lead to more challenging evaluation data and higher accuracy on some challenge sets, demonstrating the benefits of integrating expert analysis during data collection.

    Original languageEnglish (US)
    Title of host publicationFindings of the Association for Computational Linguistics, Findings of ACL
    Subtitle of host publicationEMNLP 2021
    EditorsMarie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-Tau Yih
    PublisherAssociation for Computational Linguistics (ACL)
    Pages4886-4901
    Number of pages16
    ISBN (Electronic)9781955917100
    StatePublished - 2021
    Event2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 - Punta Cana, Dominican Republic
    Duration: Nov 7 2021Nov 11 2021

    Publication series

    NameFindings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021

    Conference

    Conference2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
    Country/TerritoryDominican Republic
    CityPunta Cana
    Period11/7/2111/11/21

    ASJC Scopus subject areas

    • Language and Linguistics
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'Does Putting a Linguist in the Loop Improve NLU Data Collection?'. Together they form a unique fingerprint.

    Cite this