What Will it Take to Fix Benchmarking in Natural Language Understanding?

Samuel R. Bowman, George E. Dahl

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

    Original languageEnglish (US)
    Title of host publicationNAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics
    Subtitle of host publicationHuman Language Technologies, Proceedings of the Conference
    PublisherAssociation for Computational Linguistics (ACL)
    Pages4843-4855
    Number of pages13
    ISBN (Electronic)9781954085466
    StatePublished - 2021
    Event2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021 - Virtual, Online
    Duration: Jun 6 2021Jun 11 2021

    Publication series

    NameNAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference

    Conference

    Conference2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021
    CityVirtual, Online
    Period6/6/216/11/21

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Hardware and Architecture
    • Information Systems
    • Software

    Fingerprint

    Dive into the research topics of 'What Will it Take to Fix Benchmarking in Natural Language Understanding?'. Together they form a unique fingerprint.

    Cite this