Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark

Nikita Nangia, Samuel R. Bowman

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.

    Original languageEnglish (US)
    Title of host publicationACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
    PublisherAssociation for Computational Linguistics (ACL)
    Pages4566-4575
    Number of pages10
    ISBN (Electronic)9781950737482
    StatePublished - 2020
    Event57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 - Florence, Italy
    Duration: Jul 28 2019Aug 2 2019

    Publication series

    NameACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference

    Conference

    Conference57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
    Country/TerritoryItaly
    CityFlorence
    Period7/28/198/2/19

    ASJC Scopus subject areas

    • Language and Linguistics
    • Computer Science(all)
    • Linguistics and Language

    Fingerprint

    Dive into the research topics of 'Human vs. Muppet: A conservative estimate of human performance on the GLUE benchmark'. Together they form a unique fingerprint.

    Cite this