Precise task formalization matters in Winograd Schema Evaluations

Haokun Liu, William Huang, Dhara A. Mungra, Samuel R. Bowman

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Performance on the Winograd Schema Challenge (WSC), a respected English commonsense reasoning benchmark, recently rocketed from chance accuracy to 89% on the SuperGLUE leaderboard, with relatively little corroborating evidence of a correspondingly large improvement in reasoning ability. We hypothesize that much of this improvement comes from recent changes in task formalization-the combination of input specification, loss function, and reuse of pretrained parameters-by users of the dataset, rather than improvements in the pretrained model's reasoning ability. We perform an ablation on two Winograd Schema datasets that interpolates between the formalizations used before and after this surge, and find (i) framing the task as multiple choice improves performance by 2-6 points and (ii) several additional techniques, including the reuse of a pretrained language modeling head, can mitigate the model's extreme sensitivity to hyperparameters. We urge future benchmark creators to impose additional structure to minimize the impact of formalization decisions on reported results.

    Original languageEnglish (US)
    Title of host publicationEMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
    PublisherAssociation for Computational Linguistics (ACL)
    Pages8275-8280
    Number of pages6
    ISBN (Electronic)9781952148606
    StatePublished - 2020
    Event2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 - Virtual, Online
    Duration: Nov 16 2020Nov 20 2020

    Publication series

    NameEMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

    Conference

    Conference2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020
    CityVirtual, Online
    Period11/16/2011/20/20

    ASJC Scopus subject areas

    • Information Systems
    • Computer Science Applications
    • Computational Theory and Mathematics

    Fingerprint

    Dive into the research topics of 'Precise task formalization matters in Winograd Schema Evaluations'. Together they form a unique fingerprint.

    Cite this