TY - GEN
T1 - Right for the wrong reasons
T2 - 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
AU - Thomas McCoy, R.
AU - Pavlick, Ellie
AU - Linzen, Tal
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1746891 and the 2018 Jelinek Summer Workshop on Speech and Language Technology (JSALT). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the JSALT workshop.
Funding Information:
We are grateful to Adam Poliak, Benjamin Van Durme, Samuel Bowman, the members of the JSALT General-Purpose Sentence Representation Learning team, and the members of the Johns Hopkins Computation and Psycholinguistics Lab for helpful comments, and to Brian Leonard for assistance with the Mechanical Turk experiment. Any errors remain our own. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1746891 and the 2018 Jelinek Summer Workshop on Speech and Language Technology (JSALT). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the JSALT workshop.
Publisher Copyright:
© 2019 Association for Computational Linguistics
PY - 2020
Y1 - 2020
N2 - A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.
AB - A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.
UR - http://www.scopus.com/inward/record.url?scp=85077985542&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077985542&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85077985542
T3 - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
SP - 3428
EP - 3448
BT - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
Y2 - 28 July 2019 through 2 August 2019
ER -