TY - CONF
T1 - Glue
T2 - 7th International Conference on Learning Representations, ICLR 2019
AU - Wang, Alex
AU - Singh, Amanpreet
AU - Michael, Julian
AU - Hill, Felix
AU - Levy, Omer
AU - Bowman, Samuel R.
N1 - Funding Information:
We thank Ellie Pavlick, Tal Linzen, Kyunghyun Cho, and Nikita Nangia for their comments on this work at its early stages, and we thank Ernie Davis, Alex Warstadt, and Quora's Nikhil Dandekar and Kornel Csernai for providing access to private evaluation data. This project has benefited from financial support to SB by Google, Tencent Holdings, and Samsung Research, and to AW from AdeptMind and an NSF Graduate Research Fellowship.
Funding Information:
We are grateful for support under the National Science Foundation grant under CCF-1563098, and the Center for Science of Information (CSoI), an NSF Science and Technology Center under grant agreement CCF-0939370.
Publisher Copyright:
© 7th International Conference on Learning Representations, ICLR 2019. All Rights Reserved.
PY - 2019
Y1 - 2019
N2 - For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.
AB - For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.
UR - http://www.scopus.com/inward/record.url?scp=85083952595&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083952595&partnerID=8YFLogxK
M3 - Paper
AN - SCOPUS:85083952595
Y2 - 6 May 2019 through 9 May 2019
ER -