TY - GEN
T1 - Can you tell me how to get past sesame street? Sentence-level pretraining beyond language modeling
AU - Wang, Alex
AU - Hula, Jan
AU - Xia, Patrick
AU - Pappagari, Raghavendra
AU - Thomas McCoy, R.
AU - Patel, Roma
AU - Kim, Najoung
AU - Tenney, Ian
AU - Huang, Yinghui
AU - Yu, Katherin
AU - Jin, Shuning
AU - Chen, Berlin
AU - van Durme, Benjamin
AU - Grave, Edouard
AU - Pavlick, Ellie
AU - Bowman, Samuel R.
N1 - Funding Information:
Parts of this work were conducted as part of the Fifth Frederick Jelinek Memorial Summer Workshop (JSALT) at Johns Hopkins University, and benefited from support by the JSALT sponsors and a team-specific donation of computing resources from Google. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research. AW is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1342536. PX and BVD were supported by DARPA AIDA. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2019 Association for Computational Linguistics
PY - 2020
Y1 - 2020
N2 - Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo's pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.
AB - Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo's pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.
UR - http://www.scopus.com/inward/record.url?scp=85084066669&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85084066669&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85084066669
T3 - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
SP - 4465
EP - 4476
BT - ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
T2 - 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
Y2 - 28 July 2019 through 2 August 2019
ER -