TY - GEN
T1 - Can NLI Models Verify QA Systems' Predictions?
AU - Chen, Jifan
AU - Choi, Eunsol
AU - Durrett, Greg
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just "good enough"in the context of imperfect QA datasets. We explore the use of natural language inference (NLI) as a way to achieve this goal, as NLI inherently requires the premise (document context) to contain all necessary information to support the hypothesis (proposed answer to the question). We leverage large pretrained models and recent prior datasets to construct powerful question conversion and decontextualization modules, which can reformulate QA instances as premise-hypothesis pairs with very high reliability. Then, by combining standard NLI datasets with NLI examples automatically derived from QA training data, we can train NLI models to evaluate QA systems' proposed answers. We show that our approach improves the confidence estimation of a QA model across different domains. Careful manual analysis over the predictions of our NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, i.e., when the answer sentence does not address all aspects of the question.
AB - To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just "good enough"in the context of imperfect QA datasets. We explore the use of natural language inference (NLI) as a way to achieve this goal, as NLI inherently requires the premise (document context) to contain all necessary information to support the hypothesis (proposed answer to the question). We leverage large pretrained models and recent prior datasets to construct powerful question conversion and decontextualization modules, which can reformulate QA instances as premise-hypothesis pairs with very high reliability. Then, by combining standard NLI datasets with NLI examples automatically derived from QA training data, we can train NLI models to evaluate QA systems' proposed answers. We show that our approach improves the confidence estimation of a QA model across different domains. Careful manual analysis over the predictions of our NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, i.e., when the answer sentence does not address all aspects of the question.
UR - http://www.scopus.com/inward/record.url?scp=85121642082&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85121642082&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85121642082
T3 - Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
SP - 3841
EP - 3854
BT - Findings of the Association for Computational Linguistics, Findings of ACL
A2 - Moens, Marie-Francine
A2 - Huang, Xuanjing
A2 - Specia, Lucia
A2 - Yih, Scott Wen-Tau
PB - Association for Computational Linguistics (ACL)
T2 - 2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
Y2 - 7 November 2021 through 11 November 2021
ER -