TY - GEN
T1 - How Do We Answer Complex Questions
T2 - 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022
AU - Xu, Fangyuan
AU - Li, Junyi Jessy
AU - Choi, Eunsol
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - Long-form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. To better understand this complex and understudied task, we study the functional structure of long-form answers collected from three datasets, ELI5 (Fan et al., 2019), WebGPT (Nakano et al., 2021) and Natural Questions (Kwiatkowski et al., 2019). Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of six sentence-level functional roles for long-form answers, and annotate 3.9k sentences in 640 answer paragraphs. Different answer collection methods manifest in different discourse structures. We further analyze model-generated answers - finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers. Our annotated data enables training a strong classifier that can be used for automatic analysis. We hope our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.
AB - Long-form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. To better understand this complex and understudied task, we study the functional structure of long-form answers collected from three datasets, ELI5 (Fan et al., 2019), WebGPT (Nakano et al., 2021) and Natural Questions (Kwiatkowski et al., 2019). Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of six sentence-level functional roles for long-form answers, and annotate 3.9k sentences in 640 answer paragraphs. Different answer collection methods manifest in different discourse structures. We further analyze model-generated answers - finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers. Our annotated data enables training a strong classifier that can be used for automatic analysis. We hope our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.
UR - http://www.scopus.com/inward/record.url?scp=85131033020&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131033020&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85131033020
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 3556
EP - 3572
BT - ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
A2 - Muresan, Smaranda
A2 - Nakov, Preslav
A2 - Villavicencio, Aline
PB - Association for Computational Linguistics (ACL)
Y2 - 22 May 2022 through 27 May 2022
ER -