TY - GEN
T1 - What ingredients make for an effective crowdsourcing protocol for difficult NLU data collection tasks?
AU - Nangia, Nikita
AU - Sugawara, Saku
AU - Trivedi, Harsh
AU - Warstadt, Alex
AU - Vania, Clara
AU - Bowman, Samuel R.
N1 - Publisher Copyright:
© 2021 Association for Computational Linguistics
PY - 2021
Y1 - 2021
N2 - Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human-model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
AB - Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human-model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
UR - http://www.scopus.com/inward/record.url?scp=85115877390&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115877390&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85115877390
T3 - ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
SP - 1221
EP - 1235
BT - ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
T2 - Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021
Y2 - 1 August 2021 through 6 August 2021
ER -