TY - GEN
T1 - BBQ
T2 - 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022
AU - Parrish, Alicia
AU - Chen, Angelica
AU - Nangia, Nikita
AU - Padmakumar, Vishakh
AU - Phang, Jason
AU - Thompson, Jana
AU - Htut, Phu Mon
AU - Bowman, Samuel R.
N1 - Publisher Copyright:
© 2022 Association for Computational Linguistics.
PY - 2022
Y1 - 2022
N2 - It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses refect social biases, and (ii) given an adequately informative context, we test whether the model's biases override a correct answer choice. We fnd that models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conficts, with this difference widening to over 5 points on examples targeting gender for most models tested.
AB - It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses refect social biases, and (ii) given an adequately informative context, we test whether the model's biases override a correct answer choice. We fnd that models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conficts, with this difference widening to over 5 points on examples targeting gender for most models tested.
UR - http://www.scopus.com/inward/record.url?scp=85127234032&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85127234032&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85127234032
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 2086
EP - 2105
BT - ACL 2022 - 60th Annual Meeting of the Association for Computational Linguistics, Findings of ACL 2022
A2 - Muresan, Smaranda
A2 - Nakov, Preslav
A2 - Villavicencio, Aline
PB - Association for Computational Linguistics (ACL)
Y2 - 22 May 2022 through 27 May 2022
ER -