TY - GEN
T1 - Rissanen Data Analysis
T2 - 38th International Conference on Machine Learning, ICML 2021
AU - Perez, Ethan
AU - Kiela, Douwe
AU - Cho, Kyunghyun
N1 - Publisher Copyright:
Copyright © 2021 by the author(s)
PY - 2021
Y1 - 2021
N2 - We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.
AB - We introduce a method to determine if a certain capability helps to achieve an accurate model of given data. We view labels as being generated from the inputs by a program composed of subroutines with different capabilities, and we posit that a subroutine is useful if and only if the minimal program that invokes it is shorter than the one that does not. Since minimum program length is uncomputable, we instead estimate the labels' minimum description length (MDL) as a proxy, giving us a theoretically-grounded method for analyzing dataset characteristics. We call the method Rissanen Data Analysis (RDA) after the father of MDL, and we showcase its applicability on a wide variety of settings in NLP, ranging from evaluating the utility of generating subquestions before answering a question, to analyzing the value of rationales and explanations, to investigating the importance of different parts of speech, and uncovering dataset gender bias.
UR - http://www.scopus.com/inward/record.url?scp=85161305338&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85161305338&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85161305338
T3 - Proceedings of Machine Learning Research
SP - 8500
EP - 8513
BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021
PB - ML Research Press
Y2 - 18 July 2021 through 24 July 2021
ER -