TY - GEN
T1 - AI as a Sport
T2 - 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024
AU - Orr, Will
AU - Kang, Edward B.
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/6/3
Y1 - 2024/6/3
N2 - Artificial Intelligence (AI) systems are evaluated using competitive methods that rely on benchmark datasets to determine performance. These benchmark datasets, however, are often constructed through arbitrary processes that fall short in encapsulating the depth and breadth of the tasks they are intended to measure. In this paper, we interrogate the naturalization of benchmark datasets as veracious metrics by examining the historical development of benchmarking as an epistemic practice in AI research. Specifically, we highlight three key case studies that were crucial in establishing the existing reliance on benchmark datasets for evaluating the capabilities of AI systems: (1) the sharing of Highleyman's OCR dataset in the 1960s, which solidified a community of knowledge production around a shared benchmark dataset, (2) the Common Task Framework (CTF) of the 1980s, a state-led project to standardize benchmark datasets as legitimate indicators of technical progress; and (3) the Netflix Prize which further solidified benchmarking as a competitive goal within the ML research community. This genealogy highlights how contemporary dynamics and limitations of benchmarking developed from a longer history of collaboration, standardization, and competition. We end with reflections on how this history informs our understanding of benchmarking in the current era of generative artificial intelligence.
AB - Artificial Intelligence (AI) systems are evaluated using competitive methods that rely on benchmark datasets to determine performance. These benchmark datasets, however, are often constructed through arbitrary processes that fall short in encapsulating the depth and breadth of the tasks they are intended to measure. In this paper, we interrogate the naturalization of benchmark datasets as veracious metrics by examining the historical development of benchmarking as an epistemic practice in AI research. Specifically, we highlight three key case studies that were crucial in establishing the existing reliance on benchmark datasets for evaluating the capabilities of AI systems: (1) the sharing of Highleyman's OCR dataset in the 1960s, which solidified a community of knowledge production around a shared benchmark dataset, (2) the Common Task Framework (CTF) of the 1980s, a state-led project to standardize benchmark datasets as legitimate indicators of technical progress; and (3) the Netflix Prize which further solidified benchmarking as a competitive goal within the ML research community. This genealogy highlights how contemporary dynamics and limitations of benchmarking developed from a longer history of collaboration, standardization, and competition. We end with reflections on how this history informs our understanding of benchmarking in the current era of generative artificial intelligence.
KW - Benchmark datasets.
KW - Benchmarking for generative AI
KW - History of benchmarking
KW - Machine learning benchmarks
KW - Machine learning competitions
UR - http://www.scopus.com/inward/record.url?scp=85196635823&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85196635823&partnerID=8YFLogxK
U2 - 10.1145/3630106.3659012
DO - 10.1145/3630106.3659012
M3 - Conference contribution
AN - SCOPUS:85196635823
T3 - 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024
SP - 1875
EP - 1884
BT - 2024 ACM Conference on Fairness, Accountability, and Transparency, FAccT 2024
PB - Association for Computing Machinery, Inc
Y2 - 3 June 2024 through 6 June 2024
ER -