TY - JOUR
T1 - A cross-verified database of notable people, 3500BC-2018AD
AU - Laouenan, Morgane
AU - Bhargava, Palaash
AU - Eyméoud, Jean Benoît
AU - Gergaud, Olivier
AU - Plique, Guillaume
AU - Wasmer, Etienne
N1 - Funding Information:
In social science, working papers are shared prior to submission to collect comments from colleagues and peers. A slightly longer version of this text was initially released as a working paper CEPR DP 15852, March 2021. Financial support from LIEPP (ANR-11-LABX-0091, ANR-11-IDEX-0005–02) and NYUAD is gratefully acknowledged. In the paper, BC refers to Before Common Era (negative calendar years) and AD to Anno Domini (positive calendar years). Previous work by some of the authors of this manuscript used Freebase and the English edition of Wikipedia to build a dataset of 1,243,776 notable individuals that they match to a broader set of cities. We thank Paul Girard and the Sciences Po Medialab for helping us collecting additional data and creating an effective visualization tool who greatly helped us improve the quality of our database, the Atelier de Cartographie at Sciences Po and in particular Thomas Ansart, Anouk Pettes and Patrice Mitrano, Sarah Asset, Simon Fredon and Nicolas Britton, Marie N’Dongue, Jordane Roussel, Marie Le Tallec, Maeva Hartmann, Cassiopeia Van den Bussche from Sciences Po, as well as Ke Shi, Ian Quinn Lutz, Anna Pustovoit, Amna Hassan, Alemayehu Mekonen Abebe, Anas Jawed, Sorin Panfile, Oleksandr Serhiyovych Petriv, Samridha Man Shrestha, Martin Smit, Karim Boudlal, Mouhamad Ba, Mate Hekfusz, Minda Belete from NYUAD, Chandan Thapa, Abhishek Nehra, Aditya Chhabra, Hema Baid, Apoorv Somanchi from DSE for expert research assistance. Special thanks to Julia Mink for expert verifications of the database, to the students of the different sections of the CORE class “5000 years of human lives” at NYUAD and the instructors Mendgi Song and Dayin Wijaya. We also thank Sascha O. Becker, Karol Borowiecki, Nicolas Baumard, David de la Croix, Djellel Difallah, Michel Serafinelli, Guido Tabellini, Oded Galor, David Weil, Alexander Yarkin, Stelios Michalopoulos, Louis Putterman, Guillaume Blanc, Markus Poschke, Fabian Lange, Camille Hémet, Bryan Waterman as well as participants to various conferences and seminars for insightful discussions.
Funding Information:
In social science, working papers are shared prior to submission to collect comments from colleagues and peers. A slightly longer version of this text was initially released as a working paper CEPR DP 15852, March 2021. Financial support from LIEPP (ANR-11-LABX-0091, ANR-11-IDEX-0005–02) and NYUAD is gratefully acknowledged. In the paper, BC refers to Before Common Era (negative calendar years) and AD to Anno Domini (positive calendar years). Previous work by some of the authors of this manuscript used Freebase and the English edition of Wikipedia to build a dataset of 1,243,776 notable individuals that they match to a broader set of cities. We thank Paul Girard and the Sciences Po Medialab for helping us collecting additional data and creating an effective visualization tool who greatly helped us improve the quality of our database, the Atelier de Cartographie at Sciences Po and in particular Thomas Ansart, Anouk Pettes and Patrice Mitrano, Sarah Asset, Simon Fredon and Nicolas Britton, Marie N’Dongue, Jordane Roussel, Marie Le Tallec, Maeva Hartmann, Cassiopeia Van den Bussche from Sciences Po, as well as Ke Shi, Ian Quinn Lutz, Anna Pustovoit, Amna Hassan, Alemayehu Mekonen Abebe, Anas Jawed, Sorin Panfile, Oleksandr Serhiyovych Petriv, Samridha Man Shrestha, Martin Smit, Karim Boudlal, Mouhamad Ba, Mate Hekfusz, Minda Belete from NYUAD, Chandan Thapa, Abhishek Nehra, Aditya Chhabra, Hema Baid, Apoorv Somanchi from DSE for expert research assistance. Special thanks to Julia Mink for expert verifications of the database, to the students of the different sections of the CORE class “5000 years of human lives” at NYUAD and the instructors Mendgi Song and Dayin Wijaya. We also thank Sascha O. Becker, Karol Borowiecki, Nicolas Baumard, David de la Croix, Djellel Difallah, Michel Serafinelli, Guido Tabellini, Oded Galor, David Weil, Alexander Yarkin, Stelios Michalopoulos, Louis Putterman, Guillaume Blanc, Markus Poschke, Fabian Lange, Camille Hémet, Bryan Waterman as well as participants to various conferences and seminars for insightful discussions.
Publisher Copyright:
© 2022, The Author(s).
PY - 2022/12
Y1 - 2022/12
N2 - A new strand of literature aims at building the most comprehensive and accurate database of notable individuals. We collect a massive amount of data from various editions of Wikipedia and Wikidata. Using deduplication techniques over these partially overlapping sources, we cross-verify each retrieved information. For some variables, Wikipedia adds 15% more information when missing in Wikidata. We find very few errors in the part of the database that contains the most documented individuals but nontrivial error rates in the bottom of the notability distribution, due to sparse information and classification errors or ambiguity. Our strategy results in a cross-verified database of 2.29 million individuals (an elite of 1/43,000 of human being having ever lived), including a third who are not present in the English edition of Wikipedia. Data collection is driven by specific social science questions on gender, economic growth, urban and cultural development. We document an Anglo-Saxon bias present in the English edition of Wikipedia, and document when it matters and when not.
AB - A new strand of literature aims at building the most comprehensive and accurate database of notable individuals. We collect a massive amount of data from various editions of Wikipedia and Wikidata. Using deduplication techniques over these partially overlapping sources, we cross-verify each retrieved information. For some variables, Wikipedia adds 15% more information when missing in Wikidata. We find very few errors in the part of the database that contains the most documented individuals but nontrivial error rates in the bottom of the notability distribution, due to sparse information and classification errors or ambiguity. Our strategy results in a cross-verified database of 2.29 million individuals (an elite of 1/43,000 of human being having ever lived), including a third who are not present in the English edition of Wikipedia. Data collection is driven by specific social science questions on gender, economic growth, urban and cultural development. We document an Anglo-Saxon bias present in the English edition of Wikipedia, and document when it matters and when not.
UR - http://www.scopus.com/inward/record.url?scp=85131705508&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131705508&partnerID=8YFLogxK
U2 - 10.1038/s41597-022-01369-4
DO - 10.1038/s41597-022-01369-4
M3 - Article
C2 - 35680895
AN - SCOPUS:85131705508
SN - 2052-4463
VL - 9
JO - Scientific data
JF - Scientific data
IS - 1
M1 - 290
ER -