TY - GEN
T1 - Practical, efficient, and customizable active learning for named entity recognition in the digital humanities
AU - Erdmann, Alexander
AU - Wrisley, David Joseph
AU - Allen, Benjamin
AU - Brown, Christopher
AU - Cohen-Bodénès, Sophie
AU - Elsner, Micha
AU - Feng, Yukun
AU - Joseph, Brian
AU - Joyeux-Prunel, Béatrice
AU - de Marneffe, Marie Catherine
N1 - Funding Information:
We thank the Herodotos Project annotators for their contributions: Petra Ajaka, William Little, Andrew Kessler, Colleen Kron, and James Wolfe. Furthermore, we gratefully acknowledge support from the New York University-Paris Sciences Lettres Spatial Humanities Partnership, the Computational Approaches to Modeling Language lab at New York University Abu Dhabi, and a National Endowment for the Humanities grant, award HAA-256078-17. We also greatly appreciate the feedback of three anonymous reviewers.
Funding Information:
We thank the Herodotos Project annotators for their contributions: Petra Ajaka, William Little, Andrew Kessler, Colleen Kron, and James Wolfe. Furthermore, we gratefully acknowledge support from the New York University–Paris Sciences Lettres Spatial Humanities Partnership, the Computational Approaches to Modeling Language lab at New York University Abu Dhabi, and a National Endowment for the Humanities grant, award HAA-256078-17. We also greatly appreciate the feedback of three anonymous reviewers.
Publisher Copyright:
© 2019 Association for Computational Linguistics
PY - 2019
Y1 - 2019
N2 - Scholars in inter-disciplinary fields like the Digital Humanities are increasingly interested in semantic annotation of specialized corpora. Yet, under-resourced languages, imperfect or noisily structured data, and user-specific classification tasks make it difficult to meet their needs using off-the-shelf models. Manual annotation of large corpora from scratch, meanwhile, can be prohibitively expensive. Thus, we propose an active learning solution for named entity recognition, attempting to maximize a custom model's improvement per additional unit of manual annotation. Our system robustly handles any domain or user-defined label set and requires no external resources, enabling quality named entity recognition for Humanities corpora where such resources are not available. Evaluating on typologically disparate languages and datasets, we reduce required annotation by 20-60% and greatly outperform a competitive active learning baseline.
AB - Scholars in inter-disciplinary fields like the Digital Humanities are increasingly interested in semantic annotation of specialized corpora. Yet, under-resourced languages, imperfect or noisily structured data, and user-specific classification tasks make it difficult to meet their needs using off-the-shelf models. Manual annotation of large corpora from scratch, meanwhile, can be prohibitively expensive. Thus, we propose an active learning solution for named entity recognition, attempting to maximize a custom model's improvement per additional unit of manual annotation. Our system robustly handles any domain or user-defined label set and requires no external resources, enabling quality named entity recognition for Humanities corpora where such resources are not available. Evaluating on typologically disparate languages and datasets, we reduce required annotation by 20-60% and greatly outperform a competitive active learning baseline.
UR - http://www.scopus.com/inward/record.url?scp=85073781464&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073781464&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85073781464
T3 - NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
SP - 2223
EP - 2234
BT - Long and Short Papers
PB - Association for Computational Linguistics (ACL)
T2 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019
Y2 - 2 June 2019 through 7 June 2019
ER -