Practical, efficient, and customizable active learning for named entity recognition in the digital humanities

Alexander Erdmann, David Joseph Wrisley, Benjamin Allen, Christopher Brown, Sophie Cohen-Bodénès, Micha Elsner, Yukun Feng, Brian Joseph, Béatrice Joyeux-Prunel, Marie Catherine de Marneffe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scholars in inter-disciplinary fields like the Digital Humanities are increasingly interested in semantic annotation of specialized corpora. Yet, under-resourced languages, imperfect or noisily structured data, and user-specific classification tasks make it difficult to meet their needs using off-the-shelf models. Manual annotation of large corpora from scratch, meanwhile, can be prohibitively expensive. Thus, we propose an active learning solution for named entity recognition, attempting to maximize a custom model's improvement per additional unit of manual annotation. Our system robustly handles any domain or user-defined label set and requires no external resources, enabling quality named entity recognition for Humanities corpora where such resources are not available. Evaluating on typologically disparate languages and datasets, we reduce required annotation by 20-60% and greatly outperform a competitive active learning baseline.

Original languageEnglish (US)
Title of host publicationLong and Short Papers
PublisherAssociation for Computational Linguistics (ACL)
Pages2223-2234
Number of pages12
ISBN (Electronic)9781950737130
StatePublished - 2019
Event2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019 - Minneapolis, United States
Duration: Jun 2 2019Jun 7 2019

Publication series

NameNAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
Volume1

Conference

Conference2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019
Country/TerritoryUnited States
CityMinneapolis
Period6/2/196/7/19

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Practical, efficient, and customizable active learning for named entity recognition in the digital humanities'. Together they form a unique fingerprint.

Cite this