Fine-Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Persian Gulf

Almazhan Kapan, Suphan Kirmizialtin, Rhythm Kukreja, David Joseph Wrisley

Research output: Contribution to journalConference articlepeer-review

Abstract

Text recognition technologies increase access to global archives and make possible their computational study using techniques such as Named Entity Recognition (NER). In this paper, we present an approach to extracting a variety of named entities (NE) in unstructured historical datasets from open digital collections dealing with a space of informal British empire: the Persian Gulf region. The sources are largely concerned with people, places and tribes as well as economic and diplomatic transactions in the region. Since models in state-of-the-art NER systems function with limited tag sets and are generally trained on English-language media, they struggle to capture entities of interest to the historian and do not perform well with entities transliterated from other languages. We build custom spaCy-based NER models trained on domain-specific annotated datasets. We also extend the set of named entity labels provided by spaCy and focus on detecting entities of non-Western origin, particularly from Arabic and Farsi. We test and compare performance of the blank, pre-trained and merged spaCy-based models, suggesting further improvements. Our study makes an intervention into thinking beyond Western notions of the entity in digital historical research by creating more inclusive models using non-metropolitan corpora in English.

Original languageEnglish (US)
Pages (from-to)288-296
Number of pages9
JournalCEUR Workshop Proceedings
Volume3232
StatePublished - 2022
Event6th Digital Humanities in the Nordic and Baltic Countries Conference, DHNB 2022 - Uppsala, Sweden
Duration: Mar 15 2022Mar 18 2022

Keywords

  • Colonial Archives
  • Gulf Studies
  • Named Entity Recognition
  • Persian Gulf
  • Transliterated Names
  • spaCy

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Fine-Tuning NER with spaCy for Transliterated Entities Found in Digital Collections From the Multilingual Persian Gulf'. Together they form a unique fingerprint.

Cite this