Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive

Suphan Kirmizialtin, David Joseph Wrisley

Research output: Contribution to journalArticlepeer-review

Abstract

Our study discusses the automated transcription with deep learning methods of a digital newspaper collection printed in a historical language, Arabic-script Ottoman Turkish (OT), dating to the late nineteenth-and early twentieth-century. We situate OT text collections within a larger history of digitization of periodicals, underscoring special challenges faced by Arabic script languages. Our paper approaches the question of automated transcription of non-Latin script languages, such as OT, from the broader perspective of debates surrounding OCR use for historical archives. In our study with OT, we have opted for training handwritten text recognition (HTR) models that generate transcriptions in the left-to-right, Latin writing system familiar to contemporary readers of Turkish, and not, as some scholars may expect, in right-to-left Arabic script text. As a one-to-one correspondence between the writing systems of OT and modern Turkish does not exist, we also discuss approaches to transcription and the creation of ground truth and argue that the challenges faced in the training of HTR models also draw into question straightforward notions of transcription, especially where divergent writing systems are involved. Finally, we reflect on potential domain bias of HTR models in other historical languages exhibiting spatio-temporal variance as well as the significance of working between writing systems for language communities that also have experienced language reform and script change.

Original languageEnglish (US)
JournalDigital Humanities Quarterly
Volume16
Issue number2
StatePublished - 2022

ASJC Scopus subject areas

  • Communication
  • Arts and Humanities(all)
  • Computer Networks and Communications
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Automated Transcription of Non-Latin Script Periodicals: A Case Study in the Ottoman Turkish Print Archive'. Together they form a unique fingerprint.

Cite this