HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Haneul Yoo, Jiho Jin, Juhee Son, Jin Yeong Bak, Kyunghyun Cho, Alice Oh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued pretraining on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. 1 We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zeroshot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.

Original languageEnglish (US)
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationNAACL 2022 - Findings
PublisherAssociation for Computational Linguistics (ACL)
Pages1832-1844
Number of pages13
ISBN (Electronic)9781955917766
DOIs
StatePublished - 2022
Event2022 Findings of the Association for Computational Linguistics: NAACL 2022 - Seattle, United States
Duration: Jul 10 2022Jul 15 2022

Publication series

NameFindings of the Association for Computational Linguistics: NAACL 2022 - Findings

Conference

Conference2022 Findings of the Association for Computational Linguistics: NAACL 2022
Country/TerritoryUnited States
CitySeattle
Period7/10/227/15/22

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea'. Together they form a unique fingerprint.

Cite this