TY - GEN
T1 - HUE
T2 - 2022 Findings of the Association for Computational Linguistics: NAACL 2022
AU - Yoo, Haneul
AU - Jin, Jiho
AU - Son, Juhee
AU - Bak, Jin Yeong
AU - Cho, Kyunghyun
AU - Oh, Alice
N1 - Funding Information:
We would like to thank Yoonman Heo (Institute for the Translation of Korean Classics) providing expertise on hanja and Korean Classical Chinese. This research was supported by the Engineering Research Center Program through the National Research Foundation of Korea (NRF) funded by the Korean Government MSIT (NRF-2018R1A5A1059921). This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00421, Artificial Intelligence Graduate School Program (Sungkyunkwan University)). Kyunghyun Cho was supported by the NYU Center for Data Science National Science Foundation (Award 1922658) and Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI).
Publisher Copyright:
© Findings of the Association for Computational Linguistics: NAACL 2022 - Findings.
PY - 2022
Y1 - 2022
N2 - Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued pretraining on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. 1 We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zeroshot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.
AB - Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued pretraining on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. 1 We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zeroshot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.
UR - http://www.scopus.com/inward/record.url?scp=85137363090&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137363090&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85137363090
T3 - Findings of the Association for Computational Linguistics: NAACL 2022 - Findings
SP - 1832
EP - 1844
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
Y2 - 10 July 2022 through 15 July 2022
ER -