TY - GEN
T1 - HUE
T2 - 2022 Findings of the Association for Computational Linguistics: NAACL 2022
AU - Yoo, Haneul
AU - Jin, Jiho
AU - Son, Juhee
AU - Bak, Jin Yeong
AU - Cho, Kyunghyun
AU - Oh, Alice
N1 - Publisher Copyright:
© Findings of the Association for Computational Linguistics: NAACL 2022 - Findings.
PY - 2022
Y1 - 2022
N2 - Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued pretraining on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. 1 We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zeroshot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.
AB - Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued pretraining on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. 1 We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zeroshot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.
UR - http://www.scopus.com/inward/record.url?scp=85137363090&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137363090&partnerID=8YFLogxK
U2 - 10.18653/v1/2022.findings-naacl.140
DO - 10.18653/v1/2022.findings-naacl.140
M3 - Conference contribution
AN - SCOPUS:85137363090
T3 - Findings of the Association for Computational Linguistics: NAACL 2022 - Findings
SP - 1832
EP - 1844
BT - Findings of the Association for Computational Linguistics
PB - Association for Computational Linguistics (ACL)
Y2 - 10 July 2022 through 15 July 2022
ER -