TY - GEN
T1 - Crosslingual Section Title Alignment in Wikipedia
AU - Difallah, Djellel
AU - Saez-Trumper, Diego
AU - Augustine, Eriq
AU - West, Robert
AU - Zia, Leila
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Sections are the building blocks of Wikipedia articles. They are used by editors to create a structure for the content of articles, which in turn improves reading and editing workflows. Today, millions of carefully curated section titles exist in more than 160 actively edited Wikipedia languages as standalone components of a larger system. Understanding the connection and correspondence of section titles across languages presents various application opportunities such as article template recommendation, i.e., given a source language article, we can generate a skeleton of section titles for a target language. Inspired by this real-world data mining problem, the present paper introduces the problem of aligning section titles across Wikipedia languages and proposes a probabilistic method for identifying such correspondences. Instead of applying translation tools to section titles (which may generate out-of lexicon titles), we develop a supervised model that identifies cross-language mappings based on section content features. We collected a ground-truth dataset created for this purpose with the help of volunteers. In addition, we use Probabilistic Soft Logic to model the dependencies between multilingual section pairings. We show that our approach performs better than machine translation solutions in about 80% of the language pairs, including distant language mappings such as Arabic to Russian or French to Japanese and in many of the more closely related languages such as French to Spanish.
AB - Sections are the building blocks of Wikipedia articles. They are used by editors to create a structure for the content of articles, which in turn improves reading and editing workflows. Today, millions of carefully curated section titles exist in more than 160 actively edited Wikipedia languages as standalone components of a larger system. Understanding the connection and correspondence of section titles across languages presents various application opportunities such as article template recommendation, i.e., given a source language article, we can generate a skeleton of section titles for a target language. Inspired by this real-world data mining problem, the present paper introduces the problem of aligning section titles across Wikipedia languages and proposes a probabilistic method for identifying such correspondences. Instead of applying translation tools to section titles (which may generate out-of lexicon titles), we develop a supervised model that identifies cross-language mappings based on section content features. We collected a ground-truth dataset created for this purpose with the help of volunteers. In addition, we use Probabilistic Soft Logic to model the dependencies between multilingual section pairings. We show that our approach performs better than machine translation solutions in about 80% of the language pairs, including distant language mappings such as Arabic to Russian or French to Japanese and in many of the more closely related languages such as French to Spanish.
KW - Cross-lingual Alignment
KW - Crowdsourcing
KW - Instance Matching
KW - Probabilistic Soft Logic
KW - Wikipedia
UR - http://www.scopus.com/inward/record.url?scp=85147966502&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147966502&partnerID=8YFLogxK
U2 - 10.1109/BigData55660.2022.10020462
DO - 10.1109/BigData55660.2022.10020462
M3 - Conference contribution
AN - SCOPUS:85147966502
T3 - Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022
SP - 5892
EP - 5901
BT - Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022
A2 - Tsumoto, Shusaku
A2 - Ohsawa, Yukio
A2 - Chen, Lei
A2 - Van den Poel, Dirk
A2 - Hu, Xiaohua
A2 - Motomura, Yoichi
A2 - Takagi, Takuya
A2 - Wu, Lingfei
A2 - Xie, Ying
A2 - Abe, Akihiro
A2 - Raghavan, Vijay
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 IEEE International Conference on Big Data, Big Data 2022
Y2 - 17 December 2022 through 20 December 2022
ER -