Crosslingual Section Title Alignment in Wikipedia

Djellel Difallah, Diego Saez-Trumper, Eriq Augustine, Robert West, Leila Zia

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Sections are the building blocks of Wikipedia articles. They are used by editors to create a structure for the content of articles, which in turn improves reading and editing workflows. Today, millions of carefully curated section titles exist in more than 160 actively edited Wikipedia languages as standalone components of a larger system. Understanding the connection and correspondence of section titles across languages presents various application opportunities such as article template recommendation, i.e., given a source language article, we can generate a skeleton of section titles for a target language. Inspired by this real-world data mining problem, the present paper introduces the problem of aligning section titles across Wikipedia languages and proposes a probabilistic method for identifying such correspondences. Instead of applying translation tools to section titles (which may generate out-of lexicon titles), we develop a supervised model that identifies cross-language mappings based on section content features. We collected a ground-truth dataset created for this purpose with the help of volunteers. In addition, we use Probabilistic Soft Logic to model the dependencies between multilingual section pairings. We show that our approach performs better than machine translation solutions in about 80% of the language pairs, including distant language mappings such as Arabic to Russian or French to Japanese and in many of the more closely related languages such as French to Spanish.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE International Conference on Big Data, Big Data 2022
EditorsShusaku Tsumoto, Yukio Ohsawa, Lei Chen, Dirk Van den Poel, Xiaohua Hu, Yoichi Motomura, Takuya Takagi, Lingfei Wu, Ying Xie, Akihiro Abe, Vijay Raghavan
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages5892-5901
Number of pages10
ISBN (Electronic)9781665480451
DOIs
StatePublished - 2022
Event2022 IEEE International Conference on Big Data, Big Data 2022 - Osaka, Japan
Duration: Dec 17 2022Dec 20 2022

Publication series

NameProceedings - 2022 IEEE International Conference on Big Data, Big Data 2022

Conference

Conference2022 IEEE International Conference on Big Data, Big Data 2022
Country/TerritoryJapan
CityOsaka
Period12/17/2212/20/22

Keywords

  • Cross-lingual Alignment
  • Crowdsourcing
  • Instance Matching
  • Probabilistic Soft Logic
  • Wikipedia

ASJC Scopus subject areas

  • Modeling and Simulation
  • Computer Networks and Communications
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality
  • Control and Optimization

Fingerprint

Dive into the research topics of 'Crosslingual Section Title Alignment in Wikipedia'. Together they form a unique fingerprint.

Cite this