Clustering Wikipedia infoboxes to discover their types

Thanh Hoang Nguyen, Huong Dieu Nguyen, Viviane Moreira, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.

Original languageEnglish (US)
Title of host publicationCIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
Pages2134-2138
Number of pages5
DOIs
StatePublished - Dec 19 2012
Event21st ACM International Conference on Information and Knowledge Management, CIKM 2012 - Maui, HI, United States
Duration: Oct 29 2012Nov 2 2012

Publication series

NameACM International Conference Proceeding Series

Other

Other21st ACM International Conference on Information and Knowledge Management, CIKM 2012
CountryUnited States
CityMaui, HI
Period10/29/1211/2/12

    Fingerprint

Keywords

  • clustering
  • wikipedia infobox

ASJC Scopus subject areas

  • Software
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Networks and Communications

Cite this

Nguyen, T. H., Nguyen, H. D., Moreira, V., & Freire, J. (2012). Clustering Wikipedia infoboxes to discover their types. In CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management (pp. 2134-2138). (ACM International Conference Proceeding Series). https://doi.org/10.1145/2396761.2398588