TY - GEN
T1 - Clustering Wikipedia infoboxes to discover their types
AU - Nguyen, Thanh Hoang
AU - Nguyen, Huong Dieu
AU - Moreira, Viviane
AU - Freire, Juliana
PY - 2012
Y1 - 2012
N2 - Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.
AB - Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriate categories and provide infoboxes that follow pre-defined templates, many do not follow the guidelines or follow them loosely. This leads to undesirable effects, such as template duplication, heterogeneity, and schema drift. As a step towards addressing this problem, we propose a new unsupervised approach for clustering Wikipedia infoboxes. Instead of relying on manually assigned categories and template labels, we use the structured information available in infoboxes to group them and infer their entity types. Experiments using over 48,000 infoboxes indicate that our clustering approach is effective and produces high quality clusters.
KW - clustering
KW - wikipedia infobox
UR - http://www.scopus.com/inward/record.url?scp=84871054933&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84871054933&partnerID=8YFLogxK
U2 - 10.1145/2396761.2398588
DO - 10.1145/2396761.2398588
M3 - Conference contribution
AN - SCOPUS:84871054933
SN - 9781450311564
T3 - ACM International Conference Proceeding Series
SP - 2134
EP - 2138
BT - CIKM 2012 - Proceedings of the 21st ACM International Conference on Information and Knowledge Management
T2 - 21st ACM International Conference on Information and Knowledge Management, CIKM 2012
Y2 - 29 October 2012 through 2 November 2012
ER -