TY - GEN
T1 - Swisslink
T2 - 13th International Conference on Semantic Systems, SEMANTiCS 2017
AU - Prokofyev, Roman
AU - Luggen, Michael
AU - Difallah, Djellel Eddine
AU - Cudré-Mauroux, Philippe
N1 - Funding Information:
European Research Council (ERC)
Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/9/11
Y1 - 2017/9/11
N2 - Webpages are an abundant source of textual information with manually annotated entity links, and are often used as a source of training data for a wide variety of machine learning NLP tasks. However, manual annotations such as those found on Wikipedia are sparse, noisy, and biased towards popular entities. Existing entity linking systems deal with those issues by relying on simple statistics extracted from the data. While such statistics can effectively deal with noisy annotations, they introduce bias towards head entities and are ineffective for long tail (e.g., unpopular) entities. In this work, we first analyze statistical properties linked to manual annotations by studying a large annotated corpus composed of all English Wikipedia webpages, in addition to all pages from the CommonCrawl containing English Wikipedia annotations. We then propose and evaluate a series of entity linking approaches, with the explicit goal of creating highly-Accurate (precision > 95%) and broad annotated corpuses for machine learning tasks. Our results show that our best approach achieves maximal-precision at usable recall levels, and outperforms both state-of-The-Art entity-linking systems and human annotators.
AB - Webpages are an abundant source of textual information with manually annotated entity links, and are often used as a source of training data for a wide variety of machine learning NLP tasks. However, manual annotations such as those found on Wikipedia are sparse, noisy, and biased towards popular entities. Existing entity linking systems deal with those issues by relying on simple statistics extracted from the data. While such statistics can effectively deal with noisy annotations, they introduce bias towards head entities and are ineffective for long tail (e.g., unpopular) entities. In this work, we first analyze statistical properties linked to manual annotations by studying a large annotated corpus composed of all English Wikipedia webpages, in addition to all pages from the CommonCrawl containing English Wikipedia annotations. We then propose and evaluate a series of entity linking approaches, with the explicit goal of creating highly-Accurate (precision > 95%) and broad annotated corpuses for machine learning tasks. Our results show that our best approach achieves maximal-precision at usable recall levels, and outperforms both state-of-The-Art entity-linking systems and human annotators.
KW - Entity Linking
KW - Machine learning
KW - Manual annotations
UR - http://www.scopus.com/inward/record.url?scp=85041439820&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85041439820&partnerID=8YFLogxK
U2 - 10.1145/3132218.3132234
DO - 10.1145/3132218.3132234
M3 - Conference contribution
AN - SCOPUS:85041439820
T3 - ACM International Conference Proceeding Series
SP - 65
EP - 72
BT - Proceedings of the 13th International Conference on Semantic Systems, SEMANTiCS 2017
A2 - Hoekstra, Rinke
A2 - de Boer, Victor
A2 - Pellegrini, Tassilo
A2 - Hoekstra, Rinke
A2 - Faron-Zucker, Catherine
PB - Association for Computing Machinery
Y2 - 12 September 2017 through 13 September 2017
ER -