TY - GEN
T1 - Learning to discover domain-specific web content
AU - Pham, Kien
AU - Santos, Aécio
AU - Freire, Juliana
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/2/2
Y1 - 2018/2/2
N2 - The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.
AB - The ability to discover all content relevant to an information domain has many applications, from helping in the understanding of humanitarian crises to countering human and arms trafficking. In such applications, time is of essence: it is crucial to both maximize coverage and identify new content as soon as it becomes available, so that appropriate actions can be taken. In this paper, we propose new methods for efficient domain-specific re-crawling that maximize the yield for new content. By learning patterns of pages that have a high yield, our methods select a small set of pages that can be re-crawled frequently, increasing the coverage and freshness while conserving resources. Unlike previous approaches to this problem, our methods combine different factors to optimize the re-crawling strategy, do not require full snapshots for the learning step, and dynamically adapt the strategy as the crawl progresses. In an empirical evaluation, we have simulated the framework over 600 partial crawl snapshots in three different domains. The results show that our approach can achieve 150% higher coverage compared to existing, state-of-the-art techniques. In addition, it is also able to capture 80% of new relevant content within less than 4 hours of publication.
UR - http://www.scopus.com/inward/record.url?scp=85046901390&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85046901390&partnerID=8YFLogxK
U2 - 10.1145/3159652.3159724
DO - 10.1145/3159652.3159724
M3 - Conference contribution
AN - SCOPUS:85046901390
T3 - WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining
SP - 432
EP - 440
BT - WSDM 2018 - Proceedings of the 11th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery, Inc
T2 - 11th ACM International Conference on Web Search and Data Mining, WSDM 2018
Y2 - 5 February 2018 through 9 February 2018
ER -