TY - GEN
T1 - Optimal freshness crawl under politeness constraints
AU - Kolobov, Andrey
AU - Lubetzky, Eyal
AU - Peres, Yuval
AU - Horvitz, Eric
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/7/18
Y1 - 2019/7/18
N2 - A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine's index, in order to keep the index fresh by picking up the pages' changed content. Determining how often to recrawl pages requires making tradeoffs based on the pages' relative importance and change rates, subject to multiple resource constraints - the limited daily budget of crawl requests on the search engine's end and politeness constraints restricting the rate at which pages can be requested from a given host. In this paper, we introduce PoliteBinaryLambdaCrawl, the first optimal algorithm for freshness crawl scheduling in the presence of politeness constraints as well as non-uniform page importance scores and the crawler's own crawl request limit. We also propose an approximation for it, stating its theoretical optimality conditions and in the process discovering a connection to an approach previously thought of as a mere heuristic for freshness crawl scheduling. We explore the relative performance of PoliteBinaryLambdaCrawl and other methods for handling politeness constraints on a dataset collected by crawling over 18.5M URLs daily over 14 weeks.
AB - A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine's index, in order to keep the index fresh by picking up the pages' changed content. Determining how often to recrawl pages requires making tradeoffs based on the pages' relative importance and change rates, subject to multiple resource constraints - the limited daily budget of crawl requests on the search engine's end and politeness constraints restricting the rate at which pages can be requested from a given host. In this paper, we introduce PoliteBinaryLambdaCrawl, the first optimal algorithm for freshness crawl scheduling in the presence of politeness constraints as well as non-uniform page importance scores and the crawler's own crawl request limit. We also propose an approximation for it, stating its theoretical optimality conditions and in the process discovering a connection to an approach previously thought of as a mere heuristic for freshness crawl scheduling. We explore the relative performance of PoliteBinaryLambdaCrawl and other methods for handling politeness constraints on a dataset collected by crawling over 18.5M URLs daily over 14 weeks.
KW - Convex optimization
KW - Lagrange multiplier
KW - Planning under uncertainty
KW - Politeness constraint
KW - Search engine
KW - Web crawling
UR - http://www.scopus.com/inward/record.url?scp=85073809496&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073809496&partnerID=8YFLogxK
U2 - 10.1145/3331184.3331241
DO - 10.1145/3331184.3331241
M3 - Conference contribution
T3 - SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 495
EP - 504
BT - SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019
Y2 - 21 July 2019 through 25 July 2019
ER -