Optimal freshness crawl under politeness constraints

Andrey Kolobov, Eyal Lubetzky, Yuval Peres, Eric Horvitz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine's index, in order to keep the index fresh by picking up the pages' changed content. Determining how often to recrawl pages requires making tradeoffs based on the pages' relative importance and change rates, subject to multiple resource constraints - the limited daily budget of crawl requests on the search engine's end and politeness constraints restricting the rate at which pages can be requested from a given host. In this paper, we introduce PoliteBinaryLambdaCrawl, the first optimal algorithm for freshness crawl scheduling in the presence of politeness constraints as well as non-uniform page importance scores and the crawler's own crawl request limit. We also propose an approximation for it, stating its theoretical optimality conditions and in the process discovering a connection to an approach previously thought of as a mere heuristic for freshness crawl scheduling. We explore the relative performance of PoliteBinaryLambdaCrawl and other methods for handling politeness constraints on a dataset collected by crawling over 18.5M URLs daily over 14 weeks.

Original languageEnglish (US)
Title of host publicationSIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages495-504
Number of pages10
ISBN (Electronic)9781450361729
DOIs
StatePublished - Jul 18 2019
Event42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019 - Paris, France
Duration: Jul 21 2019Jul 25 2019

Publication series

NameSIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019
CountryFrance
CityParis
Period7/21/197/25/19

Keywords

  • Convex optimization
  • Lagrange multiplier
  • Planning under uncertainty
  • Politeness constraint
  • Search engine
  • Web crawling

ASJC Scopus subject areas

  • Information Systems
  • Applied Mathematics
  • Software

Fingerprint Dive into the research topics of 'Optimal freshness crawl under politeness constraints'. Together they form a unique fingerprint.

Cite this