Understanding website behavior based on user agent

Kien Pham, Aécio Santos, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them. In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.

Original languageEnglish (US)
Title of host publicationSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages1053-1056
Number of pages4
ISBN (Electronic)9781450342902
DOIs
StatePublished - Jul 7 2016
Event39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy
Duration: Jul 17 2016Jul 21 2016

Publication series

NameSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

Other

Other39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
CountryItaly
CityPisa
Period7/17/167/21/16

Keywords

  • Stealth crawling
  • User-agent string
  • Web cloaking
  • Web crawler detection

ASJC Scopus subject areas

  • Information Systems
  • Software

Fingerprint Dive into the research topics of 'Understanding website behavior based on user agent'. Together they form a unique fingerprint.

Cite this