TY - GEN
T1 - Understanding website behavior based on user agent
AU - Pham, Kien
AU - Santos, Aécio
AU - Freire, Juliana
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/7/7
Y1 - 2016/7/7
N2 - Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them. In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.
AB - Web sites have adopted a variety of adversarial techniques to prevent web crawlers from retrieving their content. While it is possible to simulate users behavior using a browser to crawl such sites, this approach is not scalable. Therefore, understanding existing adversarial techniques is important to design crawling strategies that can adapt to retrieve the content as efficiently as possible. Ideally, a web crawler should detect the nature of the adversarial policies and select the most cost-effective means to defeat them. In this paper, we discuss the results of a large-scale study of web site behavior based on their responses to different user-agents. We issued over 9 million HTTP GET requests to 1.3 million unique web sites from DMOZ using six different user-agents and the TOR network as an anonymous proxy. We observed that web sites do change their responses depending on user-agents and IP addresses. This suggests that probing sites for these features can be an effective means to detect adversarial techniques.
KW - Stealth crawling
KW - User-agent string
KW - Web cloaking
KW - Web crawler detection
UR - http://www.scopus.com/inward/record.url?scp=84980349497&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84980349497&partnerID=8YFLogxK
U2 - 10.1145/2911451.2914757
DO - 10.1145/2911451.2914757
M3 - Conference contribution
AN - SCOPUS:84980349497
T3 - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1053
EP - 1056
BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
Y2 - 17 July 2016 through 21 July 2016
ER -