Abstract
Focused crawlers are effective tools for applications requiring a high number of pages belonging to a specific topic. Several strategies for implementing these crawlers have been proposed in the literature, which aim to improve crawling efficiency by increasing the number of relevant pages retrieved while avoiding non-relevant pages. However, an important aspect of these crawlers has been largely overlooked: the selection of the seed pages that serve as the starting points for a crawl. In this paper, we show that the seeds can greatly influence the performance of crawlers, and propose a new framework for automatically finding seeds. We describe a system that implements this framework and show, through a detailed experimental evaluation, that by providing crawlers a seed set that is large and varied, they not only obtain higher harvest rates but also an improved topic coverage.
Original language | English (US) |
---|---|
Pages (from-to) | 449-474 |
Number of pages | 26 |
Journal | World Wide Web |
Volume | 19 |
Issue number | 3 |
DOIs | |
State | Published - May 1 2016 |
Keywords
- Focused crawling
- Relevance feedback
- Web crawling
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications