TY - GEN
T1 - Bootstrapping domain-specific content discovery on the web
AU - Pham, Kien
AU - Santos, Aécio
AU - Freire, Juliana
N1 - Publisher Copyright:
© 2019 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License.
PY - 2019/5/13
Y1 - 2019/5/13
N2 - The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites . DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-ofthe-art methods.
AB - The ability to continuously discover domain-specific content from the Web is critical for many applications. While focused crawling strategies have been shown to be effective for discovery, configuring a focused crawler is difficult and time-consuming. Given a domain of interest D, subject-matter experts (SMEs) must search for relevant websites and collect a set of representative Web pages to serve as training examples for creating a classifier that recognizes pages in D, as well as a set of pages to seed the crawl. In this paper, we propose DISCO, an approach designed to bootstrap domain-specific search. Given a small set of websites, DISCO aims to discover a large collection of relevant websites . DISCO uses a ranking-based framework that mimics the way users search for information on the Web: it iteratively discovers new pages, distills, and ranks them. It also applies multiple discovery strategies, including keyword-based and related queries issued to search engines, backward and forward crawling. By systematically combining these strategies, DISCO is able to attain high harvest rates and coverage for a variety of domains. We perform extensive experiments in four social-good domains, using data gathered by SMEs in the respective domains, and show that our approach is effective and outperforms state-ofthe-art methods.
KW - Domain-specific Website discovery
KW - Focused crawling
KW - Meta search
UR - http://www.scopus.com/inward/record.url?scp=85066884701&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066884701&partnerID=8YFLogxK
U2 - 10.1145/3308558.3313709
DO - 10.1145/3308558.3313709
M3 - Conference contribution
AN - SCOPUS:85066884701
T3 - The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019
SP - 1476
EP - 1486
BT - The Web Conference 2019 - Proceedings of the World Wide Web Conference, WWW 2019
PB - Association for Computing Machinery, Inc
T2 - 2019 World Wide Web Conference, WWW 2019
Y2 - 13 May 2019 through 17 May 2019
ER -