TY - GEN
T1 - DSDD
T2 - 30th ACM International Conference on Information and Knowledge Management, CIKM 2021
AU - Zhang, Haoxiang
AU - Santos, Aécio
AU - Freire, Juliana
N1 - Funding Information:
This work was partially supported by the DARPA D3M program and NSF award OAC-1640864. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF and DARPA.
Publisher Copyright:
© 2021 ACM.
PY - 2021/10/26
Y1 - 2021/10/26
N2 - With the push for transparency and open data, many datasets and data repositories are becoming available on the Web. This opens new opportunities for data-driven exploration, from empowering analysts to answer new questions and obtain insights to improving predictive models through data augmentation. But as datasets are spread over a plethora of Web sites, finding data that are relevant for a given task is difficult. In this paper, we take a first step towards the construction of domain-specific data lakes. We propose an end-to-end dataset discovery system, targeted at domain experts, which given a small set of keywords, automatically finds potentially relevant datasets on the Web. The system makes use of search engines to hop across Web sites, uses online learning to incrementally build a model to recognize sites that contain datasets, utilizes a set of discovery actions to broaden the search, and applies a multi-armed bandit based algorithm to balance the trade-offs of different discovery actions. We report the results of an extensive experimental evaluation over multiple domains, and demonstrate that our strategy is effective and outperforms state-of-the-art content discovery methods.
AB - With the push for transparency and open data, many datasets and data repositories are becoming available on the Web. This opens new opportunities for data-driven exploration, from empowering analysts to answer new questions and obtain insights to improving predictive models through data augmentation. But as datasets are spread over a plethora of Web sites, finding data that are relevant for a given task is difficult. In this paper, we take a first step towards the construction of domain-specific data lakes. We propose an end-to-end dataset discovery system, targeted at domain experts, which given a small set of keywords, automatically finds potentially relevant datasets on the Web. The system makes use of search engines to hop across Web sites, uses online learning to incrementally build a model to recognize sites that contain datasets, utilizes a set of discovery actions to broaden the search, and applies a multi-armed bandit based algorithm to balance the trade-offs of different discovery actions. We report the results of an extensive experimental evaluation over multiple domains, and demonstrate that our strategy is effective and outperforms state-of-the-art content discovery methods.
KW - domain-specific dataset discovery
KW - focused crawling
KW - meta search
KW - multi-armed bandit
KW - online learning
UR - http://www.scopus.com/inward/record.url?scp=85119212306&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119212306&partnerID=8YFLogxK
U2 - 10.1145/3459637.3482427
DO - 10.1145/3459637.3482427
M3 - Conference contribution
AN - SCOPUS:85119212306
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 2527
EP - 2536
BT - CIKM 2021 - Proceedings of the 30th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
Y2 - 1 November 2021 through 5 November 2021
ER -