TY - GEN
T1 - A Sketch-based Index for Correlated Dataset Search
AU - Santos, Aecio
AU - Bessa, Aline
AU - Musco, Christopher
AU - Freire, Juliana
N1 - Funding Information:
ACKNOWLEDGMENTS This work was partially supported by the DARPA D3M program and NSF award ISS-2106888. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF and DARPA.
Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Dataset search is emerging as a critical capability in both research and industry: it has spurred many novel applications, ranging from the enrichment of analyses of real-world phenomena to the improvement of machine learning models. Recent research in this field has explored a new class of data-driven queries: queries consist of datasets and retrieve, from a large collection, related datasets. In this paper, we study a specific type of data-driven query that supports relational data augmentation through numerical data relationships: given an input query table, find the top-k tables that are both joinable with it and contain columns that are correlated with a column in the query. We propose a novel hashing scheme that allows the construction of a sketch-based index to support efficient correlated table search. We show that our proposed approach is effective and efficient, and achieves better trade-offs that significantly improve both the ranking accuracy and recall compared to the state-of-the-art solutions.
AB - Dataset search is emerging as a critical capability in both research and industry: it has spurred many novel applications, ranging from the enrichment of analyses of real-world phenomena to the improvement of machine learning models. Recent research in this field has explored a new class of data-driven queries: queries consist of datasets and retrieve, from a large collection, related datasets. In this paper, we study a specific type of data-driven query that supports relational data augmentation through numerical data relationships: given an input query table, find the top-k tables that are both joinable with it and contain columns that are correlated with a column in the query. We propose a novel hashing scheme that allows the construction of a sketch-based index to support efficient correlated table search. We show that our proposed approach is effective and efficient, and achieves better trade-offs that significantly improve both the ranking accuracy and recall compared to the state-of-the-art solutions.
KW - dataset search
KW - sketching
KW - table search
UR - http://www.scopus.com/inward/record.url?scp=85136424640&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85136424640&partnerID=8YFLogxK
U2 - 10.1109/ICDE53745.2022.00264
DO - 10.1109/ICDE53745.2022.00264
M3 - Conference contribution
AN - SCOPUS:85136424640
T3 - Proceedings - International Conference on Data Engineering
SP - 2928
EP - 2941
BT - Proceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PB - IEEE Computer Society
T2 - 38th IEEE International Conference on Data Engineering, ICDE 2022
Y2 - 9 May 2022 through 12 May 2022
ER -