A Sketch-based Index for Correlated Dataset Search

Aecio Santos, Aline Bessa, Christopher Musco, Juliana Freire

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Dataset search is emerging as a critical capability in both research and industry: it has spurred many novel applications, ranging from the enrichment of analyses of real-world phenomena to the improvement of machine learning models. Recent research in this field has explored a new class of data-driven queries: queries consist of datasets and retrieve, from a large collection, related datasets. In this paper, we study a specific type of data-driven query that supports relational data augmentation through numerical data relationships: given an input query table, find the top-k tables that are both joinable with it and contain columns that are correlated with a column in the query. We propose a novel hashing scheme that allows the construction of a sketch-based index to support efficient correlated table search. We show that our proposed approach is effective and efficient, and achieves better trade-offs that significantly improve both the ranking accuracy and recall compared to the state-of-the-art solutions.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE 38th International Conference on Data Engineering, ICDE 2022
PublisherIEEE Computer Society
Number of pages14
ISBN (Electronic)9781665408837
StatePublished - 2022
Event38th IEEE International Conference on Data Engineering, ICDE 2022 - Virtual, Online, Malaysia
Duration: May 9 2022May 12 2022

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627


Conference38th IEEE International Conference on Data Engineering, ICDE 2022
CityVirtual, Online


  • dataset search
  • sketching
  • table search

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems


Dive into the research topics of 'A Sketch-based Index for Correlated Dataset Search'. Together they form a unique fingerprint.

Cite this