Correlation Sketches for Approximate Join-Correlation Queries

Aécio Santos, Aline Bessa, Fernando Chirigati, Christopher Musco, Juliana Freire

Research output: Contribution to journalConference articlepeer-review


The increasing availability of structured datasets, from Web tables and open-data portals to enterprise data, opens up opportunities to enrich analytics and improve machine learning models through relational data augmentation. In this paper, we introduce a new class of data augmentation queries: join-correlation queries. Given a column Q and a join column KQ from a query table TQ, retrieve tables TX in a dataset collection such that TX is joinable with TQ on KQ and there is a column C g TX such that Q is correlated with C. A naïve approach to evaluate these queries, which first finds joinable tables and then explicitly joins and computes correlations between Q and all columns of the discovered tables, is prohibitively expensive. To efficiently support correlated column discovery, we 1) propose a sketching method that enables the construction of an index for a large number of tables and that provides accurate estimates for join-correlation queries, and 2) explore different scoring strategies that effectively rank the query results based on how well the columns are correlated with the query. We carry out a detailed experimental evaluation, using both synthetic and real data, which shows that our sketches attain high accuracy and the scoring strategies lead to high-quality rankings.

Original languageEnglish (US)
Pages (from-to)1531-1544
Number of pages14
JournalProceedings of the ACM SIGMOD International Conference on Management of Data
StatePublished - 2021
Event2021 International Conference on Management of Data, SIGMOD 2021 - Virtual, Online, China
Duration: Jun 20 2021Jun 25 2021


  • approximate query processing
  • confidence intervals
  • dataset search
  • join-correlation estimation
  • sketching algorithms

ASJC Scopus subject areas

  • Software
  • Information Systems


Dive into the research topics of 'Correlation Sketches for Approximate Join-Correlation Queries'. Together they form a unique fingerprint.

Cite this