Abstract
Recently, Bessa et al. (PODS 2023) showed that sketches based on coordinated weighted sampling theoretically and empirically outperform popular linear sketching methods like Johnson-Lindentrauss projection and CountSketch for the ubiquitous problem of inner product estimation. We further develop this finding by introducing and analyzing two alternative sampling-based methods. In contrast to the computationally expensive algorithm in Bessa et al., our methods run in linear time (to compute the sketch) and perform better in practice, significantly beating linear sketching on a variety of tasks. For example, they provide state-of-the-art results for estimating the correlation between columns in unjoined tables, a problem that we show how to reduce to inner product estimation in a black-box way. While based on known sampling techniques (threshold and priority sampling) we introduce significant new theoretical analysis to prove approximation guarantees for our methods.
Original language | English (US) |
---|---|
Pages (from-to) | 2185-2197 |
Number of pages | 13 |
Journal | Proceedings of the VLDB Endowment |
Volume | 17 |
Issue number | 9 |
DOIs | |
State | Published - 2024 |
Event | 50th International Conference on Very Large Data Bases, VLDB 2024 - Guangzhou, China Duration: Aug 25 2024 → Aug 29 2024 |
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- General Computer Science