Selectivity estimation for Boolean queries

Zhiyuan Chen, Flip Korn, Nick Koudas, S. Muthukrishnan

    Research output: Contribution to conferencePaperpeer-review

    Abstract

    In a variety of applications ranging from optimizing queries on alphanumeric attributes to providing approximate counts of documents containing several query terms, there is an increasing need to quickly and reliably estimate the number of strings (tuples, documents, etc.) matching a Boolean query. Boolean queries in this context consist of substring predicates composed using Boolean operators. While there has been some work in estimating the selectivity of substring queries, the more general problem of estimating the selectivity of Boolean queries over substring predicates has not been studied. Our approach is to extract selectivity estimates from relationships between the substring predicates of the Boolean query. However, storing the correlation between all possible predicates in order to provide an exact answer to such predicates is clearly infeasible, as there is a super-exponential number of possible combinations of these predicates. Instead, our novel idea is to capture correlations in a space-efficient but approximate manner. We employ a Monte Carlo technique called set hashing to succinctly represent the set of strings containing a given substring as a signature vector of hash values. Correlations among substring predicates can then be generated on-the-fly by operating on these signatures. We formalize our approach and propose an algorithm for estimating the selectivity of any Boolean query using the signatures of its substring predicates. We then experimentally demonstrate the superiority of our approach over a straightforward approach based on the independence assumption wherein correlations are not explicitly captured.

    Original languageEnglish (US)
    Pages216-225
    Number of pages10
    StatePublished - 2000
    EventPODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems - Dallas, TX, USA
    Duration: May 15 2000May 17 2000

    Conference

    ConferencePODS 2000 - 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems
    CityDallas, TX, USA
    Period5/15/005/17/00

    ASJC Scopus subject areas

    • Software
    • Information Systems
    • Hardware and Architecture

    Fingerprint

    Dive into the research topics of 'Selectivity estimation for Boolean queries'. Together they form a unique fingerprint.

    Cite this