Faster top-k document retrieval using block-max indexes

Shuai Ding, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Large search engines process thousands of queries per second over billions of documents, making query processing a major performance bottleneck. An important class of optimization techniques called early termination achieves faster query processing by avoiding the scoring of documents that are unlikely to be in the top results. We study new algorithms for early termination that outperform previous methods. In particular, we focus on safe techniques for disjunctive queries, which return the same result as an exhaustive evaluation over the disjunction of the query terms. The current state-of-the-art methods for this case, the WAND algorithm by Broder et al. [11] and the approach of Strohman and Croft [30], achieve great benefits but still leave a large performance gap between disjunctive and (even non-early terminated) conjunctive queries. We propose a new set of algorithms by introducing a simple augmented inverted index structure called a block-max index. Essentially, this is a structure that stores the maximum impact score for each block of a compressed inverted list in uncompressed form, thus enabling us to skip large parts of the lists. We show how to integrate this structure into the WAND approach, leading to considerable performance gains. We then describe extensions to a layered index organization, and to indexes with reassigned document IDs, that achieve additional gains that narrow the gap between disjunctive and conjunctive top-k query processing.

    Original languageEnglish (US)
    Title of host publicationSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
    PublisherAssociation for Computing Machinery
    Pages993-1002
    Number of pages10
    ISBN (Print)9781450309349
    DOIs
    StatePublished - 2011
    Event34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011 - Beijing, China
    Duration: Jul 24 2011Jul 28 2011

    Publication series

    NameSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Conference

    Conference34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011
    Country/TerritoryChina
    CityBeijing
    Period7/24/117/28/11

    Keywords

    • Early termination
    • IR query processing
    • Inverted index
    • Top-k query processing

    ASJC Scopus subject areas

    • Information Systems

    Fingerprint

    Dive into the research topics of 'Faster top-k document retrieval using block-max indexes'. Together they form a unique fingerprint.

    Cite this