Faster temporal range queries over versioned text

Jinru He, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Versioned textual collections are collections that retain multiple versions of a document as it evolves over time. Important large-scale examples are Wikipedia and the web collection of the Internet Archive. Search queries over such collections often use keywords as well as temporal constraints, most commonly a time range of interest. In this paper, we study how to support such temporal range queries over versioned text. Our goal is to process these queries faster than the corresponding keyword-only queries, by exploiting the additional constraint. A simple approach might partition the index into different time ranges, and then access only the relevant parts. However, specialized inverted index compression techniques are crucial for large versioned collections, and a naive partitioning can negatively affect index compression and query throughput. We show how to achieve high query throughput by using smart index partitioning techniques that take index compression into account. Experiments on over 85 million versions of Wikipedia articles show that queries can be executed in a few milliseconds on memory-based index structures, and only slightly more time on disk-based structures. We also show how to efficiently support the recently proposed stable top-k search primitive on top of our schemes.

    Original languageEnglish (US)
    Title of host publicationSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
    PublisherAssociation for Computing Machinery
    Pages565-574
    Number of pages10
    ISBN (Print)9781450309349
    DOIs
    StatePublished - 2011
    Event34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011 - Beijing, China
    Duration: Jul 24 2011Jul 28 2011

    Publication series

    NameSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

    Conference

    Conference34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011
    Country/TerritoryChina
    CityBeijing
    Period7/24/117/28/11

    Keywords

    • Inverted index
    • Query processing
    • Range queries
    • Temporal search
    • Versioned documents

    ASJC Scopus subject areas

    • Information Systems

    Fingerprint

    Dive into the research topics of 'Faster temporal range queries over versioned text'. Together they form a unique fingerprint.

    Cite this