Compressing term positions in web indexes

Hao Yan, Shuai Ding, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Large search engines process thousands of queries per second on billions of pages, making query processing a major factor in their operating costs. This has led to a lot of research on how to improve query throughput, using techniques such as massive parallelism, caching, early termination, and inverted index compression. We focus on techniques for compressing term positions in web search engine indexes. Most previous work has focused on compressing docID and frequency data, or position information in other types of text collections. Compression of term positions in web pages is complicated by the fact that term occurrences tend to cluster within documents but not across document boundaries, making it harder to exploit clustering effects. Also, typical access patterns for position data are different from those for docID and frequency data. We perform a detailed study of a number of existing and new techniques for compressing position data in web indexes. We also study how to efficiently access position data for ranking functions that take proximity features into account.

    Original languageEnglish (US)
    Title of host publicationProceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009
    Pages147-154
    Number of pages8
    DOIs
    StatePublished - 2009
    Event32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009 - Boston, MA, United States
    Duration: Jul 19 2009Jul 23 2009

    Publication series

    NameProceedings - 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009

    Other

    Other32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009
    Country/TerritoryUnited States
    CityBoston, MA
    Period7/19/097/23/09

    Keywords

    • Index compression
    • Inverted index
    • Search engines

    ASJC Scopus subject areas

    • Computer Science Applications
    • Information Systems
    • Information Systems and Management

    Fingerprint

    Dive into the research topics of 'Compressing term positions in web indexes'. Together they form a unique fingerprint.

    Cite this