Scalable techniques for document identifier assignment in inverted indexes

Shuai Ding, Josh Attenberg, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing this structure. Recently, several authors have proposed techniques for improving index compression by optimizing the assignment of document identifiers to the documents in the collection, leading to significant reduction in overall index size. In this paper, we propose improved techniques for document identifier assignment. Previous work includes simple and fast heuristics such as sorting by URL, as well as more involved approaches based on the Traveling Salesman Problem or on graph partitioning. These techniques achieve good compression but do not scale to larger document collections. We propose a new framework based on performing a Traveling Salesman computation on a reduced sparse graph obtained through Locality Sensitive Hashing. This technique achieves improved compression while scaling to tens of millions of documents. Based on this framework, we describe a number of new algorithms, and perform a detailed evaluation on three large data sets showing improvements in index size.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 19th International Conference on World Wide Web, WWW '10
    Pages311-320
    Number of pages10
    DOIs
    StatePublished - 2010
    Event19th International World Wide Web Conference, WWW2010 - Raleigh, NC, United States
    Duration: Apr 26 2010Apr 30 2010

    Publication series

    NameProceedings of the 19th International Conference on World Wide Web, WWW '10

    Other

    Other19th International World Wide Web Conference, WWW2010
    Country/TerritoryUnited States
    CityRaleigh, NC
    Period4/26/104/30/10

    Keywords

    • documentID reassignment
    • index compression

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Computer Science Applications

    Fingerprint

    Dive into the research topics of 'Scalable techniques for document identifier assignment in inverted indexes'. Together they form a unique fingerprint.

    Cite this