To index or not to index: Time–space trade-offs for positional ranking functions in search engines

Diego Arroyuelo, Senén González, Mauricio Marin, Mauricio Oyarzún, Torsten Suel, Luis Valenzuela

    Research output: Contribution to journalArticlepeer-review

    Abstract

    Positional ranking functions, widely used in web search engines and related search systems, improve result quality by exploiting the positions of the query terms within documents. However, it is well known that positional indexes demand large amounts of extra space, typically about three times the space of a basic nonpositional index. Textual data, on the other hand, is needed to produce text snippets. In this paper, we study time–space trade-offs for search engines with positional ranking functions and text snippet generation. We consider both index-based and non-index based alternatives for positional data. We aim to answer the question of whether positional data should be indexed, and how. We show that there is a wide range of practical time–space trade-offs. Moreover, we show that using about 1.30 times the space of positional data, we can store everything needed for efficient query processing, with a minor increase in query time. This yields considerable space savings and outperforms, both in space and time, recent alternatives from literature. We also propose several efficient compressed text representations for snippet generation, which are able to use about half of the space of current state-of-the-art alternatives with little impact in query processing time.

    Original languageEnglish (US)
    Article number101466
    JournalInformation Systems
    Volume89
    DOIs
    StatePublished - Mar 2020

    Keywords

    • Index compression
    • Positional indexing
    • Snippet generation
    • Text compression
    • Wavelet trees

    ASJC Scopus subject areas

    • Software
    • Information Systems
    • Hardware and Architecture

    Fingerprint

    Dive into the research topics of 'To index or not to index: Time–space trade-offs for positional ranking functions in search engines'. Together they form a unique fingerprint.

    Cite this