Efficient search in large textual collections with redundancy

Jiangong Zhang, Torsten Suel

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Current web search engines focus on searching only themost recentsnapshot of the web. In some cases, however, it would be desirableto search over collections that include many different crawls andversions of each page. One important example of such a collectionis the Internet Archive, though there are many others. Sincethe data size of such an archive is multiple times that of a singlesnapshot, this presents us with significant performance challenges.Current engines use various techniques for index compression andoptimized query execution, but these techniques do not exploit thesignificant similarities between different versions of a page, or betweendifferent pages.In this paper, we propose a general framework for indexing andquery processing of archival collections and, more generally, anycollections with a sufficient amount of redundancy. Our approachresults in significant reductions in index size and query processingcosts on such collections, and it is orthogonal to and can be combinedwith the existing techniques. It also supports highly efficientupdates, both locally and over a network. Within this framework,we describe and evaluate different implementations that trade offindex size versus CPU cost and other factors, and discuss applicationsranging from archival web search to local search of web sites,email archives, or file systems. We present experimental resultsbased on search engine query log and a large collection consistingof multiple crawls.

    Original languageEnglish (US)
    Title of host publication16th International World Wide Web Conference, WWW2007
    Pages411-420
    Number of pages10
    DOIs
    StatePublished - 2007
    Event16th International World Wide Web Conference, WWW2007 - Banff, AB, Canada
    Duration: May 8 2007May 12 2007

    Publication series

    Name16th International World Wide Web Conference, WWW2007

    Other

    Other16th International World Wide Web Conference, WWW2007
    Country/TerritoryCanada
    CityBanff, AB
    Period5/8/075/12/07

    Keywords

    • Index compression
    • Inverted index
    • Query execution
    • Redundancy elimination
    • Search engines

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Software

    Fingerprint

    Dive into the research topics of 'Efficient search in large textual collections with redundancy'. Together they form a unique fingerprint.

    Cite this