Efficient algorithms for document retrieval problems

S. Muthukrishnan

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    We are given a collection D of text documents d1,...,dk, with Σ1|di| = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time 0(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated. We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects -points and lines -that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.

    Original languageEnglish (US)
    Title of host publicationProceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002
    PublisherAssociation for Computing Machinery
    Pages657-666
    Number of pages10
    ISBN (Electronic)089871513X
    StatePublished - 2002
    Event13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002 - San Francisco, United States
    Duration: Jan 6 2002Jan 8 2002

    Publication series

    NameProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms
    Volume06-08-January-2002

    Other

    Other13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002
    CountryUnited States
    CitySan Francisco
    Period1/6/021/8/02

    ASJC Scopus subject areas

    • Software
    • Mathematics(all)

    Fingerprint Dive into the research topics of 'Efficient algorithms for document retrieval problems'. Together they form a unique fingerprint.

  • Cite this

    Muthukrishnan, S. (2002). Efficient algorithms for document retrieval problems. In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2002 (pp. 657-666). (Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms; Vol. 06-08-January-2002). Association for Computing Machinery.