Cluster-based delta compression of a collection of files

Z. Ouyang, N. Memon, T. Suel, D. Trendafilov

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. We study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of Web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.

    Original languageEnglish (US)
    Title of host publicationWISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering
    EditorsWee Keong Ng, Tok Wang Ling, Angela Goh, Umeshwar Dayal, Elisa Bertino
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages257-266
    Number of pages10
    ISBN (Electronic)0769517668, 9780769517667
    DOIs
    StatePublished - 2002
    Event3rd International Conference on Web Information Systems Engineering, WISE 2002 - Singapore, Singapore
    Duration: Dec 12 2002Dec 14 2002

    Publication series

    NameWISE 2002 - Proceedings of the 3rd International Conference on Web Information Systems Engineering

    Other

    Other3rd International Conference on Web Information Systems Engineering, WISE 2002
    Country/TerritorySingapore
    CitySingapore
    Period12/12/0212/14/02

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Information Systems
    • Control and Systems Engineering

    Fingerprint

    Dive into the research topics of 'Cluster-based delta compression of a collection of files'. Together they form a unique fingerprint.

    Cite this