On compression-based text classification

Yuval Marton, Ning Wu, Lisa Hellerstein

    Research output: Contribution to journalConference articlepeer-review

    Abstract

    Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.

    Original languageEnglish (US)
    Pages (from-to)300-314
    Number of pages15
    JournalLecture Notes in Computer Science
    Volume3408
    DOIs
    StatePublished - 2005
    Event27th European Conference on IR Research, ECIR 2005 - Santiago de Compostella, Spain
    Duration: Mar 21 2005Mar 23 2005

    ASJC Scopus subject areas

    • Theoretical Computer Science
    • General Computer Science

    Fingerprint

    Dive into the research topics of 'On compression-based text classification'. Together they form a unique fingerprint.

    Cite this