Abstract
Compression-based text classification methods are easy to apply, requiring virtually no preprocessing of the data. Most such methods are character-based, and thus have the potential to automatically capture non-word features of a document, such as punctuation, word-stems, and features spanning more than one word. However, compression-based classification methods have drawbacks (such as slow running time), and not all such methods are equally effective. We present the results of a number of experiments designed to evaluate the effectiveness and behavior of different compression-based text classification methods on English text. Among our experiments are some specifically designed to test whether the ability to capture non-word (including super-word) features causes character-based text compression methods to achieve more accurate classification.
Original language | English (US) |
---|---|
Pages (from-to) | 300-314 |
Number of pages | 15 |
Journal | Lecture Notes in Computer Science |
Volume | 3408 |
DOIs | |
State | Published - 2005 |
Event | 27th European Conference on IR Research, ECIR 2005 - Santiago de Compostella, Spain Duration: Mar 21 2005 → Mar 23 2005 |
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science