Vector space classification of DNA sequences

H. M. Müller, S. E. Koonin

Research output: Contribution to journalArticlepeer-review

Abstract

Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.

Original languageEnglish (US)
Pages (from-to)161-169
Number of pages9
JournalJournal of Theoretical Biology
Volume223
Issue number2
DOIs
StatePublished - Jul 21 2003

Keywords

  • Clustering
  • Document vector
  • Gene structure
  • Genomics
  • Intron-exon identification
  • Principal component analysis

ASJC Scopus subject areas

  • Statistics and Probability
  • Modeling and Simulation
  • Biochemistry, Genetics and Molecular Biology(all)
  • Immunology and Microbiology(all)
  • Agricultural and Biological Sciences(all)
  • Applied Mathematics

Fingerprint Dive into the research topics of 'Vector space classification of DNA sequences'. Together they form a unique fingerprint.

Cite this