CLUSTER ANALYSIS OF ENGLISH TEXT.

Godfried T. Toussaint, Rajjan Shinghal

Research output: Contribution to conferencePaperpeer-review

Abstract

Large English texts on ten differnt subject matters were compiled. Estimates were obtained of the n-gram probability distributions, the word-length for each of the texts as well as English as a whole. Experiments were done to test for pairwise differences of the ten texts. Principal component analysis and hierarchical clustering analysis were applied to the data in order to discover any possible similarities and dissimilarities among the different texts. Estimates were obtained of first, second, and third-order entropies for each text, and the texts were tested for pairwise differences according to their first-order entropy estimates. The results are of interest to researchers in psychology, biology, anthropology, and computational linguistics as well as pattern recognition.

Original languageEnglish (US)
Pages164-117
Number of pages48
StatePublished - 1978
EventProc IEEE Comput Soc Conf Pattern Recognition Image Process - Chicago, IL, USA
Duration: May 31 1978Jun 2 1978

Other

OtherProc IEEE Comput Soc Conf Pattern Recognition Image Process
CityChicago, IL, USA
Period5/31/786/2/78

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'CLUSTER ANALYSIS OF ENGLISH TEXT.'. Together they form a unique fingerprint.

Cite this