Abstract
The problem of estimating the performance of a given classifier on a given data set is discussed for the case when no knowledge is available concerning the underlying distributions. A new method of estimating the probability of misclassification is proposed which yields essentially unbiased results similar to Lachenbruch's U-method with far less computation involved. While no theoretical work is presented, a practical rule of thumb is given for choosing the parameters of the estimator. The results are based on experiments performed with a data set concerning six diseases related to epigastric pain, and underline the importance of reporting performance on both the testing data and the training data. Whereas previous papers have continually reported results with a probability of correct classification as high as 74.3 per cent on the raw data and 92.0 per cent on "processed" data, in this paper it is shown that a much more significant estimate of the probability of correct classification based on this data set is 51.0 per cent.
Original language | English (US) |
---|---|
Pages (from-to) | 269-278 |
Number of pages | 10 |
Journal | Computers in Biology and Medicine |
Volume | 4 |
Issue number | 3-4 |
DOIs | |
State | Published - Feb 1975 |
Keywords
- Classification
- Epigastric pain
- Feature size
- Nearest Neighbour rule
- Nonparametric
- Pattern recognition
- Probability of misclassification
- Sample size
- Symptom diagnosis
- Testing sets
- Training sets
ASJC Scopus subject areas
- Computer Science Applications
- Health Informatics