Predictive and Optimal Small-sample Model-based Classification


Informatics based discovery can be greatly accelerated by integrating existing scientific theory with big data, particularly in high dimensional settings where sample points are expensive or difficult to acquire.

Consider a difficult problem where one is given a small dataset from which to train a classifier. The scientific content of any model is characterized by its predictive capacity, and in the case of classification, predictive capacity is quantified by misclassification rate. Given uncertainty in the true underlying model, the error rate of a classifier is typically assessed via error estimation methods, but an error estimator alone does not constitute scientific knowledge--without addressing the accuracy of an error estimator, it is simply a computation without meaning.

When there are few samples and none can be spared as holdout for validation, one must turn to training data error estimators, which are typically based on resampling and counting methods. The simulation on the right demonstrates a scatter plot of true and estimated errors from [1], conducted on a real breast cancer dataset. For each point in these figures, 50 training points are used with t-test feature selection and LDA classification, followed by (left to right in the figure) leave-one-out, 10-fold cross-validation and bootstrap error estimation. The remaining 245 points are used as holdout to approximate the true error. We observe that a least-squares linear regression is not only virtually parallel to the horizontal axis, but in fact carries a slight negative slope--the true error and estimated error are negatively correlated!

Lack of regression, and thus poor error estimation accuracy, is typical for model-free counting methods, particularly in the small sample setting where they are actually employed. Hence, in such difficult problems where little data is available, modeling assumptions are necessary, at least if one desires a reproducible study and a classifier capable of prediction.

Put another way, to improve classifier performance, it becomes crucial to take full advantage of expert knowledge and available data by integrating them into a unified model-based framework, and to apply optimization, rather than relying on data-driven algorithms with little or no small-sample performance guarantees.

Given the necessity of modeling assumptions, my recent work takes a Bayesian approach. Expert knowledge is provided in the form of a prior over an uncertainty class of potential underlying feature-label distributions. The prior is updated to a posterior upon observing data, and, as illustrated to the right, the posterior converges to singularity on the true distribution as more data is observed.

The posterior provides a mathematical foundation to analyze and optimize classifier design and performance evaluation. My work has shown that optimal classifier error estimation (the MMSE estimate of the true misclassification rate), and optimal classification (minimizing the expected misclassification rate), fall out of the theory, both conditioned on the available training data. We call the MMSE error estimate the Bayesian error estimator (BEE) and the minimal expected error classifier the Optimal Bayesian Classifier (OBC).

The BEE is a training data error estimator like cross-validation and bootstrap, however it is not evaluated by resampling and counting errors on surrogate classifiers, but rather by solving an optimization problem relative to the actual classifier in hand. The BEE is also unbiased, and while not necessarily optimal for a fixed distribution, it is optimal on average across the posterior. In the plots on the left, we observe that the BEE tends to perform best in moderately difficult classification problems, precisely where error estimation is needed most, whereas classical error estimators tend to perform well only when classification is easy.

An example OBC is shown on the right for a Gaussian model. The dotted lines are level curves for the Gaussian class-conditional densities corresponding to the expected mean and covariance for a given posterior. In this example, the expected covariances for both classes are the same, and the black solid line is a plug-in linear classifier corresponding to the Bayes classifier for the expected mean and covariance parameters. The red dotted line is the OBC--note that although we assume a Gaussian model, the decision boundary is not quadratic but in general polynomial. This tends to happen, for example, when the posterior reflects more certainty in the parameters of one class relative to the other.

Both the BEE for a given classifier and the OBC can be found efficiently via effective densities, which are the average class-conditional densities over all members of the uncertainty class weighted by the posterior. Essentially, one may treat the effective density as if it were known to be the actual density and evaluate the classical true error of a given classifier (to find the BEE) or the Bayes classifier (to find the OBC). For instance, when the densities are known to be Gaussian, under a conjugate prior the effective densities are heavier-tailed multivariate student-t distributions.

Perhaps more importantly, it becomes possible to practically evaluate the accuracy of any error estimate conditioned on the actual training data and classifier in hand, which is not possible in a model-free approach. This has been done via the sample-conditioned MSE of a classifier error estimate. Given a sample and a designed classifier, not only can one report the BEE, an optimal estimate of the misclassification rate, but also the conditional MSE, an indication of the accuracy of any error estimator. The figure to the left represents a simulation drawing random distributions from a specified prior and random samples from these distributions. From each sample we train a classifier, obtain an error estimator, and evaluate the conditional MSE. The graph is essentially a histogram of the conditional MSE values from each sample. We observe that samples condition uncertainty to a great extent, in the sense that there can be high variance in these conditional RMS values.

One application is a censored sampling technique in which sample points are acquired one at a time until the BEE and conditional MSE reach some desired stopping criteria. An illustration is shown to the right. This approach leads to classifiers with more reliable performance, and often simultaneously an economical savings in the number of sample points needed. We essentially trade variance in performance (RMS) and fixed sample size for a variance in sample size and fixed performance.

Frequentist asymptotics can also be addressed in the theory. Thanks to convergence of the posteriors to the true distributions, we have shown that under mild conditions in multinomial and Gaussian models, the BEE is a consistent estimator of the true error, the conditional MSE converges to zero, and the OBC converges pointwise to the Bayes classifier.

Bayesian frameworks make predictive small-sample classifier design and analysis possible by incorporating scientific knowledge and facilitating optimization. Under such frameworks, the Bayesian error estimator, optimal Bayesian classifier, and sample conditioned MSE are all optimal solutions to important classifier design and analysis questions, each relative to the precise sample, classifier and error estimate in hand.

[1] E. R. Dougherty, C. Sima, J. Hua, B. Hanczar, and U. M. Braga-Neto, "Performance of Error Estimators for Classification," Curr. Bioinf., vol. 5, pp. 53-67, 2010.


Relevant Publications

Bayesian MMSE Error Estimation

Journal Publications
  • Optimal MSE calibration of classifier error estimators under Bayesian models
    L. A. Dalton and E. R. Dougherty, Pattern Recognition, vol. 45, no. 6, pp. 2308-2320, June 2012.
    [link]
  • Application of the Bayesian MMSE Estimator for Classification Error to Gene-Expression Microarray Data
    L. A. Dalton and E. R. Dougherty, Bioinformatics, vol. 27, no. 13, pp. 1822-1831, May 2011.
    [link] [supplementary material]
  • Bayesian Minimum Mean-Square Error Estimation for Classification Error–Part II: The Bayesian MMSE Error Estimator for Linear Classification of Gaussian Distributions
    L. A. Dalton and E. R. Dougherty, IEEE Transactions on Signal Processing, vol. 59, no. 1, pp. 130–144, January 2011.
    [link]
  • Bayesian Minimum Mean-Square Error Estimation for Classification Error–Part I: Definition and the Bayesian MMSE Error Estimator for Discrete Classification
    L. A. Dalton and E. R. Dougherty, IEEE Transactions on Signal Processing, vol. 59, no. 1, pp. 115–129, January 2011.
    [link]
Conference Publications
  • Bayesian MMSE Estimation of Classification Error and Performance on Real Genomic Data
    L. A. Dalton and E. R. Dougherty, in Proceedings of the 9th IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '10), Cold Spring Harbor, NY, November 2010.
    [link]

Optimal Bayesian Classification

Journal Publications
  • Optimal Classifiers with Minimum Expected Error within a Bayesian Framework--Part II: Properties and Performance Analysis
    L. A. Dalton and E. R. Dougherty, Pattern Recognition, vol. 46, no. 5, pp. 1288-1300, May 2013.
    [link] [supplementary material]
  • Optimal Classifiers with Minimum Expected Error within a Bayesian Framework--Part I: Discrete and Gaussian Models
    L. A. Dalton and E. R. Dougherty, Pattern Recognition, vol. 46, no. 5, pp. 1301-1314, May 2013.
    [link]
Conference Publications
  • Classification with Minimum Expected Error Over an Uncertainty Class of Gaussian Distributions
    L. A. Dalton, in Proceedings of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), Vancouver, Canada, May 2013.
    [link]
  • Optimal Bayesian Classification and Its Application to Gene Regulatory Networks
    L. A. Dalton and E. R. Dougherty, in Proceedings of the 2012 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '12), Washington, DC, December 2012.
    [link]
  • Optimal Classifiers Within a Bayesian Framework
    L. A. Dalton and E. R. Dougherty, in Proceedings of the 2012 IEEE Statistical Signal Processing Workshop (SSP '12), Ann Arbor, MI, August 2012.
    [link]

Sample-Conditioned MSE

Journal Publications
  • Exact Sample Conditioned MSE Performance of the Bayesian MMSE Estimator for Classification Error--Part II: Consistency and Performance Analysis
    L. A. Dalton and E. R. Dougherty, IEEE Transactions on Signal Processing, vol. 60, no. 5, pp. 2588-2603, May 2012.
    [link]
  • Exact Sample Conditioned MSE Performance of the Bayesian MMSE Estimator for Classification Error--Part I: Representation
    L. A. Dalton and E. R. Dougherty, IEEE Transactions on Signal Processing, vol. 60, no. 5, pp. 2575-2587, May 2012.
    [link]
Conference Publications
  • Application of the Sample-conditioned MSE to Non-linear Classifiation and Censored Sampling
    L. A. Dalton, in Proceedings of the 21st European Signal Processing Conference (EUSIPCO 2013), Marrakech, Morocco, September 2013. (invited paper)
  • A Novel Censored Sampling Paradigm for Genomic Data Classification
    L. A. Dalton and E. R. Dougherty, in Proceedings of the 9th International Workshop on Computational Systems Biology (WCSB 2012), Ulm, Germany, June 2012.
    [pdf]
  • Classifier Error Estimator Performance in a Bayesian Context
    L. A. Dalton and E. R. Dougherty, in Proceedings of the 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '11), San Antonio, TX, December 2011.
    [link]
  • Exact MSE performance of the Bayesian MMSE Estimator for Classification Error
    L. A. Dalton and E. R. Dougherty, in Proceedings of the 2011 Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, November 2011.
    [link]