Informatics based discovery can be greatly accelerated by integrating existing scientific theory with big data, particularly in high dimensional settings where sample points are expensive or difficult to acquire.
Consider a difficult problem where one is given a small dataset from which to train a classifier. The scientific content of any model is characterized by its predictive capacity, and in the case of classification, predictive capacity is quantified by misclassification rate. Given uncertainty in the true underlying model, the error rate of a classifier is typically assessed via error estimation methods, but an error estimator alone does not constitute scientific knowledge--without addressing the accuracy of an error estimator, it is simply a computation without meaning.
When there are few samples and none can be spared as holdout for validation, one must turn to training data error estimators, which are typically based on resampling and counting methods. The simulation on the right demonstrates a scatter plot of true and estimated errors from [1], conducted on a real breast cancer dataset. For each point in these figures, 50 training points are used with t-test feature selection and LDA classification, followed by (left to right in the figure) leave-one-out, 10-fold cross-validation and bootstrap error estimation. The remaining 245 points are used as holdout to approximate the true error. We observe that a least-squares linear regression is not only virtually parallel to the horizontal axis, but in fact carries a slight negative slope--the true error and estimated error are negatively correlated!
Lack of regression, and thus poor error estimation accuracy, is typical for model-free counting methods, particularly in the small sample setting where they are actually employed. Hence, in such difficult problems where little data is available, modeling assumptions are necessary, at least if one desires a reproducible study and a classifier capable of prediction.
Put another way, to improve classifier performance, it becomes crucial to take full advantage of expert knowledge and available data by integrating them into a unified model-based framework, and to apply optimization, rather than relying on data-driven algorithms with little or no small-sample performance guarantees.
Given the necessity of modeling assumptions, my recent work takes a Bayesian approach. Expert knowledge is provided in the form of a prior over an uncertainty class of potential underlying feature-label distributions. The prior is updated to a posterior upon observing data, and, as illustrated to the right, the posterior converges to singularity on the true distribution as more data is observed.
The posterior provides a mathematical foundation to analyze and optimize classifier design and performance evaluation. My work has shown that optimal classifier error estimation (the MMSE estimate of the true misclassification rate), and optimal classification (minimizing the expected misclassification rate), fall out of the theory, both conditioned on the available training data. We call the MMSE error estimate the Bayesian error estimator (BEE) and the minimal expected error classifier the Optimal Bayesian Classifier (OBC).
The BEE is a training data error estimator like cross-validation and bootstrap, however it is not evaluated by resampling and counting errors on surrogate classifiers, but rather by solving an optimization problem relative to the actual classifier in hand. The BEE is also unbiased, and while not necessarily optimal for a fixed distribution, it is optimal on average across the posterior. In the plots on the left, we observe that the BEE tends to perform best in moderately difficult classification problems, precisely where error estimation is needed most, whereas classical error estimators tend to perform well only when classification is easy.
An example OBC is shown on the right for a Gaussian model. The dotted lines are level curves for the Gaussian class-conditional densities corresponding to the expected mean and covariance for a given posterior. In this example, the expected covariances for both classes are the same, and the black solid line is a plug-in linear classifier corresponding to the Bayes classifier for the expected mean and covariance parameters. The red dotted line is the OBC--note that although we assume a Gaussian model, the decision boundary is not quadratic but in general polynomial. This tends to happen, for example, when the posterior reflects more certainty in the parameters of one class relative to the other.
Both the BEE for a given classifier and the OBC can be found efficiently via effective densities, which are the average class-conditional densities over all members of the uncertainty class weighted by the posterior. Essentially, one may treat the effective density as if it were known to be the actual density and evaluate the classical true error of a given classifier (to find the BEE) or the Bayes classifier (to find the OBC). For instance, when the densities are known to be Gaussian, under a conjugate prior the effective densities are heavier-tailed multivariate student-t distributions.
Perhaps more importantly, it becomes possible to practically evaluate the accuracy of any error estimate conditioned on the actual training data and classifier in hand, which is not possible in a model-free approach. This has been done via the sample-conditioned MSE of a classifier error estimate. Given a sample and a designed classifier, not only can one report the BEE, an optimal estimate of the misclassification rate, but also the conditional MSE, an indication of the accuracy of any error estimator. The figure to the left represents a simulation drawing random distributions from a specified prior and random samples from these distributions. From each sample we train a classifier, obtain an error estimator, and evaluate the conditional MSE. The graph is essentially a histogram of the conditional MSE values from each sample. We observe that samples condition uncertainty to a great extent, in the sense that there can be high variance in these conditional RMS values.
One application is a censored sampling technique in which sample points are acquired one at a time until the BEE and conditional MSE reach some desired stopping criteria. An illustration is shown to the right. This approach leads to classifiers with more reliable performance, and often simultaneously an economical savings in the number of sample points needed. We essentially trade variance in performance (RMS) and fixed sample size for a variance in sample size and fixed performance.
Frequentist asymptotics can also be addressed in the theory. Thanks to convergence of the posteriors to the true distributions, we have shown that under mild conditions in multinomial and Gaussian models, the BEE is a consistent estimator of the true error, the conditional MSE converges to zero, and the OBC converges pointwise to the Bayes classifier.
Bayesian frameworks make predictive small-sample classifier design and analysis possible by incorporating scientific knowledge and facilitating optimization. Under such frameworks, the Bayesian error estimator, optimal Bayesian classifier, and sample conditioned MSE are all optimal solutions to important classifier design and analysis questions, each relative to the precise sample, classifier and error estimate in hand.
[1] E. R. Dougherty, C. Sima, J. Hua, B. Hanczar, and U. M. Braga-Neto, "Performance of Error Estimators for Classification," Curr. Bioinf., vol. 5, pp. 53-67, 2010.