The binary classification problem. A. The basic structure of a multivariate binary classification problem involves projecting an n × p dimensional data matrix X (n = number of samples, p = number of variables), via a multidimensional scaling transformation, into a one dimensional vector ŷ (1 × n). This function may simply be a weighted linear sum (as in Multiple Linear Regression), a two layer function requiring estimation of a set of Latent Variables (e.g. PLS-DA) or even a multi-layer tree- based non-linear transformation (e.g. a multi-layer perceptron or a parse tree as used commonly in Genetic Programming). The optimisation/selection of this transforming function is based on a-priori information about class membership of each of the n samples, y (usually defined in terms of dummy values, ‘control’ = ‘0’ and ‘case’ = ‘1’). The utility of a classifier is assessed using a priori class membership information of a second set of data (in the form of a test set). If the classifier produces a binary output it can be assessed in the form of a confusion matrix, while if the classifier produces a continuous output then it can be assessed in a similar fashion as is a single variable undergoing univariate significance tests. B. The so-called confusion matrix describing the outcome of predictive models that cross-tabulates the observed and predicted +/− or 1/0 patterns in a binary classification problem. If, in a binary prediction model we label the two classes as 1 (for cases) and 0 (for controls) under conditions in which we are treating the cases as ‘positive’, there are two possible prediction errors: false positives (FP) and false negatives (FN). There are also, happily, true positives and true negatives that are correctly predicted by the model. B is adapted and extended from http://asio.jde.aca.mmu.ac.uk/multivar/da4.htm, which also contains other information and is derived from (Fielding and Bell, 1997).
This basic structure is entirely general (figure 1 A), and is
well described in textbooks such as (Duda et al. , 2001 )
and (Hastie et al. , 2001 ).
Since there are two possible classes, the outcome of any
predictions relative to the ‘true’ class membership is usually set
out as a binary matrix, the so-called confusion matrix
(figure 1 B), consisting of true positives (TP), false
positives (FP), true negatives (TN) and false negatives (FN).
Some metrics derived from the confusion matrix of figure 1 ,
where N is the total number of samples and a,b,c,d refer
to numbers rather than percentages.
Some methods such as Principal Components Analysis (Jolliffe, 1986
) and a variety of clustering methods (Everitt, 1993 ; Handl et
al. , 2005 ) use only the x-data as defined in figure 1
and are fundamentally designed for what Tukey called Exploratory
Data Analysis (Tukey, 1977 ).
Either way, the final model can be expressed in terms of a
multivariate classifier as defined in figure 1 A.
Viewing this image requires a subscription. If you are a subscriber, please log in.
This image is from the article titled "Statistical strategies for avoiding false discoveries in metabolomics and related experiments"
(from Metabolomics), which is copyrighted by Springer Science+Business Media, LLC. For more information on the
copyright for this image, please refer to the full image caption and to the
The image is being made available for non-commercial purposes for subscribers to SpringerImages. For more information on what you are allowed to do with this image, please see our copyright policy.
To request permissions to use any copyrighted material, please visit the source document.
Report a copyright concern regarding this image.
Log in or register to save your favorite images and download them as high-quality PowerPoint or PDF files.
Log in or register to save your search criteria.
© Springer, part of Springer Science+Business Media.
Remote Address: 220.127.116.11 Server: 21