1 Introduction

Q2 is defined as one minus the ratio of the prediction error sum of squares (PRESS) over the total sum of squares (TSS) of the response vector y (Cruciani et al. 1992). When the PLS method was introduced for classification, the Q2 parameter survived as a measure for class prediction ability and today is regularly used to validate discrimination models such as PLSDA (Lutz et al. 2006; Wiklund et al. 2008). One of the problems of the Q2 parameters is that it is unclear which Q2 value corresponds to a good discrimination model. Therefore, the Q2 value can be compared to a distribution of Q2 values obtained from models of the same data with randomly permuted class labels Lindgren et al. (1996), Westerhuis et al. (2008). In such a way, statistical significance (P-values) can be obtained for a given discrimination model.

In PLSDA, the response vector y consists of class labels −1 and 1 (or 0 and 1) for the two class problem. When the class of new samples is predicted given a PLSDA model (e.g. in a cross validation scheme) the prediction error is summed over all samples to give the PRESS value (see Eq. 1).

$$ {\text{PRESS}} = \sum\limits_{i} {\left( {y_{i} - \hat{y}_{i} } \right)}^{2} $$
(1)

Note that here \( \hat{y}_{i} \) represents a real prediction in which sample i was not used in the model building process. The Q2 value is then calculated as

$$ Q^{2} = 1 - \frac{{\sum\limits_{i} {\left( {y_{i} - \hat{y}_{i} } \right)}^{2} }}{{\sum\limits_{i} {\left( {y_{i} - \bar{y}_{i} } \right)}^{2} }} = 1 - \frac{\text{PRESS}}{\text{TSS}} $$

Here the total sum of squares (TSS) is a constant and the PRESS should quantify how well the samples are classified. When the prediction for a sample is close to the discrimination border of 0.5, the PRESS value increases, because the sample is almost predicted wrongly. This seems a good approach. However when a sample of class 1 receives a class prediction of +1.5, it is not at all close to the discrimination border, but the PRESS still increases, while it obviously corresponds to a perfect class prediction. As this seems counter intuitive, we developed the discriminant Q2 (DQ2) statistic in which the prediction error is disregarded when the class prediction is beyond the class label.

$$ \begin{gathered} \mathop {\text{PRESSD}}\limits_{{{\text{Class}}1}} = \sum\limits_{{\hat{y}_{i}<1}} {\left( {y_{i} - \hat{y}_{i} } \right)}^{2} \hfill \\ \mathop {\text{PRESSD}}\limits_{{{\text{Class}}-1}} = \sum\limits_{{\hat{y}_{i}>-1}} {\left( {y_{i} - \hat{y}_{i} } \right)}^{2} \hfill \\ \end{gathered} $$

Thus when the prediction is above 1 for class 1 samples or when the prediction is below −1 for class −1 samples, the prediction error for that sample is ignored. This is represented in Fig. 1 where the red curve represents the prediction error for class 1 samples and the blue curve the prediction error for class −1 samples. It becomes clear that Q2 penalizes a class prediction of 2 for a class 1 sample in the same way as a class prediction of 0 (which would mean a misclassification). In the DQ2 statistic, a prediction of 2 for a class 1 sample is disregarded. DQ2 is then defined as

$$ {\text{DQ}}^{2} = 1 - \frac{\text{PRESSD}}{\text{TSS}} $$
Fig. 1
figure 1

Representation of the prediction error in Q2 and DQ2. The blue curve represents the prediction error for class −1 samples and the red curve represents the prediction error for class 1 samples

The idea of the DQ2 statistic is related to other discrimination methods such as logistic regression or SVM in which samples that are close to the discriminating line are more important for the model than samples that are far away from that line.

2 Experimental

1D 1H NOESY NMR spectra of urine samples of 28 male and female human subjects in the age of 35–75 years and mildly hypertensive (Systolic blood pressure: 130–179 mmHg, Diastolic blood pressure: <100 mmHg) were obtained. An exponential window function was applied to the free induction decay (FID) with a line-broadening factor of 0.5 Hz prior to the Fourier transformation. The Fourier transformed NMR data were manually phase and baseline corrected and calibrated against the reference standard TSP resonance at δ 0.0 ppm. The NMR spectra were subdivided into 550 discrete regions (‘buckets’) of equal width (0.02 ppm), from which the integral regions were determined using AMIX (Analysis of Mixtures, Bruker GmbH, Germany). The spectral region between δ 4.3–5.2 ppm was excluded from the data set to avoid spectral interference of residual water. The urine profiles were normalized to the integral of creatinine methyl peak between δ 3.05–3.10 ppm.

Monte Carlo simulations were performed by adding a predefined effect to the spectra of 14 randomly selected volunteers whereas for the other 14 individuals no effect was added. PLSDA was used to discriminate between the two groups. 25 cross model validations [Anderssen et al. (2006)] or sometimes called double cross validation [Smit et al. (2007)] were performed. In each double cross validation the samples were divided into seven groups. For each double cross validation a (D)Q2 value is obtained. Twenty-five double cross validations were performed in which the samples were distributed differently over the seven groups because of the large difference in (D)Q2 value depending on the specific selection of the samples in the seven groups. Thus 25 (D)Q2 values were finally obtained the average (D)Q2 was computed. Then 2,000 permutations were performed in which the class label was randomly permuted and for each permutation again the average (D)Q2 was computed in the same way as described above. The number of average (D)Q2 values of the permutations that are larger than the average (D)Q2 value of the original labeling, divided by 2,000, represents the P-value. Finally we repeated the procedure five times with each time a different selection of 14 individuals that received the treatment. In this way five P-values were obtained. The average of these five P-values is finally used to compare the Q2 and DQ2 values.

Two types of effect were added to the 14 selected individuals (see Fig. 2), a univariate effect of a single NMR peak that changed (effect 1) and a multivariate effect (effect 2).

Fig. 2
figure 2

NMR profiles of urine of 28 healthy individuals with two simulated effects

3 Results

Figure 3 shows the results of finding statistical significance for the PLSDA discrimination model between the 14 ‘treated’ individuals and the 14 ‘non-treated’ individuals. In the top figure, the P-values decrease from 0.45 to 0 for an increased effect size. The X-axis represents how much of the effect was added to the spectrum. Thus the treated spectrum was calculated by the adding the value on the X-axis times the effect by the untreated spectrum. Note that in Fig. 2, effect 1 was multiplied by 20 and effect 2 was multiplied by 50 to make them visible. Thus the actually added effects are much smaller than presented in Fig. 2.

Fig. 3
figure 3

Statistical significance profiles for PLSDA discrimination model when DQ2 or Q2 are used for a univariate effect (top) and a multivariate effect (bottom)

When an α = 0.05 significance limit would be used to reject the Null hypothesis of no effect, it can be seen that the effect size needs to be about 3.6 when Q2 is used while an effect size of about 3.4 already leads to a significant discrimination model when DQ2 is used. For the multivariate effect (effect 2) in the bottom plot of Fig. 3, the difference between Q2 and DQ2 is even larger. Here a multivariate effect size of 2.8 gives a statistical significant model when DQ2 is used while for Q2 an effect size of 3.2 is needed to become statistically significant.

4 Conclusion

In this paper the discriminant Q2 (DQ2) statistic is introduced as a replacement for the traditionally used Q2 value to represent class prediction ability. With rigorous Monte Carlo simulation it is shows that statistically significant discrimination models can be found for a smaller effect size when DQ2 is used than when the traditional Q2 is used. This is particularly beneficial in metabolomics-based discrimination problems where the biological responses can be subtle and highly variable among the individuals.