A first overview of the data (Data Overview) was obtained by comparing the densities of the metabolite concentration distributions for each of the 100 urine specimens of the classification dataset. Supplemental Fig. S1 shows the creatinine adjusted-intensity distributions. For comparison, the distribution of the Quantile normalized data that represents an average of the intensity distributions is indicated in red. Roughly similar distributions were obtained for all specimens.
Next we investigated for each normalization method whether comparable profiles were obtained for all samples of the classification dataset (Overall between Sample Normalization Performance). To that end, all preprocessing methods were included, although the variable scaling and variance stabilization methods are not specifically designed to reduce between-sample variation. For all features we calculated the pair-wise differences in intensity between spectra. We argue, that if these differences do not scatter around zero, this is evidence that for one out of a pair of spectra the concentrations are estimated systematically either too high or too low. To assess the performance of methods we calculated for each pair-wise comparison the ratio of the median of differences to the inter-quartile range (IQR) of differences and averaged the absolute values of these ratios across all pairs of samples (average median/IQR ratios). Dividing by the IQR ensures that the differences are assessed on comparable scales. The smaller the average median/IQR ratios are the better is the global between-sample normalization performance of a method. The results for the classification dataset are shown in the first row of Table 1.
Table 1 Analysis of average inter- and intra-sample differences by means of interquartile ranges
Comparing the list of average median/IQR ratios, PQN (0.04), Quantile (0.06), Cyclic Loess (0.06), VSN (0.07), and Cubic Spline (0.07) reduced overall differences between samples the best compared to the creatinine-normalized data only (0.46). The other methods, except for Contrast and Li-Wong Normalization, all improved the comparability between samples, but did not perform as well as the methods mentioned above. Note that the two variable scaling methods performed similarly and, therefore, were summarized as one entry in Table 1. The good performance of the VSN method can be explained by the fact that VSN combines variance stabilization with between-sample normalization. In comparison to the creatinine-normalized data, Auto and Pareto Scaling also showed some improvement.
While good between-sample normalization is desirable, it should not be achieved at the cost of reducing the genuine biological signal in the data. We tested for this in the Latin-square data. By experimental design, all intensity fluctuations except for those of the spiked-in metabolites, which should stand out in each pair-wise comparison of spectra, are caused by measurement imprecision. That is, spike-in features must be variable, while all other features should be constant. We assessed this quantitatively by calculating the IQR of the spike-in feature intensities and dividing it by the IQR of the non-spike-in feature intensities (average IQR ratios). These ratios are given in the second row of Table 1. High values indicate a good separation between spiked and non-spiked data points and, therefore, are favorable.
For the non-normalized data a ratio of 5.12 was obtained, i.e. the spike-in signal stood out clearly. These results were also obtained for the PQN and the Linear Baseline methods. For the Cyclic Loess, Quantile, Cubic Spline, Contrast, VSN, and Li-Wong approaches, the ratio was slightly reduced demonstrating that normalization might affect the true signals to some extent. Nevertheless, the signal-to-noise ratios for these methods were still above 4 and the signals kept standing out.
Importantly, Auto and Pareto Scaling compromised the signal-to-noise ratio severely. As for the classification data, the two variable scaling methods performed comparable and were summarized as one entry.
This prompted us to investigate systematically technical biases in this data (Analysis of intensity-dependent bias). As illustrated in Fig. 1a and b, M versus rank(A)-plots (M-rank(A)-plots) allow the identification of intensity-dependent shifts between pairs of feature vectors. Data in M-rank(A)-plots are log base 2 transformed so that a fold change of two corresponds to a difference of one. For each feature, its difference in a pair of samples (y-axis) is plotted against the rank of its mean value (x-axis). Hence, the x-axis corresponds to the dynamic spectrum of feature intensities, while the y-axis displays the corresponding variability of the intensities.
For all possible pair-wise comparisons of spectra and all investigated normalization methods, M-rank(A)-plots were produced from the classification data as well as from the Latin-square data. Representative sets of plots for a randomly selected pair of spectra selected from each of the two datasets are displayed in Fig. 1a and b. Shown are plots for creatinine-normalized classification data, respectively non-normalized Latin-square data and for data after Cyclic Loess, Quantile and Cubic Spline Normalization. In the absence of bias, the points should align evenly around the straight line at M = 0. The additionally computed loess line (curved line) represents a fit of the data and helps to determine how closely the data approaches M = 0.
In the M-rank(A) plots of creatinine-normalized (Fig. 1a) and non-normalized data (Fig. 1b), the curved loess line clearly does not coincide with the straight line at M = 0. The plot of the creatinine-normalized classification data in Fig. 1a suggests that intensities in sample 2 of the pair are systematically overestimated at both ends of the dynamic spectrum but not in the middle. One might want to attribute this observation to a technical bias in the measurements. While we cannot proof directly that the observation originates indeed from a technical bias rather than biological variation, we will show later that correction for the effect improves the estimation of fold changes, the detection of differentially produced metabolites and the classification of samples.
Here, we first evaluated the normalization methods with respect to their performance in reducing such an effect. Looking at Cyclic Loess normalized data in Fig. 1a, the bias is gone for the mid and high intensities, however, in the low-intensity region additional bias is introduced affecting up to 20% of the data points. With Quantile and Cubic Spline Normalization nearly no deviation from M = 0 can be recognized in Fig. 1a, they seem to remove any bias almost perfectly. Similar trends were also observed for the other pair-wise comparisons within the classification data (plots not shown). Application of the other normalization methods to the classification data showed that PQN and VSN evened out most bias well, although they sometimes left the loess line s-shaped. The linear baseline method performed similarly, in that it only partially reduced bias. Contrast, Li-Wong and the two variable scaling methods hardly reduced bias at all.
The M-Rank(A) plots of the Latin-square data, of which 4 examples are shown in Fig. 1b, generally resemble those obtained for the classification data, except for one major difference: Here, we have a large amount of differential spike-in features representing a range of 2- to 128-fold changes. The spike-in differences should not be lost to normalization. Therefore, for better visualization, all empirical data points of the spiked-in metabolites were marked differently, while the non-differential data points were marked in black (Fig. 1b). Ideally, all data points corresponding to the spiked-in metabolites should all be found in the high- and mid-intensity range (A). Moreover, differences (M) should increase with increasing spike-in concentrations, resulting in a triangle-like shaped distribution of the data points corresponding to the spiked-in metabolites and the curved loess line staying close to M = 0. As expected, the spike-ins stood out clearly in the non-normalized data. This was also the case for the PQN, Cyclic Loess, Contrast, Quantile, Linear Baseline, Li-Wong, Cubic Spline and VSN normalized data but not for the variable scaling normalized data.
The performance of all methods with respect to correcting dynamic range related bias can be compared in Loess-Line Plots (Bolstad et al. 2003). In these plots we drew rank(A) (x-axis) against the differences of the average loess line to the baseline at M = 0 (y-axis). The average loess line was computed for each normalization method by a loess fit of the absolute loess lines of the Ranked MA-plots for all pairs of NMR spectra. Our plots are a variation of those used by Bolstad et al. (2003) and (Keeping and Collins 2011) in that we use rank(A) instead of A on the x-axis. Any local offset from zero indicates that the normalization method does not work properly in the corresponding part of the dynamic range.
We calculated these plots for both the classification data (Fig. 2a) and the spike-in data (Fig. 2b). Since in most cases similar trends were obtained for both datasets, the best performing methods will be discussed together if not stated otherwise. In the absence of normalization, an increasing offset with decreasing intensities is observed for the lower ranks of both datasets. Cyclic Loess Normalization reduced the distance for the mid intensities well, but it increased the offset for low intensities. Contrast, Quantile and VSN Normalization all removed the intensity-dependency of the offset well. Regarding the overall distance, Quantile Normalization reduced it best, followed by VSN. Contrast Normalization left the distance at a rather large value. Taken together, this analysis shows that intensity-dependent measurement bias can only be corrected by a few normalization approaches. Not surprisingly, these are methods that model the dynamic range of intensities explicitly.
M-rank(A)-plots can also detect unwanted heteroscedasticity, which may compromise the comparability of intensity changes across features. Spreading of the point cloud at one end of the dynamic range, as exemplified by the solely creatinine-normalized and non-normalized data, respectively, in Fig. 1a and b, indicates a decrease in the reliability of measurements. In the absence of evidence that these effects reflect true biology or are due to spike-ins (data points corresponding to spiked-in metabolites in Fig. 1b), one should aim at correcting this bias. Otherwise feature lists ranked by fold changes might be dominated by strong random fluctuations at the ends of the dynamic spectrum. Between-feature comparability will only be achieved, if the standard deviation of feature intensities is kept low over the entire dynamic spectrum.
The influence of the different normalization techniques on standard deviation relative to the dynamic spectrum was investigated using plots of the standard deviation for both the classification (Fig. 3a) and the Latin-square dataset (Fig. 3b). For this, the standard deviation of the logged data in a window of features with similar average intensities was plotted versus the rank of the averaged feature intensity, similarly to Irizarry et al. (Irizarry et al. 2003). The plots show for both the creatinine-normalized (Fig. 3a) and the non-normalized data (Fig. 3b), respectively, that standard deviation decreases with increasing feature intensity. The same is true for the PQN normalized data. Further, VSN keeps the standard deviation fairly constant over the whole intensity regime. In contrast, Li-Wong increases the standard deviation compared to the non-normalized data. The two variable scaling approaches increase standard deviation substantially.
Next, we investigated the influence of preprocessing on the detection of metabolites produced differentially, the estimation of fold changes from feature intensities, and the classification of samples based on urinary NMR fingerprints.
In the Latin-square data, we know by experimental design which features have different intensities and which do not. The goal of the following analysis is to detect the spike-in related differences and to separate them from random fluctuations among the non-spiked metabolites (Detection of Fold Changes). To that end, features with expected spike-in signals were identified and separated from background features. Excluded were features that were affected by the tail of spike-in signals, and regions in which several spike-in signals overlaid. As the background signal in the bins containing spike-in signals was, in general, not negligible, it was subtracted to avoid disturbances in the fold change measures.
Then, all feature intensities in all pairs of samples were compared and fold changes were estimated. Fold changes that resulted from a spike-in were flagged. Next, the entire list of fold changes was sorted. Ideally, all flagged fold changes should rank higher than those resulting from random fluctuations. In reality, however, flagged and non-flagged fold changes mix to some degree. Obviously, by design smaller spike-in fold changes tend to be surpassed by random fluctuations. The flagging was performed with three different foci, first flagging all spike-in features, then just low spike-in fold changes up to three and last only high fold changes above ten.
Receiving operator characteristic (ROC) curves with corresponding area under the curve (AUC) values were calculated for each normalization method and are given in Supplemental Table S2. Looking at the AUC values, only four methods yielded consistently better classification results than those obtained with the non-normalized data: Contrast, Quantile, Linear Baseline, and Cubic Spline Normalization. Quantile Normalization reached the highest AUC values in all runs, Cubic Spline and the Linear Baseline method showed comparable results and Contrast Normalization performed slightly better than the non-normalized data.
Differentially produced metabolites may be detected correctly even if the actual fold changes of concentrations are systematically estimated incorrectly. The ROC curves depend only on the order of fold changes but not on the actual values. This can be sufficient in hypothesis generating research but might be problematic in more complex fields such as metabolic network modeling. Therefore, we evaluated the impact of the preprocessing techniques on the accurate determination of fold changes. Based on published reference spectra, for each metabolite a set of features corresponding to the spike-in signals was determined, features with overlapping spike-in signals were removed and the background signal was subtracted. Within this set of features, the feature with the highest measured fold change among all pairs of samples with the highest expected fold change was chosen for evaluating the accuracy of determining the actual fold change for the respective metabolite. Note that the spike-in metabolite creatinine was excluded because of the absence of any non-overlapping spike-in bins. Then, plots of the spike-in versus the measured fold changes between all pairs of samples were computed for each metabolite and each normalization method. For taurine, Fig. 4 shows exemplary results obtained from non-normalized data and from data after Cyclic Loess, Quantile, and Li-Wong Normalization, respectively.
In analogy to Bolstad et al. (2003) the following linear model was used to describe the observed signal x of a bin i and a sample j:
$$ \log x_{ij} = \gamma \log c_{0} + \gamma \log c_{\text{spike-in}} + \varepsilon_{ij} $$
(1)
Here, c
0 denotes the signal present without spike-in, c
spike-in the spike-in concentration of the respective metabolite, γ the proportionality between signal intensity and spike-in concentration, which is assumed to be concentration independent within the linear dynamic range of the NMR spectrometer, and ε
ij
the residual error.
Comparing two samples j
1 and j
2 leads to the following linear equation, for which we estimate the intercept a and the regression slope b,
$$ \log \frac{{x_{{ij_{1} }} }}{{x_{{ij_{2} }} }} = a + b\log \frac{{c_{{{\text{spike-in}}_{1} }} }}{{c_{{{\text{spike-in}}_{2} }} }} $$
(2)
In Supplemental Table S3, slope estimates b for the different metabolites and normalizations are given. Again the variable scaling methods were summarized in a single entry. It is obvious that nearly all values exceed one, meaning that the fold changes are overestimated. This can be explained by the choice of features: As one metabolite generally contributes to several features and the feature with the highest fold change between the pair of samples with the highest spike-in difference is selected for each metabolite, features overestimating the fold change are preferred over features underestimating or correctly estimating the fold change. However, we still favored this automated selection algorithm over manually searching for the “nicest looking” bin, to minimize effects of human interference.
Apart from that, it can be seen from an analysis of the slope estimates b that normalization performs quite differently for different metabolites. The methods that showed the most uniform results for all metabolites investigated are Quantile, Contrast, Linear Baseline, and Cubic Spline Normalization.
In Supplemental Table S4, values for the intercept a, the slope b, and the coefficient of determination R
2 are given, averaged over all metabolites. The data shows that the methods that performed best in estimating accurately fold changes are Quantile and Cubic Spline Normalization.
Another common application of metabolomics is the classification of samples. To investigate the degree to which the different normalization methods exerted an effect on this task, the dataset consisting of the ADPKD patient group and the control group was used. Classifications were carried out using a support vector machine (SVM) with a nested cross-validation consisting of an inner loop for parameter optimization and an outer loop for assessing classification performance (Gronwald et al. 2011). The nested cross-validation approach yields an almost unbiased estimate of the true classification error (Varma and Simon 2006). For the nested cross validation, a set of n samples was selected randomly from the dataset. This new dataset was then normalized and classifications were performed as detailed above. Classification performance was assessed by the inspection of the corresponding ROC curves (Supplemental Fig. S2). The classification was conducted five times for every normalization method and classification dataset size n.
In Table 2, the AUC values and standard deviations of the ROC curves are given for all normalization methods and classification dataset sizes of n = 20, n = 40, n = 60, n = 80, and n = 100, respectively. As expected, the classification performance of most normalization methods depended strongly on the size of the training set used for classification. The method with the highest overall AUC value was Quantile Normalization: With 0.903 for n = 100, 0.854 for n = 80, and 0.812 for n = 60, it performed the best among the normalization methods tested, albeit for larger dataset sizes only. For dataset sizes n ≤ 40, its performance was about average. Cubic-Spline Normalization performed nearly as well as Quantile Normalization: It yielded the second highest AUC values for the larger training set sizes of n = 100 (0.892) and n = 80 (0.841). In contrast to Quantile Normalization, it also performed well for smaller dataset sizes: For n = 20 (0.740), it was the best performing method. VSN also showed good classification results over the whole dataset size range, its AUC values were barely inferior to those of the Cubic-Spline Normalization. Cyclic Loess performed almost as well as Quantile Normalization. For small dataset sizes, its classification results were only slightly better than average, but for the larger dataset sizes it was among the best-performing methods. Over the whole dataset size range, the classification results of PQN, Contrast and the Linear Baseline Normalizations and those of the variable scaling methods were similar to results obtained with creatinine-normalized data. Supplemental Table S5 gives the median (first column) and the mean number (second column) of features used for classification with respect to the applied normalization method. As can be seen, the number of selected features strongly depended on the normalization method used. The best performing Quantile Normalization led to a median number of 21 features, while the application of Cubic Spline Normalization and VSN resulted in the selection of 27 and 34 features, respectively. Employment of the PQN approach and the variable scaling methods resulted for the most part in a greater number of selected features without improving classification performance. The third column of Supplemental Table S5 gives the percentage of selected features that are identical to those selected by SVM following Quantile Normalization. As can be seen, PQN yielded about 95% of identical features, followed by Li-Wong and the Linear Baseline method with approx. 90% identical features. This data shows that the ranking of features based on t-values, which was the basis for our feature selection, is only moderately influenced by normalization. The smallest percentage (52.4%) of identical features was observed for Contrast Normalization, which also performed the poorest overall (Table 2).
Table 2 Classification performance measured on classification dataset
We also investigated the impact of the use of creatinine as a scale basis for renal excretion by subjecting the classification data to direct normalization by Quantile and Cubic-Spline Normalization without prior creatinine normalization. For n = 100, AUC values of 0.902 and 0.886 were obtained for Quantile and Cubic-Spline Normalization, respectively. These values are very similar to those obtained for creatinine-normalized data, which had been 0.903 and 0.892 for Quantile and Cubic-Spline Normalization, respectively. However, without prior creatinine–normalization an increase in the average number of selected features was noticed, namely from 21 to 31 and 27 to 36 features, respectively, for Quantile and Cubic-Spline Normalization. In summary one can say that Quantile and Cubic-Spline Normalization are the two best performing methods with respect to sample classification irrespective whether prior creatinine normalization has been performed or not.
Different preprocessing techniques have also been evaluated with respect to the NMR analysis of metabolites in blood serum (de Meyer et al. 2010). Especially, Integral Normalization, where the total sum of the intensities of each spectrum is kept constant, and PQN were tested in combination with different binning approaches. PQN fared the best, but it was noted that none of the methods tested yielded optimal results, calling for improvements in both spectral data acquisition and preprocessing. The PQN technique was also applied to the investigation of NMR spectra obtained from cerebrospinal fluid (Maher et al. 2011).
Several of the preprocessing techniques compared here have been also applied to mass spectrometry-derived metabolomic data and proteomics measurements. Van den Berg et al. (2006) applied 8 different preprocessing methods to GC-MS data. These included Centering, Auto Scaling, Range Scaling, Pareto Scaling, Vast Scaling, Level Scaling, Log Transformation and Power Transformation. They found, as expected, that the selection of the proper data pre-treatment method depended on the biological question, the general properties of the dataset and the subsequent statistical data analysis method. Despite these boundaries, Auto Scaling and Range Scaling showed the overall best performance. For the NMR metabolomic data presented here, the latter two methods were clearly outperformed by Quantile, Cubic Spline and VSN Normalization, all of which were not included in the analysis of the GC-MS data. In the proteomics field, Quantile and VSN normalization are commonly employed (Jung 2011).