Three spectral representations from the Fourier transformed FIDs, the real spectrum (REAL), the imaginary spectrum (IMAG), and the absolute value spectrum (ABS) are given for the set of 60 tea spectra in Fig. 1. The ABS is the absolute value of the complex spectrum (i.e., REAL + IMAGi) and represents the magnitude of the magnetization in the complex plane. The peaks of the ABS spectrum are broader and less symmetric than those in the REAL spectrum. For this reason, the REAL spectrum is the preferred choice for spectroscopists who are concerned with qualitative analysis. Note that the IMAG spectrum does not contribute to the amplitude of the ABS spectrum, because it passes through zero at chemical shifts where the REAL peak maxima occur. However, since the contribution occurs at the peak edges, wider peaks will comprise more signal by the larger peak areas. When the peak resolution is unimportant as is the case for spectral pattern recognition and comparison, the ABS spectrum will be beneficial because it uses the entire NMR signal. In theory, the signal-to-noise ratio should improve by a factor of the square root of two.
To evaluate the reproducibility, the pooled standard deviation about the 12 tea sample means was calculated from the normalized spectra. This figure of merit measures the inherent error of the measurement. The pooled standard deviation has two functions for this paper. First, it is used to characterize measurement error of the experiment. Second, it will be used to scale some of the datasets that have high dynamic range (i.e., very large and very small peaks). The benefit will be demonstrated with the liquor study.
The pooled standard deviations for the REAL, IMAG, and ABS spectra are given in Fig. 2. The larger the peak or the intensity of standard deviation, the greater the error. In this figure, the ABS error profile gives the minimum error throughout most of the spectral range, while the REAL and IMAG spectra have greater errors. For pattern recognition, reproducibility is key and the classification results will be consistent with this finding.
All the evaluations of the four datasets used consistent conditions. The spectral range was [0.5, 7.0] ppm to eliminate the solvent peak at δ 7.26 ppm and the TMS peak at δ 0.00 ppm. The number of spectral measurements (i.e., data points per spectrum) was 20,000. Each spectrum was normalized to unit vector length. For two datasets, the liquor and hops, the spectra were scaled by the pooled standard deviation; because those spectra have high dynamic ranges, without scaling poor classification accuracy was obtained (e.g., 60%). This scaling is hence referred to as error-scaling. All comparisons will examine the REAL versus the ABS spectrum because the IMAG spectrum generally gave the worst classification results. BLPs were used to achieve a statistical validation with 100 bootstraps to yield sufficient statistical power. Positive t scores will favor ABS and negative REAL spectral representations. The matched sample t test is used to compare the classification results for each bootstrap between the REAL and ABS spectrum.
Most of the classifiers were parameter free, except for the SVM. The SVM had its cost C factor arbitrarily set to inf which is a MATLAB variable for a very large number. The sPLS-DA was the super PLS implementation which determines the optimal number of latent variables by an internal BLP of the calibration set. FuRES is the softest classifier and tends to be the most sensitive to the representation of the data because it balances variance and bias (i.e., larger peaks are favored over smaller features). The SVMTreeG is the softest of the SVM classifiers and the SVMTreeH trades softness for efficiency in building minimal spanning trees.
A brief description of the teas is given in Table 1. Missing fields in the table correspond to unknown information. The spectra for the tea extracts are given in Fig. 1. The principal component scores and the classification trees are given in Fig. 3. The principal component scores allow for the visualization of the distribution of the spectra. The REAL results are in the left column and the ABS results on the right column of this figure. Both sets of scores appear to be similar; however, the percent total variances (sum of the percentages on each axis) of the ABS scores of 95% is greater than the value for the REAL scores 92%, which indicates that the ABS scores have a better noise distribution. At the bottom are two classification trees obtained from SVMTreeH, a fuzzy entropy-based support vector machine tree. For both trees, all the classes have been resolved. The tree structures are the same except for rules #6, #8, and #9 that characterize groups that are closer together in the dataspace. Table 4 reports the average results of the 100 bootstraps and 5-Latin partitions. The measures of precision presented with the averages are 95% confidence levels. A matched sample t test was used to compare the classification rates between the REAL and ABS spectra. Positive t scores indicate a higher classification rate for the ABS set of data. For all six classifiers, the ABS spectra gave significantly better classifications.
The next set is a set of eight liquor samples from various phases of production. A description is given in Table 2. Figure 4 demonstrates the usefulness of the error-scaling procedure. The spectra for both the REAL (left) and ABS (right) are dominated by the peaks for ethanol. The characteristic peaks are from the other compounds that are minuscule. The middle of the figure comprises the principal component scores for the normalized spectra and the bottom of the figure comprises principal component scores that were obtained after the error scaling procedure. Two trends are obvious. First, error scaling greatly enhances the resolution of the objects in the different classes by giving appropriate weights to the smaller peaks in the spectra. Second, the ABS spectral scores exhibit much greater resolution of samples than the REAL spectral scores. The classification results using 100 bootstraps and 3-Latin partitions are given in Table 5. The ABS dataset gave significantly improved results for all classifiers.
A set of data were nine samples of hops extracts that had replicate measurements collected on different days. A description of these samples is given in Table 6. The spectra and principal component scores are given in Fig. 5. There are many smaller but characteristic peaks downfield from 2 ppm. For this case, error scaling improved the classification results significantly as well. There are subtle differences between the principal component scores of the REAL and ABS sets. The ABS scores have a greater cumulative variance than the REAL scores. The results are reported in Table 7. For all six classifiers, the results were significantly better for the ABS data.
The last set was also the largest. It comprised 25 Cannabis extracts that each had 5 replicates yielding 125 spectra. Error scaling was not required for this data. Table 3 gives a description of the sample extracts and Fig. 6 contains the spectra and principal component scores. When comparing the principal component scores, REAL has the greater cumulative variance of 80% compared to 79% for the ABS. The classification results are given in Table 8. For all classifiers, except for SVMTreeG, the ABS representation gave significantly better results.