As can be seen in the bright-field images (Fig. 1), the dry pollen grains from the five different grass species are similar in size and morphology. In general, grass pollen is characterized by a simple spherical shape, single circular and annulate aperture situated distally, and microechinate grain wall ornamentation . Grass pollen has very limited mechanisms for preventing desiccation . As a result, grass pollen morphology is dramatically changed after shedding, collapsing from a spherical shape of fresh pollen to extensive infolding of dry pollen . The extensive infolding leads to large variation in Mie scattering effects, resulting with extremely unreproducible spectra. Although the pollen grains of all five measured species have similar morphology, those of Poa alpina and Anthoxanthum odoratum are slightly smaller than the pollen grains of Lolium perenne, Bromus inermis, and Hordeum bulbosum.
Influence of the paraffin spectral contribution
Following our recently established protocol , we embedded the pollen samples in paraffin to avoid scattering artifacts in the spectra. Figure 2 shows the averages of baseline-corrected and vector-normalized spectra of non-embedded (Fig. 2a) and of the paraffin-embedded pollen grains (Fig. 2b) for each pollen species. The spectra of the embedded pollen show much less variation within each species (Fig. 2b) compared to the large standard deviation when measured as the unembedded samples (Fig. 2a). The most prominent bands in the spectra are found at 989 and 1045 cm−1 both assigned to carbohydrates, at 1161 cm−1 assigned to lipids and carbohydrates, at 1549 and 1659 cm−1 assigned to amide II and amide I vibrations of proteins, respectively, and at 1745 cm−1 assigned to lipids . In Fig. 2 b, the characteristic absorbance of paraffin adds to this pollen signature and is particularly prominent in the region from 1300 to 1500 cm−1. In particular, bands associated with the methyl rocking vibration at 1377 cm−1 and the CH2 bending and CH3 deformations modes at 1462 cm−1 determine the spectra of all the embedded pollen samples (Fig. 2b) . Although much less dominating, the spectra from the non-embedded samples also contain signals in this spectral region.
The paraffin bands at 1377 and 1462 cm−1 in the spectra of the embedded samples vary between the different species (Fig. 2b). In the spectra of pollen from Poa alpina and Anthoxanthum odoratum, both bands have higher relative absorbance values, whereas for Lolium perenne and Bromus inermis, they are less strong. In the spectrum of Hordeum bulbosum, the two characteristic paraffin signals have much smaller contributions and the spectrum in the region from 1300 to 1500 cm−1 resembles that of the averaged spectrum from the non-embedded pollen grains (compare the two blue traces in Fig. 2 a and b). The different relative contribution by the embedding paraffin in the spectra of the different species is likely related to the different size of the pollen grains, leading to a different relative amount of paraffin versus pollen material in the probed microscopic volume.
We have tested three different approaches for correction of FTIR spectra of the paraffin-embedded samples. In comparison, in the simplest procedure, paraffin spectral contributions were not suppressed, and the spectra were only baseline-corrected and vector-normalized. The assessment of this preprocessing by PLS-DA with full CV indicates clearly that the spectra of the different species can be discriminated (Table 1). The overall success rate of 79% was achieved, with the individual success rates of approximately 90% for Hordeum bulbosum, Anthoxanthum odoratum, and Poa alpina spectra. The average spectra in Fig. 2 already suggest that the different extent of the paraffin spectral contribution could also influence the discrimination of the different pollen species. The results of the PCA corroborate this, and the loadings of the first principal component (PC 1) (see Electronic Supplementary Material (ESM) Fig. S1, right column) show the two strong paraffin-related signals at 1377 and 1462 cm−1. Also, in the other principal components, e.g., PC 4 (ESM Fig. S1, right column), the paraffin signals may be a reason for the species-related variation, as can be seen from the presence of signal at 1460 cm−1. This indicates that the paraffin contribution needs to be minimized before data analysis, in order to obtain classification and identification based on pollen chemistry alone.
Selection of non-affected spectral ranges
As discussed above, the strong deformation modes of paraffin affect the spectra mostly in the spectral range from 1300 to 1500 cm−1 with the two bands at 1377 and 1462 cm−1. Therefore, this spectral region was omitted from the data set (compare Scheme 1, approach 1), similar to the approach in our first paraffin-embedding study . Eliminating only the two strongest bands from paraffin here, we assume that other spectral features contributed by the paraffin in the regions 800–1300 and 1500–1800 cm−1 are negligibly small compared to the absorption bands of the pollen samples themselves. Following the removal of the 1300–1500-cm−1 spectral range, the spectra were normalized by the simple EMSC model.
The assessment by PLS-DA with full CV shows that the overall classification success rate is lower (i.e., 76%, see Table 2) compared to the preprocessing, where the contribution by paraffin is not corrected for (Table 1). Similar to these results, the success rates can vary enormously for each of the pollen species, ranging from 46% for Bromus inermis, where one fourth of the actual Bromus inermis pollen spectra was misclassified as Hordeum bulbosum, to 91% correct classification of Anthoxanthum odoratum and Poa alpina spectra. PCA results show that the main variances within this data set are found in the spectral range from 850 to 1150 cm−1 (ESM Fig. S2A right loadings of PC 1 and PC 2), which can be assigned mainly to carbohydrates [14, 18]. A differentiation between the pollen spectra from Anthoxanthum odoratum and Poa alpina and between Hordeum bulbosum and Lolium perenne can be achieved in PC 3 and PC 6, respectively, as found in the scores plot (ESM Fig. S2B). The finding that the spectral differences in the pollen spectra preprocessed by excluding the range from 1300 to 1500 cm−1 lead to a relatively small drop in classification success rates, compared to the simple preprocessing—without consideration of the paraffin influence, is in agreement with a previous work that reports the successful discrimination of paraffin-embedded pollen from other plant species .
Decomposition of spectra from paraffin-embedded pollen using NMF
A decomposition of spectral signatures belonging to different chemical constituents within the same spectrum of a complex sample can be achieved by a matrix factorization algorithm, such as NMF. This can result in a more detailed analysis of the spectral features from the different chemical constituents . In addition, such matrix factorization algorithms have been shown to be very useful for the exclusion of disruptive contributions from spectra, e.g., for de-noising [51, 52]. Therefore, NMF was used in another preprocessing approach (Scheme 1, approach 2) to split our spectra into pollen spectra and paraffin spectra. In this procedure, the 1004 spectra from each individual pollen grain and 190 spectra of pure paraffin were decomposed together several times using different numbers of components—six components. The decomposition using six components was chosen as optimal based on visual inspection, which indicated a good separation of the paraffin spectra in components 2 and 6 (Fig. 3). These two components show the typical paraffin bands at 1377 and 1462 cm−1. The reconstructed paraffin and pollen spectra were obtained for each measured spectrum (each pollen grain), and the averages of these two sets of reconstructed spectra for each species are shown in Fig. 4 a and b, respectively. The reconstructed paraffin spectra (Fig. 4a) are in good agreement with a paraffin reference spectrum (Fig. 4a, top). Table 3 shows the normalized relative amount of each of the six components. The variation of the relative paraffin contribution (Table 3, components 2 and 6) is in good agreement with the visual observation of pollen spectra (Fig. 2), showing its larger contribution to Anthoxanthum odoratum and Poa alpina spectra and smaller contribution for the other three species. The averages of the spectra that were reconstructed from the remaining four components show no characteristic paraffin signals (Fig. 4b). Compared to the spectra from unembedded single pollen grains on ZnSe slide discussed above (compare Fig. 2a), three characteristic bands at 1236, 1331, and 1408 cm−1 are visible more clearly. They can be assigned to phospholipids, indicated, e.g., by the P=O-stretching vibration at 1236 cm−1, amino acids, as illustrated by the COO− stretching mode at 1408 cm−1, and carbohydrates, the latter possibly causing the band at 1331 cm−1 that is likely assigned to a ring deformation vibration [18, 49, 53].
The PLS-DA with full CV classification of the pollen spectra reconstructed by the NMF approach results with higher success rate (82%) compared to PLS-DA results of the previous preprocessing procedures (compare Table 4 with Tables 1 and 2). The success rates for Bromus inermis and Lolium perenne are slightly higher (65 and 71%, Table 4) compared to the classification results of approach 1 (46 and 64%, Table 2). Nevertheless, the PCA results (ESM Fig. S3) indicate that the variation within the NMF-decomposed spectra might still be slightly affected by bands from paraffin, as indicated by the variation in the CH2 deformation at 1460 cm−1 that on the one hand is assigned to lipids in the pollen , but that could also be present due to residual paraffin contributions (ESM Fig. S3, right column, loadings of PC 2 and PC 6).
Correction of the spectra using EMSC with a paraffin constituent spectrum
EMSC can be used to correct scattering and other non-absorption effects in FTIR data [35, 54, 55]. This is achieved by executing the model-based normalization with the help of a reference spectrum. In the preprocessing for approach 1 (cf. Scheme 1 and “Selection of non-affected spectral ranges” section), we used a simple EMSC model with linear and quadratic terms and the mean spectrum of the spectral data set . Here, in approach 3 (cf. Scheme 1), we used the complex EMSC with a modeled paraffin contribution. We assume that the spectra are composed of two components, a paraffin and a pollen constituent. In order to apply EMSC on the data set, the pollen constituent spectrum was chosen as a reference spectrum, and an averaged pure paraffin spectrum was added into the algorithm as discussed previously . As a result, the spectra are normalized so that particularly the bands at 1377 and 1462 cm−1 show less variation between the spectra from the five species (Fig. 5). For the classification, this would mean that the variation induced by the differences in the paraffin-embedding medium can be minimized and that classification is only based on the spectral contributions by the pollen grains themselves.
The PLS-DA with full CV classification of the pollen spectra preprocessed by the complex EMSC approach results with the highest success rate (83%) of all the tested approaches (Table 5). In particular, the success rate for Bromus inermis is higher (63% in Table 5) compared to the classification of the data set corrected using approach 1 (49% in Table 2). This indicates that the already very promising classification results obtained in the previous study on 11 plant species  can be improved even further by optimizing the spectral preprocessing step. In general, approach 2 (the NMF approach, Table 4) and approach 3 (the complex EMSC approach, Table 5) result with relatively similar success rates. For all preprocessing procedures, the success rates vary regarding the different pollen species. The pollen spectra of the species Anthoxanthum odoratum, Hordeum bulbosum, and Poa alpina can be classified well (Table 5, > 90%), while the identification of the spectra belonging to Bromus inermis and Lolium perenne is more challenging, with success rates of 63 and 69%, respectively.
Classification by hierarchical cluster analysis and principal component analysis
The success rates for the full cross-validation PLS-DA-based classification indicate that the three approaches of minimizing the paraffin contribution to the spectra, namely (i) omitting a part of the spectrum (approach 1), (ii) non-negative matrix factorization (approach 2), and (iii) normalization of the paraffin signals by EMSC (approach 3), lead to a similar ability to discriminate the pollen spectra from the species Anthoxanthum odoratum, Hordeum bulbosum, and Poa alpina, and a less efficient classification of the two species Bromus inermis and Lolium perenne within the data set. It has been shown before that the spectra of some grass pollen species have more unique spectral features than others, so that their discrimination within a data set of similar pollen species is less difficult [20, 23]. In order to assess intra- versus interspecies differences, a hierarchical cluster analysis was performed, using the spectral data obtained by approach 3 (Scheme 1, left blue box). This pretreatment has the advantage that no supervision is needed, and automated pattern recognition tools could be developed for a fast identification of the spectra.
The hierarchical cluster analysis was carried out with the average spectra of 20 single pollen spectra of each sample, leading to 50 spectra in total, using Euclidean distances and Ward’s algorithm. The resulting dendrogram is shown in Fig. 6. Most of the spectra of Poa alpina (Fig. 6, purple branches), Anthoxanthum odoratum (Fig. 6, black branches), and Hordeum bulbosum (Fig. 6, blue branches) are clustered almost exclusively in respective groups. This is in good agreement with the PLS-DA identification discussed above and indicates low variances within the spectra of the respective species. The high similarity of the majority of spectra from Bromus inermis (Fig. 6, red branches) to those of Hordeum bulbosum (Fig. 6, blue branches) agrees with the high number of Bromus inermis spectra that are misclassified in the PLS-DA as Hordeum bulbosum spectra (cf. Table 5). We therefore conclude on a high similarity of the composition of the pollen from these two species, in agreement with the close relationship of the tribes Hordeeae (Triticeae) and Bromeae within the Pooideae subfamily [56, 57]. The cluster in the dendrogram that comprises all except one spectrum from Poa alpina (Fig. 6, purple branches) is very similar to a group of spectra that contains average pollen spectra from Anthoxanthum odoratum and Lolium perenne plants (Fig. 6, black branches and green branches, respectively), also in agreement with the misclassification by PLS-DA of several individual spectra from Anthoxanthum odoratum as Lolium perenne and Poa alpina, and vice versa (Table 5). Moreover, it can be concluded that the chemical composition of these pollen has more similarities compared to those from the other species, in agreement with the fact that all of them belong to the Poeae/Aveneae tribe complex .
In a PCA, the differences between the spectra of the five pollen species can be identified based on the loadings spectra. Figure 7 a shows the scores plot and corresponding loadings of the first and second principal components of a PCA with the average spectra of the 50 plants. The first and second PCs explain 54% of the total variance in the data set. As visible in the scores plot in Fig. 7 a, the mostly positive score values regarding the first PC of spectra from Poa alpina and Anthoxanthum odoratum, as well as most spectra from Lolium perenne, confirm the high similarity of the pollen composition in these two species. Similarly, the spectra from Bromus inermis and Hordeum bulbosum show mostly negative score values regarding the first PC (Fig. 7a). According to the loadings in Fig. 7 a, the most pronounced differences between the spectra from the Bromus inermis/Hordeum bulbosum group and those from the two species Poa alpina and Anthoxanthum odoratum are in the broad bands around 945 cm−1 (second PC) and 1678 cm−1 (first PC) that could be assigned to molecular vibrations of carbohydrates and proteins, respectively [14, 49, 53]. This would lead to the conclusion that pollen from Bromus inermis can be discriminated from Poa alpina and Anthoxanthum odoratum based on a different carbohydrate and protein composition. The scores plot in Fig. 7 b shows that separation of Poa alpina and the Bromus inermis/Hordeum bulbosum group from the other species is also possible based on the variance in the third PC. According to the corresponding loading spectra in Fig. 7 b, the discrimination is caused by signals that can be assigned to carbohydrates at 966 cm−1, to sporopollenin at 1167 and 1610 cm−1, here tentatively assigned to lipids at 1423 and 1460 cm−1, and to proteins at 1651 and 1691 cm−1 [16, 18].
Pattern recognition for classification of grass pollen spectra from independent populations
The PLS-DA models discussed in the previous sections (“Selection of non-affected spectral ranges” and “Correction of the spectra using EMSC with a paraffin constituent spectrum”) show high success rates for the discrimination of the three pollen species Anthoxanthum odoratum, Hordeum bulbosum, and Poa alpina in a leave-one-out approach, where each individual spectrum of each sample is tested separately. Nevertheless, in such a setting, the model is trained with spectra from different plants, but from the same population as those of the pollen that is identified. A robust, reliable discrimination method should include the possibility to identify pollen spectra that come from different plant populations as well. In our experiments, the plants in each of the five species originate from two different populations. Therefore, a PLS-DA model was constructed using spectra from just one population per species, amounting to 502 spectra. The success rates were obtained by using the respective other population from each species as an independent test set, comprising other 502 spectra. The results for the identification of the unknown populations by PLS-DA are shown in the upper section of Table 6. Comparing the success rates with the results obtained in the leave-one-spectrum-out approach above (Table 4), we find that the success rates are only slightly lower for the species Anthoxanthum odoratum, Hordeum bulbosum, and Poa alpina when spectra come from an unknown population. Nevertheless, they are very low in those species, where success rates werealready low in the leave-one-spectrum-out identification, that is, in Bromus inermis and Lolium perenne (compare the upper section of Table 6 with Table 5), with the success rate for classification of the former drops from 63% to 26%. We assign this low success rate to a greater variation between the different populations in these two species. Similar observations have been reported for other grass species with bulk FTIR and MALDI mass spectrometry approaches as well and have been discussed in the greater context of adaptive variation [23, 58]. We have also used the second derivatives of the spectra, which yielded similar success rates (ESM Table S1).
Apart from the linear classifier PLS-DA, we used machine learning for the identification of spectra from the respective unknown populations. A feed-forward ANN was trained with the same set of 502 spectra, divided into a training, validation, and internal test set, and tested with 502 spectra from the other respective populations. The success rates were very similar, with a higher number of correct species assignment in Lolium perenne and similar misclassification, e.g., assignment of Lolium perenne as Poa alpina (Table 5, middle section). The slightly diminished success rate for the identification of Hordeum bulbosum compared to the PLS-DA classifier is balanced, considering a 66% correct identification of the spectra from Lolium perenne pollen that is an improvement compared to the PLS-DA model (compare top and middle sections in Table 6). Consistent with the results of HCA (Fig. 6) and PCA (Fig. 7), almost all incorrectly assigned spectra of Hordeum bulbosum are labeled with Bromus inermis as output class, also in agreement with the close phylogenetic relationship of the two species mentioned above [56, 57]. Identification using a random forest algorithm as another machine learning approach results in similar success rates as the PLS-DA model in the case of Bromus inermis and Lolium perenne, but lower numbers of correct identification than PLS-DA and ANN for the other three species (Table 5, last section). Changing the number of trees in the RF from 300 (cf. results in Table 3), determined to be optimum, to higher numbers, results in similar success rates.
The strong decrease in classification success in Bromus inermis when an unknown population must be identified (compare the respective columns in Table 4 and in the three sections of Table 5) and the quite high success rates for other species are in agreement with the different intraspecies variance that was observed between populations of other Poaceae species [23, 58]. Especially in Anthoxanthum odoratum and also Poa alpina that show highest success rates here (Table 5), the ability to distinguish spectra from different populations of the same species was challenging based on FTIR spectra  but could be achieved using other chemical information of the pollen samples .
The fact that identification is based here on spectra from individual pollen grains rather than averages from one plant adds another source of variation here, as was recently also discussed when we compared different spectroscopic methods that probe either bulk samples or individual pollen grains and their potential for pollen identification . Nevertheless, the possibility to study pollen spectra in mixtures could in the future open possibilities for FTIR imaging-based identification of mixed grass pollen samples, similar to existing high-throughput and mapping approaches [9, 10].