High-Throughput Prediction of Acacia and Eucalypt Lignin Syringyl/Guaiacyl Content Using FT-Raman Spectroscopy and Partial Least Squares Modeling

High-throughput techniques are necessary to efficiently screen potential lignocellulosic feedstocks for the production of renewable fuels, chemicals, and bio-based materials, thereby reducing experimental time and expense while supplanting tedious, destructive methods. The ratio of lignin syringyl (S) to guaiacyl (G) monomers has been routinely quantified as a way to probe biomass recalcitrance. Mid-infrared and Raman spectroscopy have been demonstrated to produce robust partial least squares models for the prediction of lignin S/G ratios in a diverse group of Acacia and eucalypt trees. The most accurate Raman model has now been used to predict the S/G ratio from 269 unknown Acacia and eucalypt feedstocks. This study demonstrates the application of a partial least squares model composed of Raman spectral data and lignin S/G ratios measured using pyrolysis/molecular beam mass spectrometry (pyMBMS) for the prediction of S/G ratios in an unknown data set. The predicted S/G ratios calculated by the model were averaged according to plant species, and the means were not found to differ from the pyMBMS ratios when evaluating the mean values of each method within the 95 % confidence interval. Pairwise comparisons within each data set were employed to assess statistical differences between each biomass species. While some pairwise appraisals failed to differentiate between species, Acacias, in both data sets, clearly display significant differences in their S/G composition which distinguish them from eucalypts. This research shows the power of using Raman spectroscopy to supplant tedious, destructive methods for the evaluation of the lignin S/G ratio of diverse plant biomass materials.


Background
The ratio of lignin syringyl (S) to guaiacyl (G) moieties has been characteristically quantified as one method to evaluate biomass recalcitrance [1][2][3][4][5][6][7].While higher S/G ratios have resulted in increased lignin pulping reactivity [2,7], a clear trend linking S/G ratio to the enzymatic degradation of plant cell walls has not been established.Some reports have indicated high S/G to correlate with increased sugar release [4,6], while other studies have concluded that ratio reductions are optimal [1].Regardless of the exact effects S/G ratios have on the saccharification of biomass, this parameter has proven to be important in developing a better understanding of lignin structure and degradation.
In a previous study, mid-infrared (MIR), near-infrared (NIR), and FT-Raman spectra were coupled with lignin S/G ratios obtained using pyrolysis/molecular beam mass spectrometry (pyMBMS) for the construction of multivariate analysis (MVA) models [8].Various iterations were performed to determine the spectral processing techniques that provided the most robust calibration models.The models were vigorously assessed using statistical metrics including root mean standard error (RMSE) or Scree plots for determining the appropriate number of factors, and the RMSE values and calculated coefficients of correlation (r) and determination (R 2 ) after using a full cross-validation and a 50 sample validation data set.These parameters illustrated increased accuracy from using MIR and Raman spectroscopy for the development of models capable of predicting lignin S/G ratios.For example, MIR and Raman spectroscopy partial least squares (PLS) models resulted in a root mean standard error of prediction (RMSEP) of 0.13 to 0.15 and 0.13 to 0.16, respectively, while the RMSEP measured using NIR spectra was slightly more erroneous (0.18 to 0.21).While the construction of these models illustrated the potential of MVA and vibrational spectroscopy to screen biomass based on lignin S/G ratios, the complete evaluation of the MVA models required the prediction of the S/G ratios for an unknown data set.The execution of this step is integral for the determination of a model's practicality for assessing future samples.
The motivations for conducting this study were twofold.One was to evaluate the most robust FT-Raman model for the prediction of lignin S/G ratios from 269 trees from three genera (Acacia, Corymbia, and Eucalyptus) encompassing 17 diverse species.The analysis of the S/G predictions calculated from this model displayed an accuracy correlative to the pyMBMS reference results, highlighting the use of nondestructive Raman spectroscopy to reduce experimental time and expense.The second rationale behind this study was to determine which plants (whether measured directly using pyMBMS or predicted using the Raman model) had the lowest and highest lignin S/G ratios.Evaluations between species were conducted using pairwise comparisons within measured and predicted S/G data sets to ensure that any statistical differences found within the modeled data could be verified against a widely accepted chemical analysis technique.

Wood Samples
The sampling techniques used for the acquisition of the wood samples used in this study have been described in a previous manuscript [8].In addition to the 245 samples used for the construction of the PLS model, 269 diverse Acacia and eucalypt samples comprised the unknown sample matrix.

Fourier Transform Raman Spectroscopy
The FT-Raman spectral collection parameters have been described in a previous manuscript [8].

Pyrolysis/Molecular Beam Mass Spectrometry
The pyMBMS instrumental and spectral processing methodologies have been previously described [9].

Multivariate Analysis
All modeling was conducted using the Unscrambler X software package (Camo, Inc., Oslo, Norway).Samples used for calibrating and validating the original models were united, creating a 245 sample calibration matrix composed of Acacia, Corymbia, and Eucalyptus trees.The model was evaluated employing a full cross-validation (CV) before predicting the S/G ratios of 269 unknown samples.Overfitting of the data was gauged by analyzing the RMSE or Scree plot and by studying the effects of using non-optimal numbers of factors on the predictive capacity of the model.The most influential variables utilized for model construction were identified from the regression coefficients plot.The model was recalculated using solely these vibrational modes, thereby diminishing spectral noise and subsequently increasing the calibration and prediction accuracy.

Statistical Analysis
The predicted Raman and measured pyMBMS lignin S/G ratios were compared to assess whether there were statistical differences between the samples.A non-parametric Kruskal-Wallis test (χ 2 =155.99,p value<2.2×10 −16 ) was used to evaluate differences between taxa for the Raman S/G predicted values [10].Post hoc comparisons between taxa were carried out using Mann-Whitney U tests with a Holm adjustment for multiple comparisons [11].Pyrolysis S/G ratios were analyzed with a standard one-factor analysis of variance (ANOVA) (F(18,182)=16.82,p value<2×10 −16 ).Tukey's honestly significant differences (HSD) protocol was performed as a post hoc comparison between taxa.To determine if there were any significant differences between the predicted and reference S/G values for each species (Table 2), a Mann-Whitney U test was conducted.Analyses were performed using R Studio (R Studio, version 3.0.2,Boston, MA, USA).

Results and Discussion
In a previous study, PLS models employing first derivative Raman spectra and an extended multiplicative scatter correction (EMSC) provided the highest accuracy and robustness [8].These models, developed to predict lignin S/G ratios in Acacias and eucalypts, contained randomly generated calibration and validation sets encompassing 195 and 50 samples, respectively.Figure 1 provides a comparison between first derivative, EMSC-transformed Raman spectra of Acacia microbotrya (black) and Eucalyptus globulus subspecies globulus (red) trees.These two specific samples represent the extremes encompassed in the pyMBMS measurements, and the Raman spectra were analyzed to attempt the elucidation the spectral differences correlative to this range.While spectral differences near 781, 1037, 1150, 1259, 1332, 1603, and 1627 cm −1 can be identified between the two samples, lignin and its derivatives have vibrational modes at these locations corresponding to both S and G moieties, as well as lignin skeletal and phenyl ring vibrations, making the assignment of these bands challenging (see Table 1).Cellulose can further complicate the assignment of some of these bands, as it has known Raman peaks near 1119 and 1150 cm −1 [12].The complexity of the Raman spectra of heterogeneous biomass samples obscures sample-to-sample qualitative comparisons.Figure 1 also illustrates the deficiency of striking spectral disparities, thereby highlighting the proficiency of employing MVA to hone in on previously obfuscated sample variance.
To predict the S/G ratio for the unknown data set consisting of 269 Acacia and eucalypt samples, the validation set was combined into the calibration matrix, providing a new, more vigorous model.This resulted in a five-factor calibration model with a calibration R 2 of 0.848 and a validation R 2 of 0.824, following a full CV, generating an RMSECVof 0.13.It should be noted that these five factors do not represent individual biomass constituents but rather represent sources of spectral variance being drawn out by the model.The score plot produced by the model is shown in Fig. 2 and represents the Fig. 1 Comparison of the first derivative+EMSC Raman spectra of low S/G Acacia microbotrya (black spectrum, S/G=1.2) and higher S/G Eucalyptus globulus subspecies maidenii (red spectrum, S/G=3.0), as measured by pyrolysis/molecular beam mass spectrometry.The x-axis is in wavenumbers, while the y-axis represents the Raman intensity (EMSC extended multiplicative scatter correction, S/G syringyl to guaiacyl ratio).Vertical dashed lines have been added to illustrate spectral differences classification of samples based on second and third principal components (PCs) or factors.In this plot, the blue squares, red circles, and green triangles represent the Acacia, Corymbia, and Eucalyptus genera, respectively.Three distinct groups can be identified, although the Corymbia and Eucalyptus groups have some overlap.This is expected, as the pyMBMS measured lignin S/G ratios of these genera are more comparable, juxtaposed to the Acacia samples.Also, as anticipated, Corymbia and Eucalyptus trees measured to contain lower S/G ratios using pyMBMS, such as all Eucalyptus crebra samples (S/G=1.6±0.4) or a Corymbia torelliana heartwood sample (S/G=1.5),were closest to the Acacias (A. microbotrya S/G=1.3±0.1,Acacia saligna S/G=1.7±0.2).Since the bottom left quadrant contains the plants with lower S/G ratios, on average, the top right quadrant was expected to reveal an opposite trend.Indeed, samples located at the farthest corner of this quadrant show increased pyMBMS lignin S/G ratios (Corymbia citriodora subspecies variegate (CCV)=2.6,Eucalyptus cladocalyx = 2.6, Eucalyptus dunnii = 2.8, E. globulus=2.8, and Eucalyptus moluccana=2.5).Given the lack of statistical differences between many of these higher S/G samples (see Table 3), however, the classifications are much less defined, contrasted to the Acacia cluster.The loading plots for the first three factors are provided in Fig. 3a-c.Loadings plots represent which vibrational modes are important in composing a specific factor.The vibrational modes of polymeric lignin and its individual phenylpropanoid constituents have similar spectral signatures, complicating the analysis of the loading plots.While specific peaks indicative of G, S, and polymeric lignin can be identified in the loading plots of the first three factors, there is no discernible trend aligning a specific factor with an unambiguous lignin moiety.Rather, each of the loadings contributes G, S, and lignin spectral features to the overall classification.This can be exemplified in Fig. 2, where, as previously discussed, the lower left quadrant contains samples with the lowest S/G ratios, while, in general, 369 (S), 357, 370 (G) [22] higher ratios can be identified along a path to the upper right quadrant.This suggests that both factors 2 and 3 are being employed to develop the classification of the trees based on S/G ratios.Figure 4 shows the linearity of this pyMBMS/Raman model for both the calibration and full CV data sets.The reference and cross-validated lignin S/G ratios deviate from the linear trendline at higher S/G values.This overcompensation of S lignin is likely due to the fact that syringyl units are being preferentially released during the chemical degradation of lignin [13,14].Regardless of this deviation, the calibration (blue) and full CV (red) trendlines display a strong correlation.Plotting the regression coefficients (Fig. 5) allowed the isolation of integral spectral regions used for constructing the model.Table 1 lists the shaded wavenumber sections identified in Fig. 5, and characteristic lignin and lignin monomer vibrational modes potentially corresponding to these regions, as previously assigned in the literature.It should be noted that given the complex nature of biomass, there may be overlap between the vibrational modes of lignin and lignin monomers, with other cell wall constituents.
The model successfully identified and extracted the lignin spectral regions, including the regions of significant variance ascertained from Fig. 1.The regression coefficient plot was evaluated for specific monomeric trends; however, no distinct pattern emerged regarding the relationship between S or G moieties being predominantly positively or negatively correlated (Table 1).Despite some overlap between the Raman spectral assignments, there is a general consensus amid the references regarding peak location and their classification as bonds indicative of lignin and lignin monomers.It should be noted that differences in instrumental configurations can result in variation of vibrational mode peak locations.The strongest vibrational modes of cellulose occur at 1091 and 1117 cm −1 Fig. 3 Graphical representations of the a first PC loadings, b second PC loadings, and c third PC loadings used in the classification of the plant samples by genus Fig. 4 Plot of the predicted lignin S/G ratio using a model built from first derivative, EMSCtransformed FT-Raman spectra and pyMBMS reference data.The blue and red lines signify the linearity of the calibration and prediction data sets, following a full cross-validation.The x-axis depicts the pyMBMS measured ratio, while the y-axis indicates the Raman predicted ratio.S/G syringyl to guaiacyl ratio, FT Fourier transform, pyMBMS pyrolysis molecular beam mass spectrometry, EMSC extended multiplicative scatter correction [12].These and weaker cellulose peaks are by the spectral regions identified in the Raman regression coefficients plots.Further analysis revealed that the cellulose or polysaccharide vibrational modes were either negatively correlated (Fig. 5, 1091 cm −1 ), not identified as important to the model construction (1268 cm −1 ), or positively correlated due to spectral overlap with lignin vibrational modes (e.g., 896, 1117, and 1338 cm −1 ).Other potential sources of spectral overlap include xylan and extractive material such as proteins, lipids, etc. [12,[15][16][17].Interestingly, three spectral regions identified in the regression coefficient plot correlated with postulated H lignin markers, as listed in Table 1.These occur between 833-838 and 1176-1178 cm −1 , and at 1488 cm −1 .The distinctive bands of S and G lignin, as well as cellulose, can be eliminated as correlative to these vibrational modes.Although its content in hardwoods is often minute, Acacia and Eucalyptus species have been determined to contain potentially 2-9 % H lignin, depending on the age of the trees [18][19][20][21].Further chemical or instrumental analysis, such as thioacidolysis or 2D nuclear magnetic resonance, is required to determine if these vibrational modes correspond to H lignin.
As previously mentioned, the first motivation for conducting this research was to evaluate whether MVA models produced using Raman spectra and pyMBMS S/G ratios could accurately predict the S/G ratios in an unknown sample set, diminishing the need to destructively pyrolyze of all samples.The pyMBMS reference S/G ratios averaged for each plant species, including the number of samples for each tree, and the range of S/G ratios contained in each data set were previously reported [8]. Figure 6 illustrates the mean pyMBMS ratios using the 95 % confidence interval.Table 2 reveals the Raman predicted S/G ratios for the plant species in the unknown data matrix.A comparison between the pyMBMS and Raman predicted S/G values for each species variegata, E. argophloia, E. cladocalyx, E. cloeziana, E. crebra, E. dunnii, E. globulus (subspecies globulus and maidenii), E. grandis, E. longirostrata, E. loxophleba, E. moluccana, E. occidentalis, and E. polybractea.S/G syringyl to guaiacyl lignin ratio using a non-parametric Mann-Whitney U test shows no significant differences (p value<0.05),with the tion of Eucalyptus argophloia (W = 2, p value = 0.03).This significant difference could be attributed to a small reference population (n = 5), different genetic backgrounds, and/or environmental microsite variation, but the statistical comparison clearly illustrates the power and advantages of using robust, high-throughput multivariate modeling to predict lignin monomeric content, as the predicted lignin S/G ratios exhibit strong correlation with the pyMBMS range and average.
Once the Raman/pyMBMS model was verified to exhibit high accuracy, the ensuing question to explore was which samples exhibited the most significant variance in S/G ratios (i.e., which samples were at the measured or predicted S/G extremes).Evaluation of statistical differences within the pyMBMS and Raman data sets exposes some unique trends between the S/G ratios of the species measured.Pairwise comparisons between species in each data set were evaluated using Tukey's honestly significant differences for the pyMBMS data and Holm corrected Mann-Whitney U tests for the Raman data.These statistical analyses were selected since the pyMBMS data roughly followed a normal distribution, whereas the Raman predictions did not (negatively skewed).A p value lower than 0.05 indicates that the S/G ratios of the two species being compared are statistically different, while p values at or above 0.05 indicate analogous S/G ratios.A. microbotrya, A. saligna, and E. crebra are among the lowest S/G ratios in both the reference [8] and predicted data sets (Table 2).The Raman predictions for both Acacia species show significant differences from each Corymbia species (Corymbia citriodorasubspecies citriodora (CCC) and CCV) and eight Eucalyptus species (Table 4).This result was confirmed in the pyMBMS data set where the Acacias displayed statistical differencesfrom 13 other eucalypts (Table 3).Although the S/G ratios of the Acacia trees were similar to E. crebra values in the Raman data set, E. crebra only showed significant disparity with the Corymbia samples and Eucalyptus loxophleba.This could potentially be attributed to the small number (n=6) of E. crebra samples used to generate the Raman prediction model, the fact that E. crebra was from a limited number of provenances sampled at only one site, or because of the greater number of Corymbia (CCV, n=61; CCC, n=44, sampled across multiple sites and provenances) and E. loxophleba contained in the model (n=23, multiple provenances), thereby increasing the predictive capabilities of the model for those samples.The need for larger sample sizes to predict significant variance between species is further illustrated by E. cladocalyx (n=3), which showed similarity with all other species (Table 4).CCV S/G ratios, predicted by Raman spectroscopy, reveal more statistical differences than those of CCC, when compared to the other plant species.E. dunnii, Eucalyptus kochii, Eucalyptus longirostrata, E. moluccana, and Eucalyptus occidentalis show similarities within the genus, but significant variance when juxtaposed with both Acacias (E.kochii differs only from A. microbotrya) and CCC.The prediction of the S/G ratio of unknown E. globulus subspecies maidenii shows no statistical dissimilarity to other eucalypts, with the exception of Eucalyptus cloeziana, Eucalyptus grandis, and Eucalyptus polybractea.
An assessment of the Raman predicted S/G ratios with the pyMBMS data further exemplifies the predictive capabilities of Raman spectroscopic modeling.Using Tables 3 and 4, a total of 136 evaluations were made to investigate which statistical dissimilarities were found in both predicted and reference data sets.Pairwise comparisons between species in the Raman data set identified 40 statistical differences (shown in bold), whereas 54 significant differences (shown in bold) were detected for the same species within the pyMBMS data set       (C. torelliana and Corymbia hybrids were excluded for comparison as they were only tested using the reference method).those, 28 evaluations between species were found to be significant in both data sets (denoted with an asterisk).These results clearly illustrate the lower S/G values of the Acacias when contrasted with the eucalypt samples.The other significant variations in S/G ratios, discovered between species within one data set, but not both, are likely due to the input of small sample sizes into the tests (E.grandis (reference), n= 2; E. cladocalyx (predicted), n=2), or the use of a single factor analysis of variance (ANOVA) and post hoc testing with the pyMBMS data set, which is more robust at elucidating significant discrepancies than non-parametric tests.The data sets clearly illustrate differences in monomeric lignin composition between a diverse group of Acacia and eucalypt wood samples.Further investigation into additional sources of variance, such as age and site effects, will provide understanding into the biological context of wood formation for these important forestry species.

Conclusions
There are greater than 900 diverse species of both Acacias and eucalypts, the latter including Corymbia and Eucalyptus.In order to isolate which trees may be the most advantageous for developing biofuels and bio-based chemicals, phenotypic traits that correlate to plant cell wall structure and recalcitrance must be evaluated, such that suitable deconstruction strategies can be postulated.Many of the standard techniques for measuring monomeric content and ratio are laborious, destructive, toxic, and may require complex data analysis, making these methods unsuitable for screening large populations.The employment of Raman spectroscopy can enable the rapid, nondestructive, screening of potential feedstocks, such as Acacias and eucalypts, for traits deemed important for biofuel and/or bio-based chemical production, and most attuned to the needs of biorefineries.The construction of a robust, multivariate, high-throughput Raman model has been previously established.The current study examined the actual practicality of using this model to gauge the lignin S/G ratio in a large unknown data setof Acacias and eucalypts.The means of the predicted Acacias and eucalypts S/G ratios were not statistically different from those measured using pyMBMS, with the exception of E. argophloia, which could be due to the small sample size analyzed, genetic variations, and/or environmental microsite variations.This research shows the potential of using Raman spectroscopy to supplant tedious, destructive methods for the evaluation of the lignin S/G ratios of different biomass.

Fig. 5 Fig. 6
Fig. 5 Regression coefficients plot illustrating the spectral regions denoted as integral to the model calculation.The black shaded spectral regions illustrate the vibrational modes used in producing the model, while the blue shaded areas signify spectral noise excluded from the model construction.The x-axis is in wavenumbers, while the y-axis is the calculated regression coefficient values.FT Fourier transform, PLS partial least squares

5 C
. citriodora subsp.citriodora (CCC) bold indicate statistically different pairwise comparisons, while the asterisk (*) denotes statistically different comparisons found in both the pyMBMS and Raman data sets pyMBMS pyrolysis molecular beam mass spectrometry, S/G lignin syringyl to guaiacyl ratio, ANOVA analysis of variance a E. globulus includes subspecies globulus and maidenii

Table 1
Raman vibrational modes identified from regression coefficient plot and spectral assignments corresponding to lignin and/or lignin monomers

Table 2
Prediction matrix sample characteristics, S/G averages, and comparisons with the pyMBMS measured ratios a E. globulus includes subspecies globulus and maidenii

Table 4
Statistical comparison between Raman spectroscopy predicted lignin S/G ratios using calculated p values from Kruskal-Wallis test (χ Values in bold indicate a statistically different pairwise comparison, while the asterisk (*) denotes statistically different comparisons found in both the pyMBMS and Raman data sets S/G lignin syringyl to guaiacyl ratio a E. globulus includes subspecies globulus and maidenii