Amino acid side chain contribution to protein FTIR spectra: impact on secondary structure evaluation

Prediction of protein secondary structure from FTIR spectra usually relies on the absorbance in the amide I–amide II region of the spectrum. It assumes that the absorbance in this spectral region, i.e., roughly 1700–1500 cm−1 is solely arising from amide contributions. Yet, it is accepted that, on the average, about 20% of the absorbance is due to amino acid side chains. The present paper evaluates the contribution of amino acid side chains in this spectral region and the potential to improve secondary structure prediction after correcting for their contribution. We show that the β-sheet content prediction is improved upon subtraction of amino acid side chain contributions in the amide I–amide II spectral range. Improvement is relatively important, for instance, the error of prediction of β-sheet content decreases from 5.42 to 4.97% when evaluated by ascending stepwise regression. Other methods tested such as partial least square regression and support vector machine have also improved accuracy for β-sheet content evaluation. The other structures such as α-helix do not significantly benefit from side chain contribution subtraction, in some cases prediction is even degraded. We show that co-linearity between secondary structure content and amino acid composition is not a main limitation for improving secondary structure prediction. We also show that, even though based on different criteria, secondary structures defined by DSSP and XTLSSTR both arrive at the same conclusion: only the β-sheet structure clearly benefits from side chain subtraction. It must be concluded that side chain contribution subtraction benefit for the evaluation of other secondary structure contents is limited by the very rough description of side chain absorbance which does not take into account the variations related to their environment. The study was performed on a large protein set. To deal with the large number of proteins present, we worked on protein microarrays deposited on BaF2 slides and FTIR spectra were acquired with an imaging system. Supplementary Information The online version contains supplementary material available at 10.1007/s00249-021-01507-7.


Table of Contents
. Evolution of RMSECV as a function of the number of latent variables (LVs) in PLS models. The numbers circled in red have been used in this work. Prediction appears not significant for dG. The same result is obtained for dB, dT and dI (not shown). These results were obtained on the original spectra. Prediction for xH, RMSECV= 6.19

No side chain subtraction
Side chain subtracted Figure S4: Relation between the actual and predicted secondary structure content. The actual structure content is obtained from the analysis of the high resolution structure by the definitions designed by XTLSSTR (King and Johnson 1999). XTLSSTR is an alternative to DSSP to obtain secondary structure content from PDB files. It defines helix (H),-strand (E), and a series of structures that are not abundant enough to obtain good prediction, including 310-helix, hydrogen-bonded turn, non-hydrogen-bonded turn and poly(L-proline) II type 31-helix. Lower case letters indicate residues that are not part of the core of the main structures, but are located either at the end of a structure or disconnected from it. Prediction have been obtained in LOO cross-validation by ASLR using 4 wavenumbers. Evaluation was carried out before (left column) and after (right column) subtraction of the side chain contributions. Prediction for xE, RMSECV= 4.14 -strand (core)

Reference
No side chain subtraction Side chain subtracted Figure S5: ASLR prediction of XTLSSTR-defined structures: detailed analysis for the -strand Figure S5: Detailed analysis of the prediction of the total -strand content defined by XTLSSTR by ASLR before and after subtraction of the side chain contribution. Left: difference between predicted and actual content as a function of the spectrum number, right: the predicted concentrations are reported as a function of the actual concentrations. The best fit is indicted by the central dashed line, the two other red lines indicate ± one standard deviation. The amino acid composition is reported below for these proteins.
Amino acid composition of the proteins singled out in the figure above.  Table S1: Parameters describing side chain contribution between 1800 and 1400 cm -1 .
The parameters below are presented in a format that can be read immediately by MatLab. A copy/paste of the lines below should load the data. A plot is finally provided, using the relative amino acid mean content found in cSP92.

cross-validation
Kennard-Stone raw spectra aa corrected raw spectra aa corrected SVM  Table S2: characterization of models based on SVM (top), PLS (middle) and ASLR (bottom) for the prediction of DSSP-defined secondary structure content in cSP92 protein FTIR spectra. The left part of the table reports results obtained in leave-one-out cross-validation mode, the right part results obtained on the Kennard-Stone subset (1/3 of the proteins). STDDEV REF is the standard deviation of the reference (DSSP) secondary structure content in the test set, RMSECV in the root mean square error in leave-one-our cross-validation and RMSEKS is the root mean square error for the Kennard-Stone subset.  CV is defined as STDDEV REFCV /RMSECV and  KS as STDDEV REFKS /RMSEKS; r is the correlation coefficient. The number of LVs used is indicated for PLS. For ASLR, 4 wavenumbers have been used. Spectra were either the original spectra (raw spectra, dark yellow) or the spectra corrected for amino acid side chain contributions (aa corrected, light yellow).