Introduction

Proteins are one of the most important biological macromolecules as they perform a variety of functions such as enzyme catalysis, ion and molecular transport, antibody production, and cellular/physiological activity regulation. Protein activities are heavily influenced by the three-dimensional structure of the protein1. Furthermore, protein and protein complex structures provide a wealth of information for understanding inter-residue interactions such as protein folding mechanisms, folding and unfolding rates, protein structure stability, stability upon mutation, recognition mechanisms of protein–protein, protein-nucleic acid, protein–ligand complexes, which are instrumental for structure-based drug design2,3. Thermophilic proteins (TPPs) have already been established a critical role in biotechnology and chemical processing4. TPPs are stable at high temperatures of about 80–100 °C and environmental temperature of the host organism5,6. Additionally, specific amino acid properties such as shape, Gibbs free energy change of hydration in native proteins, dipeptide composition, contacts between amino acid residues, number of ion pairs, hydrogen bonds, packing, and aromatic clusters all play an important role in TPP stability5,7. According to a thorough examination of all interactions, hydrophobicity is the most important feature in TPP stability, followed by ion pairs and hydrogen bonds8. Understanding the molecular basis of protein thermostability is critical for designing proteins for specific industrial and medical applications that necessitate special stability3. Furthermore, TPPs are resistant to denaturation by chemical compounds such as detergents, surfactants, oxidizing agents, and proteases9,10. As a result of these properties, TPPs can be easily purified by heat treatment and can withstand harsh industrial conditions for a longer period of time11. It should be noted that higher thermostability of therapeutic proteins can extend their blood survival time12. As for their advantages in high-temperature industrial catalysis, TPPs have reduced contamination, easy mixing with low viscosity and high mass transfer rate, higher solubility of substrates and products13. Furthermore, the advantage of TPPs are their use in high-temperature pelleting process14 and in endothermic processes such as the isomerization of glucose to generate high fructose syrups15. Although experimental methods are the way to certify thermostability of proteins, these methods are usually labor-intensive, time-consuming and expensive. Thus, it is desirable to develop a rapid and accurate approach for identifying TPPs from a large collection of proteins.

Several previous studies have shown that machine learning (ML)-based tools can accurately characterize various protein functions using only protein primary sequences16,17,18,19,20,21,22,23,24. Several computational efforts based on machine learning (ML) methods have been made in recent years to identify TPPs20,21,24,25,26,27,28,29,30,31,32,33 as summarized in Table 1. As can be seen from Table 1, support vector machine (SVM) method is the most widely used technique for identifying TPPs20,21,24,25,26,28,29,30. For instance, Zhang and Fan31 developed the first TPP predictor based on amino acid composition (AAC) descriptors. Particularly, they developed a TPP predictor using the partial least squares (PLS) method on a small set of training data (76 TPPs and 76 MPPs). Afterwards, the same group32 introduced a LogitBoost predictor based on a larger number of data consisting of 3521 TPPs and 4895 MPPs (called Zhang2007). In 2008, Gromiha et al.27 established a new dataset (called Gromiha2008) by applying the CD-HIT program34 using a threshold of 0.4 on the Zhang2007 data so as to remove additional redundant sequences. In 2011, Lin et al.20 constructed a more reliable benchmark dataset containing 915 TPPs and 793 non-TPPs (called Lin2011). Using this dataset, ThermoPred was developed by means of the SVM method in conjunction with AAC and dipeptide composition (DPC), which could achieve an improvement in accuracy (ACC) of 0.933 as evaluated by the jackknife cross-validation in their comparative analysis with the model of Gromiha et al.27. In addition, Fan et al.25 introduced a new TPP predictor (called PSSM400_pKa) based on the SVM method and trained on three different feature encodings namely AAC, acid dissociation constant (pKa) and position-specific scoring matrices (PSSM). The PSSM400_pKa predictor was developed based on the Gromiha2008 dataset and its predictive performance was validated by using two independent test datasets where the Gromiha2008 data and two independent test datasets are referred to as Fan2016.

Table 1 Summary of existing ML-based models for thermophilic protein prediction.

Although existing methods could achieve good predictive performance, their overall utility is limited in terms of interpretability and practical utility. The following important issues are needed to be addressed. Firstly, SVM-based predictors are not easy-to-use and difficult for biologists and biochemists to implement on their own datasets. On the other hand, the ability of biologists and biochemists in understanding the resulting model is of great importance if they are to be applied in a real-world setting. Secondly, existing datasets do not include comprehensive TPPs and non-TPPs. Therefore, these datasets might not have sufficient information necessary for the development of comprehensive TPP predictors. Finally, almost all existing methods (with the exception for ThermoPred20) did not provide a web server for public usage therefore their practical application is quite limited.

In this paper, we present SCMTPP, a novel, simple-to-implement, and interpretable computational model that is designed to improve predictive performance and model interpretability for the identification of TPPs. Figure 1 summarizes the SCMTPP's overall framework. Firstly, we established an up-to-date dataset (i.e. 1823 TPPs and 3124 non-TPPs) by combining positive and negative samples from datasets of previous studies20,25,32,35. Secondly, propensity scores of 20 amino acids and 400 g-gap dipeptides were estimated via the scoring card method (SCM). Finally, derived propensity scores were used for the development of a prediction model (SCMTPP) based on a scoring function for determining important biophysical and biochemical properties for TPPs. Results indicated that SCMTPP could outperform existing methods and widely used ML-based classifiers in terms of simplicity, interpretability, and practical application (according to tenfold cross-validation and independent tests).

Figure 1
figure 1

Schematic framework of the development of SCMTPP. This can be summarized into five main steps: (i) Training and independent test datasets preparation, (ii) Feature extraction, (iii) SCM-based model development, (iv) TPPs characterization and (v) SCMTPP webserver construction.

Materials and methods

Dataset preparation

In this study, we created an up-to-date dataset by combining previously reported datasets consisting of Zhang200732,35, Lin201120 and Fan201625. Particularly, Zhang200732,35, Lin201120 and Fan201625 datasets contained 8419, 1708 and 4684 sequences, respectively. Herein, these TPPs and non-TPPs were considered as positive and negative samples, respectively. Particularly, the positive dataset was extracted from thermophilic organisms20,25,31,32 while the negative dataset represents the integration of non-TPPs and mesophilic proteins (MPPs) extracted from non-thermophilic organisms (i.e. Lin201120) and mesophilic organisms (i.e. Zhang200732,35 and Fan201625), respectively. From these, we excluded protein sequences containing nonstandard letters such as “B”, “U”, “X”, or “Z”. Subsequently, redundant sequences were removed by applying the CD-HIT program using a threshold of 0.4 on both positive and negative datasets so as to avoid overestimation of the model performance. As a result, a total of 4945 sequences containing 1823 TPPs and 3124 non-TPPs were obtained and considered as the largest and up-to-date dataset in this aspect. Among these, we randomly selected 80% of the positive dataset containing 1482 TPPs and an equal number of non-TPPs from the negative dataset to construct a training dataset called TPP-TRN (1482 TPPs and 1482 non-TPPs). In the meanwhile, the remaining set of TPPs and an equal number of non-TPPs were considered as the independent test dataset called TPP-IND (371 TPPs and 371 non-TPPs). For reproducibility purposes, the TPP-TRN and TPP-IND datasets can be downloaded from our web server (at http://pmlabstack.pythonanywhere.com/SCMTPP).

Feature representation

The g-gap dipeptide composition (GDC) descriptor is another variation of the DPC descriptor (\(\mathrm{g}=0\)) by representing the fraction of any two interval amino acids \({(\mathrm{aa}}_{\mathrm{i}},{\mathrm{aa}}_{\mathrm{j}};j-i>1)\) in a given peptide P. This descriptor can be formulated as:

$$\mathrm{GDC }\left(\mathrm{g}\right)=\left[{f}_{1}^{g}, {f}_{2}^{g},\dots {f}_{400}^{g}\right]$$
(1)

where \({f}_{i}^{g}\) is the percentage of the composition of the ith (\(i=\mathrm{1,2},\dots ,400\)) g-gap dipeptide.

$${f}_{i}^{g}=\frac{{n}_{i}^{g}}{{\sum }_{i=1}^{400}{n}_{i}^{g}}$$
(2)

where \({n}_{i}^{g}\) represents the total number of ith g-gap dipeptide in a given peptide P. The dimension of the GDC descriptor is 400.

Scoring card method

The SCM method has been demonstrated to perform admirably in terms of conceptual simplicity, ease of implementation and interpretability16,18,36,37,38,39. In 2012, Huang et al.19 firstly introduced the original SCM method. More recently, Charoenkwan et al. had developed an improved version that is designed for predicting and characterizing anticancer peptides38. It is well-recognized that the SCM method is effective for identifying proteins and providing information on the underlying molecular mechanism of proteins. The following points summarize the benefits of the SCM method. To begin, unlike well-known ML methods (such as SVM and NB methods), the SCM method uses only one threshold value to distinguish positives from negatives. Second, the SCM method is the most cost-effective method for performing a genome-wide prediction of any protein family. Finally, the information from the propensity scores of 20 amino acids and 400 dipeptides helps wet-lab researchers gain insights into the properties of proteins. The following describe the concepts and optimization procedures of an SCM classifier trained with GDC (g = 0):

Phase 1: Preparing the TPP-TRN and TPP-IND datasets for SCM classifier development and evaluation.

Phase 2: Calculating initial propensity scores of GDC (\(\mathrm{g}=0\)) using a statistical approach. For convenience of discussion, we denote propensity scores of the g-gap dipeptide term as PSGD (g = 0, 1, 2, …, 9). Further details of this statistical approach are provided in our previous studies16,18,36,37,38,39,40.

Phase 3: Optimizing the initial PSGD (g = 0) and estimating the threshold value using the GA algorithm in order to improve the predictive performance39. Specifically, the fitness function of the GA was mainly used for optimizing two important factors: the area under the receiver operating characteristic (AUC) (\({W}_{1}\)) and the Pearson’s correlation coefficient (R value) between the initial and optimized PSGD (g = 0) (\({W}_{2}\)). To avoid the overfitting issue, the fitness function \(\mathrm{Fit}\left(.\right)\) was performed via a tenfold cross-validation procedure and represented as follows:

$$\mathrm{Fit}\left(\mathrm{PSGD}\right)=0.9\times \mathrm{AUC}+ 0.1\times \mathrm{R}$$
(3)

Furthermore, weights for \({W}_{1}\) and \({W}_{2}\) were set based on our previous studies18,37,38,39,40.

Phase 4: Constructing a scoring function S(P) based on the SCM method to calculate TPP score of an unknown protein P. Herein, the scoring function was created using the optimized propensity scores of 400 dipeptides and can be defined as follows:

$$S(P)=\sum_{i=1}^{400}{DP}_{i}{PS}_{i}$$
(4)

where \({DP}_{i}\) and \({PS}_{i}\) represent the total number and propensity score of the ith dipeptide.

Phase 5: Identifying the biological function of an unknown protein P using the scoring function S(P). Particularly, for a given unknown protein sequence P, it is classified as TPP if S(P) is greater than the threshold value, otherwise P is classified as non-TPP.

$$S\left(P\right)=\left\{\begin{array}{c}1,\sum_{i=1}^{400}{DP}_{i}{PS}_{i}>threshold\\\\0,\sum_{i=1}^{400}{DP}_{i}{PS}_{i}<threshold\end{array}\right.$$
(5)

where \(1\) and \(0\) represent prediction results as TPP and non-TPPs, respectively.

Characterization of thermophilic proteins using SCMTPP

Propensity scores of 20 amino acids were estimated and used in this study to provide a better understanding of the biophysical and biochemical properties of TPPs using SCMTPP. Particularly, a statistical approach was used to calculate the propensity scores for each amino acid. The propensity score for Glu, for example, is calculated by averaging propensity scores of 40 dipeptides that contain Glu. In addition, propensity scores of 20 amino acids were also used to identify a set of informative physicochemical properties (PCPs) as extracted from the amino acid index database (AAindex)41 by means of R values from amongst propensity scores of 20 amino acids with those of 531 PCPs.

Performance evaluation

In order to evaluate the prediction ability of the model, we used four widely used metrics for the two-class prediction problems as follows:

$$\mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\left(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}\right)}$$
(6)
$$\mathrm{Sn}=\frac{\mathrm{TP}}{\left(\mathrm{TP}+\mathrm{FN}\right)}$$
(7)
$$\mathrm{Sp}=\frac{\mathrm{TN}}{\left(\mathrm{TN}+\mathrm{FP}\right)}$$
(8)
$$\mathrm{MCC}=\frac{\mathrm{TP}\times \mathrm{TN}-\mathrm{FP}\times \mathrm{FN}}{\sqrt{(\mathrm{TP}+\mathrm{FP})(\mathrm{TP}+\mathrm{FN})(\mathrm{TN}+\mathrm{FP})(\mathrm{TN}+\mathrm{FN})}}$$
(9)

where ACC, Sn, Sp and MCC represents accuracy, sensitivity, specificity and Matthews correlation coefficient, respectively. Particularly, the number of correctly predicted true TPPs and true non-TPPs is indicated by TP and TN, respectively. Furthermore, FP stands for the number of non-TPPs that are predicted to be TPPs, and FN stands for the number of TPPs that was predicted to be non-TPPs. The proposed model was compared to previously described models using the receiver operating characteristic (ROC) curve of threshold-independent parameters. As a result, the area under the ROC curve (AUC) was used to evaluate prediction performance, with AUC values in the range of 0.5 and 1 denoting random and perfect models, respectively42,43,44,45,46,47.

Analysis of three-dimensional structure of thermophilic proteins

Herein, Galaxy TBM (http://galaxy.seoklab.org/ index.html) was used for the determination of three-dimensional structures of TPPs and non-TPPs. The workflow of protein modelling consisted of two main stages: (i) selecting reliable models that are aligned with PROMALS3D48 and MODELLERCSA49 models and (ii) detecting and remodelling loop areas using the refining method. Particularly, protein structures of selected models were refined using 3Dpro (http://scratch.proteomics.ics.uci.edu/explanation.html#3Dpro) and GalaxyRefine (http://galaxy.seoklab.org/cgi-bin/submit.cgi?type = REFINE). Finally, the ProSA-web server (https://prosa.services.came.sbg.ac.at/prosa.php) and the Ramachandran plots were used to validate the three-dimensional structure. Moreover, hydrophobic and charge surface were visualized by using the BIOVIA Discovery Studio software (Dassault Systèmes BIOVIA, Discovery Studio Modeling Environment, Release 2018, San Diego: Dassault Systèmes, 2016).

Results and discussion

Prediction assessment of different propensity scores of g-gap dipeptides

The predictive performance of SCM classifiers trained with different PSGD (g = 0–9) was evaluated by means of tenfold cross-validation and independent tests on TPP-TRN and TPP-IND datasets, respectively. The GA algorithm was used to optimize and generate 10 sets of propensity scores for each g-gap dipeptide in order to construct 10 different SCM classifiers. As a result, among these ten sets, the one with the highest cross-validation MCC was chosen as the best. Supplementary Tables S1-S10 list the predictive performance of various SCM classifiers trained with PSGD (g = 0–9). Moreover, a summary of the predictive performance of 10 SCM classifiers trained by the 10 optimal sets of PSGD (g = 0–9) and evaluated by tenfold cross-validation and independent test results are recorded in Tables 2 and 3, respectively.

Table 2 Cross-validation results of SCM models using different optimal propensity scores of g-gap dipeptides.
Table 3 Independent test results of SCM models using different optimal propensity scores of g-gap dipeptides.

It is noticed that the mean ± SD values of ACC, Sn, Sp, MCC and AUC as based on 10 SCM classifiers are 0.867 ± 0.006, 0.871 ± 0.012, 0.864 ± 0.015, 0.735 ± 0.013 and 0.916 ± 0.005, respectively, using tenfold cross-validation. As can be seen from Table 2, PSGD (g = 0) was found to achieve the highest ACC of 0.883 with an MCC of 0.766 and an AUC of 0.926. Furthermore, PSGD (g = 1) and PSGD (g = 3) also performed well as it afforded the second and third highest ACC of 0.872 and 0.869, respectively. In the case of independent test results, Table 3 shows that the mean ± SD values of ACC, Sn, Sp, MCC and AUC based on 10 SCM classifiers are 0.850 ± 0.010, 0.842 ± 0.017, 0.858 ± 0.016, 0.700 ± 0.019 and 0.909 ± 0.006, respectively. PSGD (g = 6) achieved the highest ACC and MCC of 0.867 and 0.733, respectively, while PSGD (g = 0) achieved the second highest ACC and MCC of 0.865 and 0.731, respectively. From Table 3, it can be observed that PSGD (g = 0) achieved very comparable independent test results to that of PSGD (g = 6) in terms of all metrics (i.e. ACC, Sn, Sp, MCC and AUC). Taken into consideration the performance of both tenfold cross-validation and independent test results, results indicated that the SCM classifier trained with PSGD (g = 0) (i.e. the propensity scores of dipeptide) was the optimal one for the identification of TPPs and is referred to as SCMTPP. Further details of propensity scores of dipeptides are depicted in Fig. 2.

Figure 2
figure 2

Propensity scores of 400 dipeptides as obtained from the proposed SCMTPP.

Comparison of initial and optimized propensity scores

The improved predictive performance of SCMTPP is mainly due to estimated propensity scores of dipeptides derived from the SCM approach. In order to understand this phenomenon, firstly, we compared the predictive performance of optimized (optimized-PS) and initial (initial-PS) propensity scores of dipeptides. Table 4 shows the predictive performance of optimized-PS and initial-PS as evaluated by tenfold cross-validation and independent tests. As shown in Table 4, the optimized-PS achieved cross-validation ACC, Sp and MCC of 0.883, 0.887 and 0.766, which represents 3.9%, 5.8% and 7.8%, respectively, improvements over that of the initial-PS. Furthermore, independent test results of the optimized-PS were found to be consistently higher than that of the initial-PS. Particularly, optimized-PS afforded improvements as demonstrated by higher values of ACC, Sp and MCC of 1.7%, 3.7% and 3.8%, respectively, when compared to that of the initial-PS. In addition, histogram plots was used to represent scores of TTPs and non-TTPs as derived from SCMTPP by using initial-PS (Fig. 3A) and optimized-PS (Fig. 3B). As can be seen in Fig. 3, the optimized-PS shows a clear distinction between TTPs and non-TPPs thereby indicating that the optimized-PS was more effective for discriminating TTPs from non-TPPs than that of the initial-PS.

Table 4 Cross-validation and independent test results of SCM-based classifiers using initial-PS and optimized-PS.
Figure 3
figure 3

Histogram plot represent scores of thermophilic and non-thermophilic proteins as derived from SCMTPP using initial (A) and optimized (B) dipeptides propensity scores on the training dataset where the mean and standard deviation are indicated by bars and closed circles, respectively.

Comparison of SCMTPP with well-known ML classifiers and the existing method

In order to assess the predictive effectiveness of the proposed SCMTPP, we compared its performance with well-known ML classifiers as well as with the existing method on the same training and independent test dataset. Herein, we constructed and optimized several ML classifiers using SVM, decision tree (DT), k-nearest neighbor (KNN) and naive Bayes (NB) with AAC, DPC and amino acid index (AAI). All of these ML classifiers were constructed using the scikit-learn Python machine learning package (version 0.22)50. Figure 4 and Supplementary Tables S11-S12 summarize results of SCMTPP and several ML classifiers as evaluated by tenfold cross-validation and independent test. In regards to the existing method, Table 1 shows that three of these existing methods (i.e. Montanucci et al.’s method21, ThermoPred20 and Zuo et al.’s method33) were available as a webserver. However, ThermoPred is the only webserver that was functional at the time of this manuscript’s preparation. Therefore, the performance of SCMTPP was compared with only ThermoPred and their results are reported in Table 5.

Figure 4
figure 4

Performance evaluations of SCMTPP and conventional TPP predictors. (A,B) tenfold cross-validation of ACC and MCC from SCMTPP versus conventional TPP predictors. (C,D) Independent test of ACC and MCC from SCMTPP versus conventional TPP predictors.

Table 5 Cross-validation and independent test results of SCMTPP and ThermoPred.

Insights gained from Fig. 4, Table 5 and Supplementary Tables S11-S12 can be summarized as follows: (i) Two SVM-based classifiers consisting of SVM-DPC and SVM-ACC was found to achieve the two highest performance with ACC (cross-validation and independent test) of (0.910 and 0.904) and (0.906 and 0.898) for SVM-DPC and SVM-ACC, respectively; (ii) SCMTPP achieved very comparable to these two classifiers as well as ThermoPred with cross-validation and independent test ACC of 0.883 and 0.865, respectively, (iii) SCMTPP and SVM-based classifier (except for SVM-AAI) performed better than DT-based, KNN-based and NB-based classifiers. Particularly, the cross-validation ACC of SCMTPP was 7.05–16.83%, 3.78–14.68 and 1.86–14% higher than DT-based, KNN-based and NB-based classifiers, respectively. It is well-known that SVM method is a complicated approach that is not straightforward to provide the underlying biological implications16,18,36,37,38,39,40. On the other hand, SCM method is based on a simple weighted-sum approach that is more easy-to-understand method for biologists and provide interpretable propensity scores of dipeptides. Altogether, these comparative results revealed that the proposed SCMTPP predictor was the most suitable one for the identification and analysis of TPPs in terms of conceptual simplicity, ease of implementation and effectiveness.

Identification of potential thermophilic proteins

Unlike existing methods, the proposed SCMTPP predictor is an easy-to-use and cost-effective for determining the likelihood of uncharacterized proteins namely TPPs using a simple scoring function \(S(P)\)16,18,36,37,38,39,40. Recently, Charoenkwan et al. made the use of SCM method for determining a new potential peptide-based drug for the hypoxia inducible factor 1α (HIF-1α)36. Herein, the scoring function \(S\left(P\right)\) was used to calculate TPP scores (PS-TPP) for all proteins in the TPP-TRN dataset. Table 6 records ten top-ranked proteins having the highest TPP scores along with their name, PS-TPP, UniProt ID, function and source organism. As seen in Table Table 6, it could be noticed that all of the ten top-ranked proteins exhibited TPP scores of greater than 418. In addition, Fig. 5 depicts three-dimensional structures of TPPs (Q9YFR9, Q57676 and Q9YD25) and non-TPPs (Q8ZDC4, Q66A07 and A1AZ52) having the highest (528.74, 527.79 and 525.29, respectively) and lowest (319.67, 331.20 and 340.61, respectively) TPP scores, respectively. The five top-ranked proteins having the highest TPP scores and their UniProtID contained: 50S ribosomal protein L38E (528.74, Q9YFR9), Uncharacterized protein MJ0223 (527.79, Q57676), 50S ribosomal protein L31e (525.29, Q9YD25), Protein Grp (519.54, Q9WZV) and Elongation factor 1-beta (519.28, Q8TYN8). From amongst these ten proteins, they were from five main organisms consisting of Aeropyrum pernix (Q9YFR9, Q9YD25, P58289,), Archaeoglobus fulgidus (O28071), Methanocaldococcus jannaschii (Q57676), Methanopyrus kandleri (Q8TYN8, Q8TX34, Q8TXI4 and Q8TWL9) and Thermotoga maritime (Q9WZV4). Interestingly, the uncharacterized protein MJ0223 was from Methanocaldococcus jannaschii which is an anaerobic thermophilic archaea51.

Table 6 Top ten TPPs having the highest PS-TPP derived from the proposed SCMTPP.
Figure 5
figure 5

Three-dimensional structures of TPPs (Q9YFR9, Q57676 and Q9YD25) and non-TPPs (Q8ZDC4, Q66A07 and A1AZ52) having the highest (528.74, 527.79 and 525.29, respectively) and lowest (319.67, 331.20 and 340.61, respectively) TPP scores, respectively, where the optimal threshold value is 418.

Characterization of thermophilic proteins using propensity scores of amino acids

In this section, propensity scores of 20 amino acids and 400 dipeptides to be TPPs were analyzed to provide good understanding of physicochemical properties of TPP. As mentioned above, these propensity scores were generated by using SCMTPP based on the training dataset containing 1482 TPPs and 1482 non-TPPs. Table 7 records the propensity scores of amino acids along with the percentage of amino acid compositions, while Fig. 2 displays the propensity scores of dipeptides. As seen in Table 7, we notice that the correlation coefficient R between the propensity scores of amino acids and the difference of the percentage of amino acid compositions among TPPs and non-TPPs is 0.96. This again confirmed that the propensity scores of amino acids and dipeptides had more discriminative power to capture the key information between TPPs and non-TPPs. By consideration of the propensity scores of amino acids, we noticed that the top-five amino acids to be TPPs consisted of Glu, Lys, Val, Arg and Ile with respective scores of 510.18, 480.00, 470.75, 464.08 and 435.65, respectively, while the top-five amino acids to be non-TPPs consisted of Gln, Thr, Ala, Asn and Phe with respective scores of 255.43, 306.00, 323.63, 332.48 and 351.25, respectively. In case of the propensity scores of dipeptides, it could be found that the ten top-ranked dipeptides to be TPPs consisted of EE, GW, SG, WS, KY, YP, PW, IM, VY, EG and RI with their scores of 1000, 979, 956, 952, 908, 881, 876, 864, 860, 853 and 838, respectively, while the ten top-ranked dipeptides to be non-TPPs consisted of AA, LQ, NM, FW, MQ, AD, MT, SI, QL, QA and AQ with their scores of 0, 11, 27, 41, 47, 71, 99, 104, 115, 129 and 144, respectively.

Table 7 Propensity scores of twenty amino acids in becoming a thermophilic protein (PS-TPP) along with amino acid compositions (%) of TPPs and non-TPPs.

As shown in Table 7, the ranks of the top-five amino acids to be TPPs (propensity, difference) for Glu, Lys, Val, Arg and Ile are (1, 1), (2, 2), (3, 3), (4, 4) and (5, 5), respectively, while the ranks of the top-five amino acids to be non-TPPs for Gln, Thr, Ala, Asn and Phe are (20, 20), (19, 18), (18, 19), (17, 16) and (16, 13), respectively. Many previous studies indicated that Glu, Lys and Arg had higher occurrence in TPPs than MPPs20,27,28,35,52,53,54,55. For example, Haney et al.53 conducted a comprehensive analysis on 115 protein sequences from M. jannaschii. Their results of amino acid composition analysis showed that Ile, Arg, Glu, Lys and Pro plays an important role in thermostability of proteins while Ser, Asn, Gln, Thr, and Met contributed to the mesostability of proteins. Haney et al.53 also reported that important physicochemical and biochemical properties for TPPs consisted of hydrophobicity, charged and uncharged polar residues. Zhang and Fang35 provided the residue distribution analysis by employing DPC on 3521 TPPs and 4895 MPP. Based on their analysis results, they reported that dipeptide compositions of EX and KX were significantly higher in TPPs as compared to MPPs while the dipeptide compositions of AX, HX, NX, QX and TX were significantly higher in MPPs as compared to TPPs where X denotes any amino acid. In 2004, Ding et al.54 mainly focused on the influence of single amino acid composition on TTPs by analyzing a large dataset containing three thermophilic organisms, ten hyperthermophilic organisms and 52 mesophilic organisms, which were collected from the NCBI database. From amongst 400 dipeptides, archaeal proteins had compositions of VK, KI, YK, IK, KV, KY and EV that were effective contributing to the increase of TPPs while compositions of DA, AD, TD, DD, DT, HD, DH, DR and DG contributed to the increase of MPPs. In the meanwhile, bacterial proteins had compositions of KE, EE, EK, YE, VK, KV, KK, LK, EI, EV, RK, EF, KY, VE, KI, KG, EY, FK, KF, FE, KR, VY, MK, WK and WE that contributed to the increase of TPPs while compositions of WQ, AA, QA, MQ, AW, QW, QQ, RQ, QH, HQ, AD, AQ, WL, QL, HA and DA contributed to the increase of MPPs. Altogether, our estimated propensity scores of amino acids as derived from SCMTPP is quite consistent with those of previous studies20,27,28,54,55,56. However, there are other factors responsible for improving the thermal stability of proteins such as hydrogen bonds, hydrophobic interactions, electrostatic interactions, α-helix forming and the entropy of unfolding55,57. More details on characterization of the thermal stability of proteins will be described below.

Characterization of thermophilic proteins using informative PCPs

Numerous studies have demonstrated that biochemical and biophysical properties such as side chain56,58 or beta-sheet propensity22 and side chain56,58 were essential for understanding the thermostability of proteins. As can be seen in Table 8, the three selected informative PCPs along with their corresponding R values as selected by SCMTPP consisted of FUKS010101 (R = 0.616), FUKS010101 (R = 0.523) and FUKS010109 (R = 0.307), respectively. In addition, the top-twenty informative PCPs having the highest and lowest R values are recorded in Supplementary Tables S13 and S14, respectively.

Table 8 Summary of four important physicochemical properties as determined by SCMTPP.

The FUKS010101 property is described as the Surface composition of amino acids in intracellular proteins of thermophiles (percent) (Fukuchi-Nishikawa, 2001)56. Fukuchi and Nishikawa suggested that proteins from thermophilic bacteria had 45.1% charged residues containing 23.6% negatively charged residues and 21.5% positively charged residues on the surface, which was found to be higher than those of other groups (19.9% nonpolar residues, 16.6% polar residues and 18.5% others)56. Figure 6 provides an example on the interpolated charge surface plot of TPPs and non-TPPs. Figure 6A,B shows interpolated charge surface plots of Q9YFR9 (TPP) and P0A223 (non-TPP). The blue surfaces of the P0A223 indicates that the interpolated charge of the entire P0A223 is higher than that of P0A223. In general, the interpolated charge surface are often used to determine hydrogen bonding patterns, electrostatic interaction and strengths of salt bridges in biomolecular simulations59. Many studies have also confirmed that amino acids with charged side chains could be regarded as the important factor for the increase of the thermostability of proteins35,57 where positively and negatively charged amino acids contain (Arg, His and Lys) and (Asp and Glu), respectively. As shown in Table 8, the ranks of propensity scores (PS-TPP, FUKS010101) for Lys, Glu, Arg, Asp and His are (1, 1), (2, 2), (4, 3), (11, 5) and (14, 17), respectively. Interestingly, from amongst these charged amino acids, three of these were found in the top-five amino acids contributing to TPPs (i.e. Lys, Glu and Arg). At the typical biological pH, Lys and Glu is capable of carrying a charge for forming hydrogen bonds. This phenomenon render it as one of the crucial factors that is responsible for enhancing the thermostability of proteins. In the meanwhile, it is well-recognized that TTPs could participate in salt bridge interaction, which is known as a typical charge–charge interaction between oppositely charged residues. Many research groups have shown that the number of salt bridges show a positive correlation to the thermostability of proteins35,60,61,62,63. Interestingly, FUKS010101 and FUKS010102 properties are described in the AAindex as Surface composition of amino acids in intracellular proteins of thermophiles (percent) and mesophiles (percent) (Fukuchi-Nishikawa, 2001)56, respectively, while the ZIMJ680101 property is described in the AAindex as Hydrophobicity (Zimmerman et al., 1968). Specifically, FUKS010101 and FUKS010102 properties suggested that the fraction of hydrophobic residues in thermophilic bacteria (19.9%) is quite equivalent to that of the mesophilic bacteria (17.3%) in the surface composition56. Figure 7 shows an example surface hydrophobicity plot of TPPs and non-TPPs. Figure 7A,B shows surface hydrophobicity plots of Q9YFR9 (TPP) and P0A223 (non-TPP). Moreover, brown surfaces of Q9YFR9 was found to be quite similar to that of P0A223. Recently, Vieille and Zeikus13 conducted a comparative analysis of residue contents between TTPs and MPPs on genome sequences containing seven TTPs and eight MPPs. Their analysis revealed that the content of hydrophobic amino acids in TPPs was quite similar to those of MPPs. Vieille and Zeikus’s analysis were quite consistent with those of previous works53,64,65.

Figure 6
figure 6

Interpolated charge surface of Q9YFR9 (TPP) and P0A223 (non-TPP) having TPP scores of 528.74 and 341.99, respectively, where the optimal threshold value is 418. Blue, white and red colors denote high, medium and low interpolated charge, respectively.

Figure 7
figure 7

Surface hydrophobicity of Q9YFR9 (TPP) and P0A223 (non-TPP) having TPP scores of 528.74 and 341.99, respectively, where the optimal threshold value is 418. Brown, white and blue colors denote high, medium and low hydrophobicity, respectively.

Herein, results from analyses were based on the propensity scores of 20 amino acids to be TPPs (i.e. derived from primary sequence information). Particularly, selected TPPs and non-TPPs were employed to analyze their interpolated charge and hydrophobicity. However, analysis was limited due to the small size of samples used herein. In order to explicitly understand this phenomenon, average values of interpolated charge and hydrophobicity from 1482 TPPs and 1482 non-TPPs should be computed for future analysis.

Utilization of the proposed SCMTPP

Finally, we had created a user-friendly web server SCMTPP to allow easy access to the model by the scientific community. Thus, SCMTPP is freely available online at http://pmlabstack.pythonanywhere.com/SCMTPP. Step-by-step guidelines on how to use the SCMTPP web server are provided in the Supplementary information.

Conclusions

The accurate identification of novel TTPs from a large number of uncharacterized protein sequences is important in basic research as well as a variety of applications in the food industry. Herein, we propose SCMTPP as a novel and interpretable computational model for the identification and characterization of TPPs. Firstly, we established an up-to-date dataset from published literature in order to develop an effective prediction model. Propensity scores of 20 amino acids and 400 g-gap dipeptides were calculated using the SCM method. Unlike previous methods, our predictor aims to provide a better understanding of the molecular basis for TPPs as well as improve prediction accuracy. Because of its simplicity, interpretability, and practical application, our empirical studies based on cross-validation and independent tests demonstrated the effectiveness and applicability of SCMTPP, which outperformed existing methods and widely used ML-based predictors. Finally, SCMTPP was set up as a publicly accessible web server at http://pmlabstack.pythonanywhere.com/SCMTPP to help experimental scientists with large-scale TPP identification. The proposed SCMTPP webserver and SCMTPP-derived propensity scores are expected to be useful tools for facilitating basic research and a variety of applications in the food industry.