Amino Acids

, Volume 42, Issue 5, pp 1947–1953 | Cite as

Detecting thermophilic proteins through selecting amino acid and dipeptide composition features

  • Songyot Nakariyakul
  • Zhi-Ping Liu
  • Luonan Chen
Original Article


Detecting thermophilic proteins is an important task for designing stable protein engineering in interested temperatures. In this work, we develop a simple but efficient method to classify thermophilic proteins from mesophilic ones using the amino acid and dipeptide compositions. Since most of the amino acid and dipeptide compositions are redundant, we propose a new forward floating selection technique to select only a useful subset of these compositions as features for support vector machine-based classification. We test the proposed method on a benchmark data set of 915 thermophilic and 793 mesophilic proteins. The results show that our method using 28 amino acid and dipeptide compositions achieves an accuracy rate of 93.3% evaluated by the jackknife cross-validation test, which is higher not only than the existing methods but also than using all amino acid and dipeptide compositions.


Amino acid composition Dipeptide composition Feature selection Floating search method Protein thermostability 



This work was supported by the Chinese Academy of Sciences Fellowship for Young International Scientist with Grant No. 2010Y1Sb10 and NSFC with Grant No. 31050110435 (S. Nakariyakul). This work was also supported by the Chief Scientist Program of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences with Grant No. 2009CSP002 (L. Chen), and supported by NSFC under Grants No. 61072149 and No. 91029301 (L. Chen and Z.P. Liu), the Knowledge Innovation Program of CAS with Grant No. KSCX2-EW-R-01 (L. Chen and Z.P. Liu), and supported by the Key Project of Shanghai Education Committee (B.10-0412-08-001), Japan (JSPS) FIRST Program (L. Chen) and Shanghai Natural Science Foundation under Grant No. 11ZR1443100 (Z.P. Liu).

Supplementary material

726_2011_923_MOESM1_ESM.pdf (57 kb)
Supplementary material 1 (PDF 56 kb)


  1. Bommarius AS, Broering JM, Chapparro-Riggers JF, Polizzi KM (2006) High-throughput screening for enhanced protein stability. Curr Opin Biotechnol 17:606–610PubMedCrossRefGoogle Scholar
  2. Chen L, Wang RS, Zhang X (2009) Biomolecular network: methods and applications in systems biology. Wiley, LondonGoogle Scholar
  3. Chen L, Wang RQ, Li C, Aihara K (2010) Modelling biomolecular networks in cells: structures and dynamics. Springer, BerlinGoogle Scholar
  4. Ghosh K, Dill KA (2009) Computing protein stabilities from their chain lengths. Proc Natl Acad Sci USA 106:10649–10654PubMedCrossRefGoogle Scholar
  5. Gromiha MM, Suresh MX (2008) Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 70:1274–1279PubMedCrossRefGoogle Scholar
  6. Gromiha MM, Oobatake M, Sarai A (1999) Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys Chem 82:51–67PubMedCrossRefGoogle Scholar
  7. Huang Y, Niu B, Gao Y, Fu L, Li W (2010) CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics 26:680–682PubMedCrossRefGoogle Scholar
  8. Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97:273–324CrossRefGoogle Scholar
  9. Kumar S, Nussinov R (2001) How do thermophilic proteins deal with heat? Cell Mol Life Sci 58:1216–1233PubMedCrossRefGoogle Scholar
  10. Kumar S, Tsai CJ, Nussinov R (2000) Factors enhancing protein thermostability. Protein Eng 13:179–191PubMedCrossRefGoogle Scholar
  11. Kumar S, Tsai CJ, Nussinov R (2001) Thermodynamic differences among homologous thermophilic and mesophilic proteins. Biochemistry 40:14152–14165PubMedCrossRefGoogle Scholar
  12. Lin H, Chen W (2011) Prediction of the thermophilic proteins using feature selection technique. J Microbiol Methods 84:67–70PubMedCrossRefGoogle Scholar
  13. Marill T, Green DM (1963) On the effectiveness of receptors in cognition system. IEEE Trans Inform Theory 9:11–17CrossRefGoogle Scholar
  14. Montanucci L, Fariselli P, Martelli PL, Casadio R (2008) Predicting protein thermostability changes from sequence upon multiple mutations. Bioinformatics 24:i190–i195PubMedCrossRefGoogle Scholar
  15. Nakariyakul S, Casasent D (2008) Hyperspectral waveband selection for contaminant detection on poultry carcasses. Opt Eng 47:087202CrossRefGoogle Scholar
  16. Nakariyakul S, Casasent D (2009) An improvement on floating search algorithms for feature subset selection. Pattern Recog 42:1932–1940CrossRefGoogle Scholar
  17. Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intel 27:1226–1238CrossRefGoogle Scholar
  18. Pokala N, Handel TM (2001) Protein design-where we were, where we are, where we’re going. J Struct Biolo 134:269–281CrossRefGoogle Scholar
  19. Pudil P, Novovicova J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15:1119–1125CrossRefGoogle Scholar
  20. Querol E, Perez-Pons JA, Mozo-Villarias A (1996) Analysis of protein conformational characteristics related to thermostability. Protein Eng 9:265–271PubMedCrossRefGoogle Scholar
  21. Radestock S, Gohlke H (2008) Exploiting the link between protein rigidity and thermostability for data-driven protein engineering. Eng Life Sci 8:507–522CrossRefGoogle Scholar
  22. Razvi A, Scholtz JM (2006) Lessons in stability from thermophilic proteins. Protein Sci 15:1569–1578PubMedCrossRefGoogle Scholar
  23. Saraboji K, Gromiha MM, Ponnuswamy MN (2005) Importance of main-chain hydrophobic free energy to the stability of thermophilic proteins. Int J Biol Macromol 35:211–220PubMedCrossRefGoogle Scholar
  24. Shen HB, Chou KC (2008) PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373:386–388PubMedCrossRefGoogle Scholar
  25. Szilagyi A, Zavodsky P (2000) Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Struct Fold Des 8:493–504CrossRefGoogle Scholar
  26. Wasikowski M, Chen X-W (2010) Combating the small sample class imbalance problem using feature selection. IEEE Trans Knowl Data Eng 22:1388–1400CrossRefGoogle Scholar
  27. Whitney AW (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput 20:1100–1103CrossRefGoogle Scholar
  28. Wu LC, Lee JX, Huang HD, Liu BJ, Horng JT (2009) An expert system to predict protein thermostability using decision tree. Expert Syst Appl 36:9007–9014CrossRefGoogle Scholar
  29. Yano JK, Poulos TL (2003) New understandings of the thermostable and peizostable enzymes. Curr Opin Biotechnol 14:360–365PubMedCrossRefGoogle Scholar
  30. Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning, AAAI Press, Menlo Park, pp 56–63Google Scholar
  31. Zhang G, Fang B (2006a) Discrimination of thermophilic and mesophilic proteins via pattern recognition methods. Process Biochem 41:552–556CrossRefGoogle Scholar
  32. Zhang G, Fang B (2006b) Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem 41:1729–1798CrossRefGoogle Scholar
  33. Zhang G, Fang B (2007) LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J Biotechnol 127:417–424PubMedCrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.Key Laboratory of Systems Biology, SIBS-Novo Nordisk Translational Research Centre for PreDiabetes, Shanghai Institutes for Biological SciencesChinese Academy of SciencesShanghaiChina
  2. 2.Department of Electrical and Computer EngineeringThammasat UniversityPathumthaniThailand

Personalised recommendations