Detecting thermophilic proteins through selecting amino acid and dipeptide composition features
- 254 Downloads
Detecting thermophilic proteins is an important task for designing stable protein engineering in interested temperatures. In this work, we develop a simple but efficient method to classify thermophilic proteins from mesophilic ones using the amino acid and dipeptide compositions. Since most of the amino acid and dipeptide compositions are redundant, we propose a new forward floating selection technique to select only a useful subset of these compositions as features for support vector machine-based classification. We test the proposed method on a benchmark data set of 915 thermophilic and 793 mesophilic proteins. The results show that our method using 28 amino acid and dipeptide compositions achieves an accuracy rate of 93.3% evaluated by the jackknife cross-validation test, which is higher not only than the existing methods but also than using all amino acid and dipeptide compositions.
KeywordsAmino acid composition Dipeptide composition Feature selection Floating search method Protein thermostability
This work was supported by the Chinese Academy of Sciences Fellowship for Young International Scientist with Grant No. 2010Y1Sb10 and NSFC with Grant No. 31050110435 (S. Nakariyakul). This work was also supported by the Chief Scientist Program of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences with Grant No. 2009CSP002 (L. Chen), and supported by NSFC under Grants No. 61072149 and No. 91029301 (L. Chen and Z.P. Liu), the Knowledge Innovation Program of CAS with Grant No. KSCX2-EW-R-01 (L. Chen and Z.P. Liu), and supported by the Key Project of Shanghai Education Committee (B.10-0412-08-001), Japan (JSPS) FIRST Program (L. Chen) and Shanghai Natural Science Foundation under Grant No. 11ZR1443100 (Z.P. Liu).
- Chen L, Wang RS, Zhang X (2009) Biomolecular network: methods and applications in systems biology. Wiley, LondonGoogle Scholar
- Chen L, Wang RQ, Li C, Aihara K (2010) Modelling biomolecular networks in cells: structures and dynamics. Springer, BerlinGoogle Scholar
- Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning, AAAI Press, Menlo Park, pp 56–63Google Scholar