Abstract
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed ‘IDM-PhyChm-Ens’ method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed ‘IDM-PhyChm-Ens’ method has shown improved performance compared to existing techniques.
Similar content being viewed by others
References
American Cancer Society (2013) Cancer Facts & Figures. American Cancer Society Inc. http://www.cancer.org/acs/groups/content/@epidemiologysurveilance/documents/document/acspc-036845.pdf. Accessed 4 Aug 2013
Balmain A, Gray J et al (2003) The genetics and genomics of cancer. Nat Genet 33:238–244
Benediktsson JA, Swain PH (1992) Consensus theoretic classification methods. IEEE Trans Syst Man Cabernet 22:688–704
Bennett KP, Blue JA (1998) A support vector machine approach to decision trees. In: Neural networks proceedings. IEEE world congress on computational intelligence. The 1998 IEEE international joint conference, Anchorage, pp 2396–2401
Bing-Yu S, Zhu Z-H, Li J, Linghu B (2011) Combined feature selection and cancer prognosis using support vector machine regression. EEE/ACM Trans Comput Biol Bioinform 8(6):1671–1677
Bray F, McCarron P, Parkin DM (2004) The changing global patterns of female breast cancer incidence and mortality. Breast Cancer Res 6(6):229–239
Caroline D, Brasseur K, Leblanc V, Parent S, Asselin É, Bérubé G (2012) SAR study of tyrosine–chlorambucil hybrid regioisomers; synthesis and biological evaluation against breast cancer cell lines. Amino Acids 43(2):923–935
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(27):1–27
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chen C, Zhou X, Tian Y, Zou X, Cai P (2006) Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network. Anal Biochem 357(1):116–121
Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19
Chou KC, David WE (1999) Prediction of membrane protein types and subcellular locations. Proteins: Struct, Funct, Bioinf 34(1):137–153
Dobson PD, Cai YD, Stapley BJ, Doig AJ (2004) Prediction of protein function in the absence of significant sequence similarity. Curr Med Chem 11(16):2135–2142
Dursun D, Walker G, Kadam A (2005) Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med 34(2):113–128
Džeroski S, Ženko B (2004) Is combining classifiers with stacking better than selecting the best one? Mach Learn 54:255–273
Einipour A (2011) A fuzzy-ACO method for detect breast cancer. Glob J Health Sci 3(2):195–199
Emmanuel M, Alvarez MM, Trevino V (2010) Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm. Comput Biol Chem 34(4):244–250
Eshlaghy AT, Poorebrahimi A, Ebrahimi M, Razavi AR, Ahmad LG (2013) Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform 4(2):124. doi:10.4172/2157-7420.1000124
Goodman DE, Boggess L, Watkins A (2002) Artificial immune system classification of multiple-class problems. In: Proceedings of the artificial neural networks in engineering 2002, pp 179–183
Hastie T, Tibshirani R, Friedman J (eds) (2001) The elements of statistical learning. Springer, New York
Hayat M, Khan A (2011) Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J Theor Biol 271:10–17
Hopp TP, Woods KR (1981) Prediction of protein antigenic determinants from amino acid sequences. Nat Acad Sci 78(6):3824–3828
Huang M-L, Hung Y-H, Chen W-Y (2010) Neural network classifier with entropy based feature selection on breast cancer diagnosis. J Med Syst 34(5):865–873
Jene-Sanz A, Váraljai R, Vilkova AV, Khramtsova GF, Khramtsov AI, Olopade OI, Lopez-Bigas N, Benevolenskaya EV (2013) Expression of polycomb targets predicts breast cancer prognosis. Mol Cell Biol 33(19):3951–3961
Ji-Yeon Y, Yoshihara K, Tanaka K, Hatae M, Masuzaki H, Itamochi H, Takano M, Ushijima K, Tanyi JL, Coukos G, Lu Y, Mills GB, Verhaak RGW (2013) Predicting time to ovarian carcinoma recurrence using protein markers. J Clin Investig 123(9):3740–3750
Karabatak M, Ince MC (2009) An expert system for detection of breast cancer based on association rules and neural network. Expert Syst Appl 36(2, Part 2):3465–3469
Khan A, Majid A, Tae-Sun C (2010) Predicting protein subcellular location: exploiting amino acid based sequence of feature spaces and fusion of diverse classifiers. Amino Acids 38(1):347–350
Khan A, Majid A, Hayat M (2011) CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Comput Biol Chem 35(4):218–229
Krishnan MMR, Banerjee S, Chakraborty C, Ray AK (2010) Statistical analysis of mammographic features and its classification using support vector machine. Expert Syst Appl 37:470–478. doi:10.1016/j.eswa.2009.05.045
Li DC, Wu CS, Tsai TI, Lina YS (2007) Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput Oper Res 34:966–982
Li D-C, Liu C-W, Hu SC (2010) A learning method for the class imbalance problem with medical data sets. Comput Biol Med 40(5):509–518
Li DC, Liu CW, Hu SC (2011) A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets. Artif Intell Med 52:45–52. doi:10.1016/j.artmed.2011.02.001
Liao R, Wan T, Qin Z (2010) Classification of benign and malignant breast tumors in ultrasound images based on multiple sonographic and textural features. In: Proceedings international conference on intelligent human-machine systems and cybernetics 2011 (IHMSC-2011). IEEE, Hangzhou, 26–27 Aug 2010, pp 71–74
Lin H (2008) The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J Theor Biol 252(2):350–356
Maqsood H, Khan A, Yeasin M (2012) Prediction of membrane proteins using split amino acid and ensemble classification. Amino Acids 42(6):2447–2460
Milenković J, Hertl K, Košir A, Žibert J, Tasič JF (2013) Characterization of spatiotemporal changes for the classification of dynamic contrast-enhanced magnetic-resonance breast lesions. Artif Intell Med 58(2):101–114
Mohabatkar H (2010) Prediction of cyclin proteins using Chous pseudo amino acid composition. Protein Pept Lett 17(10):1207
Muhammad T, Khan A, Majid A, Lumini A (2013) Subcellular localization using fluorescence imagery: utilizing ensemble classification with diverse feature extraction strategies and data balancing. Appl Soft Comput 13(11):4231–4243
Munteanu CR, Magalhães AL, Uriarte E, González-Díaz H (2009) Multi-target QPDR classification model for human breast and colon cancer-related proteins using star graph topological indices. J Theor Biol 257(2):303–311
Nasim FU, Ejaz S, Ashraf M, Asif AR, Oellerich M, Ahmad G, Malik GA, Attiq-ur-Rehman (2012) Potential biomarkers in the sera of breast cancer patients from Bahawalpur, Pakistan. Biomark Cancer 10(4):19–34
Pena-Reyes CA, Sipper M (1999) A fuzzy-genetic approach to breast cancer diagnosis. Artif Intell Med 17:131–155
Phang JM, Liu W (2012) Proline metabolism and cancer. Front Biosci: J Virtual Libr 17:1835
Pierrick C, Joseph AP, Poulain P, Brevern AGd, Rebehmed J (2013) Cis-trans isomerization of omega dihedrals in proteins. Amino Acids 45(2):279–289
Qiu JD, Huang JH, Shi SP, Liang RP (2010) Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein Pept Lett 17(6):715–722
Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90
Ramani RG, Jacob SG (2013a) Improved classification of lung cancer tumors Based on structural and physicochemical properties of proteins using data mining models. PLoS One 8(3):e58772. doi:10.1371/journal.pone.0058772
Ramani RG, Jacob SG (2013b) Prediction of cancer rescue p53 mutants in silico using Naïve Bayes learning methodology. Protein Pept Lett 20(11):1280–1891
Ramani RG, Jacob SG (2013c) Prediction of P53 mutants (multiple sites) transcriptional activity based on structural (2D&3D) properties. PLoS One 8(2):e55401
Richardson A (2011) Proline metabolism in metastatic breast cancer. http://cbcrp.org.127.seekdotnet.com/research/PageGrant.asp?grant_id=6922. Accessed 23 Sept 2013
Ruxandra S, Stoean C (2013) Modeling medical decision making by support vector machines, explaining by rules of evolutionary algorithms with feature selection. Expert Syst Appl 40:2677–2686
Şahan S, Polat K, Kodaz H, Güneş S (2007) A new hybrid method based on fuzzy-artificial immune system and k-nn algorithm for breast cancer diagnosis. Comput Biol Med 37(3):415–423
Sahu SS, Panda G (2010) A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction. J Comput Biol Chem 34(5):320–327
Saima R, Hussain M, Ali A, Khana A (2013) A recent survey on colon cancer detection techniques. IEEE/ACM Trans Comput Biol Bioinform 10(3):545–563
Sheau-Ling H, Hsieh S-H, Cheng P-H, Chen C-H, Hsu K-P, Lee I-S, Wang Z, Lai F (2012) Design ensemble machine learning model for breast cancer diagnosis. J Med Syst 36(5):2841–2847
Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, Mandelker D, Leary RJ, Ptak J, Silliman N (2006) The consensus coding sequences of human breast and colorectal cancers. Science 314(5797):268–274
Ster B, Dobnikar A (1996) Neural networks in medical diagnosis: Comparison with other methods. In: Proceedings of the international conference on engineering applications of neural networks, pp 427–430
Tanford C (1962) Contribution of hydrophobic interactions to the stability of the globular conformation of proteins. J Am Chem Soc 84(22):4240–4247
Vapnik VN (1995) The nature of statistical learning theory. Springer Verlag, New York
William CC (ed) (2010) An omics perspective on cancer research. Springer, Netherlands. ISBN: 978-90-481-2674-3
Xin M, Guo J, Liu H, Xie J, Sun X (2012) Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM Trans Comput Biol Bioinform 9(6):1766–1775
Xu R, Anagnostopoulos GC, Wunsch DC (2007) Multiclass cancer classification using semisupervised ellipsoid ARTMAP and particle swarm optimization with gene expression data. IEEE/ACM Trans Comput Biol Bioinform 4(1):65–77
Yvan S, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Acknowledgments
Authors are very grateful to Pakistan Institute of Engineering and Applied Sciences (PIEAS) for providing useful resources for this work.
Conflict of interest
None.
Author information
Authors and Affiliations
Corresponding author
Additional information
Matlab based codes developed for this study can be provided to academicians on request.
Rights and permissions
About this article
Cite this article
Ali, S., Majid, A. & Khan, A. IDM-PhyChm-Ens: Intelligent decision-making ensemble methodology for classification of human breast cancer using physicochemical properties of amino acids. Amino Acids 46, 977–993 (2014). https://doi.org/10.1007/s00726-013-1659-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-013-1659-x