An entropy-based classification of breast cancerous genes using microarray data

  • Mausami Mondal
  • Rahul Semwal
  • Utkarsh Raj
  • Imlimaong Aier
  • Pritish Kumar Varadwaj
Original Article


Gene expression levels obtained from microarray data provide a promising technique for doing classification on cancerous data. Due to the high dimensionality of the microarray datasets, the redundant genes need to be removed and only significant genes are required for building the classifier. In this work, an entropy-based method was used based on supervised learning to differentiate between normal tissue and breast tumor based on their gene expression profiles. This work employs four widely used machine learning techniques for breast cancer prediction, namely support vector machine (SVM), random forest, k-nearest neighbor (KNN) and naive Bayes. The performance of these techniques was evaluated on four different classification performance measurements which result in getting more accuracy in case of SVM as compared to other machine learning algorithms. Classification accuracy of 91.5% was achieved by support vector machine with 0.833 F1 measures. Furthermore, these techniques were evaluated on the basis of performance by ROC curve and calibration graph.


Support vector machine k-Nearest neighbor Random forest Naive Bayes Classification Machine learning algorithm 



The authors acknowledge the Department of Bioinformatics and Applied Sciences, Indian Institute of Information Technology, Allahabad, for providing computing facility.

Compliance with ethical standards

Conflict of interest

The authors have no conflict of interest.


  1. 1.
    Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537CrossRefGoogle Scholar
  2. 2.
    Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422CrossRefGoogle Scholar
  3. 3.
    Ben-Dor A, Bruhn L, Friedman N et al (2000) Tissue classification with gene expression profiles. J Comput Biol 7:559–583CrossRefGoogle Scholar
  4. 4.
    DeSantis CE, Siegel RL, Sauer AG et al (2016) Cancer statistics for African Americans, 2016: progress and opportunities in reducing racial disparities. CA Cancer J Clin 66:290–308CrossRefGoogle Scholar
  5. 5.
    Hedley DW, Rugg CA, Gelber RD (1987) Association of DNA index and S-phase fraction with prognosis of nodes positive early breast cancer. Cancer Res 47:4729–4735Google Scholar
  6. 6.
    Khan J, Wei JS, Ringner M et al (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7:673–679CrossRefGoogle Scholar
  7. 7.
    Luo J, Ellis MJ (2010) Microarray data analysis in neoadjuvant biomarker studies in estrogen receptor-positive breast cancer. Breast Cancer Res 12:112. CrossRefGoogle Scholar
  8. 8.
    Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470CrossRefGoogle Scholar
  9. 9.
    DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278:680–686CrossRefGoogle Scholar
  10. 10.
    Wang L, Chu F, Xie W (2007) Accurate cancer classification using expressions of very few genes. IEEEACM Trans Comput Biol Bioinforma TCBB 4:40–53CrossRefGoogle Scholar
  11. 11.
    Furberg CD, Yusuf S (1988) Effect of drug therapy on survival in chronic congestive heart failure. Am J Cardiol 62:41A–45ACrossRefGoogle Scholar
  12. 12.
    Heuvers ME, Hegmans JP, Stricker BH, Aerts JG (2012) Improving lung cancer survival; time to move on. BMC Pulm Med 12:77. CrossRefGoogle Scholar
  13. 13.
    Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recognit 45:531–539CrossRefGoogle Scholar
  14. 14.
    Herrero J, Valencia A, Dopazo J (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17:126–136CrossRefGoogle Scholar
  15. 15.
    Dembele D, Kastner P (2003) Fuzzy C-means method for clustering microarray data. Bioinformatics 19:973–980CrossRefGoogle Scholar
  16. 16.
    Saldanha AJ (2004) Java Treeview—extensible visualization of microarray data. Bioinformatics 20:3246–3248CrossRefGoogle Scholar
  17. 17.
    Vanitha CDA, Devaraj D, Venkatesulu M (2015) Gene expression data classification using support vector machine and mutual information-based gene selection. Proced Comput Sci 47:13–21CrossRefGoogle Scholar
  18. 18.
    Yeung KY, Haynor DR, Ruzzo WL (2001) Validating clustering for gene expression data. Bioinformatics 17:309–318CrossRefGoogle Scholar
  19. 19.
    Chang JC, Wooten EC, Tsimelzon A et al (2003) Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. Lancet 362:362–369CrossRefGoogle Scholar
  20. 20.
    Bradley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30:1145–1159CrossRefGoogle Scholar
  21. 21.
    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Mach Learn ECML 98:137–142CrossRefGoogle Scholar
  22. 22.
    Furey TS, Cristianini N, Duffy N et al (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16:906–914CrossRefGoogle Scholar
  23. 23.
    Anderson TF, Abrams DS, Grens EA (1978) Evaluation of parameters for nonlinear thermodynamic models. AIChE J 24:20–29MathSciNetCrossRefGoogle Scholar
  24. 24.
    Serretti A, Smeraldi E (2004) Neural network analysis in pharmacogenetics of mood disorders. BMC Med Genet 5:27CrossRefGoogle Scholar
  25. 25.
    Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in neural information processing systems. pp 841–848Google Scholar
  26. 26.
    Ahmed M, Shahjaman M, Rana M et al (2017) Robustification of Naïve bayes classifier and its application for microarray gene expression data analysis. Biomed Res Int 2017:3020627. CrossRefGoogle Scholar
  27. 27.
    Lu Y, Han J (2003) Cancer classification using gene expression data. Inf Syst 28:243–268CrossRefGoogle Scholar
  28. 28.
    Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18–22Google Scholar
  29. 29.
    Svetnik V, Liaw A, Tong C et al (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci 43:1947–1958CrossRefGoogle Scholar
  30. 30.
    Díaz-Uriarte R, De Andres SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7:3. CrossRefGoogle Scholar
  31. 31.
    Ray C (2011) Cancer identification and gene classification using DNA micro array gene expression patterns. Int J Comput Sci Issues 8:155–160Google Scholar
  32. 32.
    Zhang M-L, Zhou Z-H (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit 40:2038–2048CrossRefGoogle Scholar
  33. 33.
    Parry RM, Jones W, Stokes TH et al (2010) k-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. Pharmacogenomics J 10:292–309CrossRefGoogle Scholar
  34. 34.
    Geisser S (1993) Selecting a statistical model and predicting. In: Predictive inference: an introduction. Springer, Berlin, pp 88–117CrossRefGoogle Scholar
  35. 35.
    Demšar J, Curk T, Erjavec A et al (2013) Orange: data mining toolbox in Python. J Mach Learn Res 14:2349–2353zbMATHGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  • Mausami Mondal
    • 1
  • Rahul Semwal
    • 1
  • Utkarsh Raj
    • 1
  • Imlimaong Aier
    • 1
  • Pritish Kumar Varadwaj
    • 1
  1. 1.Department of Bioinformatics and Applied SciencesIndian Institute of Information Technology - AllahabadPrayagrajIndia

Personalised recommendations