In silico markers: an evolutionary and statistical approach to select informative genes of human breast cancer subtypes

  • Shib Sankar BhowmickEmail author
  • Debotosh Bhattacharjee
  • Luis Rato
Research Article



Recent advancement in bioinformatics offers the ability to identify informative genes from high dimensional gene expression data. Selection of informative genes from these large datasets has emerged as an issue of major concern among researchers.


Gene functionality and regulatory mechanisms can be understood through the analysis of these gene expression data. Here, we present a computational method to identify informative genes for breast cancer subtypes such as Basal, human epidermal growth factor receptor 2 (Her2), luminal A (LumA), and luminal B (LumB).


The proposed In Silico Markers method is a wrapper feature selection method based on Least Absolute Shrinkage and Selection Operator (LASSO), Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and Support Vector Machine (SVM) as a classifier. Moreover, the composite measure consisting of relevance, redundancy, and rank score of frequently appeared genes are used to select informative genes.


The informative genes are validated by statistical and biologically relevant criteria. For a comparative evaluation of the proposed approach, biological similarity score designed on semantic similarity measure of GO terms are investigated. Further, the proposed technique is evaluated with 7 existing gene selection techniques using two-class annotated breast cancer subtype datasets.


The utilization of this method can bring about the discovery of informative genes. Furthermore, under multiple criteria decision-making set-up, informative genes selected by the In Silico Markers are found to be admirable than the compared methods selected genes.


Breast cancer subtype Biological analysis Gene selection Messenger RNA Statistical analysis 


Compliance with ethical standards

Conflict of interest

Shib Sankar Bhowmick, Debotosh Bhattacharjee and Luis Rato declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Supplementary material

13258_2019_816_MOESM1_ESM.pdf (856 kb)
Supplementary material 1 (PDF 856 kb)
13258_2019_816_MOESM2_ESM.pdf (55 kb)
Supplementary material 2 (PDF 56 kb)
13258_2019_816_MOESM3_ESM.pdf (54 kb)
Supplementary material 3 (PDF 54 kb)
13258_2019_816_MOESM4_ESM.pdf (40 kb)
Supplementary material 3 (PDF 41 kb)


  1. Ang JC, Mirzal A, Haron H, Hamed HNA (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE Trans Comput Biol Bioinform 13(5):971–989CrossRefGoogle Scholar
  2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25(1):25CrossRefGoogle Scholar
  3. Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550CrossRefGoogle Scholar
  4. Bhowmick SS, Bhattacharjee D, Rato L (2018a) Identification of tissue-specific tumor biomarker using different optimization algorithms. Genes Genom 41(4):1–13Google Scholar
  5. Bhowmick SS, Saha I, Bhattacharjee D, Genovese LM, Geraci F (2018b) Genome-wide analysis of NGS data to compile cancer-specific panels of miRNA biomarkers. PloS One 13(7):e0200353CrossRefGoogle Scholar
  6. Blenkiron C, Goldstein LD, Thorne NP, Spiteri I, Chin SF, Dunning MJ, Barbosa-Morais NL, Teschendorff AE, Green AR, Ellis IO et al (2007) MicroRNA expression profiling of human breast cancer identifies new markers of tumor subtype. Genome Biol 8(10):R214CrossRefGoogle Scholar
  7. Cao J, Zhang L, Wang B, Li F, Yang J (2015) A fast gene selection method for multi-cancer classification using multiple support vector data description. J Biomed Inf 53:381–389CrossRefGoogle Scholar
  8. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297Google Scholar
  9. Cui J, Li F, Wang G, Fang X, Puett JD, Xu Y (2011) Gene-expression signatures can distinguish gastric cancer grades and stages. PloS One 6(3):e17819CrossRefGoogle Scholar
  10. Deepthi P, Thampi SM (2015) PSO based feature selection for clustering gene expression data. In: International conference on communication and signal processing, communication and energy systems, pp 1–5Google Scholar
  11. Dijkstra S, Mulders P, Schalken J (2014) Clinical use of novel urine and blood based prostate cancer biomarkers: a review. Clin Biochem 47(10–11):889–896CrossRefGoogle Scholar
  12. Ding C, Peng H (2005) Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 3(02):185–205CrossRefGoogle Scholar
  13. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537CrossRefGoogle Scholar
  14. Grada A, Weinbrecht K (2013) Next-generation sequencing: methodology and application. J Investig Dermatol 133(8):e11CrossRefGoogle Scholar
  15. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422CrossRefGoogle Scholar
  16. Hansen N, Ostermeier A (1996) Adapting arbitrary normal mutation distributions in evolution strategies: the covariance matrix adaptation. In: Proceedings of IEEE international Conference on evolutionary Computation, pp 312–317Google Scholar
  17. Iorio MV, Ferracin M, Liu CG, Veronese A, Spizzo R, Sabbioni S, Magri E, Pedriali M, Fabbri M, Campiglio M et al (2005) MicroRNA gene expression deregulation in human breast cancer. Cancer Res 65(16):7065–7070CrossRefGoogle Scholar
  18. Jakulin A (2005) Machine learning based on attribute interactions. Fakulteta za racunalništvo in informatikoGoogle Scholar
  19. Jemal A, Bray F, Center MM, Ferlay J, Ward E, Forman D (2011) Global cancer statistics. CA Cancer J Clin 61(2):69–90CrossRefGoogle Scholar
  20. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M (2004) The KEGG resource for deciphering the genome. Nucleic Acids Resr 32(suppl\(_{-}\)1):D277–D280Google Scholar
  21. Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, Koplev S, Jenkins SL, Jagodnik KM, Lachmann A, et al (2016) Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Resr p gkw377Google Scholar
  22. Lai C, Reinders MJ, van’t Veer LJ, Wessels LF (2006) A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets. BMC Bioinform 7(1):235CrossRefGoogle Scholar
  23. Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of speech and natural Lang workshop, Morgan Kaufmann, pp 212–217Google Scholar
  24. Lin D, Tang X (2006) Conditional infomax learning: an integrated framework for feature extraction and fusion. Computer Vision-ECCV 2006, vol 3951. Springer, Berlin/Heidelberg, pp 68–82CrossRefGoogle Scholar
  25. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12):550CrossRefGoogle Scholar
  26. Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods 9(5):471CrossRefGoogle Scholar
  27. Network CGA et al (2012) Comprehensive molecular portraits of human breast tumours. Nature 490(7418):61CrossRefGoogle Scholar
  28. Nguyen T, Nahavandi S (2016) Modified AHP for gene selection and cancer classification using type-2 fuzzy logic. IEEE Trans Fuzzy Sys 24(2):273–287CrossRefGoogle Scholar
  29. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. In: IEEE Transactions pattern analysis and machine intelligence pp 1226–1238Google Scholar
  30. Rathore S, Iftikhar MA, Hussain M (2014) A novel approach for automatic gene selection and classification of gene based colon cancer datasets. In: International Conference Emerging Technologies, pp 42–47Google Scholar
  31. Robinson MD, McCarthy DJ, Smyth GK (2010) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26(1):139–140CrossRefGoogle Scholar
  32. Sharma S (2009) Tumor markers in clinical practice: General principles and guidelines. Indian J Med Paediatr Oncol 30(1):1CrossRefGoogle Scholar
  33. Sørlie T, Wang Y, Xiao C, Johnsen H, Naume B, Samaha RR, Børresen-Dale AL (2006) Distinct molecular mechanisms underlying clinically relevant subtypes of breast cancer: gene expression analyses across three different platforms. BMC Genomics 7(1):127CrossRefGoogle Scholar
  34. Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, Haibe-Kains B et al (2006) Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst 98(4):262–272CrossRefGoogle Scholar
  35. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc 58(1):267–288Google Scholar
  36. Trevino V, Falciani F, Barrera-Saldaña HA (2007) DNA microarrays: a powerful genomic tool for biomedical and clinical research. Mol Med 13(9–10):527CrossRefGoogle Scholar
  37. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281CrossRefGoogle Scholar
  38. Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM, Network CGAR et al (2013) The cancer genome atlas pan-cancer analysis project. Nat Genet 45(10):1113–1120CrossRefGoogle Scholar
  39. Yang H, Moody J (1999) Feature selection based on joint mutual information. In: Proceedings of international ICSC symposium advance intel data analysis, pp 22–25Google Scholar

Copyright information

© The Genetics Society of Korea 2019

Authors and Affiliations

  1. 1.Department of Electronics and Communication EngineeringHeritage Institute of TechnologyKolkataIndia
  2. 2.Department of Computer Science and EngineeringJadavpur UniversityKolkataIndia
  3. 3.Department of InformaticsUniversity of EvoraEvoraPortugal

Personalised recommendations