Skip to main content

Advertisement

Log in

Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data

  • S.I. : WorldCIST'20
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Cancer is a severe condition of uncontrolled cell division that results in a tumor formation that spreads to other tissues of the body. Therefore, the development of new medication and treatment methods for this is in demand. Classification of microarray data plays a vital role in handling such situations. The relevant gene selection is an important step for the classification of microarray data. This work presents gene encoder, an unsupervised two-stage feature selection technique for the cancer samples’ classification. The first stage aggregates three filter methods, namely principal component analysis, correlation, and spectral-based feature selection techniques. Next, the genetic algorithm is used, which evaluates the chromosome utilizing the autoencoder-based clustering. The resultant feature subset is used for the classification task. Three classifiers, namely support vector machine, k-nearest neighbors, and random forest, are used in this work to avoid the dependency on any one classifier. Six benchmark gene expression datasets are used for the performance evaluation, and a comparison is made with four state-of-the-art related algorithms. Three sets of experiments are carried out to evaluate the proposed method. These experiments are for the evaluation of the selected features based on sample-based clustering, adjusting optimal parameters, and for selecting better performing classifier. The comparison is based on accuracy, recall, false positive rate, precision, F-measure, and entropy. The obtained results suggest better performance of the current proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30(1):41–47

    Article  Google Scholar 

  2. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  3. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511

    Article  Google Scholar 

  4. Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386

    Article  Google Scholar 

  5. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability vol 1 (4), pp 281–297, June 1967

  6. Kohonen T (2012) Self-organization and associative memory. Springer, Berlin

    MATH  Google Scholar 

  7. Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95(25):14863–14868

    Article  Google Scholar 

  8. Ben-Dor A, Shamir R, Yakhini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3–4):281–297

    Article  Google Scholar 

  9. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  Google Scholar 

  10. Brazma A, Vilo J (2000) Gene expression data analysis. FEBS Lett 480(1):17–24

    Article  Google Scholar 

  11. Xing EP, Karp RM (2001) CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics 17:S306–S315

    Article  Google Scholar 

  12. Law MH, Figueiredo MA, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166

    Article  Google Scholar 

  13. Mahajan S, Singh S (2016) Review on feature selection approaches using gene expression data. Imp J Interdiscip Res 2(3):1–6

    Google Scholar 

  14. Pavithra D, Lakshmanan B (2017) Feature selection and classification in gene expression cancer data. In: 2017 international conference on computational intelligence in data science (ICCIDS). IEEE, pp 1–6

  15. Alshamlan HM, Badr GH, Alohali YA (2015) Genetic Bee Colony (GBC) algorithm: a new gene selection method for microarray cancer classification. Comput Biol Chem 56:49–60

    Article  Google Scholar 

  16. Bihani P, Patil ST (2014) A comparative study of data analysis techniques. Int J Emerg Trends Technol Comput Sci 3(2):95–101

    Google Scholar 

  17. Halim Z, Ali O, Khan G (2019) On the efficient representation of datasets as graphs to mine maximal frequent itemsets. IEEE Trans Knowl Data Eng 1–18. https://doi.org/10.1109/TKDE.2019.2945573

  18. Han J Kamber M, Tung AK (2001) Spatial clustering methods in data mining. In: Geographic data mining and knowledge discovery. vol 1, pp 188–217

  19. Halim Z, Rehan M (2020) On identification of driving-induced stress using electroencephalogram signals: a framework based on wearable safety-critical scheme and machine learning. Inf Fusion 53:66–79

    Article  Google Scholar 

  20. Iqbal S, Halim Z (2020) Orienting conflicted graph edges using genetic algorithms to discover pathways in protein-protein interaction networks. In: IEEE/ACM transactions on computational biology and bioinformatics, 1–16. https://doi.org/10.1109/TCBB.2020.2966703

  21. Halim Z, Atif M, Rashid A, Edwin CA (2017) Profiling players using real-world datasets: clustering the data and correlating the results with the big-five personality traits. IEEE Trans Affect Comput 10(4):568–584

    Article  Google Scholar 

  22. Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E (2016) Clustering algorithms: their application to gene expression data. Bioinform Biol Insights 10:237–253

    Google Scholar 

  23. Caruana R, Freitag D (1994) Greedy attribute selection. In: Machine learning proceedings, pp 28–36

  24. Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324

    Article  Google Scholar 

  25. Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125

    Article  Google Scholar 

  26. Frigui H, Nasraoui O (2000) Simultaneous clustering and attribute discrimination. In: Ninth IEEE international conference on fuzzy systems. FUZZ-IEEE 2000 (Cat. No. 00CH37063), IEEE vol 1, pp 158–163

  27. Chen H, Zhang Y, Gutman I (2016) A kernel-based clustering method for gene selection with gene expression data. J Biomed Inform 62:12–20

    Article  Google Scholar 

  28. Song C, Huang Y, Liu F, Wang Z, Wang L (2014) Deep auto-encoder based clustering. Intell Data Anal 18(6S):S65–S76

    Article  Google Scholar 

  29. Chen PY, Huang JJ (2019) A hybrid autoencoder network for unsupervised image clustering. Algorithms 12(6):122

    Article  MathSciNet  Google Scholar 

  30. Salem H, Attiya G, El-Fishawy N (2017) Classification of human cancer diseases by gene expression profiles. Appl Soft Comput 50:124–134

    Article  Google Scholar 

  31. Ghosh M, Adhikary S, Ghosh KK, Sardar A, Begum S, Sarkar R (2019) Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57(1):159–176

    Article  Google Scholar 

  32. Rani MJ, Devaraj D (2019) Two-stage hybrid gene selection using mutual information and genetic algorithm for cancer data classification. J Med Syst 43(8):235

    Article  Google Scholar 

  33. Tiwari S, Singh B, Kaur M (2017) An approach for feature selection using local searching and global optimization techniques. Neural Comput Appl 28(10):2915–2930

    Article  Google Scholar 

  34. Langley P (1994) Selection of relevant features in machine learning. Proceedings of the AAAI fall symposium on relevance 184:245–271

    MathSciNet  MATH  Google Scholar 

  35. Muhammad T, Halim Z (2016) Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl Soft Comput 49:365–384

    Article  Google Scholar 

  36. Shah A, Halim Z (2019) On efficient mining of frequent itemsets from big uncertain databases. J Grid Comput 17(4):831–850

    Article  Google Scholar 

  37. Zhu X, Li X, Zhang S, Ju C, Wu X (2016) Robust joint graph sparse coding for unsupervised spectral feature selection. IEEE Trans Neural Netw Learn Syst 28(6):1263–1275

    Article  MathSciNet  Google Scholar 

  38. Jiang P, Maghrebi M, Crosky A, Saydam S (2017) Unsupervised deep learning for data-driven reliability and risk analysis of engineered systems. In: Samui P, Sekhar S, Balas VE (eds) Handbook of neural computation. Academic Press, Cambridge, pp 417–431

  39. Mao W, Wang F (2012) New advances in intelligence and security informatics. Academic Press, Cambridge

    Google Scholar 

  40. Ayyad SM, Saleh AI, Labib LM (2019) Gene expression cancer classification using modified K-nearest neighbors technique. BioSystems 176:41–51

    Article  Google Scholar 

  41. Halim Z, Khattak JH (2019) Density-based clustering of big probabilistic graphs. Evol Syst 10(3):333–350

    Article  Google Scholar 

  42. Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In; 2010 IEEE international conference on data mining, IEEE, pp 911–916

  43. Halim Z, Khan S (2019) A data science-based framework to categorize academic journals. Scientometrics 119(1):393–423

    Article  Google Scholar 

  44. Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37(3):487–501

    Article  Google Scholar 

  45. Zhu L, Ma B, Zhao X (2010) Clustering validity analysis based on silhouette coefficient [J]. J Comput Appl 30(2):139–141

    Google Scholar 

  46. Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201

    Article  Google Scholar 

  47. Li T, Ma J (2018) Fuzzy clustering with automated model selection: entropy penalty approach. In: 2018 14th IEEE international conference on signal processing (ICSP). IEEE, pp 571–576

  48. Karypis MSG, Kumar V, Steinbach M (2000) A comparison of document clustering techniques. In: TextMining workshop at KDD2000

  49. Sathiaraj D, Huang X, Chen J (2019) Predicting climate types for the Continental United States using unsupervised clustering techniques. Environmetrics 30(4):e2524

    Article  MathSciNet  Google Scholar 

  50. Bhuiyan MNQ, Shamsujjoha M, Ripon, SH, Proma FH, Khan F (2019) Transfer learning and supervised classifier based prediction model for breast cancer. In: Big data analytics for intelligent healthcare management, Academic Press, Cambridge, pp 59–86

  51. Gan G (2013) Application of data clustering and machine learning in variable annuity valuation. Insur Math Econ 53(3):795–801

    Article  MathSciNet  Google Scholar 

  52. Breiman L (2001) Random forests. Mach Learn 45.1(2001):5–32

Download references

Acknowledgements

The authors are indebted to the editor and anonymous reviewers for their helpful comments and suggestions. The authors wish to thank GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under GA-F scheme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Uzma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Uzma, Al-Obeidat, F., Tubaishat, A. et al. Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data. Neural Comput & Applic 34, 8309–8331 (2022). https://doi.org/10.1007/s00521-020-05101-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-020-05101-4

Keywords

Navigation