Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data

Uzma; Al-Obeidat, Feras; Tubaishat, Abdallah; Shah, Babar; Halim, Zahid

doi:10.1007/s00521-020-05101-4

Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data

S.I. : WorldCIST'20
Published: 14 June 2020

Volume 34, pages 8309–8331, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Uzma¹,
Feras Al-Obeidat²,
Abdallah Tubaishat²,
Babar Shah² &
…
Zahid Halim¹

1876 Accesses
36 Citations
Explore all metrics

Abstract

Cancer is a severe condition of uncontrolled cell division that results in a tumor formation that spreads to other tissues of the body. Therefore, the development of new medication and treatment methods for this is in demand. Classification of microarray data plays a vital role in handling such situations. The relevant gene selection is an important step for the classification of microarray data. This work presents gene encoder, an unsupervised two-stage feature selection technique for the cancer samples’ classification. The first stage aggregates three filter methods, namely principal component analysis, correlation, and spectral-based feature selection techniques. Next, the genetic algorithm is used, which evaluates the chromosome utilizing the autoencoder-based clustering. The resultant feature subset is used for the classification task. Three classifiers, namely support vector machine, k-nearest neighbors, and random forest, are used in this work to avoid the dependency on any one classifier. Six benchmark gene expression datasets are used for the performance evaluation, and a comparison is made with four state-of-the-art related algorithms. Three sets of experiments are carried out to evaluate the proposed method. These experiments are for the evaluation of the selected features based on sample-based clustering, adjusting optimal parameters, and for selecting better performing classifier. The comparison is based on accuracy, recall, false positive rate, precision, F-measure, and entropy. The obtained results suggest better performance of the current proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

Article Open access 22 September 2020

Da Xu, Jialin Zhang, … Matthias Dehmer

RFE and Mutual-INFO-Based Hybrid Method Using Deep Neural Network for Gene Selection and Cancer Classification

References

Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30(1):41–47
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article Google Scholar
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Article Google Scholar
Jiang D, Tang C, Zhang A (2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
Article Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability vol 1 (4), pp 281–297, June 1967
Kohonen T (2012) Self-organization and associative memory. Springer, Berlin
MATH Google Scholar
Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci 95(25):14863–14868
Article Google Scholar
Ben-Dor A, Shamir R, Yakhini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3–4):281–297
Article Google Scholar
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588
Article Google Scholar
Brazma A, Vilo J (2000) Gene expression data analysis. FEBS Lett 480(1):17–24
Article Google Scholar
Xing EP, Karp RM (2001) CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics 17:S306–S315
Article Google Scholar
Law MH, Figueiredo MA, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Article Google Scholar
Mahajan S, Singh S (2016) Review on feature selection approaches using gene expression data. Imp J Interdiscip Res 2(3):1–6
Google Scholar
Pavithra D, Lakshmanan B (2017) Feature selection and classification in gene expression cancer data. In: 2017 international conference on computational intelligence in data science (ICCIDS). IEEE, pp 1–6
Alshamlan HM, Badr GH, Alohali YA (2015) Genetic Bee Colony (GBC) algorithm: a new gene selection method for microarray cancer classification. Comput Biol Chem 56:49–60
Article Google Scholar
Bihani P, Patil ST (2014) A comparative study of data analysis techniques. Int J Emerg Trends Technol Comput Sci 3(2):95–101
Google Scholar
Halim Z, Ali O, Khan G (2019) On the efficient representation of datasets as graphs to mine maximal frequent itemsets. IEEE Trans Knowl Data Eng 1–18. https://doi.org/10.1109/TKDE.2019.2945573
Han J Kamber M, Tung AK (2001) Spatial clustering methods in data mining. In: Geographic data mining and knowledge discovery. vol 1, pp 188–217
Halim Z, Rehan M (2020) On identification of driving-induced stress using electroencephalogram signals: a framework based on wearable safety-critical scheme and machine learning. Inf Fusion 53:66–79
Article Google Scholar
Iqbal S, Halim Z (2020) Orienting conflicted graph edges using genetic algorithms to discover pathways in protein-protein interaction networks. In: IEEE/ACM transactions on computational biology and bioinformatics, 1–16. https://doi.org/10.1109/TCBB.2020.2966703
Halim Z, Atif M, Rashid A, Edwin CA (2017) Profiling players using real-world datasets: clustering the data and correlating the results with the big-five personality traits. IEEE Trans Affect Comput 10(4):568–584
Article Google Scholar
Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E (2016) Clustering algorithms: their application to gene expression data. Bioinform Biol Insights 10:237–253
Google Scholar
Caruana R, Freitag D (1994) Greedy attribute selection. In: Machine learning proceedings, pp 28–36
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Article Google Scholar
Pudil P, Novovičová J, Kittler J (1994) Floating search methods in feature selection. Pattern Recogn Lett 15(11):1119–1125
Article Google Scholar
Frigui H, Nasraoui O (2000) Simultaneous clustering and attribute discrimination. In: Ninth IEEE international conference on fuzzy systems. FUZZ-IEEE 2000 (Cat. No. 00CH37063), IEEE vol 1, pp 158–163
Chen H, Zhang Y, Gutman I (2016) A kernel-based clustering method for gene selection with gene expression data. J Biomed Inform 62:12–20
Article Google Scholar
Song C, Huang Y, Liu F, Wang Z, Wang L (2014) Deep auto-encoder based clustering. Intell Data Anal 18(6S):S65–S76
Article Google Scholar
Chen PY, Huang JJ (2019) A hybrid autoencoder network for unsupervised image clustering. Algorithms 12(6):122
Article MathSciNet Google Scholar
Salem H, Attiya G, El-Fishawy N (2017) Classification of human cancer diseases by gene expression profiles. Appl Soft Comput 50:124–134
Article Google Scholar
Ghosh M, Adhikary S, Ghosh KK, Sardar A, Begum S, Sarkar R (2019) Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57(1):159–176
Article Google Scholar
Rani MJ, Devaraj D (2019) Two-stage hybrid gene selection using mutual information and genetic algorithm for cancer data classification. J Med Syst 43(8):235
Article Google Scholar
Tiwari S, Singh B, Kaur M (2017) An approach for feature selection using local searching and global optimization techniques. Neural Comput Appl 28(10):2915–2930
Article Google Scholar
Langley P (1994) Selection of relevant features in machine learning. Proceedings of the AAAI fall symposium on relevance 184:245–271
MathSciNet MATH Google Scholar
Muhammad T, Halim Z (2016) Employing artificial neural networks for constructing metadata-based model to automatically select an appropriate data visualization technique. Appl Soft Comput 49:365–384
Article Google Scholar
Shah A, Halim Z (2019) On efficient mining of frequent itemsets from big uncertain databases. J Grid Comput 17(4):831–850
Article Google Scholar
Zhu X, Li X, Zhang S, Ju C, Wu X (2016) Robust joint graph sparse coding for unsupervised spectral feature selection. IEEE Trans Neural Netw Learn Syst 28(6):1263–1275
Article MathSciNet Google Scholar
Jiang P, Maghrebi M, Crosky A, Saydam S (2017) Unsupervised deep learning for data-driven reliability and risk analysis of engineered systems. In: Samui P, Sekhar S, Balas VE (eds) Handbook of neural computation. Academic Press, Cambridge, pp 417–431
Mao W, Wang F (2012) New advances in intelligence and security informatics. Academic Press, Cambridge
Google Scholar
Ayyad SM, Saleh AI, Labib LM (2019) Gene expression cancer classification using modified K-nearest neighbors technique. BioSystems 176:41–51
Article Google Scholar
Halim Z, Khattak JH (2019) Density-based clustering of big probabilistic graphs. Evol Syst 10(3):333–350
Article Google Scholar
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In; 2010 IEEE international conference on data mining, IEEE, pp 911–916
Halim Z, Khan S (2019) A data science-based framework to categorize academic journals. Scientometrics 119(1):393–423
Article Google Scholar
Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37(3):487–501
Article Google Scholar
Zhu L, Ma B, Zhao X (2010) Clustering validity analysis based on silhouette coefficient [J]. J Comput Appl 30(2):139–141
Google Scholar
Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201
Article Google Scholar
Li T, Ma J (2018) Fuzzy clustering with automated model selection: entropy penalty approach. In: 2018 14th IEEE international conference on signal processing (ICSP). IEEE, pp 571–576
Karypis MSG, Kumar V, Steinbach M (2000) A comparison of document clustering techniques. In: TextMining workshop at KDD2000
Sathiaraj D, Huang X, Chen J (2019) Predicting climate types for the Continental United States using unsupervised clustering techniques. Environmetrics 30(4):e2524
Article MathSciNet Google Scholar
Bhuiyan MNQ, Shamsujjoha M, Ripon, SH, Proma FH, Khan F (2019) Transfer learning and supervised classifier based prediction model for breast cancer. In: Big data analytics for intelligent healthcare management, Academic Press, Cambridge, pp 59–86
Gan G (2013) Application of data clustering and machine learning in variable annuity valuation. Insur Math Econ 53(3):795–801
Article MathSciNet Google Scholar
Breiman L (2001) Random forests. Mach Learn 45.1(2001):5–32

Download references

Acknowledgements

The authors are indebted to the editor and anonymous reviewers for their helpful comments and suggestions. The authors wish to thank GIK Institute for providing research facilities. This work was sponsored by the GIK Institute graduate research fund under GA-F scheme.

Author information

Authors and Affiliations

The Machine Intelligence Research Group (MInG), Faculty of Computer Science and Engineering, Ghulam Ishaq Khan Institute of Engineering Sciences and Technology, Topi, Pakistan
Uzma & Zahid Halim
College of Technological Innovation at Zayed University, Abu Dhabi, UAE
Feras Al-Obeidat, Abdallah Tubaishat & Babar Shah

Authors

Uzma
View author publications
You can also search for this author in PubMed Google Scholar
Feras Al-Obeidat
View author publications
You can also search for this author in PubMed Google Scholar
Abdallah Tubaishat
View author publications
You can also search for this author in PubMed Google Scholar
Babar Shah
View author publications
You can also search for this author in PubMed Google Scholar
Zahid Halim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Uzma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Uzma, Al-Obeidat, F., Tubaishat, A. et al. Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data. Neural Comput & Applic 34, 8309–8331 (2022). https://doi.org/10.1007/s00521-020-05101-4

Download citation

Received: 10 March 2020
Accepted: 04 June 2020
Published: 14 June 2020
Issue Date: June 2022
DOI: https://doi.org/10.1007/s00521-020-05101-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data

Abstract

Access this article

Similar content being viewed by others

A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

RFE and Mutual-INFO-Based Hybrid Method Using Deep Neural Network for Gene Selection and Cancer Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Gene encoder: a feature selection technique through unsupervised deep learning-based clustering for large gene expression data

Abstract

Access this article

Similar content being viewed by others

A Combined Enhancing and Feature Extraction Algorithm to Improve Learning Accuracy for Gene Expression Classification

Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data

RFE and Mutual-INFO-Based Hybrid Method Using Deep Neural Network for Gene Selection and Cancer Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation