Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data

Venkataramana, Lokeswari; Jacob, Shomona Gracia; Ramadoss, Rajavel; Saisuma, Dodda; Haritha, Dommaraju; Manoja, Kunthipuram

doi:10.1007/s13258-019-00859-x

Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data

Research Article
Published: 19 August 2019

Volume 41, pages 1301–1313, (2019)
Cite this article

Genes & Genomics Aims and scope Submit manuscript

Lokeswari Venkataramana ORCID: orcid.org/0000-0003-2331-4507¹,
Shomona Gracia Jacob²,
Rajavel Ramadoss³,
Dodda Saisuma¹,
Dommaraju Haritha¹ &
…
Kunthipuram Manoja¹

498 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Data mining techniques are used to mine unknown knowledge from huge data. Microarray gene expression (MGE) data plays a major role in predicting type of cancer. But as MGE data is huge in volume, applying traditional data mining approaches is time consuming. Hence parallel programming frameworks like Hadoop, Spark and Mahout are necessary to ease the task of computation.

Objective

Not all the gene expressions are necessary in prediction, it is very essential to select important genes for improving classification accuracy. So feature selection algorithms are parallelized and executed on Spark framework to eliminate unnecessary genes and identify only predictive genes in very less time without affecting prediction accuracy.

Methods

Parallelized hybrid feature selection (HFS) method is proposed to serve the purpose. This method includes parallelized correlation feature subset selection followed by rank-based feature selection methods. The selected subset of genes is evaluated using parallel classification algorithms. The accuracy values obtained are compared with existing rank-weight feature selection, parallelized recursive feature selection methods and also with the values obtained by executing parallelized HFS on DistributedWekaSpark.

Results

The classification accuracy obtained with the proposed parallelized HFS method is 97% and 79% for gastric cancer and childhood leukemia respectively. The proposed parallelized HFS method produced ~ 4% to ~ 15% improvement in classification accuracy when compared with previous methods.

Conclusion

The results reveal the fact that the proposed parallelized feature selection algorithm is scalable to growing medical data and predicts cancer sub-types in lesser time with higher accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Ensemble of Cooperative Parallel Metaheuristics for Gene Selection in Cancer Classification

Massively Parallel Feature Selection Based on Ensemble of Filters and Multiple Robust Consensus Functions for Cancer Gene Identification

Benchmarking Gene Selection Techniques for Prediction of Distinct Carcinoma from Gene Expression Data: A Computational Study

References

Ali SI, Shahzad W (2012) A feature subset selection method based on symmetric uncertainty and ant colony optimization. In: IEEE international conference on technologies (ICET), pp 1–6
Alshamlan HM, Badr GH, Alohali Y (2013) A study of cancer microarray gene expression prole: objectives and approaches. In: Proceedings of the world congress on engineering, vol 2, pp 1–6
Bang MS, Kang K, Lee JJ, Lee YJ, Choi JE, Ban JY, Oh CH (2017) Transcriptome analysis of non-small cell lung cancer and genetically matched adjacent normal tissues identifies novel prognostic marker genes. Genes Genom 39(3):277–284
Article CAS Google Scholar
Bioinformatics Laboratory (2019). http://www.biolab.si/supp/bi-ancer/projections/info/ALLGSE412_poterapiji.html. Accessed 20 July 2019
Bolón-CanedoV V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150
Article Google Scholar
Chuang LY, Yang CH, Wu KC, Yang CH (2011) A hybrid feature selection method for DNA microarray data. Comput Biol Med 41(4):228–237
Article CAS Google Scholar
Das AK, Goswami S, Chakrabarti A, Chakraborty B (2017) A new hybrid feature selection approach using feature association map for supervised and unsupervised classification. Expert Syst Appl 88:81–94
Article Google Scholar
Eiras-Franco C, Bolón-Canedo V, Ramos S, González-Domínguez J, Alonso-Betanzos A, Touriño J (2016) Multithreaded and Spark parallelization of feature selection filters. J Comput Sci 17:609–619
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Article CAS Google Scholar
Gracia Jacob S (2015) Discovery of novel oncogenic patterns using hybrid feature selection and rule mining. Ph.D. Thesis. Anna University. India
Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the seventeenth international conference on machine learning, pp 359–366
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11(1):10–18
Article Google Scholar
Heo J, Lee JS, Leem SH (2013) Distinct gene expression signatures during development of distant metastasis. Genes Genom 35(4):511–522
Article CAS Google Scholar
Kang S, Hong S (2011) Prediction of personalized drugs based on genetic variations provided by DNA sequencing technologies. Genes Genom 33(6):591–603
Article Google Scholar
Lee CP, Leu Y (2017) A novel hybrid feature selection method for microarray data analysis. Appl Soft Comput 11(1):208–213
Article Google Scholar
Li J, Liu H (2017) Challenges of feature selection for big data analytics. IEEE Intell Syst 32(2):9–15
Article Google Scholar
Lokeswari YV, Jacob SG, Ramadoss R (2019) Parallel prediction algorithms for heterogeneous data: a case study with real-time big datasets. In: Peter JD, Alavi AH, Javadi B (eds) Advances in big data and cloud computing. Springer, Singapore, pp 529–538
Chapter Google Scholar
Lu H, Chen J, Yan K, Jin Q, Xue Y, Gao Z (2017) A hybrid feature selection algorithm for gene expression data classification. Neurocomputing 256:56–62
Article Google Scholar
Peralta D, del Río S, Ramírez-Gallego S, Triguero I, Benitez JM (2015) Herrera F (2015) Evolutionary feature selection for big data classification: a Mapreduce approach. Math Probl Eng 2015(246139):1–11
Article Google Scholar
Ramani RG, Jacob SG (2013) Benchmarking classification models for cancer prediction from gene expression data: a novel approach and new findings. Stud Inform Control 22(2):134–143
Article Google Scholar
Ryza S, Laserson U, Owen S, Wills J (2017) Advanced analytics with Spark: patterns for learning from data at scale. O’Reilly Media Inc., Northern California, USA
Google Scholar
Singh RK, Sivabalakrishnan M (2015) Feature selection of gene expression data for cancer classification: a review. Procedia Comput Sci 50:52–57
Article Google Scholar
Spark Release 2.2.1—Apache Spark (2019). https://spark.apache.org/releases/spark-release-2-2-1.html. Accessed 25 July 2019
Venkataramana L, Jacob SG, Ramadoss R (2018) Parallelized classification of cancer sub-types from gene expression profiles using recursive gene selection. Stud Inform Control 27(1):215–224
Google Scholar
Waikato Environment for Knowledge Analysis (WEKA) (2019). http://weka.sourceforge.net/packageMetaData/distributedWekaSpark/index.html. Accessed 26 July 2019
Wang X, Gotoh O (2010) A robust gene selection method for microarray-based cancer classification. Cancer Inform 9:CIN-S3794
Article Google Scholar
Wang Z, Zhang Y, Chen Z, Yang H, Sun Y, Kang J, Yang Y, Liang X (2016) Application of ReliefF algorithm to selecting feature sets for classification of high resolution remote sensing image. In: 2016 IEEE international geoscience and remote sensing symposium (IGARSS), pp 755–758
Yu JF, Guo J, Liu QB, Hou Y, Xiao K, Chen QL, Wang JH, Sun X (2015) A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome. Genes Genom 37(4):347–355
Article Google Scholar
Yuan M, Yang Z, Huang G, Ji G (2017) Feature selection by maximizing correlation information for integrated high-dimensional protein data. Pattern Recognit Lett 92:17–24
Article Google Scholar
Zhang H, Li L, Luo C, Sun C, Chen Y, Dai Z, Yuan Z (2014) Informative gene selection and direct classification of tumor based on chi square test of pairwise gene interactions. Biomed Res Int 2014(589290):1–9
Google Scholar

Download references

Acknowledgements

This research work is part of project work funded by Science and Engineering Research Board (SERB), Department of Science and Technology (DST) funded project under Young Scientist Scheme—Early Start-up Research Grant- titled “Investigation on the effect of Gene and Protein Mutants in the onset of Neuro-Degenerative Brain Disorders (Alzheimer’s and Parkinson’s disease): A Computational Study” with Reference no-SERB—YSS/2015/000737/ES.

Author information

Authors and Affiliations

Department of CSE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
Lokeswari Venkataramana, Dodda Saisuma, Dommaraju Haritha & Kunthipuram Manoja
Muscat, Oman
Shomona Gracia Jacob
Department of ECE, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Chennai, India
Rajavel Ramadoss

Authors

Lokeswari Venkataramana
View author publications
You can also search for this author in PubMed Google Scholar
Shomona Gracia Jacob
View author publications
You can also search for this author in PubMed Google Scholar
Rajavel Ramadoss
View author publications
You can also search for this author in PubMed Google Scholar
Dodda Saisuma
View author publications
You can also search for this author in PubMed Google Scholar
Dommaraju Haritha
View author publications
You can also search for this author in PubMed Google Scholar
Kunthipuram Manoja
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lokeswari Venkataramana.

Ethics declarations

Conflict of interest

Lokeswari Venkataramana, Shomona Gracia Jacob, Rajavel Ramadoss, Dodda Saisuma, Dommaraju Haritha and Kunthipuram Manoja declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent is not necessary as this article does not involve human or animal participants.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Venkataramana, L., Jacob, S.G., Ramadoss, R. et al. Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data. Genes Genom 41, 1301–1313 (2019). https://doi.org/10.1007/s13258-019-00859-x

Download citation

Received: 27 February 2019
Accepted: 02 August 2019
Published: 19 August 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s13258-019-00859-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data