Abstract
Microarray gene expression data play a major role in predicting chronic disease at an early stage. It also helps to identify the most appropriate drug for curing the disease. Such microarray gene expression data is huge in volume to handle. All gene expressions are not necessary to predict a disease. Gene selection approaches pick only genes that play a prominent role in detecting a disease and drug for the same. In order to handle huge gene expression data, gene selection algorithms can be executed in parallel programming frameworks such as Hadoop Mapreduce and Spark. Paediatric cancer is a threatening illness that affects children at age of 0–14 years. It is very much necessary to identify child tumours at early stage to save the lives of children. So the authors investigate on paediatric cancer gene data to identify the optimal genes that cause cancer in children. The authors propose to execute parallel Chi-Square gene selection algorithm on Spark, selected genes are evaluated using parallel logistic regression and support vector machine (SVM) for Binary classification on Spark Machine Learning library (Spark MLlib) and compare the accuracy of prediction and classification respectively. The results show that parallel Chi-Square selection followed by parallel logistic regression and SVM provide better accuracy compared to accuracy obtained with complete set of gene expression data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Shomona Gracia Jacob, Dr.R.Geetha Ramani, P.Nancy: Feature Selection and Classification in Breast Cancer Datasets through Data Mining Algorithms, In Proceedings of the IEEE International Conference on Computational Intelligence and Computing Research (ICCIC’2011), Kanyakumari, India, IEEE Catalog Number: CFP1120J-PRT, ISBN: 978-1-61284-766-5. (2011). 661–667.
Masih, Shraddha, and Sanjay Tanwani: Data Mining Techniques in Parallel and Distributed Environment-A Comprehensive Survey. In International Journal of Emerging Technology and Advanced Engineering (March 2014), Vol. 4, Issue 3, (2014) 453–461.
Pakize., Seyed Reza and Abolfazl Gandomi: Comparative Study of Classification Algorithms Based on MapReduce Model. In International Journal of Innovative Research in Advanced Engineering (2014), ISSN (2014): 2349–2163.
Jacob, S.G. and Ramani, R.G.: Data mining in clinical data sets: a review. training, 4(6). (2012).
Yeh, J.Y: Applying data mining techniques for cancer classification on gene expression data. In Cybernetics and Systems: An International Journal, 39(6), (2008). 583–602.
Shomona Gracia Jacob, Dr.R.Geetha Ramani, Nancy.P: Classification of Splice Junction DNA sequence data through Data mining techniques, ICFCCT, 2012, held at Beijing, China, May 19–20, ISBN:978-988-15121-4-7, (2012). 143–148.
Jirapech-Umpai, T. and Aitken, S.: Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. In BMC bioinformatics, 6(1), (2005).148.
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z.: Tissue classification with gene expression profiles. In Journal of computational biology, 7(3–4), (2000). 559–583.
Piatetsky-Shapiro, G. and Tamayo, P.: Microarray data mining: facing the challenges. In ACM SIGKDD Explorations Newsletter, 5(2), (2003).1–5.
Lavanya, D. and Rani, D.K.U.: Analysis of feature selection with classification: Breast cancer datasets. Indian Journal of Computer Science and Engineering (IJCSE), 2(5), (2011), 756–763.
Lavanya, D. and Rani, K.U:. Ensemble decision tree classifier for breast cancer data. In International Journal of Information Technology Convergence and Services, 2(1), (2012).17.
Vanaja, S. and Kumar, K.R.: Analysis of feature selection algorithms on classification: a survey. In International Journal of Computer Applications, 96(17) (2014).
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P. and Poggio, T:. Multiclass cancer diagnosis using tumour gene expression signatures. In Proceedings of the National Academy of Sciences, 98(26), (2001).15149–15154.
Rajeswari K, Vaithiyanathan, V. and Pede, S.V:. Feature selection for classification in medical data mining. In International Journal of Emerging Trends and Technology in Computer Science (IJETTCS), 2(2), (2013). 492–7.
Devi, M.A. and Sarma, D.D., Comparison of Clustering Algorithms with Feature Selection on Breast Cancer Dataset. In Journal of Innovation in Computer Science and Engineering, (2015).59–63.
Wang, X. and Gotoh, O:. A robust gene selection method for microarray-based cancer classification. In Cancer informatics, 9, (2010).15–30.
Hassanien, A.E: Classification and feature selection of breast cancer data based on decision tree algorithm. In Studies in Informatics and Control, 12(1), (2003). 33–40.
Zhang, H., Li, L., Luo, C., Sun, C., Chen, Y., Dai, Z. and Yuan, Z:. Informative gene selection and direct classification of tumour based on chi-square test of pairwise gene interactions. In BioMed research international, (2014).
Nguyen, C., Wang, Y. and Nguyen, H.N.: Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic. In Journal of Biomedical Science and Engineering, 6(5), (2013).551.
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benitez, J.M. and Herrera, F”. Evolutionary Feature Selection for Big Data Classification: A MapReduce Approach. Mathematical Problems in Engineering, 501, (2015), 246139.
Islam, A.T., Jeong, B.S., Bari, A.G., Lim, C.G. and Jeon, S.H: MapReduce based parallel gene selection method. Applied Intelligence, 42(2), (2015), 147–156.
Begum, S., Chakraborty, D. and Sarkar, R: Cancer classification from gene expression based microarray data using SVM ensemble. In 2015 International Conference on Condition Assessment Techniques in Electrical Systems (CATCON) IEEE (2015), 13–16.
Jeyachidra, J. and Punithavalli, M: February. A comparative analysis of feature selection algorithms on classification of gene microarray dataset. In Information Communication and Embedded Systems (ICICES), 2013 IEEE International Conference on (2013), 1088–1093.
http://www.biolab.si/supp/bi-cancer/projections/info/EWSGSE967.htm.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Lokeswari, Y.V., Jacob, S.G. (2017). Prediction of Child Tumours from Microarray Gene Expression Data Through Parallel Gene Selection and Classification on Spark. In: Behera, H., Mohapatra, D. (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 556. Springer, Singapore. https://doi.org/10.1007/978-981-10-3874-7_62
Download citation
DOI: https://doi.org/10.1007/978-981-10-3874-7_62
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3873-0
Online ISBN: 978-981-10-3874-7
eBook Packages: EngineeringEngineering (R0)