DataCan: Robust Approach for Genome Cancer Data Analysis

  • Varun GoelEmail author
  • Vishal Jangir
  • Venkatesh Gauri Shankar
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1016)


While we glance in the past twenty years, it can be evidently noticed that biological sciences have brought about an active analytical research in high-dimensional data. Recently, many new approaches in Data Science and Machine Learning fields have emerged to handle the ultrahigh-dimensional genome data. Several cancer data types together with the availability of pertinent studies going on similar types of cancers adds to the complexity of the data. It is of commentative biological and clinical interest to understand what subtypes a cancer has, how a patient’s genomic profiles and survival rates vary among subtypes, whether a survival of a patient can be predicted from his or her genomic profile, and the correlation between different genomic profiles. It is of utmost importance to identify types of cancer mutations as they play a very significant role in divulging useful observations into disease pathogenesis and advancing therapy varying from person to person. In this paper we focus on finding the cancer-causing genes and their specific mutations and classifying the genes on the 9 classes of cancer. This will help in predicting which genetic mutation causes which type of cancer. We have used Sci-kit Learn and NLTK for this project to analyze what each class means by classifying all genetic mutations into 17 major mutation types (according to dataset). Dataset is in two formats: CSV and Text, where csv containing the genes and their mutations and text file containing the description of these mutations. Our approach merged the two datasets and used Random Forest, with GridSearchCv and ten-fold Cross-Validation, to perform a supervised classification analysis and has provided with an accuracy score of 68.36%. This is not much accurate as the genes & their variations don’t follow the HGVS Nomenclature of genes because of which conversion of text to numerical format resulted in loss of some important features. Our findings suggest that classes 1, 4 and 7 contribute the most for causing cancer.


Cancer Genes Random-Forest SVM Variations Data analytics 



Varun Goel is the corresponding author. It is our privilege to express our sincere thanks to Prof. Venkatesh Gauri Shankar (Assistant Professor) from Manipal University Jaipur for his helpful guidance and discussions on our data analysis methods. He provided with various resources to support us during the implementation of this work.


  1. 1.
    Bhola, A., & Tiwari, A. K. (2015, December) Machine learning based approaches for cancer classification using gene expression data. Machine Learning and Applications: An International Journal (MLAIJ), 2(3/4).Google Scholar
  2. 2.
    Kharya, S., (2012). Using data mining techniques for diagnosis and prognosis of cancer disease. International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), 2(2).CrossRefGoogle Scholar
  3. 3.
    Liang, M., Li, Z., Chen, T., & Zeng, J. (2015, July/August) Integrative data analysis of multi-platform cancer data with a multimodal deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(4).Google Scholar
  4. 4.
    Gregory, K. B., Momin, A. A., Coombes, K. R., & Baladandayuthapani, V. (2014, November/December) Latent feature decompositions for integrative analysis of multi-platform genomic data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(6).Google Scholar
  5. 5.
    Weitschek, E., Cumbo, F., Cappelli, E., & Felici, G. (2016). Genomic data integration: A case study on next generation sequencing of cancer. In 2016 27th International Workshop on Database and Expert Systems Applications.Google Scholar
  6. 6.
    Huang, H.-Y., Ho, C.-M., Lin, C.-Y., Chang, Y.-S., Yang, C.-A., & Chang, J.-G. (2016). An integrative analysis for cancer studies. In 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering.Google Scholar
  7. 7.
    Mishra, S., Kaddi, C. D., & Wang, M. D. (2015). Pan-cancer analysis for studying cancer stage using protein expression data. In Conf Proc IEEE Eng. Med Biol Soc (pp. 8189–8192).Google Scholar
  8. 8.
    Guyon, I., et al. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422.Google Scholar
  9. 9.
    Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefGoogle Scholar
  10. 10.
    Dev, J., et al. (2012). A classification technique for microarray gene expression data using PSO-FLANN. International Journal on Computer Science and Engineering, 4(9), 1534.Google Scholar
  11. 11.
    Castaño, A., et al. (2011). Neuro-logistic models based on evolutionary generalized radial basis function for the microarray gene expression classification problem. Neural Processing Letters, 34(2), 117–131.CrossRefGoogle Scholar
  12. 12.
    Sharma, A., Imoto, S., & Miyano, S. (2012). A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 9(3), 754–764.CrossRefGoogle Scholar
  13. 13.
    Rajput, D. S., Singh, P., & Bhattacharya, M. (2011). Feature selection with efficient initialization of clusters centers for high dimensional data clustering. In 2011 International Conference on IEEE Communication Systems and Network Technologies (CSNT) (pp. 293–297).Google Scholar
  14. 14.
    Jiang, D., Tang, C., & Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering, 16(11), 1370–1386.CrossRefGoogle Scholar
  15. 15.
    Devi, B., Kumar, S., Anuradha, Shankar, V.G. (2019). AnaData: A novel approach for data analytics using random forest tree and SVM. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. Scholar
  16. 16.
    Shankar, V. G., Jangid, M., Devi, B., Kabra, S. (2018). Mobile big data: Malware and its analysis. In Proceedings of First International Conference on Smart System, Innovations and Computing (Vol. 79, pp. 831–842). Smart Innovation, Systems and Technologies. Singapore: Springer. Scholar
  17. 17.
    Priyanga, A., & Prakasam, S. (2013). Effectiveness of data mining—Based cancer prediction system (DMBCPS). International Journal of Computer Applications, 83(10), 0975–8887.Google Scholar
  18. 18.
    Azuaje, F. (1999). Interpretation of genome expression patterns: computational challenges and opportunities. In IEEE Engineering in Medicine and Biology Magazine: The Quarterly Magazine of the Engineering in Medicine & Biology Society (Vol. 19, Issue, 6, pp. 119–119).Google Scholar
  19. 19.
    Shankar, V. G., Devi, B., Srivastava, S. (2019). DataSpeak: Data extraction, aggregation, and classification using big data novel algorithm. In B. Iyer, S. Nalbalwar, N. Pathak (Eds.), Computing, communication and signal processing (Vol. 810). Advances in Intelligent Systems and Computing. Singapore: Springer. Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Varun Goel
    • 1
    Email author
  • Vishal Jangir
    • 1
  • Venkatesh Gauri Shankar
    • 1
  1. 1.Department of Information TechnologyManipal University JaipurJaipurIndia

Personalised recommendations