Wireless Personal Communications

, Volume 102, Issue 3, pp 2099–2116 | Cite as

Machine Learning Based Big Data Processing Framework for Cancer Diagnosis Using Hidden Markov Model and GM Clustering

  • Gunasekaran ManogaranEmail author
  • V. Vijayakumar
  • R. Varatharajan
  • Priyan Malarvizhi Kumar
  • Revathi Sundarasekar
  • Ching-Hsien Hsu


The change in the DNA is a form of genetic variation in the human genome. In addition, the DNA copy number change is also linked with the progression of many emerging diseases. Array-based Comparative Genomic Hybridization (CGH) is considered as a major task when measuring the DNA copy number change across the genome. Moreover, DNA copy number change is an essential measure to diagnose the cancer disease. Next generation sequencing is an important method for studying the spread of infectious disease qualitatively and quantitatively. CGH is widely used in continuous monitoring of copy number of thousands of genes throughout the genome. In recent years, the size of the DNA sequence data is very large. Hence, there is a need to use a scalable machine learning approach to overcome the various issues in DNA copy number change detection. In this paper, we use a Bayesian hidden Markov model (HMM) with Gaussian Mixture (GM) Clustering approach to model the DNA copy number change across the genome. The proposed Bayesian HMM with GM Clustering approach is compared with various existing approaches such as Pruned Exact Linear Time method, binary segmentation method and segment neighborhood method. Experimental results demonstrate the effectiveness of our proposed change detection algorithm.


Big Data Machine learning Bayesian hidden Markov model Gaussian mixture clustering DNA copy number change Comparative genomic hybridization 


  1. 1.
    Attiyeh, E. F., Diskin, S. J., Attiyeh, M. A., Mossé, Y. P., Hou, C., Jackson, E. M., et al. (2009). Genomic copy number determination in cancer cells from single nucleotide polymorphism microarrays based on quantitative genotyping corrected for aneuploidy. Genome Research, 19(2), 276–283.CrossRefGoogle Scholar
  2. 2.
    Zhao, X., Li, C., Paez, J. G., Chin, K., Jänne, P. A., Chen, T. H., et al. (2004). An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Research, 64(9), 3060–3071.CrossRefGoogle Scholar
  3. 3.
    Lopez, D., Gunasekaran, M., Murugan, B. S., Kaur, H., & Abbas, K. M. (2014). Spatial big data analytics of influenza epidemic in Vellore, India. In 2014 IEEE international conference on big data (Big Data) (pp. 19–24).Google Scholar
  4. 4.
    Varatharajan, R., Manogaran, G., Priyan, M. K., & Sundarasekar, R. (2017). Wearable sensor devices for early detection of Alzheimer disease using dynamic time warping algorithm. Cluster Computing, 1–10.Google Scholar
  5. 5.
    Varatharajan, R., Manogaran, G., Priyan, M. K., Balaş, V. E., & Barna, C. (2017). Visual analysis of geospatial habitat suitability model based on inverse distance weighting with paired comparison analysis. Multimedia Tools and Applications, 1–21.Google Scholar
  6. 6.
    Thota, C., Sundarasekar, R., Manogaran, G., Varatharajan, R., & Priyan, M. K. (2018). Centralized fog computing security platform for IoT and cloud in healthcare system. In Exploring the convergence of big data and the internet of things (pp. 141–154). IGI Global.Google Scholar
  7. 7.
    Varatharajan, R., Vasanth, K., Gunasekaran, M., Priyan, M., & Gao, X. Z. (2017). An adaptive decision based kriging interpolation algorithm for the removal of high density salt and pepper noise in images. Computers & Electrical Engineering.Google Scholar
  8. 8.
    Manogaran, G., Lopez, D., Thota, C., Abbas, K. M., Pyne, S., & Sundarasekar, R. (2017). Big data analytics in healthcare internet of things. In G. S. Tomar (Ed.), Innovative healthcare systems for the 21st century (pp. 263–284). Berlin: Springer.CrossRefGoogle Scholar
  9. 9.
    Manogaran, G., & Lopez, D. (2017). Spatial cumulative sum algorithm with big data analytics for climate change detection. Computers & Electrical Engineering.Google Scholar
  10. 10.
    Manogaran, G., & Lopez, D. (2017). A Gaussian process based big data processing framework in cluster computing environment. Cluster Computing, 1–16.Google Scholar
  11. 11.
    Campbell, P. J., Yachida, S., Mudie, L. J., Stephens, P. J., Pleasance, E. D., Stebbings, L. A., et al. (2010). The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature, 467(7319), 1109–1113.CrossRefGoogle Scholar
  12. 12.
    Vayena, E., Salathé, M., Madoff, L. C., & Brownstein, J. S. (2015). Ethical challenges of big data in public health. PLoS Computational Biology, 11(2), e1003904.CrossRefGoogle Scholar
  13. 13.
    Lopez, D., & Gunasekaran, M. (2015). Assessment of vaccination strategies using fuzzy multi-criteria decision making. In Proceedings of the Fifth International Conference on Fuzzy and Neuro Computing (FANCCO-2015) (pp. 195–208). Berlin: Springer.Google Scholar
  14. 14.
    Lopez, D., & Sekaran, G. (2016). Climate change and disease dynamics-a big data perspective. International Journal of Infectious Diseases, 45, 23–24.CrossRefGoogle Scholar
  15. 15.
    Lopez, D., & Manogaran, G. (2016). Big data architecture for climate change and disease dynamics. In G. S. Tomar et al. (Eds.) The human element of big data: issues, analytics, and performance (pp. 301–331). Boca Raton: CRC Press.Google Scholar
  16. 16.
    Manogaran, G., Thota, C., & Kumar, M. V. (2016). MetaCloud data storage architecture for big data security in cloud computing. Procedia Computer Science, 87, 128–133.CrossRefGoogle Scholar
  17. 17.
    Manogaran, G., & Lopez, D. (2016). Health data analytics using scalable logistic regression with stochastic gradient descent. International Journal of Advanced Intelligence Paradigms, 9, 1–15.Google Scholar
  18. 18.
    Manogaran, G., & Lopez, D. (2017). Disease surveillance system for big climate data processing and dengue transmission. International Journal of Ambient Computing and Intelligence, 8(2), 88–105.CrossRefGoogle Scholar
  19. 19.
    Thota, C., Manogaran, G., Lopez, D., & Vijayakumar, V. (2017). Big data security framework for distributed cloud data centers. In Cybersecurity breaches and issues surrounding online threat protection (pp. 288–310). IGI Global.Google Scholar
  20. 20.
    Manogaran, G., Thota, C., Lopez, D., Vijayakumar, V., Abbas, K. M., & Sundarsekar, R. (2017). Big data knowledge system in healthcare. In C. Bhatt, N. Dey & A. Ashour (Eds.), Internet of things and big data technologies for next generation healthcare (pp. 133–157). Berlin: Springer.CrossRefGoogle Scholar
  21. 21.
    Gijzen, H. (2013). Development: big data for a sustainable future. Nature, 502(7469), 38.CrossRefGoogle Scholar
  22. 22.
    Wang, X., & Sun, Z. (2013). The design of water resources and hydropower cloud GIS platform based on big data. In Y. Xie, X. Cui & F. Bian (Eds.), Geo-informatics in resource management and sustainable ecosystem (pp. 313–322). Berlin: Springer.CrossRefGoogle Scholar
  23. 23.
    Howe, D., Costanzo, M., Fey, P., Gojobori, T., Hannick, L., Hide, W., et al. (2008). Big data: The future of biocuration. Nature, 455(7209), 47–50.CrossRefGoogle Scholar
  24. 24.
    Hampton, S. E., Strasser, C. A., Tewksbury, J. J., Gram, W. K., Budden, A. E., Batcheller, A. L., et al. (2013). Big data and the future of ecology. Frontiers in Ecology and the Environment, 11(3), 156–162.CrossRefGoogle Scholar
  25. 25.
    Jang, S. M., & Hart, P. S. (2015). Polarized frames on—climate change‖ and—global warming‖ across countries and states: evidence from twitter big data. Global Environmental Change, 32, 11–17.CrossRefGoogle Scholar
  26. 26.
    Zhao, W., Ma, H., & He, Q. (2009). Parallel k-means clustering based on mapreduce. In M. G. Jaatun, G. Zhao & C. Rong (Eds.), Cloud computing (pp. 674–679). Berlin: Springer.CrossRefGoogle Scholar
  27. 27.
    Nguyen, C. D., Nguyen, D. T., & Pham, V. H. (2013). Parallel two-phase K-means. In B. Murgante, S. Misra & M. Carlini (Eds.), Computational Science and Its Applications–ICCSA 2013 (pp. 224–231). Berlin: Springer.CrossRefGoogle Scholar
  28. 28.
    Sun, Z., & Fox, G. (2012). Study on parallel SVM based on MapReduce. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA) (p. 1). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).Google Scholar
  29. 29.
    Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. (34), pp. 226–231).Google Scholar
  30. 30.
    Li, L., & Xi, Y. (2011).Research on clustering algorithm and its parallelization strategy. In IEEE international conference on computational and information sciences (ICCIS) (pp. 325–328).Google Scholar
  31. 31.
    He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., & Fan, J. (2011). Mr-dbscan: An efficient parallel density-based clustering algorithm using mapreduce. In IEEE 17th international conference on parallel and distributed systems (ICPADS) (pp. 473–480).Google Scholar
  32. 32.
    Fries, S., Wels, S., & Seidl, T. (2014).Projected clustering for huge data sets in MapReduce. In EDBT (pp. 49–60).Google Scholar
  33. 33.
    Moise, G., Sander, J., & Ester, M. (2006). P3C: A robust projected clustering algorithm. In IEEE sixth international conference on data mining, 2006. ICDM’06 (pp. 414–425).Google Scholar
  34. 34.
    Gao, Z., Bu, W., Zheng, Y., & Wu, X. (2017). Automated layer segmentation of macular OCT images via graph-based SLIC superpixels and manifold ranking approach. Computerized Medical Imaging and Graphics, 55, 42–53.CrossRefGoogle Scholar
  35. 35.
    Baran, U., Zhu, W., Choi, W. J., Omori, M., Zhang, W., Alkayed, N. J., et al. (2016). Automated segmentation and enhancement of optical coherence tomography-acquired images of rodent brain. Journal of Neuroscience Methods, 270, 132–137.CrossRefGoogle Scholar
  36. 36.
    Li, D., Taniguchi, E. V., Cai, S., Paschalis, E. I., Wang, H., Miller, J. B., & Shen, L. Q. (2016). Comparison of swept-source and enhanced depth imaging spectral-domain optical coherence tomography in quantitative characterisation of the optic nerve head. British Journal of Ophthalmology, bjophthalmol-2016.Google Scholar
  37. 37.
    Tang, J., Liu, X., & Sun, Q. (2009). A direct image contrast enhancement algorithm in the wavelet domain for screening mammograms. IEEE Journal of Selected Topics in Signal Processing, 3(1), 74–80.CrossRefGoogle Scholar
  38. 38.
    Li, C., Wang, X., Eberl, S., Fulham, M., & Feng, D. (2013). A new energy framework with distribution descriptors for image segmentation. IEEE Transactions on Image Processing, 22(9), 3578–3590.CrossRefGoogle Scholar
  39. 39.
    Vermeer, K. A., van der Schoot, J., Lemij, H. G., & de Boer, J. F. (2012). RPE-normalized RNFL attenuation coefficient maps derived from volumetric OCT imaging for glaucoma assessment RNFL attenuation coefficient maps for Glaucoma. Investigative Ophthalmology & Visual Science, 53(10), 6102–6108.CrossRefGoogle Scholar
  40. 40.
    Ma, Z., Xue, J. H., Leijon, A., Tan, Z. H., Yang, Z., & Guo, J. (2016). Decorrelation of neutral vector variables: Theory and applications. IEEE transactions on neural networks and learning systems.Google Scholar
  41. 41.
    Ma, Z., Teschendorff, A. E., Leijon, A., Qiao, Y., Zhang, H., & Guo, J. (2015). Variational bayesian matrix factorization for bounded support data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4), 876–889.CrossRefGoogle Scholar
  42. 42.
    Ng, P. A. A. Y. (2005). Learning first-order Markov models for control. In Advances in neural information processing systems 17: Proceedings of the 2004 conference (Vol. 17, p. 1). MIT Press.Google Scholar
  43. 43.
    Ma, Z., Rana, P. K., Taghia, J., Flierl, M., & Leijon, A. (2014). Bayesian estimation of Dirichlet mixture model with variational inference. Pattern Recognition, 47(9), 3143–3157.CrossRefzbMATHGoogle Scholar
  44. 44.
    Ma, Z., Xie, J., Li, H., Sun, Q., Si, Z., Zhang, J., & Guo, J. (2017). The role of data analysis in the development of intelligent energy networks. arXiv preprint arXiv:1705.11132.
  45. 45.
    Ghahramani, Z. (2001). An introduction to hidden Markov models and Bayesian networks. International Journal of Pattern Recognition and Artificial Intelligence, 15(01), 9–42.CrossRefGoogle Scholar
  46. 46.
    Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics, 19(suppl 2), ii215–ii225.CrossRefGoogle Scholar
  47. 47.
    Henderson, J., Salzberg, S., & Fasman, K. H. (1997). Finding genes in DNA with a hidden Markov model. Journal of Computational Biology, 4(2), 127–141.CrossRefGoogle Scholar
  48. 48.
    Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F., et al. (2007). PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research, 17(11), 1665–1674.CrossRefGoogle Scholar
  49. 49.
    Boys, R. J., Henderson, D. A., & Wilkinson, D. J. (2000). Detecting homogeneous segments in DNA sequences by using hidden Markov models. Applied Statistics, 49, 269–285.MathSciNetzbMATHGoogle Scholar
  50. 50.
    Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models. Stochastic processes and their applications, 40(1), 127–143.MathSciNetCrossRefzbMATHGoogle Scholar
  51. 51.
    Hidden Markov model. (2017). Retrieved October 9, 2017, from
  52. 52.
    Siepel, A., & Haussler, D. (2004). Combining phylogenetic and hidden Markov models in biosequence analysis. Journal of Computational Biology, 11(2–3), 413–428.CrossRefGoogle Scholar
  53. 53.
    Krogh, A., Brown, M., Mian, I. S., Sjölander, K., & Haussler, D. (1994). Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235(5), 1501–1531.CrossRefGoogle Scholar
  54. 54.
    Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51(1), 79–94.MathSciNetCrossRefzbMATHGoogle Scholar
  55. 55.
    Stanke, M., Schöffmann, O., Morgenstern, B., & Waack, S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics, 7(1), 62.CrossRefGoogle Scholar
  56. 56.
    Yada, T., Totoki, Y., Ishikawa, M., Asai, K., & Nakai, K. (1998). Automatic extraction of motifs represented in the hidden Markov model from a number of DNA sequences. Bioinformatics, 14(4), 317–325.CrossRefGoogle Scholar
  57. 57.
    Jablonowski, K. (2017). Hidden Markov models for protein domain homology identification and analysis. SH2 Domains: Methods and Protocols, 1555, 47–58.CrossRefGoogle Scholar
  58. 58.
    Lehmann, T., & Schlattmann, P. (2017). Treatment of nonignorable missing data when modeling unobserved heterogeneity with finite mixture models. Biometrical Journal, 59(1), 159–171.MathSciNetCrossRefzbMATHGoogle Scholar
  59. 59.
    Prakash, R. M., & Kumari, R. S. S. (2017). Spatial fuzzy C means and expectation maximization algorithms with bias correction for segmentation of MR brain images. Journal of Medical Systems, 41(1), 15.CrossRefGoogle Scholar
  60. 60.
    Mihlin, A., & Levin, C. S. (2017). An expectation maximization method for joint estimation of emission activity distribution and photon attenuation map in PET. IEEE Transactions on Medical Imaging, 36(1), 214–224.CrossRefGoogle Scholar
  61. 61.
    Bhadra, A. (2017). An expectation–maximization scheme for measurement error models. Statistics & Probability Letters, 120, 61–68.MathSciNetCrossRefzbMATHGoogle Scholar
  62. 62.
    Kounades-Bastian, D., Girin, L., Alameda-Pineda, X., Gannot, S., & Horaud, R. (2017). An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures. In IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
  63. 63.
    Borges, P. (2017). EM algorithm-based likelihood estimation for a generalized Gompertz regression model in presence of survival data with long-term survivors: an application to uterine cervical cancer data. Journal of Statistical Computation and Simulation, 87, 1–11.MathSciNetCrossRefGoogle Scholar
  64. 64.
    Chen, F., Agüero, J. C., Gilson, M., Garnier, H., & Liu, T. (2017). EM-based identification of continuous-time ARMA Models from irregularly sampled data. Automatica, 77, 293–301.MathSciNetCrossRefzbMATHGoogle Scholar
  65. 65.
    Shinmura, K., Kato, H., Kawanishi, Y., Yoshimura, K., Igarashi, H., Goto, M., et al. (2017). Reduced expression of the DNA glycosylase gene MUTYH is associated with an increased number of somatic mutations via a reduction in the DNA repair capacity in prostate adenocarcinoma. Molecular Carcinogenesis, 56(2), 781–788.CrossRefGoogle Scholar
  66. 66.
    Papastamoulis, P., & Rattray, M. (2017). A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data. Journal of the Royal Statistical Society: Series C (Applied Statistics).Google Scholar
  67. 67.
    Killick, R., Eckley, I. A., Jonathan, P., & Chester, U. K. (2011). Efficient detection of multiple changepoints within an oceano-graphic time series. In Proceedings of the 58th world science congress of ISI.Google Scholar
  68. 68.
    Scott, A. J., & Knott, M. (1974). A cluster analysis method for grouping means in the analysis of variance. Biometrics, 30, 507–512.CrossRefzbMATHGoogle Scholar
  69. 69.
    Auger, I. E., & Lawrence, C. E. (1989). Algorithms for the optimal identification of segment neighborhoods. Bulletin of Mathematical Biology, 51(1), 39–54.MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.VIT UniversityVelloreIndia
  2. 2.School of Computing Science and EngineeringVIT UniversityChennaiIndia
  3. 3.Sri Ramanujar Engineering CollegeChennaiIndia
  4. 4.Priyadarshini Engineering CollegeVelloreIndia
  5. 5.Chung Hua UniversityHsinchuTaiwan

Personalised recommendations