Big Data Analytics with Machine Learning Tools

  • T. P. Fowdur
  • Y. Beeharry
  • V. Hurbungs
  • V. Bassoo
  • V. Ramnarain-Seetohul
Chapter
Part of the Studies in Big Data book series (SBD, volume 30)

Abstract

Big data analytics is the current, trending hot-topic in the research community. Several tools and techniques for handling and analysing structured and unstructured data are emerging very rapidly. However, most of the tools require high expert knowledge for understanding their concepts and utilizing them. This chapter presents an in-depth overview of the various corporate and open-source tools currently being used for analysing and learning from big data. An overview of the most common platforms such as IBM, HPE, SAPRANA, Microsoft Azure and Oracle is first given. Additionally, emphasis has been laid on two open-source tools: H2O and Spark MLlib. H2O was developed by H2O.ai, a company launched in 2011 and MLIB is an open source API that is part of the Apache Software Foundation. Different classification algorithms have been applied to Mobile-Health related data using both H2O and Spark MLlib. Random Forest, Naïve Bayes, and Deep Learning algorithms have been used on the data in H2O. Random Forest, Decision Trees, and Multinomial Logistic Regression Classification algorithms have been used with the data in Spark MLlib. The illustrations demonstrate the flows for developing, training, and testing mathematical models in order to obtain insights from M-Health data using open source tools developed for handling big data.

Keywords

Open-source tools H2O Spark MLlib 

References

  1. 1.
    Russom, P. (2011). Executive summary: Big data analytics. Renton: The Data Warehouse Institute (TWDI).Google Scholar
  2. 2.
    Landset, S., Khoshgoftaar, T. M., Ritcher, A. M., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(24), 1–36.Google Scholar
  3. 3.
    The R Foundation. (2017). The R project for statistical computing. Retrieved January 22, 2017 from https://www.r-project.org/
  4. 4.
    Machine Learning Group at the University of Waikato Weka 3. (n.d.). Data mining software in Java. Retrieved January 22, 2017 from http://www.cs.waikato.ac.nz/ml/weka/
  5. 5.
    International Data Corporation. (2017). Discover the digital universe of opportunities, rich data and the increasing value of the Internet of things. Retrieved January 22, 2017 from https://www.emc.com/leadership/digital-universe/index.htm
  6. 6.
    Vidhya, S., Sarumathi, S., & Shanthi, N. (2014). Comparative analysis of diverse collection of big data analytics tools. World Academy of Science, Engineering and Technology: International Journal of Computer, Electrical, Automation, Control and Information Engineering, 8(9), 1646–1652.Google Scholar
  7. 7.
    Kamal, S., Ripon, S. H., Dey, N., Ashour, A. S., & Santhi, V. (2016). A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Computer Methods and Programs in Biomedicine, 131, 161–206.CrossRefGoogle Scholar
  8. 8.
    Kamal, S., Dey, N., Ashour, A. S., Ripon, S. H., & Balas, V. E. (2016). FbMapping: An automated system for monitoring Facebook data. Neural Network World. doi: 10.14311/NNW.2017.27.002 Google Scholar
  9. 9.
    Kamal, S., Nimmy, S. F., Hossain, M. I., Dey, N., Ashour, A. S., & Santhi, V. (2016). ExSep: An exon separation process using neural skyline filter. In International conference on electrical, electronics, and optimization techniques (ICEEOT).Google Scholar
  10. 10.
    Bhattacherjee, A., Roy, S., Paul, S., Roy, P., Kaussar, N., & Dey, N. (2016). Classification approach for breast cancer detection using back propagation neural network: A study. In Biomedical image analysis and mining techniques for improved health outcomes (p. 12). IGI-Global.Google Scholar
  11. 11.
    Chatterjee, S., Ghosh, S., Dawn, S., Hore, S., & Dey, N. (in press). Optimized forest type classification: A machine learning approach. In 3rd international conference on information system design and intelligent applications. Vishakhapatnam: Springer AISC.Google Scholar
  12. 12.
    Kimbahune, V. V., Deshpandey, A. V., & Mahalle, P. N. (2017). Lightweight key management for adaptive addressing in next generation internet. International Journal of Ambient Computing and Intelligence (IJACI), 8(1), 20.Google Scholar
  13. 13.
    Najjar, M., Courtemanche, F., Haman, H., Dion, A., & Bauchet, J. (2009). Intelligent recognition of activities of daily living for assisting memory and/or cognitively impaired elders in smart homes. International Journal of Ambient Computing and Intelligence (IJACI), 1(4), 17.Google Scholar
  14. 14.
    IBM. (n.d.). Big data at the speed of business [Online]. Retrieved January 22, 2017 from https://www-01.ibm.com/software/data/bigdata/
  15. 15.
    Zikopoulos, P., Deroos, D., Parasuraman, K., Deutsch, T., Corrigan, D., & Giles, J. (2013). Harness the power of big data: The IBM big data platform. New York: Mc-Graw Hill Companies.Google Scholar
  16. 16.
    IBM. (n.d.). IBM big data platform [Online]. Retrieved January 22, 2017 from https://www-01.ibm.com/software/in/data/bigdata/enterprise.html
  17. 17.
    IBM. (n.d.). Bringing big data to the enterprise [Online]. Retrieved March 12, 2017 from https://www-01.ibm.com/software/sg/data/bigdata/enterprise.html
  18. 18.
    Hewlett-Packard Enterprise. (2017). Big data services: Build an insight engine [Online]. Retrieved January 22, 2017 from https://www.hpe.com/us/en/services/consulting/big-data.html
  19. 19.
    Hewlett-Packard Enterprise. (2017). Big data software [Online]. Retrieved March 13, 2017 from https://saas.hpe.com/en-us/software/big-data-analytics-software.
  20. 20.
    Hewlett-Packard. (2017). HAVEN big data platform for developers [Online]. Retrieved March 15, 2017 from http://www8.hp.com/us/en/developer/HAVEn.html?jumpid=reg_r1002_usen_c-001_title_r0002
  21. 21.
    SAP. (2017). Unlock business potential from big data more quickly and easily [Online]. Retrieved January 22, 2017 from http://www.sap.com/romania/documents/2015/08/f63628c9-3c7c-0010-82c7-eda71af511fa.html
  22. 22.
    SAP. (2017). Unlock business potential from your big data faster and easier with SAP HANA Vora [Online]. Retrieved January 22, 2017 from http://www.sap.com/hk/product/data-mgmt/hana-vora-hadoop.html
  23. 23.
    SAP Community. (n.d.). What is SAP HANA? [Online]. Retrieved March 15, 2017 from https://archive.sap.com/documents/docs/DOC-60338
  24. 24.
    Predictive Analytics Today. (n.d.). 50 big data platforms and big data analytics software [Online]. Retrieved January 22, 2017 from http://www.predictiveanalyticstoday.com/bigdata-platforms-bigdata-analytics-software/
  25. 25.
    Microsoft. (2017). Big data and analytics [Online]. Retrieved January 22, 2017 from https://azure.microsoft.com/en-us/solutions/big-data/
  26. 26.
    Microsoft Developer. (2017). Microsoft azure services platform [Online]. Retrieved March 15, 2017 from https://blogs.msdn.microsoft.com/mikewalker/2008/10/27/microsoft-azure-services-platform/
  27. 27.
    Microsoft Azure. (2017). Take a look at these innovative stories by world class companies [Online]. Retrieved March 24, 2017 from https://azure.microsoft.com/en-gb/case-studies/
  28. 28.
    Oracle. (2017). Big data features [Online]. Retrieved January 22, 2017 from https://cloud.oracle.com/en_US/big-data/features
  29. 29.
    Oracle. (2017). Big data in the cloud [Online]. Retrieved January 22, 2017 from https://cloud.oracle.com/bigdata
  30. 30.
    Pollock, J. (2015). Take it to the limit: An information architecture for beyond Hadoop [Online]. Retrieved March 24, 2017 from https://conferences.oreilly.com/strata/big-data-conference-ca-2015/public/schedule/detail/40599
  31. 31.
    Evosys. (2017). Oracle big data customers success stories [Online]. Retrieved March 24, 2017 from http://www.evosysglobal.com/big-data-2-level
  32. 32.
    H2O.ai. (n.d.). Fast scalable machine learning API [Online]. Retrieved February 17, 2017 from http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html
  33. 33.
    The Apache Software Foundation. (2017). Apache spark MLlib [Online]. Retrieved March 21, 2017 from http://spark.apache.org/mllib/
  34. 34.
    LeDell, E. (2015). High performance machine learning in R with H2O. Tokyo: ISM HPC on R Workshop.Google Scholar
  35. 35.
    Candel, A., Parmar, V., LeDell, E., & Arora, A. (2016) [Online]. Retrieved February 24, 2017 from http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf
  36. 36.
    Nykodym, T., & Maj, P. (2017). Fast analytics on big data with H2O [Online]. Retrieved February 20, 2017 from http://gotocon.com/dl/gotoberlin2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf
  37. 37.
    Novet, J. (2014). 0xdata takes $8.9M and becomes H2O to match its open-source machine-learning project [Online]. Retrieved March 21, 2017 from http://venturebeat.com/2014/11/07/h2o-funding/
  38. 38.
    Cage, D. (2013). Platfora founder goes in search of big-data answers [Online]. Retrieved March 21, 2017 from http://blogs.wsj.com/venturecapital/2013/04/15/platfora-founder-goes-in-search-of-big-data-answers/
  39. 39.
    Wilson, A. (1999). ACM Honors Dr. John M. Chambers of bell labs with the 1998 ACM software system award for creating S system [Online]. Retrieved March 21, 2017 from http://oldwww.acm.org/announcements/ss99.html
  40. 40.
    Chambers, J., & Hastie, T. (1991). Statistical models in S. Brooks Cole: Wadsworth.MATHGoogle Scholar
  41. 41.
    Schuster, W. (2014). Cliff click on in-memory processing, 0xdata H20, efficient low latency java and GCs [Online]. Retrieved March 21, 2017 from https://www.infoq.com/interviews/click-0xdata
  42. 42.
    Click, C. (2016). Winds of change [Online]. Retrieved March 21, 2017 from http://www.cliffc.org/blog/2016/02/19/winds-of-change/
  43. 43.
    H2O. (2016). H2O [Online]. Retrieved March 21, 2017 from http://0xdata.com/about/
  44. 44.
    Efron, B., & Tibshirani, R. (1994). An introduction to the bootstrap. New York: Chapman & Hall/CRC.MATHGoogle Scholar
  45. 45.
    Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. Boca Raton: Chapman & Hall/CRC.MATHGoogle Scholar
  46. 46.
    Hastie, T., Tibshirani, R., & Friedman, J. H. (2011). The elements of statistical learning. New York: Springer.MATHGoogle Scholar
  47. 47.
    Boyd, S., & Vandenberghe, L. (2004). Convex optimization [Online]. Retrieved March 21, 2017 from http://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
  48. 48.
    Wikepedia. (2017). H2O (Software) [Online]. Retrieved March 21, 2017 from https://en.wikipedia.org/wiki/H2O_(software)
  49. 49.
    0xdata. (2013). H2O software architecture. Retrieved March 21, 2017 from http://h2o-release.s3.amazonaws.com/h2o/rel-noether/4/docs-website/developuser/h2o_sw_arch.html
  50. 50.
    Banos, O., Garcia, R., & Saez, A. (2014). MHEALTH dataset data set. UCI machine learning repository: Center for machine learning and intelligent systems [Online]. Retrieved February 24, 2017 from https://archive.ics.uci.edu/ml/datasets/MHEALTH+Dataset
  51. 51.
    Banos, O., Garcia, R., Holgado, J. A., Damas, M., Pomares, H., Rojas, I., Saez, A., & Villalonga, C. (2014). mHealthDroid: a novel framework for agile development of mobile health applications. In Proceedings of the 6th international work-conference on ambient assisted living and active ageing (IWAAL), Belfast, Northern Ireland.Google Scholar
  52. 52.
    Banos, O., Villalonga, C., Garcia, R., Saez, R., Damas, M., Holgado, J. A., et al. (2015). Design, implementation, and validation of a novel open framework for agile development of mobile health applications. Biomedical Engineering OnLine, 14(S2:S6), 1–20.Google Scholar
  53. 53.
    Databricks. (n.d.). Making machine learning simple: Building machine learning solutions with databricks [Online]. Retrieved March 21, 2017 from http://cdn2.hubspot.net/hubfs/438089/Landing_pages/ML/Machine-Learning-Solutions-Brief-160129.pdf
  54. 54.
    Kraska, T., Talwalkar, A., Duchi, J., Grith, R., Franklin, M., & Jordan, M. (2013). Distributed machine-learning system. In Conference on innovative data systems research.Google Scholar
  55. 55.
    Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). Machine learning in apache spark. Journal of Machine Learning Research, 17, 1–7.MathSciNetMATHGoogle Scholar
  56. 56.
    Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., & Stoica, I. (2014). raphX: Graph processing in a distributed data framework. In Conference on operating systems design and implementation.Google Scholar
  57. 57.
    Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning Research, 3, 993–1022.MATHGoogle Scholar
  58. 58.
    Bradley, J. (2015). Topic modeling with LDA: MLlib meets GraphX [Online]. Retrieved March 21, 2017 from https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html
  59. 59.
    Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, & I. (2013). Discretized streams: Fault-tolerant streaming computing at scale. In Symposium on operating systems principles.Google Scholar
  60. 60.
    Freeman, J. (2015). Introducing streaming k-means in Apache Spark 1.2 [Online]. Retrieved March 21, 2017 from https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
  61. 61.
    Panda, B., Herbach, J. S., Basu, S., Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with mapreduce. In International conference on very large databases.Google Scholar
  62. 62.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine Learning Research, 12, 2825–2830.MathSciNetMATHGoogle Scholar
  63. 63.
    Buitink, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In European conference on machine learning and principles and practices of knowledge discovery in databases.Google Scholar
  64. 64.
    Sparks, E. R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J. E., et al. (2013). MLI: An API for distributed machine learning. In International conference on data mining.Google Scholar
  65. 65.
    Sparks, E. R., Talwalkar, A., Haas, D., Franklin, M. J., Jordan, M. I., & Kraska, T. (2015). Automating model for large scale machine learning. In Symposium on cloud computing.Google Scholar
  66. 66.
    Meng, X., Bradley, J., Sparks, E., & Venkataraman, S. (2015). ML pipelines: A new high-level API for MLlib. Retrieved March 21, 2017 from https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
  67. 67.
    Apache Spark. (n.d.). ML pipelines [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-pipeline.html
  68. 68.
    Apache Spark. (n.d.). Extracting, transforming and selecting features [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-features.html
  69. 69.
    Apache Spark. (n.d.). Classification and regression [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-classification-regression.html
  70. 70.
    Apache Spark. (n.d.). Clustering [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-clustering.html
  71. 71.
    Apache Spark. (n.d.). Collaborative filtering [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-collaborative-filtering.html
  72. 72.
    Apache Spark. (n.d.). ML tuning: Model selection and hyperparameter tuning [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-tuning.html
  73. 73.
    Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1–21.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • T. P. Fowdur
    • 1
  • Y. Beeharry
    • 2
  • V. Hurbungs
    • 3
  • V. Bassoo
    • 4
  • V. Ramnarain-Seetohul
    • 4
  1. 1.Department of Electrical and Electronic Engineering, Faculty of EngineeringUniversity of MauritiusRéduit, MokaMauritius
  2. 2.Faculty of Information, Communication and Digital TechnologiesUniversity of MauritiusRéduit, MokaMauritius
  3. 3.Department Software and Information Systems, Faculty of Information, Communication and Digital TechnologiesUniversity of MauritiusRéduitMauritius
  4. 4.Department of Information Communication Technology, Faculty of Information, Communication and Digital TechnologiesUniversity of MauritiusRéduitMauritius

Personalised recommendations