Big Data Analytics with Machine Learning Tools
Abstract
Big data analytics is the current, trending hot-topic in the research community. Several tools and techniques for handling and analysing structured and unstructured data are emerging very rapidly. However, most of the tools require high expert knowledge for understanding their concepts and utilizing them. This chapter presents an in-depth overview of the various corporate and open-source tools currently being used for analysing and learning from big data. An overview of the most common platforms such as IBM, HPE, SAPRANA, Microsoft Azure and Oracle is first given. Additionally, emphasis has been laid on two open-source tools: H2O and Spark MLlib. H2O was developed by H2O.ai, a company launched in 2011 and MLIB is an open source API that is part of the Apache Software Foundation. Different classification algorithms have been applied to Mobile-Health related data using both H2O and Spark MLlib. Random Forest, Naïve Bayes, and Deep Learning algorithms have been used on the data in H2O. Random Forest, Decision Trees, and Multinomial Logistic Regression Classification algorithms have been used with the data in Spark MLlib. The illustrations demonstrate the flows for developing, training, and testing mathematical models in order to obtain insights from M-Health data using open source tools developed for handling big data.
Keywords
Open-source tools H2O Spark MLlibReferences
- 1.Russom, P. (2011). Executive summary: Big data analytics. Renton: The Data Warehouse Institute (TWDI).Google Scholar
- 2.Landset, S., Khoshgoftaar, T. M., Ritcher, A. M., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the Hadoop ecosystem. Journal of Big Data, 2(24), 1–36.Google Scholar
- 3.The R Foundation. (2017). The R project for statistical computing. Retrieved January 22, 2017 from https://www.r-project.org/
- 4.Machine Learning Group at the University of Waikato Weka 3. (n.d.). Data mining software in Java. Retrieved January 22, 2017 from http://www.cs.waikato.ac.nz/ml/weka/
- 5.International Data Corporation. (2017). Discover the digital universe of opportunities, rich data and the increasing value of the Internet of things. Retrieved January 22, 2017 from https://www.emc.com/leadership/digital-universe/index.htm
- 6.Vidhya, S., Sarumathi, S., & Shanthi, N. (2014). Comparative analysis of diverse collection of big data analytics tools. World Academy of Science, Engineering and Technology: International Journal of Computer, Electrical, Automation, Control and Information Engineering, 8(9), 1646–1652.Google Scholar
- 7.Kamal, S., Ripon, S. H., Dey, N., Ashour, A. S., & Santhi, V. (2016). A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Computer Methods and Programs in Biomedicine, 131, 161–206.CrossRefGoogle Scholar
- 8.Kamal, S., Dey, N., Ashour, A. S., Ripon, S. H., & Balas, V. E. (2016). FbMapping: An automated system for monitoring Facebook data. Neural Network World. doi: 10.14311/NNW.2017.27.002 Google Scholar
- 9.Kamal, S., Nimmy, S. F., Hossain, M. I., Dey, N., Ashour, A. S., & Santhi, V. (2016). ExSep: An exon separation process using neural skyline filter. In International conference on electrical, electronics, and optimization techniques (ICEEOT).Google Scholar
- 10.Bhattacherjee, A., Roy, S., Paul, S., Roy, P., Kaussar, N., & Dey, N. (2016). Classification approach for breast cancer detection using back propagation neural network: A study. In Biomedical image analysis and mining techniques for improved health outcomes (p. 12). IGI-Global.Google Scholar
- 11.Chatterjee, S., Ghosh, S., Dawn, S., Hore, S., & Dey, N. (in press). Optimized forest type classification: A machine learning approach. In 3rd international conference on information system design and intelligent applications. Vishakhapatnam: Springer AISC.Google Scholar
- 12.Kimbahune, V. V., Deshpandey, A. V., & Mahalle, P. N. (2017). Lightweight key management for adaptive addressing in next generation internet. International Journal of Ambient Computing and Intelligence (IJACI), 8(1), 20.Google Scholar
- 13.Najjar, M., Courtemanche, F., Haman, H., Dion, A., & Bauchet, J. (2009). Intelligent recognition of activities of daily living for assisting memory and/or cognitively impaired elders in smart homes. International Journal of Ambient Computing and Intelligence (IJACI), 1(4), 17.Google Scholar
- 14.IBM. (n.d.). Big data at the speed of business [Online]. Retrieved January 22, 2017 from https://www-01.ibm.com/software/data/bigdata/
- 15.Zikopoulos, P., Deroos, D., Parasuraman, K., Deutsch, T., Corrigan, D., & Giles, J. (2013). Harness the power of big data: The IBM big data platform. New York: Mc-Graw Hill Companies.Google Scholar
- 16.IBM. (n.d.). IBM big data platform [Online]. Retrieved January 22, 2017 from https://www-01.ibm.com/software/in/data/bigdata/enterprise.html
- 17.IBM. (n.d.). Bringing big data to the enterprise [Online]. Retrieved March 12, 2017 from https://www-01.ibm.com/software/sg/data/bigdata/enterprise.html
- 18.Hewlett-Packard Enterprise. (2017). Big data services: Build an insight engine [Online]. Retrieved January 22, 2017 from https://www.hpe.com/us/en/services/consulting/big-data.html
- 19.Hewlett-Packard Enterprise. (2017). Big data software [Online]. Retrieved March 13, 2017 from https://saas.hpe.com/en-us/software/big-data-analytics-software.
- 20.Hewlett-Packard. (2017). HAVEN big data platform for developers [Online]. Retrieved March 15, 2017 from http://www8.hp.com/us/en/developer/HAVEn.html?jumpid=reg_r1002_usen_c-001_title_r0002
- 21.SAP. (2017). Unlock business potential from big data more quickly and easily [Online]. Retrieved January 22, 2017 from http://www.sap.com/romania/documents/2015/08/f63628c9-3c7c-0010-82c7-eda71af511fa.html
- 22.SAP. (2017). Unlock business potential from your big data faster and easier with SAP HANA Vora [Online]. Retrieved January 22, 2017 from http://www.sap.com/hk/product/data-mgmt/hana-vora-hadoop.html
- 23.SAP Community. (n.d.). What is SAP HANA? [Online]. Retrieved March 15, 2017 from https://archive.sap.com/documents/docs/DOC-60338
- 24.Predictive Analytics Today. (n.d.). 50 big data platforms and big data analytics software [Online]. Retrieved January 22, 2017 from http://www.predictiveanalyticstoday.com/bigdata-platforms-bigdata-analytics-software/
- 25.Microsoft. (2017). Big data and analytics [Online]. Retrieved January 22, 2017 from https://azure.microsoft.com/en-us/solutions/big-data/
- 26.Microsoft Developer. (2017). Microsoft azure services platform [Online]. Retrieved March 15, 2017 from https://blogs.msdn.microsoft.com/mikewalker/2008/10/27/microsoft-azure-services-platform/
- 27.Microsoft Azure. (2017). Take a look at these innovative stories by world class companies [Online]. Retrieved March 24, 2017 from https://azure.microsoft.com/en-gb/case-studies/
- 28.Oracle. (2017). Big data features [Online]. Retrieved January 22, 2017 from https://cloud.oracle.com/en_US/big-data/features
- 29.Oracle. (2017). Big data in the cloud [Online]. Retrieved January 22, 2017 from https://cloud.oracle.com/bigdata
- 30.Pollock, J. (2015). Take it to the limit: An information architecture for beyond Hadoop [Online]. Retrieved March 24, 2017 from https://conferences.oreilly.com/strata/big-data-conference-ca-2015/public/schedule/detail/40599
- 31.Evosys. (2017). Oracle big data customers success stories [Online]. Retrieved March 24, 2017 from http://www.evosysglobal.com/big-data-2-level
- 32.H2O.ai. (n.d.). Fast scalable machine learning API [Online]. Retrieved February 17, 2017 from http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html
- 33.The Apache Software Foundation. (2017). Apache spark MLlib [Online]. Retrieved March 21, 2017 from http://spark.apache.org/mllib/
- 34.LeDell, E. (2015). High performance machine learning in R with H2O. Tokyo: ISM HPC on R Workshop.Google Scholar
- 35.Candel, A., Parmar, V., LeDell, E., & Arora, A. (2016) [Online]. Retrieved February 24, 2017 from http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf
- 36.Nykodym, T., & Maj, P. (2017). Fast analytics on big data with H2O [Online]. Retrieved February 20, 2017 from http://gotocon.com/dl/gotoberlin2014/slides/PetrMaj_and_TomasNykodym_FastAnalyticsOnBigData.pdf
- 37.Novet, J. (2014). 0xdata takes $8.9M and becomes H2O to match its open-source machine-learning project [Online]. Retrieved March 21, 2017 from http://venturebeat.com/2014/11/07/h2o-funding/
- 38.Cage, D. (2013). Platfora founder goes in search of big-data answers [Online]. Retrieved March 21, 2017 from http://blogs.wsj.com/venturecapital/2013/04/15/platfora-founder-goes-in-search-of-big-data-answers/
- 39.Wilson, A. (1999). ACM Honors Dr. John M. Chambers of bell labs with the 1998 ACM software system award for creating S system [Online]. Retrieved March 21, 2017 from http://oldwww.acm.org/announcements/ss99.html
- 40.Chambers, J., & Hastie, T. (1991). Statistical models in S. Brooks Cole: Wadsworth.MATHGoogle Scholar
- 41.Schuster, W. (2014). Cliff click on in-memory processing, 0xdata H20, efficient low latency java and GCs [Online]. Retrieved March 21, 2017 from https://www.infoq.com/interviews/click-0xdata
- 42.Click, C. (2016). Winds of change [Online]. Retrieved March 21, 2017 from http://www.cliffc.org/blog/2016/02/19/winds-of-change/
- 43.H2O. (2016). H2O [Online]. Retrieved March 21, 2017 from http://0xdata.com/about/
- 44.Efron, B., & Tibshirani, R. (1994). An introduction to the bootstrap. New York: Chapman & Hall/CRC.MATHGoogle Scholar
- 45.Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. Boca Raton: Chapman & Hall/CRC.MATHGoogle Scholar
- 46.Hastie, T., Tibshirani, R., & Friedman, J. H. (2011). The elements of statistical learning. New York: Springer.MATHGoogle Scholar
- 47.Boyd, S., & Vandenberghe, L. (2004). Convex optimization [Online]. Retrieved March 21, 2017 from http://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
- 48.Wikepedia. (2017). H2O (Software) [Online]. Retrieved March 21, 2017 from https://en.wikipedia.org/wiki/H2O_(software)
- 49.0xdata. (2013). H2O software architecture. Retrieved March 21, 2017 from http://h2o-release.s3.amazonaws.com/h2o/rel-noether/4/docs-website/developuser/h2o_sw_arch.html
- 50.Banos, O., Garcia, R., & Saez, A. (2014). MHEALTH dataset data set. UCI machine learning repository: Center for machine learning and intelligent systems [Online]. Retrieved February 24, 2017 from https://archive.ics.uci.edu/ml/datasets/MHEALTH+Dataset
- 51.Banos, O., Garcia, R., Holgado, J. A., Damas, M., Pomares, H., Rojas, I., Saez, A., & Villalonga, C. (2014). mHealthDroid: a novel framework for agile development of mobile health applications. In Proceedings of the 6th international work-conference on ambient assisted living and active ageing (IWAAL), Belfast, Northern Ireland.Google Scholar
- 52.Banos, O., Villalonga, C., Garcia, R., Saez, R., Damas, M., Holgado, J. A., et al. (2015). Design, implementation, and validation of a novel open framework for agile development of mobile health applications. Biomedical Engineering OnLine, 14(S2:S6), 1–20.Google Scholar
- 53.Databricks. (n.d.). Making machine learning simple: Building machine learning solutions with databricks [Online]. Retrieved March 21, 2017 from http://cdn2.hubspot.net/hubfs/438089/Landing_pages/ML/Machine-Learning-Solutions-Brief-160129.pdf
- 54.Kraska, T., Talwalkar, A., Duchi, J., Grith, R., Franklin, M., & Jordan, M. (2013). Distributed machine-learning system. In Conference on innovative data systems research.Google Scholar
- 55.Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). Machine learning in apache spark. Journal of Machine Learning Research, 17, 1–7.MathSciNetMATHGoogle Scholar
- 56.Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D., Franklin, M. J., & Stoica, I. (2014). raphX: Graph processing in a distributed data framework. In Conference on operating systems design and implementation.Google Scholar
- 57.Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning Research, 3, 993–1022.MATHGoogle Scholar
- 58.Bradley, J. (2015). Topic modeling with LDA: MLlib meets GraphX [Online]. Retrieved March 21, 2017 from https://databricks.com/blog/2015/03/25/topic-modeling-with-lda-mllib-meets-graphx.html
- 59.Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, & I. (2013). Discretized streams: Fault-tolerant streaming computing at scale. In Symposium on operating systems principles.Google Scholar
- 60.Freeman, J. (2015). Introducing streaming k-means in Apache Spark 1.2 [Online]. Retrieved March 21, 2017 from https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
- 61.Panda, B., Herbach, J. S., Basu, S., Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with mapreduce. In International conference on very large databases.Google Scholar
- 62.Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of machine Learning Research, 12, 2825–2830.MathSciNetMATHGoogle Scholar
- 63.Buitink, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In European conference on machine learning and principles and practices of knowledge discovery in databases.Google Scholar
- 64.Sparks, E. R., Talwalkar, A., Smith, V., Kottalam, J., Pan, X., Gonzalez, J. E., et al. (2013). MLI: An API for distributed machine learning. In International conference on data mining.Google Scholar
- 65.Sparks, E. R., Talwalkar, A., Haas, D., Franklin, M. J., Jordan, M. I., & Kraska, T. (2015). Automating model for large scale machine learning. In Symposium on cloud computing.Google Scholar
- 66.Meng, X., Bradley, J., Sparks, E., & Venkataraman, S. (2015). ML pipelines: A new high-level API for MLlib. Retrieved March 21, 2017 from https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html
- 67.Apache Spark. (n.d.). ML pipelines [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-pipeline.html
- 68.Apache Spark. (n.d.). Extracting, transforming and selecting features [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-features.html
- 69.Apache Spark. (n.d.). Classification and regression [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-classification-regression.html
- 70.Apache Spark. (n.d.). Clustering [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-clustering.html
- 71.Apache Spark. (n.d.). Collaborative filtering [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-collaborative-filtering.html
- 72.Apache Spark. (n.d.). ML tuning: Model selection and hyperparameter tuning [Online]. Retrieved March 21, 2017 from http://spark.apache.org/docs/latest/ml-tuning.html
- 73.Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., & Muharemagic, E. (2015). Deep learning applications and challenges in big data analytics. Journal of Big Data, 2(1), 1–21.CrossRefGoogle Scholar