Abstract
The fast evolving Big Data environment has provoked that a myriad of tools, paradigms, and techniques surge to tackle different use cases in industry and science. However, because of the myriad of existing tools, it is often difficult for practitioners and experts to analyze and select the correct tool for their problems. In this chapter we present an introductory summary to the wide environment of Big Data with the aim of providing necessary knowledge to algorithm makers so that they are able to develop scalable and efficient machine learning solutions. We start with the discussion of common technical concepts, paradigms, and technologies which are the basement of frameworks like Spark and Hadoop. Afterwards we analyze in depth the most popular frameworks in Big Data, and their main components. Next we also discuss other novel platforms for high-speed streaming processing that are gaining increasing importance in industry. Finally we make a comparison between two of the most relevant large-scale processing platforms nowadays: Spark and Flink.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aggarwal, C. C. (2015). Data mining: The textbook. Berlin: Springer.
Apache Cascading. (2019). http://www.cascading.org/
Apache Drill. (2019). Apache Drill. https://drill.apache.org/
Apache Flink. (2019). http://flink.apache.org/
Apache Flink Project. (2015). Peeking into Apache Flink’s Engine Room. https://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Apache Flume. (2019). https://flume.apache.org/
Apache Giraph. (2019). Apache Giraph. https://giraph.apache.org/
Apache Hive. (2019). https://hive.apache.org/
Apache Ignite. (2019). https://ignite.apache.org/
Apache Mahout. (2019). https://mahout.apache.org/
Apache Pig. (2019). https://pig.apache.org/
Apache Software Foundation. (2019). Apache project directory. https://projects.apache.org/
Apache Spark. (2019). Apache Spark: Lightning-fast cluster computing. http://spark.apache.org/
Apache Spark Project. (2015). Project Tungsten (Apache Spark). https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Apache Storm. (2019). https://storm.apache.org/
Apache Tez. (2019). https://tez.apache.org/
Apache YARN. (2019). https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Avro Project. (2019). Avro Project. https://avro.apache.org/
Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140–147.
Böhm, C., & Krebs, F. (2004). The k-nearest neighbour join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6), 728–749.
Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 485–509.
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).
Comer, D. (1979). Ubiquitous B-tree. ACM Computing Surveys, 11(2), 121–137.
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation. Berkeley: USENIX Association.
Dennis, J. B. (1974). First version of a data flow procedure language (pp. 362–376). Berlin: Springer.
Dursi, J. (2019). HPC is dying, and MPI is killing it. https://www.dursi.ca/post/hpc-is-dyingand-mpi-is-killing-it.html/. Online; accessed July 2019.
Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M., et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.
Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association, 94(446), 496–509.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, 44(8), 1761–1776.
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2019). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 9.
García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics, 2(1), 1.
Gilbert, S., & Lynch, N. (2012). Perspectives on the cap theorem. Computer, 45(2), 30–36.
Hadoop Distributed File System. (2019). https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.
Harris, D. (2013). The history of Hadoop: From 4 nodes to the future of data. https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Hazelcast. (2019). https://hazelcast.com/
Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics. Sebastopol: O’Reilly Media.
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Online; accessed March 2019.
Liu, D.C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.
Liu, T., Rosenberg, C. J., & Rowley, H. A. (2009). Performing a parallel nearest-neighbor matching operation using a parallel hybrid spill tree. US Patent 7,475,071.
Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2016). kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.
Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., et al. (2010). Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases (pp. 330–339).
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34), 1–7.
MongoDB. (2019). https://www.mongodb.com/
NoSQL Database. (2019). NoSQL database. http://nosql-database.org/
Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009).
Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 1426–1437.
Parquet Project. (2019). Parquet Project. https://parquet.apache.org/
Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2019). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.
Robbins, H., & Monro, S. (1985). A stochastic approximation method. In Herbert Robbins selected papers (pp. 102–109). Berlin: Springer.
Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Computation, 16(5), 1063–1076.
Rosenblatt, F. (1961). Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Technical report, Cornell Aeronautical Lab Inc., Buffalo.
Ross Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Kuala Lumpur: Pearson Education Limited.
Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional (1st ed.). Boston: Addison-Wesley.
Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance techniques. Technical Report, Indiana University.
Spark Packages. (2019). 3rd party spark packages. https://spark-packages.org/
Spark Petabyte Sort. (2014). Apache Spark the fastest open source engine for sorting a petabyte. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.
Stonebraker, M. (1986) The case for shared nothing. Database Engineering, 9, 4–9.
Sung, M. (2000). SIMD parallel processing Michael Sung 6.911: Architectures anonymous. http://www.ai.mit.edu/projects/aries/papers/writeups/darkman-writeup.pdf/. [Online; accessed July 2019].
The H2O.ai team. (2015). H2O: Scalable machine learning. http://www.h2o.ai
Valiant, L. G. (1990). A bridging model for parallel computation. Communications of ACM, 33(8), 103–111.
Wei, L.-J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14–15), 1871–1879.
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.
Yu, S., & Guo, S. (2016). Big data concepts, theories, and applications. Amsterdam: Elsevier.
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12 (pp. 2–2).
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Big Data: Technologies and Tools. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-39105-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)