Big Data: Technologies and Tools

Luengo, Julián; García-Gil, Diego; Ramírez-Gallego, Sergio; García, Salvador; Herrera, Francisco

doi:10.1007/978-3-030-39105-8_2

Julián Luengo⁶,
Diego García-Gil⁶,
Sergio Ramírez-Gallego⁷,
Salvador García⁶ &
…
Francisco Herrera⁶

2233 Accesses
1 Citations

Abstract

The fast evolving Big Data environment has provoked that a myriad of tools, paradigms, and techniques surge to tackle different use cases in industry and science. However, because of the myriad of existing tools, it is often difficult for practitioners and experts to analyze and select the correct tool for their problems. In this chapter we present an introductory summary to the wide environment of Big Data with the aim of providing necessary knowledge to algorithm makers so that they are able to develop scalable and efficient machine learning solutions. We start with the discussion of common technical concepts, paradigms, and technologies which are the basement of frameworks like Spark and Hadoop. Afterwards we analyze in depth the most popular frameworks in Big Data, and their main components. Next we also discuss other novel platforms for high-speed streaming processing that are gaining increasing importance in industry. Finally we make a comparison between two of the most relevant large-scale processing platforms nowadays: Spark and Flink.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aggarwal, C. C. (2015). Data mining: The textbook. Berlin: Springer.
MATH Google Scholar
Apache Cascading. (2019). http://www.cascading.org/
Apache Drill. (2019). Apache Drill. https://drill.apache.org/
Google Scholar
Apache Flink. (2019). http://flink.apache.org/
Apache Flink Project. (2015). Peeking into Apache Flink’s Engine Room. https://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
Apache Flume. (2019). https://flume.apache.org/
Apache Giraph. (2019). Apache Giraph. https://giraph.apache.org/
Google Scholar
Apache Hive. (2019). https://hive.apache.org/
Apache Ignite. (2019). https://ignite.apache.org/
Apache Mahout. (2019). https://mahout.apache.org/
Apache Pig. (2019). https://pig.apache.org/
Apache Software Foundation. (2019). Apache project directory. https://projects.apache.org/
Apache Spark. (2019). Apache Spark: Lightning-fast cluster computing. http://spark.apache.org/
Google Scholar
Apache Spark Project. (2015). Project Tungsten (Apache Spark). https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Apache Storm. (2019). https://storm.apache.org/
Apache Tez. (2019). https://tez.apache.org/
Apache YARN. (2019). https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
Avro Project. (2019). Avro Project. https://avro.apache.org/
Google Scholar
Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140–147.
Article MathSciNet MATH Google Scholar
Böhm, C., & Krebs, F. (2004). The k-nearest neighbour join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6), 728–749.
Article Google Scholar
Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 485–509.
Article MathSciNet MATH Google Scholar
Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).
Google Scholar
Comer, D. (1979). Ubiquitous B-tree. ACM Computing Surveys, 11(2), 121–137.
Article MathSciNet MATH Google Scholar
Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation. Berkeley: USENIX Association.
Google Scholar
Dennis, J. B. (1974). First version of a data flow procedure language (pp. 362–376). Berlin: Springer.
Google Scholar
Dursi, J. (2019). HPC is dying, and MPI is killing it. https://www.dursi.ca/post/hpc-is-dyingand-mpi-is-killing-it.html/. Online; accessed July 2019.
Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M., et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.
Google Scholar
Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association, 94(446), 496–509.
Article MathSciNet MATH Google Scholar
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.
Article MathSciNet MATH Google Scholar
Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, 44(8), 1761–1776.
Article Google Scholar
García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2019). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 9.
Article Google Scholar
García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics, 2(1), 1.
Article Google Scholar
Gilbert, S., & Lynch, N. (2012). Perspectives on the cap theorem. Computer, 45(2), 30–36.
Article Google Scholar
Hadoop Distributed File System. (2019). https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.
Book MATH Google Scholar
Harris, D. (2013). The history of Hadoop: From 4 nodes to the future of data. https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Google Scholar
Hazelcast. (2019). https://hazelcast.com/
Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics. Sebastopol: O’Reilly Media.
Google Scholar
Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Online; accessed March 2019.
Liu, D.C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.
Article MathSciNet MATH Google Scholar
Liu, T., Rosenberg, C. J., & Rowley, H. A. (2009). Performing a parallel nearest-neighbor matching operation using a parallel hybrid spill tree. US Patent 7,475,071.
Google Scholar
Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2016). kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.
Article Google Scholar
Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., et al. (2010). Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases (pp. 330–339).
Article Google Scholar
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34), 1–7.
MathSciNet MATH Google Scholar
MongoDB. (2019). https://www.mongodb.com/
NoSQL Database. (2019). NoSQL database. http://nosql-database.org/
Google Scholar
Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009).
Google Scholar
Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 1426–1437.
Article Google Scholar
Parquet Project. (2019). Parquet Project. https://parquet.apache.org/
Google Scholar
Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2019). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.
Article Google Scholar
Robbins, H., & Monro, S. (1985). A stochastic approximation method. In Herbert Robbins selected papers (pp. 102–109). Berlin: Springer.
Chapter Google Scholar
Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Computation, 16(5), 1063–1076.
Article MATH Google Scholar
Rosenblatt, F. (1961). Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Technical report, Cornell Aeronautical Lab Inc., Buffalo.
Google Scholar
Ross Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Google Scholar
Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Kuala Lumpur: Pearson Education Limited.
MATH Google Scholar
Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional (1st ed.). Boston: Addison-Wesley.
Google Scholar
Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance techniques. Technical Report, Indiana University.
Google Scholar
Spark Packages. (2019). 3rd party spark packages. https://spark-packages.org/
Google Scholar
Spark Petabyte Sort. (2014). Apache Spark the fastest open source engine for sorting a petabyte. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.
Article Google Scholar
Stonebraker, M. (1986) The case for shared nothing. Database Engineering, 9, 4–9.
Google Scholar
Sung, M. (2000). SIMD parallel processing Michael Sung 6.911: Architectures anonymous. http://www.ai.mit.edu/projects/aries/papers/writeups/darkman-writeup.pdf/. [Online; accessed July 2019].
The H2O.ai team. (2015). H2O: Scalable machine learning. http://www.h2o.ai
Valiant, L. G. (1990). A bridging model for parallel computation. Communications of ACM, 33(8), 103–111.
Article Google Scholar
Wei, L.-J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14–15), 1871–1879.
Article Google Scholar
Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.
Google Scholar
Yu, S., & Guo, S. (2016). Big data concepts, theories, and applications. Amsterdam: Elsevier.
Book Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12 (pp. 2–2).
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and AI, University of Granada, Granada, Spain
Julián Luengo, Diego García-Gil, Salvador García & Francisco Herrera
DOCOMO Digital España, Madrid, Madrid, Spain
Sergio Ramírez-Gallego

Authors

Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Diego García-Gil
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Ramírez-Gallego
View author publications
You can also search for this author in PubMed Google Scholar
Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Big Data: Technologies and Tools. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-39105-8_2
Published: 17 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-39104-1
Online ISBN: 978-3-030-39105-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics