Skip to main content

Big Data: Technologies and Tools

  • Chapter
  • First Online:
Big Data Preprocessing

Abstract

The fast evolving Big Data environment has provoked that a myriad of tools, paradigms, and techniques surge to tackle different use cases in industry and science. However, because of the myriad of existing tools, it is often difficult for practitioners and experts to analyze and select the correct tool for their problems. In this chapter we present an introductory summary to the wide environment of Big Data with the aim of providing necessary knowledge to algorithm makers so that they are able to develop scalable and efficient machine learning solutions. We start with the discussion of common technical concepts, paradigms, and technologies which are the basement of frameworks like Spark and Hadoop. Afterwards we analyze in depth the most popular frameworks in Big Data, and their main components. Next we also discuss other novel platforms for high-speed streaming processing that are gaining increasing importance in industry. Finally we make a comparison between two of the most relevant large-scale processing platforms nowadays: Spark and Flink.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 84.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://spark-packages.org/package/databricks/spark-sklearn.

  2. 2.

    https://spark-packages.org/package/yahoo/CaffeOnSpark.

  3. 3.

    https://spark-packages.org/package/h2oai/sparkling-water.

References

  1. Aggarwal, C. C. (2015). Data mining: The textbook. Berlin: Springer.

    MATH  Google Scholar 

  2. Apache Cascading. (2019). http://www.cascading.org/

  3. Apache Drill. (2019). Apache Drill. https://drill.apache.org/

    Google Scholar 

  4. Apache Flink. (2019). http://flink.apache.org/

  5. Apache Flink Project. (2015). Peeking into Apache Flink’s Engine Room. https://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

  6. Apache Flume. (2019). https://flume.apache.org/

  7. Apache Giraph. (2019). Apache Giraph. https://giraph.apache.org/

    Google Scholar 

  8. Apache Hive. (2019). https://hive.apache.org/

  9. Apache Ignite. (2019). https://ignite.apache.org/

  10. Apache Mahout. (2019). https://mahout.apache.org/

  11. Apache Pig. (2019). https://pig.apache.org/

  12. Apache Software Foundation. (2019). Apache project directory. https://projects.apache.org/

  13. Apache Spark. (2019). Apache Spark: Lightning-fast cluster computing. http://spark.apache.org/

    Google Scholar 

  14. Apache Spark Project. (2015). Project Tungsten (Apache Spark). https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

  15. Apache Storm. (2019). https://storm.apache.org/

  16. Apache Tez. (2019). https://tez.apache.org/

  17. Apache YARN. (2019). https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

  18. Avro Project. (2019). Avro Project. https://avro.apache.org/

    Google Scholar 

  19. Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140–147.

    Article  MathSciNet  MATH  Google Scholar 

  20. Böhm, C., & Krebs, F. (2004). The k-nearest neighbour join: Turbo charging the KDD process. Knowledge and Information Systems, 6(6), 728–749.

    Article  Google Scholar 

  21. Broder, A., & Mitzenmacher, M. (2004). Network applications of bloom filters: A survey. Internet Mathematics, 1(4), 485–509.

    Article  MathSciNet  MATH  Google Scholar 

  22. Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., et al. (2013). API design for machine learning software: Experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning (pp. 108–122).

    Google Scholar 

  23. Comer, D. (1979). Ubiquitous B-tree. ACM Computing Surveys, 11(2), 121–137.

    Article  MathSciNet  MATH  Google Scholar 

  24. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. In OSDI04: Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation. Berkeley: USENIX Association.

    Google Scholar 

  25. Dennis, J. B. (1974). First version of a data flow procedure language (pp. 362–376). Berlin: Springer.

    Google Scholar 

  26. Dursi, J. (2019). HPC is dying, and MPI is killing it. https://www.dursi.ca/post/hpc-is-dyingand-mpi-is-killing-it.html/. Online; accessed July 2019.

  27. Fernández, A., del Río, S., López, V., Bawakid, A., del Jesús, M. J., Benítez, J. M., et al. (2014). Big data with cloud computing: An insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380–409.

    Google Scholar 

  28. Fine, J. P., & Gray, R. J. (1999). A proportional hazards model for the subdistribution of a competing risk. Journal of the American Statistical Association, 94(446), 496–509.

    Article  MathSciNet  MATH  Google Scholar 

  29. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.

    Article  MathSciNet  MATH  Google Scholar 

  30. Galar, M., Fernández, A., Barrenechea, E., Bustince, H., & Herrera, F. (2011). An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes. Pattern Recognition, 44(8), 1761–1776.

    Article  Google Scholar 

  31. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J. M., & Herrera, F. (2019). Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1), 9.

    Article  Google Scholar 

  32. García-Gil, D., Ramírez-Gallego, S., García, S., & Herrera, F. (2017). A comparison on scalability for batch big data processing on Apache Spark and Apache Flink. Big Data Analytics, 2(1), 1.

    Article  Google Scholar 

  33. Gilbert, S., & Lynch, N. (2012). Perspectives on the cap theorem. Computer, 45(2), 30–36.

    Article  Google Scholar 

  34. Hadoop Distributed File System. (2019). https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

  35. Härdle, W., Horng-Shing Lu, H., & Shen, X. (2018). Handbook of big data analytics. Berlin: Springer.

    Book  MATH  Google Scholar 

  36. Harris, D. (2013). The history of Hadoop: From 4 nodes to the future of data. https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/

    Google Scholar 

  37. Hazelcast. (2019). https://hazelcast.com/

  38. Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning spark: Lightning-fast big data analytics. Sebastopol: O’Reilly Media.

    Google Scholar 

  39. Laney, D. (2001). 3D data management: Controlling data volume, velocity and variety. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Online; accessed March 2019.

  40. Liu, D.C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.

    Article  MathSciNet  MATH  Google Scholar 

  41. Liu, T., Rosenberg, C. J., & Rowley, H. A. (2009). Performing a parallel nearest-neighbor matching operation using a parallel hybrid spill tree. US Patent 7,475,071.

    Google Scholar 

  42. Maillo, J., Ramírez, S., Triguero, I., & Herrera, F. (2016). kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for big data. Knowledge-Based Systems, 117, 3–15.

    Article  Google Scholar 

  43. Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., et al. (2010). Dremel: Interactive analysis of web-scale datasets. In Proceedings of the 36th International Conference on Very Large Data Bases (pp. 330–339).

    Article  Google Scholar 

  44. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., et al. (2016). MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34), 1–7.

    MathSciNet  MATH  Google Scholar 

  45. MongoDB. (2019). https://www.mongodb.com/

  46. NoSQL Database. (2019). NoSQL database. http://nosql-database.org/

    Google Scholar 

  47. Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. In Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009).

    Google Scholar 

  48. Panda, B., Herbach, J. S., Basu, S., & Bayardo, R. J. (2009). Planet: Massively parallel learning of tree ensembles with MapReduce. Proceedings of the VLDB Endowment, 2(2), 1426–1437.

    Article  Google Scholar 

  49. Parquet Project. (2019). Parquet Project. https://parquet.apache.org/

    Google Scholar 

  50. Ramalingeswara Rao, T., Mitra, P., Bhatt, R., & Goswami, A. (2019). The big data system, components, tools, and technologies: A survey. Knowledge and Information Systems, 60, 1165–1245.

    Article  Google Scholar 

  51. Robbins, H., & Monro, S. (1985). A stochastic approximation method. In Herbert Robbins selected papers (pp. 102–109). Berlin: Springer.

    Chapter  Google Scholar 

  52. Rosasco, L., De Vito, E., Caponnetto, A., Piana, M., & Verri, A. (2004). Are loss functions all the same? Neural Computation, 16(5), 1063–1076.

    Article  MATH  Google Scholar 

  53. Rosenblatt, F. (1961). Principles of neurodynamics. Perceptrons and the theory of brain mechanisms. Technical report, Cornell Aeronautical Lab Inc., Buffalo.

    Google Scholar 

  54. Ross Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

    Google Scholar 

  55. Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Kuala Lumpur: Pearson Education Limited.

    MATH  Google Scholar 

  56. Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley Professional (1st ed.). Boston: Addison-Wesley.

    Google Scholar 

  57. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance techniques. Technical Report, Indiana University.

    Google Scholar 

  58. Spark Packages. (2019). 3rd party spark packages. https://spark-packages.org/

    Google Scholar 

  59. Spark Petabyte Sort. (2014). Apache Spark the fastest open source engine for sorting a petabyte. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

  60. Spiegler, I., & Maayan, R. (1985). Storage and retrieval considerations of binary data bases. Information Processing and Management, 21(3), 233–254.

    Article  Google Scholar 

  61. Stonebraker, M. (1986) The case for shared nothing. Database Engineering, 9, 4–9.

    Google Scholar 

  62. Sung, M. (2000). SIMD parallel processing Michael Sung 6.911: Architectures anonymous. http://www.ai.mit.edu/projects/aries/papers/writeups/darkman-writeup.pdf/. [Online; accessed July 2019].

  63. The H2O.ai team. (2015). H2O: Scalable machine learning. http://www.h2o.ai

  64. Valiant, L. G. (1990). A bridging model for parallel computation. Communications of ACM, 33(8), 103–111.

    Article  Google Scholar 

  65. Wei, L.-J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Statistics in Medicine, 11(14–15), 1871–1879.

    Article  Google Scholar 

  66. Wu, X., Zhu, X., Wu, G.-Q., & Ding, W. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 26(1), 97–107.

    Google Scholar 

  67. Yu, S., & Guo, S. (2016). Big data concepts, theories, and applications. Amsterdam: Elsevier.

    Book  Google Scholar 

  68. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., et al. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12 (pp. 2–2).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Luengo, J., García-Gil, D., Ramírez-Gallego, S., García, S., Herrera, F. (2020). Big Data: Technologies and Tools. In: Big Data Preprocessing. Springer, Cham. https://doi.org/10.1007/978-3-030-39105-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-39105-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-39104-1

  • Online ISBN: 978-3-030-39105-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics