Abstract
The first version of Spark was open sourced in 2010, and it went into Apache incubation in 2013. By early 2014, it was promoted to a top-level Apache project. It has already replaced Hadoop as the Big Data processing engine of choice in most organizations. This is a testament to its maturity and the richness of its design. Batch processing, iterative and interactive computation, stream processing, graph analytics, ETL, machine learning, and data warehousing; you name it and Spark can already handle it. This chapter is a hands-on primer to Spark to set the stage for the rest of the book.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Matei Zaharia et al., “Spark: Cluster Computing with Working Sets, Proceedings of HotCloud ’10 (USENIX Association, 2010).
- 2.
*Insert “speed” joke here.*
- 3.
Reynold Xin, “Shark, Spark SQL, Hive on Spark, and the Future of SQL on Spark,” Databricks, July 1, 2014, https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html .
- 4.
Reynold Xin, Michael Armbrust, and Davies Liu, “Introducing DataFrames in Spark for Large Scale Data Science,” Databricks, February 17, 2015, https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html .
- 5.
This requires passwordless key-based authentication between the master and all worker nodes. Alternatively, you can set SPARK_SSH_FOREGROUND and provide a password for each worker machine.
- 6.
A task may or may not correspond to a single transformation. This depends on the dependencies in a stage. Refer to Chapter 4 for details on dependencies.
- 7.
- 8.
- 9.
Spark uses HADOOP_CONF_DIR and YARN_CONF_DIR to access HDFS and talk to the YARN resource manager.
- 10.
Tom White, “What’s New in Apache Hadoop 0.21,” Cloudera, August 26, 2010, http://blog.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/ .
- 11.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Zubair Nabi
About this chapter
Cite this chapter
Nabi, Z. (2016). Introduction to Spark. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_2
Download citation
DOI: https://doi.org/10.1007/978-1-4842-1479-4_2
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-1480-0
Online ISBN: 978-1-4842-1479-4
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)