Introduction to Spark

Nabi, Zubair

doi:10.1007/978-1-4842-1479-4_2

Zubair Nabi²

2627 Accesses
1 Citations

Abstract

The first version of Spark was open sourced in 2010, and it went into Apache incubation in 2013. By early 2014, it was promoted to a top-level Apache project. It has already replaced Hadoop as the Big Data processing engine of choice in most organizations. This is a testament to its maturity and the richness of its design. Batch processing, iterative and interactive computation, stream processing, graph analytics, ETL, machine learning, and data warehousing; you name it and Spark can already handle it. This chapter is a hands-on primer to Spark to set the stage for the rest of the book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Matei Zaharia et al., “Spark: Cluster Computing with Working Sets, Proceedings of HotCloud ’10 (USENIX Association, 2010).
2.
*Insert “speed” joke here.*
3.
Reynold Xin, “Shark, Spark SQL, Hive on Spark, and the Future of SQL on Spark,” Databricks, July 1, 2014, https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html .
4.
Reynold Xin, Michael Armbrust, and Davies Liu, “Introducing DataFrames in Spark for Large Scale Data Science,” Databricks, February 17, 2015, https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html .
5.
This requires passwordless key-based authentication between the master and all worker nodes. Alternatively, you can set SPARK_SSH_FOREGROUND and provide a password for each worker machine.
6.
A task may or may not correspond to a single transformation. This depends on the dependencies in a stage. Refer to Chapter 4 for details on dependencies.
7.
www.scala-sbt.org/ .
8.
https://github.com/sbt/sbt-assembly .
9.
Spark uses HADOOP_CONF_DIR and YARN_CONF_DIR to access HDFS and talk to the YARN resource manager.
10.
Tom White, “What’s New in Apache Hadoop 0.21,” Cloudera, August 26, 2010, http://blog.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/ .
11.
http://tachyon-project.org/ .

Author information

Authors and Affiliations

Lahore, Pakistan
Zubair Nabi

Authors

Zubair Nabi
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nabi, Z. (2016). Introduction to Spark. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_2

Download citation

DOI: https://doi.org/10.1007/978-1-4842-1479-4_2
Published: 14 June 2016
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-1480-0
Online ISBN: 978-1-4842-1479-4
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics