Skip to main content

Introduction to Spark

  • Chapter
  • First Online:
Pro Spark Streaming

Abstract

The first version of Spark was open sourced in 2010, and it went into Apache incubation in 2013. By early 2014, it was promoted to a top-level Apache project. It has already replaced Hadoop as the Big Data processing engine of choice in most organizations. This is a testament to its maturity and the richness of its design. Batch processing, iterative and interactive computation, stream processing, graph analytics, ETL, machine learning, and data warehousing; you name it and Spark can already handle it. This chapter is a hands-on primer to Spark to set the stage for the rest of the book.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Matei Zaharia et al., “Spark: Cluster Computing with Working Sets, Proceedings of HotCloud ’10 (USENIX Association, 2010).

  2. 2.

    *Insert “speed” joke here.*

  3. 3.

    Reynold Xin, “Shark, Spark SQL, Hive on Spark, and the Future of SQL on Spark,” Databricks, July 1, 2014, https://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html .

  4. 4.

    Reynold Xin, Michael Armbrust, and Davies Liu, “Introducing DataFrames in Spark for Large Scale Data Science,” Databricks, February 17, 2015, https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html .

  5. 5.

    This requires passwordless key-based authentication between the master and all worker nodes. Alternatively, you can set SPARK_SSH_FOREGROUND and provide a password for each worker machine.

  6. 6.

    A task may or may not correspond to a single transformation. This depends on the dependencies in a stage. Refer to Chapter 4 for details on dependencies.

  7. 7.

    www.scala-sbt.org/ .

  8. 8.

    https://github.com/sbt/sbt-assembly .

  9. 9.

    Spark uses HADOOP_CONF_DIR and YARN_CONF_DIR to access HDFS and talk to the YARN resource manager.

  10. 10.

    Tom White, “What’s New in Apache Hadoop 0.21,” Cloudera, August 26, 2010, http://blog.cloudera.com/blog/2010/08/what%E2%80%99s-new-in-apache-hadoop-0-21/ .

  11. 11.

    http://tachyon-project.org/ .

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Zubair Nabi

About this chapter

Cite this chapter

Nabi, Z. (2016). Introduction to Spark. In: Pro Spark Streaming. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-1479-4_2

Download citation

Publish with us

Policies and ethics