Encyclopedia of Big Data Technologies

2019 Edition
| Editors: Sherif Sakr, Albert Y. Zomaya

Caching for SQL-on-Hadoop

  • Gene PangEmail author
  • Haoyuan Li
Reference work entry
DOI: https://doi.org/10.1007/978-3-319-77525-8_249


Caching for SQL-on-Hadoop are techniques and systems which store data to provide faster access to that data, for Structured Query Language (SQL) engines running on the Apache Hadoop ecosystem.


The Apache Hadoop software project (Apache Hadoop 2018) has grown in popularity for distributed computing and big data. The Hadoop stack is widely used for storing large amounts of data, and for large-scale, distributed, and fault-tolerant data processing of that data. The Hadoop ecosystem has been important for organizations to extract actionable insight from the large volumes of collected data, which is difficult or infeasible for traditional data processing methods.

The main storage system for Hadoop is the Hadoop Distributed File System (HDFS). It is a distributed storage system which provides fault-tolerant and scalable storage. The main data processing framework for Hadoop is MapReduce, which is based on the Google MapReduce project (Dean and Ghemawat 2008). MapReduce...

This is a preview of subscription content, log in to check access.


  1. Alluxio (2018) Alluxio – open source memory speed virtual distributed storage. https://www.alluxio.org/. Accessed 19 Mar 2018
  2. Apache Drill (2018) Apache Drill. https://drill.apache.org. Accessed 19 Mar 2018
  3. Apache Hadoop (2018) Welcome to Apache Hadoop! http://hadoop.apache.org. Accessed 19 Mar 2018
  4. Apache Hadoop HDFS (2018) Centralized cache management in HDFS. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html. Accessed 19 Mar 2018
  5. Apache Hive (2018) Apache Hive. https://hive.apache.org. Accessed 19 Mar 2018
  6. Apache Hive LLAP (2018) LLAP. https://cwiki.apache.org/confluence/display/Hive/LLAP. Accessed 19 Mar 2018
  7. Apache Ignite (2018) Apache Ignite. https://ignite.apache.org/index.html. Accessed 19 Mar 2018
  8. Apache Impala (2018) Apache Impala. https://impala.apache.org. Accessed 19 Mar 2018
  9. Apache Spark SQL (2018) Spark SQL. https://spark.apache.org/sql/. Accessed 19 Mar 2018
  10. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  11. Facebook (2018) Presto. https://prestodb.io/. Accessed 19 Mar 2018
  12. Floratou A et al (2016) Adaptive caching in big SQL using the HDFS cache. In: SoCC’16 proceedings of the seventh ACM symposium on cloud computing, Snata Clara, 5–7 Oct 2016Google Scholar
  13. Spark RDD (2018) RDD programming guide. http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence. Accessed 19 Mar 2018
  14. Spark SQL (2018) Spark SQL, dataframes and datasets guide. http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory. Accessed 19 Mar 2018

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Alluxio Inc.San MateoUSA

Section editors and affiliations

  • Yuanyuan Tian
    • 1
  • Fatma Özcan
    • 2
  1. 1.IBM Almaden Research CenterSAN JOSEUSA
  2. 2.IBM Research – AlmadenSan JoseUSA