Caching for SQL-on-Hadoop
Caching for SQL-on-Hadoop are techniques and systems which store data to provide faster access to that data, for Structured Query Language (SQL) engines running on the Apache Hadoop ecosystem.
The Apache Hadoop software project (Apache Hadoop 2018) has grown in popularity for distributed computing and big data. The Hadoop stack is widely used for storing large amounts of data, and for large-scale, distributed, and fault-tolerant data processing of that data. The Hadoop ecosystem has been important for organizations to extract actionable insight from the large volumes of collected data, which is difficult or infeasible for traditional data processing methods.
The main storage system for Hadoop is the Hadoop Distributed File System (HDFS). It is a distributed storage system which provides fault-tolerant and scalable storage. The main data processing framework for Hadoop is MapReduce, which is based on the Google MapReduce project (Dean and Ghemawat 2008). MapReduce...
- Alluxio (2018) Alluxio – open source memory speed virtual distributed storage. https://www.alluxio.org/. Accessed 19 Mar 2018
- Apache Drill (2018) Apache Drill. https://drill.apache.org. Accessed 19 Mar 2018
- Apache Hadoop (2018) Welcome to Apache Hadoop! http://hadoop.apache.org. Accessed 19 Mar 2018
- Apache Hadoop HDFS (2018) Centralized cache management in HDFS. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html. Accessed 19 Mar 2018
- Apache Hive (2018) Apache Hive. https://hive.apache.org. Accessed 19 Mar 2018
- Apache Hive LLAP (2018) LLAP. https://cwiki.apache.org/confluence/display/Hive/LLAP. Accessed 19 Mar 2018
- Apache Ignite (2018) Apache Ignite. https://ignite.apache.org/index.html. Accessed 19 Mar 2018
- Apache Impala (2018) Apache Impala. https://impala.apache.org. Accessed 19 Mar 2018
- Apache Spark SQL (2018) Spark SQL. https://spark.apache.org/sql/. Accessed 19 Mar 2018
- Facebook (2018) Presto. https://prestodb.io/. Accessed 19 Mar 2018
- Floratou A et al (2016) Adaptive caching in big SQL using the HDFS cache. In: SoCC’16 proceedings of the seventh ACM symposium on cloud computing, Snata Clara, 5–7 Oct 2016Google Scholar
- Spark RDD (2018) RDD programming guide. http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence. Accessed 19 Mar 2018
- Spark SQL (2018) Spark SQL, dataframes and datasets guide. http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory. Accessed 19 Mar 2018