Introduction

Hadoop is a popular open source software framework that allows the distributed processing of large scale data sets [1]. It employs the MapReduce paradigm to divide the computation tasks into parts that can be distributed to a commodity cluster and therefore, provides horizontal scalability [2]–[9]. The MapReduce functions of Hadoop uses (key,value) pairs as data format. The input is retrieved in chunks from Hadoop Distributed File System (HDFS) and assigned to one of the mappers that will process data in parallel and produce the (k1,v1) pairs for the reduce step. Then, (k1,v1) pair goes through shuffle phase that assigns the same k1 pairs to the same reducer. The reducers gather the pairs with the same k1 values into groups and perform aggregation operations (see Figure 1). HDFS is the underlying file system of Hadoop. Due to its simplicity, scalability, fault-tolerance and efficiency Hadoop has gained significant support from both industry and academia; however, there are some limitations in terms of its interfaces and performance [10]. Querying the data with Hadoop as in a traditional RDBMS infrastructure is one of the most common problems that Hadoop users face. This affects a majority of users who are not familiar with the internal details of MapReduce jobs to extract information from their data warehouses.

Figure 1
figure 1

MapReduce tasks.

Hadoop Hive is an open source SQL-based distributed warehouse system which is proposed to solve the problems mentioned above by providing an SQL-like abstraction on top of Hadoop framework. Hive is an SQL-to-MapReduce translator with an SQL dialect, HiveQL, for querying data stored in a cluster [11]–[13]. When users want to benefit from both MapReduce and SQL, mapping SQL statements to MapReduce tasks can become a very difficult job [14]. Hive does this work by translating queries to MapReduce jobs, thereby exploiting the scalability of Hadoop while presenting a familiar SQL abstraction [15]. These attributes of Hive make it a suitable tool for data warehouse applications where large scale data is analyzed, fast response times are not required, and there is no need to update data frequently [4].

Since most data warehouse applications are implemented using SQL-based RDBMSs, Hive lowers the barrier to moving these applications to Hadoop, thus, people who already know SQL can easily use Hive. Similarly, Hive makes it easier for developers to port SQL-based applications to Hadoop. Since Hive is based on a query-at-a-time model and processes each query independently, issuing multiple queries in close time interval decreases performance of Hive due to its execution model. From this perspective, it is important to note that there has been no study, to date, that incorporates the Multiple-query optimization (MQO) technique for Hive to reduce the total execution time of a batch of queries [16]–[18].

Studies concerning MQO for traditional warehouses have shown that it is an efficient technique that dramatically increases the performance of time-consuming decision support queries [2],[19]–[21]. In order to improve the performance of Hadoop Hive in massively issued query environments, we propose SharedHive, which processes HiveQL queries as a batch and improves the total execution time by merging correlated queries before passing them to the Hive query optimizer [6],[15],[22]. By analyzing the common tasks of correlated HiveQL queries we merge them to a new set of insert queries with an optimization algorithm and execute as a batch. The developed model is introduced as a novel component for Hadoop Hive architecture.

In Related work Section, brief information is presented concerning the related work on MQO, SQL-to-MapReduce translators that are similar to Hive, and recent query optimization studies on MapReduce framework. SharedHive system architecture Section explains the traditional architecture of Hive and introduces our novel MQO component. The next Section (Sharing scan and computation tasks of HiveQL queries) explains the process of generating a set of merged insert queries from correlated queries. Experimental setup and results Section discusses the experiments conducted to evaluate the SharedHive framework for HiveQL queries that have different correlation levels. The Section before the conclusion presents the comparison of SharedHive with the other MapReduce-based MQO methods. Our concluding remarks are given in Conclusions and future work Section.

Related work

The MQO problem was introduced in the 1980s and finding an optimal global query plan using MQO was shown to be an NP-Hard problem [16],[23]. Since then, a considerable amount of work has been undertaken on RDBMSs and data analysis applications [24]–[26]. Mehta and DeWitt considered CPU utilization, memory usage, and I/O load variables in a study during planning multiple queries to determine the degree of intra-operator parallelism in parallel databases to minimize the total execution time of declustered join methods [27]. A proxy-based infrastructure for handling data intensive applications has been proposed by Beynon [28]; however, this infrastructure was not as scalable as a collection of distributed cache servers available at multiple back-ends. A data integration system that reduces the communication costs by a multiple query reconstruction algorithm is proposed by [29]. IGNITE [30] and QPipe [31] are important studies that use the micro machine concept for query operators to reduce the total execution time of a set of queries. A novel MQO framework is proposed for the existing SPARQL query engines [32]. A cascade-style optimizer for Scope, Microsoft’s system for massive data analysis, is designed in [33]. CoScan [34],[35] shows how sharing scan operations can benefit multiple MapReduce queries.

In recent years, a significant amount of research and commercial activity has focused on integrating MapReduce and structured database technologies [36]. Mainly there are two approaches, either adding MapReduce features to a parallel database or adding database technologies to MapReduce. The second approach is more attractive because no widely available open source parallel database system exists, whereas MapReduce is available as an open source project. Furthermore, MapReduce is accompanied by a plethora of free tools as well as having cluster availability and support. Hive [11], Pig [37], Scope [20], and HadoopDB [10],[38] are projects that provide SQL abstractions on top of MapReduce platform to familiarize the programmers with complex queries. SQL/MapReduce [39] and Greenplum [21] are recent projects that use MapReduce to process user-defined functions (UDF).

Recently, there have been interesting studies that apply MQO to MapReduce frameworks for unstructured data; for example MRShare [40] processes a batch of input queries as a single query. The optimal grouping of queries for execution is defined as an optimization problem based on MapReduce cost model. The experimental results reported for MRShare demonstrate its effectiveness. In spite of some initial MQO studies to reduce the execution time of MapReduce-based single queries [41], to our knowledge there is no study similar to ours that is related to the MQO of Hadoop Hive by using insert query statements.

SharedHive system architecture

In this section, we briefly present the architecture of SharedHive which is a modified version of Hadoop Hive with a new MQO component inserted on top of the Driver component of Hive (see Figure 2). Inputs to the driver which contains compiler, optimizer and executer are pre-processed by the added Multiple Query Optimizer component which analyzes incoming queries and produces a set of merged HiveQL insert queries. Finally, the remaining queries that don’t have any correlation with others are appended at the end of the correlated query sets. The system catalog and relational database structure (relations, attributes, partitions, etc.) are stored and maintained by Metastore. Once a HiveQL statement is submitted, it is maintained by Driver which controls the execution of tasks in order to answer the query. Compiler parses the query string and transforms the parse tree to a logical plan. Optimizer performs several passes over the logical plan and rewrites it. The physical plan generator creates a physical plan from the logical plan.

Figure 2
figure 2

Architecture of SharedHive with newly added multiple query optimizer component.

HiveQL statements are submitted via the Command Line Interface (CLI), the Web User Interface or the thrift interface. Normally, the query is directed to the driver component in conventional Hive architecture. In SharedHive, the MQO component (located after the client interface) receives the incoming queries before the driver component. The set of incoming queries are inspected, their common tables and intermediate common joins are detected, and merged to obtain a new set of HiveQL queries that answer all the incoming queries. The details of this process are explained in the next Section.

The new MQO component passes the new set of merged queries to the compiler component of Hive driver that produces a logical plan using information from the Metastore and optimizes this plan using a single rule-based optimizer. The execution engine receives a directed acyclic graph (DAG) of MapReduce and associated HDFS tasks, then executes them in accordance with the dependencies of the tasks. The new MQO component does not require any major changes in the system architecture of Hadoop Hive and can be easily integrated into Hive.

Sharing scan and computation tasks of HiveQL queries

In order to benefit from the common scan/join tasks of the input queries and reduce the number (i.e. total amount) of redundant tasks, SharedHive merges input queries into a new set of HiveQL insert queries and produces answers to each query as a separate HDFS file.

The problem of merging a set of queries can be formally described as:

Input: A set of HiveQL queries Q={q1,...,q n }.Output: A set of merged HiveQL queries Q={q 1 ,...,q m }, where mn.

Rewrite/combine the given input queries in such a way that the total execution time of query set Q is less than the total execution time of query set Q. If the execution time of query q i is represented with t i then

i = 1 m ( t i ) i = 1 n ( t i )
(1)

Given q i is the merged insert query corresponding to queries q j and q k then all of the output tuples and columns required by both queries must be produced by query q i preserving the predicate attributes of q j and q k .

The existing architecture of Hive produces several jobs that run in parallel to answer a query. The insert queries merged by SharedHive can combine the scan and/or intermediate join operations of the input queries in a new set of insert queries and gain performance increases by reducing the number of MapReduce tasks and the sizes of read/written HDFS files.

Unlike the traditional SQL statements, HiveQL join query statements are written in the FROM part of the query [15] such as

The example below shows how a merged HiveQL insert query for TPC-H queries Q1 and Q6 is constructed.

Merging TPC-H Queries Q1 and Q6 :

The underlying SQL-to-Mapreduce translator of Hive uses one operation to one job model [22] and opens a new job for each operation (table scan, join, group by, etc.) in a SQL statement. Significant performance increases can be obtained by reducing the number of MapReduce tasks of these jobs. Figure 3 presents MapReduce tasks of merged insert query (Q1+Q6) that reduces the scan operations.

Figure 3
figure 3

MapReduce tasks for merged insert query ( Q1 + Q6 ).

In the Appendix, three merged queries are presented to explain the merging process of HiveQL queries that share common input and output parts. The first one merges two Q1 queries that have different selection predicates, the second one merges two fully-correlated queries, Q14 and Q19, that share a common join operation and the third one merges two partially correlated queries Q1 and Q18[42].

HiveQL statements have a preprocessing overhead for MapReduce tasks that will be executed to complete a query and this causes high latencies that could cause short running queries to take longer time on Hive [43]. In addition to the emerging opportunities of using common table scan and join operations, SharedHive intends to decrease the preprocessing period of uncorrelated query MapReduce tasks.

In the merging process of SharedHive, each query is classified according to the shared tables and/or join operations in the FROM clause of HiveQL statements. The input queries are inserted into a data structure that maintains the groups of similar queries according to the largest sharing opportunity they have with other queries.

While grouping the queries, the highest precedence is given to (a) queries with fully-correlated FROM expressions, (b) queries with partially-correlated FROM expressions and (c) queries that have no correlation with the other queries (which are appended to the end of the set of merged queries) (see Algorithm ??). With this approach, the common scan/join tasks in merged insert queries are not executed repeatedly [15]. After the merging process, the optimized set of insert queries are passed to the query execution layer of Hive.

Experimental setup and results

In this section, experimental setup and the performance evaluation of the merged HiveQL insert queries are presented. TPC-H is chosen as our benchmark database and related decision support queries because they process high amounts of data [44]. We believe this is a good match for our experiments since Hadoop is also designed to process large amounts of data. 11 query sets are prepared from standard TPC-H queries to experimentally analyze performance of SharedHive under different workload scenarios. These query sets define three correlation categories for merged queries (uncorrelated, partially correlated, and fully correlated). Uncorrelated queries have nothing in common, partially-correlated queries share at least one table and zero or more join operations. Fully-correlated queries have exactly the same list of the tables/joins (where conditions have different selection predicates). Table 1 gives the selected set of queries and their correlation levels. Query sets 8 and 9 use single queries that are submitted several times with different selection predicates. Query set 8 executes no join operation so that it presents the performance gains of SharedHive with intensive scan sharing, whereas query set 9 includes common join operations that require communication between datanodes. Query sets 10 and 11 include queries that produce several merged insert queries.

Table 1 Sets of selected TPC-H queries and their correlation levels

Three different TPC-H decision support databases with sizes 1GB, 100GB and 1TB are used. Similar experimental settings are used in previous studies [22],[40].

The experiments are performed on a private Cloud server, 4U DELL PowerEdge R910 having 32 (64 with Hyper Threading) cores. Each core is Intel Xeon E7-4820 with 2.0GHz processing power. The server has 128GB DDR3 1600MHz virtualized memory and Broadcom Nextreme II 5709 1Gbps NICs. Operating system of the physical server is Windows Server 2012 Standard Edition. 20 Linux CentOS 6.4 virtual machines are installed on this server as guest operating systems. Each virtual machine has two 2.0GHz processors, 2GB RAM and 250GB disk storage. An additional master node is used as NameNode/JobTracker (4 processors, 8GB RAM and 500GB disk storage). The latest stabilized versions of Hadoop, release 1.2.1 and Hive version 0.12.0 are used [1],[11]. The splitsize of the files (HDFS block size) is 64MB, replication number is 2, maximum number of map tasks is 2, maximum number of reduce tasks is 2 and map output compression is disabled during the experiments.

In order to remove noise in performance measurements, the Cloud server is only dedicated to our experiment during the performance evaluation. Therefore, we believe that performance interference from external factors such as network congestion or OS-level contention on shared resources are minimized as much as possible. We observe that there were only negligible changes in the response time of the queries when we repeated our experiments three times.

Table 2 presents the response times of TPC-H queries (Q1, Q3, Q6, Q11, Q12, Q14, Q17, Q18, Q19, Q22) with 1GB, 100GB and 1TB database sizes. These results constitute baselines to compare the results of the merged HiveQL queries with single execution performance of Hive.

Table 2 Execution times (sec.) of single TPC-H queries

Tables 3 and 4 show the performance increases for the selected HiveQL query sets given in Table 1 that are merged and run to observe the effect of SharedHive on total response times. The percentage values show the reduction of the response time. Significant performance increases can be seen easily.

Table 3 Execution times of sequential and merged queries in seconds
Table 4 Execution times of sequential and merged query sets (8 and 9) in seconds

Although uncorrelated queries have nothing in common, their total execution times are observed to reduce by 0.2%-6.9% due to the improvement in HIVE query preprocessing overheads. Merging uncorrelated queries does not increase the performance when the database size reaches terabyte scale. The reductions in total execution times of partially correlated queries is higher than uncorrelated query sets (between 1.5%-20.8%). The highest benefits are observed in the fully correlated query sets (between 9.9%-39.9%). For query set 8 (single Q1 query submitted 8 times) the total query execution time is reduced from 26,716 to 3,985 seconds (85.1% reduction). The performance of mixed query sets depends on the correlation level of the queries they contain. Mixed query sets 10 and 11 execute their queries with 9.9% and 15.5% less execution times, respectively. During these experiments, the size of the intermediate tables that are written to the disks is considered carefully by SharedHive. If predicted overhead of writing intermediate results is larger than the expected improvement in response time, then queries are not merged. SharedHive is observed to reduce the number of MapReduce tasks and the sizes of read/written HDFS files as well. The results given in Table 5 present the effect of SharedHive for the number of MapReduce tasks and the sizes of read/written files of the given insert queries (having different correlation levels). As the correlation level of queries increases the number of MapReduce tasks and the sizes of read/written data also decreases substantially.

Table 5 Comparing the number of MapReduce tasks and the sizes of read/written HDFS files by Hive and SharedHive for different correlation level 100GB TPC-H data warehouse queries

The optimization time of SharedHive on analyzing and merging the queries is observed to be small. This is because of the small number of input queries and executing Algorithm ?? on them requires only examination of their FROM clauses which are parsed to identify similar expressions and rewriting the merged HiveQL query. This optimization does not take more than a few milliseconds.

In the last phase of our experiments, SharedHive is run on five different cluster sizes to observe its scalability with increasing number of datanodes. First, the merged insert query (Q14+Q19) is executed on the cluster using three different database sizes (1GB, 10GB and 100GB). It is observed that increasing the number of datanodes in the cluster improves the performance of the merged query reducing execution times by 29%, 81% and 88% in the database instances when the number of datanodes is increased from 1 to 20 (see Figure 4).

Figure 4
figure 4

The effect of increasing number of datanodes for merged query ( Q14 + Q19 ) with 1GB, 10GB and 100GB database sizes.

MQO component of SharedHive is an extension to Hive and welcomes any performance increase that is achieved on the HDFS layer either due to increase in the number of datanodes or balanced distribution of data files.

Comparison with other MapReduce-based MQO systems

SharedHive can perform the execution of the selected/correlated queries in shorter times than Hive by reducing the number of MapReduce tasks and the sizes of the files read/written by the tasks. The correlation detection mechanism of SharedHive is simple and does not find the number common rows and/or columns of queries with complex algorithms as in [19],[29]. The execution time performance gains are observed to be within the range of %1.5-85.1% in accordance with the correlation level of the queries. For repeatedly issued similar queries that have different predicates, SharedHive performs well. SharedHive benefits from underlying HDFS architecture therefore, its scalability is preserved and better performance is obtained when additional datanodes are introduced to Hadoop. The query results obtained by SharedHive have been compared with those of Hive and verified to be the same.

MRShare [40] is a recent MQO system developed for benefitting from multiple queries containing similar MapReduce tasks. It transforms a batch of queries into a new batch that will be executed more efficiently by merging jobs into groups and evaluating each group as a single query. MRShare optimizes queries that work on the same input table and does not consider sharing of join operations. However, SharedHive can merge queries containing joins into a new set of insert queries. MRShare shares scan tasks by creating a single job for multiple jobs and does not use temporary files (as it is done by SharedHive).

YSmart [22] is a correlation-aware MQO system similar to SharedHive. It detects and removes redundant MapReduce tasks of single complex queries but does not optimize multiple queries. The developers of YSmart present experimental results that significantly outperform conventional Hive for single queries. SharedHive does not provide any performance increase for single queries unless they are submitted several times (with different predicates). SharedHive works in the application layer of Hive by merging the query level operations, whereas MRShare and YSmart explore and eliminate redundant tasks in the MapReduce layer.

Apache Pig is the most mature MapReduce-based platform that supports a large number of sharing mechanisms among multiple queries [37]. Complex tasks consisting of multiple interrelated data transformations are explicitly encoded as data flow sequences; however, its query language, Pig Latin, is not compatible with standard SQL statements like SharedHive.

Conclusions and future work

In this study, we propose a multiple query optimization (MQO) framework, SharedHive, for improving the performance of MapReduce-based data warehouse Hadoop Hive queries. To our knowledge, this is the first work that aims at improving the performance of Hive with MQO techniques. In SharedHive, we detect common tasks of correlated TPC-H HiveQL queries and merge them into a new set of global Hive insert queries. With this approach, it has been experimentally shown that significant performance improvements can be achieved by reducing the number of MapReduce tasks and the total sizes of read/written files.

As future work, we plan to incorporate MQO functionality at MapReduce layer, similar to YSmart, into SharedHive. In this way, it will be possible to eliminate even more redundant MapReduce tasks in queries and improve the overall performance of naïve rule-based Hive query optimizer even further.

Appendix

A. Merging two Q1 queries that have different select predicates

B. Merging queries Q14 and Q19(Fully correlated FROM clauses)

C. Merging queries Q1 and Q18(Partially correlated FROM clauses)