Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Big Data Indexing

  • Mohamed Y. EltabakhEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_255-1

Definitions

The major theme of this topic is building indexes, which are auxiliary data structures, on top of big datasets to speed up its retrieval and querying. The topic covers a wide range of index types along with a comparison of their structures and capabilities.

Overview

Big data infrastructures such as Hadoop are increasingly supporting applications that manage structured or semi-structured data. In many applications including scientific applications, weblog analysis, click streams, transaction logs, and airline analytics, at least partial knowledge about the data structure is known. For example, some attributes (columns in the data) may have known data types and possible domain of values, while other attributes may have little information known about them. This knowledge, even if it is partial, can enable optimization techniques that otherwise would not be possible.

Query optimization is a core mechanism in data management systems. It enables executing users’ queries efficiently without the users having to know how their queries will execute. A query optimizer figures out the best query plan to execute, in which order the query operators will run, where the data should be located and how will it move during the processing, and which segments of a query can execute in parallel versus running in sequence. Query optimization in big data is highly important, especially because (1) the datasets to be processed are getting very large, (2) the analytical queries are increasing in complexity and may take hours to execute if not carefully optimized, and (3) the pay-as-you-go cost models for cloud computing add additional urgency for optimized processing.

A typical query in big data applications may touch files in the order of 100 s of GBs or TBs of size. These queries are typically very expensive as they consume significant resources and require long periods of time to execute. For example, in transaction log applications, e.g., transaction history of customer purchases, one query might be interested in retrieving all transactions from the last 2 months that exceed a certain amount of dollar money. Such query may need to scan billions of records and go over TBs of data.

Indexing techniques are well-known techniques in database systems, especially relational databases (Chamberlin et al. 1974; Maier 1983; Stonebraker et al. 1990), to optimize query processing. Examples of the standard indexing techniques are the B+-tree (Bayer and McCreight 1972), R-Tree (Guttman 1984), and Hash-based indexes (Moro et al. 2009) along with their variations. However, transforming these techniques and structures to big data infrastructures is not straightforward due to the unique characteristics of both the data itself and the underlying infrastructure processing the data. At the data level, the data is no longer assumed to be stored in relational tables. Instead, the data is received and stored in the forms of big batches of flat files. In addition, the data size exceeds what relational database systems can typically handle.

On the other hand, at the infrastructure level, the processing model no longer follows the relational model of query execution, which relies on connecting a set of query operators together to form a query tree. Instead, the MapReduce computing paradigm is entirely different as it relies on two rigid phases of map and reduce (Dean and Ghemawat 2008). Moreover, the access pattern of the data from the file system is also different. In relational databases, the data records are read in the form of disk pages (a.k.a disk blocks), which are very small in size (typically between 8 to 128 KBs) and usually hold few data records (10 s or at most 100 s of records). And thus, it is assumed that the database systems can support record-level access. In contrast, in the Hadoop file system (HDFS), a single data block ranges between 64 MBs to 1 GB, and usually holds many records. Therefore, the record-level access no longer holds. Even the feasible operations over the data are different from those supported in relational databases. For example, record updates and deletes are not allowed in the MapReduce infrastructure. All of these unique characteristics of big data fundamentally affect the design of the appropriate indexing and preprocessing techniques.

Plain Hadoop is found to be orders of magnitude slower than distributed database management systems when evaluating queries on structured data (Stonebraker et al. 2010; Abadi 2010). One of the main observed reasons for this slow performance is the lack of indexing in the Hadoop infrastructure. As a result, significant research efforts have been dedicated to designing indexing techniques suitable for the Hadoop infrastructure. These techniques have ranged from record-level indexing (Dittrich et al. 2010, 2012; Jiang et al. 2010; Richter et al. 2012) to split-level indexing (Eltabakh et al. 2013; Gankidi et al. 2014), from user-defined indexes (Dittrich et al. 2010; Jiang et al. 2010; Gankidi et al. 2014) to system-generated and adaptive indexes (Richter et al. 2012; Dittrich et al. 2012; Eltabakh et al. 2013), and from single-dimension indexes (Dittrich et al. 2010, 2012; Jiang et al. 2010; Richter et al. 2012; Eltabakh et al. 2013) to multidimensional indexes (Liu et al. 2014; Eldawy and Mokbel 2015; Lu et al. 2014).

Table 1 provides a comparison among several of the Hadoop-based indexing techniques with respect to different criteria. Record-level granularity techniques aim for skipping irrelevant records within each data split, but eventually they may touch all splits. In contrast, the split-level granularity techniques aim for skipping entire irrelevant splits. SpatialHadoop system provides both split-level global indexing and record-level local indexing, and thus it can skip irrelevant data at both granularities.
Table 1

Comparison of hadoop-based indexed techniques

Technique

Granularity

Dimensionality

DB-Hybrid

Definition

Index Location

Hadoop++ (Dittrich et al. 2010)

Record

1

No

Admin

HDFS

HAIL (Dittrich et al. 2012)

Record

1

No

Admin

HDFS

LIAH (Richter et al. 2012)

Record

1

No

System

HDFS

E3 (Eltabakh et al. 2013)

Split

1 & 2

No

System

DB

SpatialHadoop (Eldawy and Mokbel 2015)

Record/Split

m

No

Admin

HDFS

ScalaGist (Lu et al. 2014)

Record

m

No

Admin

HDFS

HadoopDB (Abouzeid et al. 2009)

Record

m

Yes

Admin

DB

Polybase Index (Gankidi et al. 2014)

Split

m

Yes

Admin

DB

Some techniques index only one attribute at a time (Dimensionality = 1), while others allow indexing multidimensional data (Dimensionality = m). Techniques like HadoopDB and Polybase Index inherit the multidimensional capabilities from the underlying DBMS. E3 technique enables indexing pairs of values (from two attributes) but only for a limited subset of the possible values. Most techniques operate only on the HDFS data (DB-Hybrid = N), while HadoopDB and Polybase Index have a database system integrated with HDFS to form a hybrid system.

In most of the proposed techniques, the system’s admin decides on which attributes to be indexed. The only exceptions are the LIAH index, which is an adaptive index that automatically detects the changes of the workload and accordingly creates (or deletes) indexes, and the E3 index, which automatically indexes all attributes in possibly different ways depending on the data types and the workload. Finally, the index structure is either stored in HDFS along with its data as in Hadoop++, HAIL, LIAH, SpatialHadoop, and ScalaGist, in a database system along with its data as in HadoopDB, or in a database system while the data resides in HDFS as in E3 and Polybase Index. The following sections cover few of these techniques in more details.

Target Queries: Indexing techniques target optimizing queries that involve selection predicates, which is the common theme for all techniques listed in Table 1. Yet, they may differ on how queries are expressed and the mechanism by which the selection predicates are identified. For example, Hadoop++, HAIL, and LIAH allow expressing the query in Java while passing the selection predicates as arguments within the map-reduce job configuration. As a result, a customized input format will receive these predicates (if any) and perform the desired filtering during execution. In contrast, E3 framework is built on top of the Jaql high-level query language (Beyer et al. 2011), and thus queries are expressed, compiled, and optimized using Jaql engine (Beyer et al. 2011). An example query is as follows:

 read( hdfs("docs.json") )
-> transform {
author: $.meta.author,
products: $.meta.product,
Total: $.meta.Qty ∗ $.meta.
  Price}
-> filter $.products == "XYZ";

Jaql has the feature of applying selection-push-down during query compilation whenever possible. As a result, in the given query the filter, operator will be pushed before the transform operator, with the appropriate rewriting. The E3 framework can then detect this filtering operation directly after the read operation of the base file and thus can push the selection predicate into its customized input format to apply the filtering as early as possible.

HadoopDB provides a front-end for expressing SQL queries on top of its data, which is called SMS. SMS is an extension to Hive. In HadoopDB queries are expressed in an identical way to standard SQL as in the following example:

SELECT pageURL, pageRank
  FROM Rankings
  WHERE pageRank > 10;

SpatialHadoop is designed for spatial queries, and thus it provides a high-level language and constructs for expressing these queries and operating on spatial objects, e.g., points and rectangles. For example, a query can be expressed as follows:

  Objects =  LOAD "points" AS
(id:int, Location:POINT);
  Result   =  FILTER Objects BY
  Overlaps (Location,
Rectangle(x1, y1, x2, y2));

ScalaGist enables building Gist indexes, e.g., B+-tree and R-tree, over HDFS data. A single query in ScalaGist can make use of multiple indexes at the same time. For example, given a table T with schema {x, …, (a1, a2)}, where x is a one-dimensional column and (a1, a2) is a two-dimensional column, the following query can use both a B+-tree index (on x) and an R-tree index (on (a1, a2)) during its evaluation:

SELECT ∗
   FROM T
   WHERE x \(\leq\) 100
     AND 10 \(\leq\) a1 \(\leq\) 20
     AND 30 \(\leq\) a2 \(\leq\) 60;

Finally, the Polybase system enables expressing queries using standard SQL over HDFS data that are defined as external tables. First, users need to define the external table as in the following example:

  Create External Table
hdfsLineItem
(l_orderkey BIGINT Not Null,
l_partkey  BIGINT Not Null,
...)
With (Location = '/tpch1gb/
lineitem.tbl',
Data_Source = VLDB_HDP_Cluster,
File_Format = Text_Delimited);

And then, a query on the external table can be expressed as follows:

SELECT ∗
  FROM hdfsLineItem
  WHERE l_orderkey = 1;

Record-Level Nonadaptive Indexing

Hadoop++ (Dittrich et al. 2010) is an indexing technique built on top the Hadoop infrastructure. Unlike other techniques that require extensive changes to Hadoop’s execution model to offer run-time optimizations, e.g., HadoopDB (Abouzeid et al. 2009; Abouzied et al. 2010), Hadoop++ relies on augmenting the indexing structures to the data in a way that does not affect the execution mechanism of Hadoop. All processing on the indexes, e.g., creating the indexes, augmenting them to the data, and their access, are all performed through pluggable user-defined functions (UDFs) that are already available within the Hadoop framework.

The basic idea of the Hadoop++ index, which is referred to as a trojan index, is illustrated in Fig. 1a. At the loading time, the base data is partitioned using a map-reduce job. This job partitions the data based on the attribute to be indexed, i.e., if attribute X is to be indexed, then depending on the X’s value in each record, the record will be assigned to a specific split Id. This assignment is performed by the mapper function. On the other hand, the reducer function receives all the records belonging to a specific split and creates the trojan index corresponding to that split.
Fig. 1

Hadoop++ Trojan Index and Trojan Join

The index is then augmented to the data split to form a bigger split, referred to as an indexed split as depicted in the figure. Each indexed split will also have a split header (H), and a split footer (F), which together hold the metadata information about each indexed split, e.g., the split size, the number of records, the smallest and largest indexed values within this split, etc. In general, Hadoop++ can be configured to create several trojan indexes for the same data on different attributes. However, only one index can be the primary index according to which the data records are sorted within each split. This primary index is referred to as the clustered index, while the other additional indexes are non-clustered indexes.

At query time, given a query involving a selection predicate on one of the indexed attributes, the processing works as follows. First, a custom InputFormat function would read each indexed split (instead of the data splits) and consult the trojan index for that split w.r.t the selection predicate. If there are multiple indexes, then the appropriate index is selected based on the selection predicate. If none of the records satisfies the query predicate, then the entire split is skipped, and the map function terminates without actually checking any record within this split. Otherwise, the trojan index will point to the data records within the split that satisfies the query. If the trojan index is clustered, then this means that the data records within the given block are ordered according to the indexed attribute, and thus the retrieval of the records will be faster and requires less I/Os.

It is worth highlighting that trojan indexes are categorized as local indexes meaning that a local index is created for each data split in contrast to building a single global index for the entire dataset. Local indexes have their advantages and disadvantages. For example, one of the advantages is that the entire dataset does not need to be sorted, which is important because global sorting is prohibitively expensive in big data. However, one disadvantage is that each indexed split has to be touched at query time. This implies that a mapper function has to be scheduled and initiated by Hadoop for each split even if many of these splits are irrelevant to the query.

Hadoop++ framework also provides a mechanism, called Trojan Join, to speed up the join operation between two datasets, say S and T (refer to Fig. 1b). The basic idea is to partition both datasets (at the same time) on the join key. This partitioning can be performed at the loading time as a preprocessing step. The actual join does not take place during this partitioning phase. Instead, only the corresponding data partitions from both datasets are grouped together in bigger splits, referred to as Co-Partition Splits.

At query time, when S and T need to be joined, the join operation can take place as a map-only job, where each mapper will be assigned one complete co-partition split. As such, each mapper can join the corresponding partitions included in its split. Therefore, the join operation becomes significantly less expensive since the shuffling/sorting and reduce phases have been eliminated (compared to the traditional map-reduce join operation in Hadoop). As highlighted in Fig. 1b, the individual splits from S or T within a single co-group may have a trojan index built on them.

Hadoop++ framework is suitable for static indexing and joining. That is, at the time of loading the data into Hadoop, the system needs to know whether or not indexes need to be created (and on which attributes) and also whether or not co-partitioning between specific datasets needs to be performed. After loading the data, no additional indexes or co-partitioning can be created unless the entire dataset is reprocessed from scratch. Similarly, if new batches of files arrive and need to be appended to an existing indexed dataset, then the entire dataset needs to be reloaded (and the entire indexes to be re-created) in order to accommodate for the new batches.

Record-Level Adaptive Indexing

The work proposed in Dittrich et al. (2012), Richter et al. (2012) overcomes some of the limitations of other previous indexing techniques, e.g, Dittrich et al. (2010) and Jiang et al. (2010). The key limitations include the following: First, high creation overhead for indexes. Usually building the indexes requires a preprocessing step, and this step can be expensive since it has to go over the entire dataset. Previous evaluations have shown that this overhead is usually redeemed from few queries, i.e., the execution of few queries using the index will redeem the cost paid upfront to create the index. Although that is true, reducing the creation overhead is always a desirable thing. The second limitation is the question of which attributes to index? In general, if the query workload is changing, then different indexes may need to be created (or deleted) over time. The work in Dittrich et al. (2012) and Richter et al. (2012) addresses these two limitations.

HAIL (Hadoop Aggressive Indexing Library) (Dittrich et al. 2012) makes use of the fact that Hadoop, by default, creates three replicas of each data block – this default behavior can be altered by the end users to either increase or decrease the number of replicas. In plain Hadoop, these replicas are exact mirror of each other. However, HAIL proposes to reorganize the data in each replica in a different way, e.g., each of the three replicas of the same data block can be sorted on a different attribute. As a result, a single file can have multiple clustered indexes at the same time. For example, as illustrated in Fig. 2, the 1st replica can have each of its splits sorted on attribute X, the 2nd replica sorted on attribute Y , and the 3rd replica sorted on attribute Z. These sorting orders are local within each split. Given this ordering, a clustered trojan index as proposed in Dittrich et al. (2010) can be built on each replica independently.
Fig. 2

HAIL Indexing Framework

HAIL also proposes a replica-aware scheduling policy. In plain Hadoop, since all replicas are the same, the task scheduling decision does not differentiate between the replicas. In contrast in HAIL the task scheduler needs to take the query predicates into account while selecting the target replica to work on. For example, referring to Fig. 2, given a query involving a selection predicate on attribute Y , then HAIL scheduler will try to assign the map tasks to the splits of the 2nd replica. Otherwise, a full scan operation has to be performed on either of the other replicas because their indexes cannot help in evaluating the given predicate.

LIAH (Lazy Indexing and Adaptivity in Hadoop) indexing framework (Richter et al. 2012) further extends the idea of HAIL by adaptively selecting the columns to be indexed under changing workloads and also lazily building these indexes as more queries execute in the system. LIAH can incrementally build a given index starting from indexing few splits and incrementally indexing more splits as more queries are executed until the entire index is built. This strategy is based on the idea of piggybacking the index creation task over other user’s queries to reduce the overheads involved in the index creation.

Referring to Fig. 2, in LIAH it is possible that the system starts without any indexes on the base data. And then, by automatically observing the query workload, the system decides that attributes X and Y are good candidates for indexing, e.g., many queries have selection predicates on either of these two attributes. LIAH puts a strategy to incrementally make the splits of the 1st replica sorted based on X, and for each split where its data becomes sorted, its corresponding trojan index is built. LIAH framework keeps track of which blocks have been indexed and which blocks need to be indexed (which will be done progressively and piggybacked over future users’ jobs). As more user jobs are submitted to the system and the data blocks are read anyway, additional blocks can be indexed. In this way, the overheads involved in the index creation are distributed over many user queries.

Split-Level Indexing

Most of the previous indexing techniques proposed over the Hadoop infrastructure try to mimic the indexes in traditional databases in that they are record-level indexes. That is, their objective is to eliminate and skip irrelevant records from being processed. Although these techniques show some improvements in query execution, they still encounter unnecessary high overhead during execution. For example, imagine the extreme case where a queried value x appears only in very few splits of a given file. In this case, the indexing techniques like Hadoop++ and HAIL would encounter the overheads of starting a map task for each split, reading the split headers, searching the local index associated with the split, and then reading few data records or directly terminating. These overheads are substantial in a map-reduce job and if eliminated can improve the performance.

The E3 framework proposed in Eltabakh et al. (2013) is based on the aforementioned insight. Its objective is not to build a fine-grained record-level index but instead be more Hadoop-compliant and build a coarse-grained split-level index to eliminate entire splits whenever possible. E3 proposes a suite of indexing mechanisms that work together to eliminate irrelevant splits to a given query before the execution, and thus map tasks start only for a potentially small subset of splits (See Fig. 3a). E3 integrates four indexing mechanisms, which are split-level statistics, inverted indexes, materialized views, and adaptive caching; each is beneficial under specific cases. The split-level statistics are calculated for each number and date field in the dataset.
Fig. 3

E3 indexing framework (Eltabakh et al. 2013)

Examples of the collected statistics include the min and max values of each field in each split, and if this min-max range is very sparse, then the authors have proposed a domain segmentation algorithm to divide this range into possibly many but tighter ranges to avoid false positives (a false positive is when the index indicates that a value may exist in the dataset while it is actually not present). The inverted index is built on top of the string fields in the dataset, i.e., each string value is added to the index, and it points to all splits including this values. By combining these two types of indexes for a given query involving a selection predicate, E3 can identify which splits are relevant to the query, and only for those splits a set of mappers will be triggered.

The other two mechanisms, i.e., materialized views and adaptive caching, are used in the cases where indexes are mostly useless. One example provided in Eltabakh et al. (2013) is highlighted in Fig. 3b. This example highlights what is called “nasty values,” which are values that are infrequent over the entire dataset but scattered over most of the data splits, e.g., each split has one or few records of this value. In this case, the inverted index will point to all splits and becomes almost useless.

E3 framework handles these nasty values by coping their records into an auxiliary materialized view. For example, the base file A in Fig. 3b will now have an addition materialized view file stored in HDFS that contains a copy of all records having the nasty value v. Identifying the nasty values and deciding on which ones have a higher priority to handle have been proven to be an NP-hard problem, and the authors have proposed an approximate greedy algorithm to solve it (Eltabakh et al. 2013).

The adaptive caching mechanism in E3 is used to optimize conjunctive predicates, e.g., (A = x and B = y), under the cases where each of x and y individually is frequent, but their combination in one record is very infrequent. In other words, none of x or y is nasty value, but their combination is a nasty pair. In this case, neither the indexes nor the materialized views are useful. Since it is prohibitively expensive to enumerate all pairs and identify the nasty ones, the E3 framework handles these nasty pairs by dynamically observing the query execution and identifying on the fly the nasty pairs. For example, for the conjunctive predicates (A = x and B = y), E3 consults the indexes to select a subset of splits to read. And then, it will observe the number of mappers that actually produced matching records. If the number of mappers is very small compared to the triggered ones, then E3 identifies (x, y) to be a nasty pair. Consequently, (x, y) will be cached along with pointers to its relevant splits.

Hadoop-RDBMS Hybrid Indexing

There has been a long debate on whether or not Hadoop and database system can coexist together in a single working environment and whether or not this strategy is beneficial. There are several successful projects that built such integration (Abouzeid et al. 2009; Gankidi et al. 2014; Floratou et al. 2014a,b; Balmin et al. 2013; Katsipoulakis et al. 2015; Tian et al. 2016).

HadoopDB is one of the early projects that brings the optimizations of relational database systems to Hadoop (Abouzeid et al. 2009). HadoopDB proposes major changes to Hadoop’s infrastructure by replacing the HDFS storage layer by a database management layer. That is, the Data Node and Task Tracker on each slave node in the Hadoop’s cluster will be running an instance of a database system. This database instance replaces the HDFS layer, and thus the data on each slave node are stored and managed by the database engine.

HadoopDB will push as much of work as possible to the database engine, and as a result all indexing capabilities and query optimizations of database systems automatically become accessible. However, the drawback of HadoopDB is that the management of dynamic scheduling and fault tolerance becomes more complicated. In addition, the integration of structured and unstructured data in the same workflow becomes tricky to perform.

Polybase (Gankidi et al. 2014) is another system that enables the integration of Hadoop and database engines. In Polybase, the HDFS datasets are defined within the database system as external tables. And then, users’ queries can span both the data stored in the DBMS and the data stored in HDFS’s external tables. At execution time, part of the query can be translated to map-reduce jobs while another part is SQL-based.

The data flow between the two systems in Polybase takes place through custom InputFormats and Database Connectors. However, without efficient access plans to the data in the external tables, these tables can easily become a bottleneck, and the entire execution plan slows down. The work in Gankidi et al. (2014) proposes an indexing technique, called Polybase Split-Indexing, that creates B+-tree indexes on the HDFS datasets. These indexes reside within the database system. These indexes can be leveraged in different ways. For selection queries, they can be used as early split-level filters to identify the relevant splits in HDFS. For join queries, they can be used for performing a semi-join within the database system before retrieving HDFS’s data. Moreover, they can be used as caches of hot HDFS data within the database system, and if a query touches only the attributes within the index, then the entire processing can be performed inside the database.

Conclusion

This entry covers various types of big data indexing techniques to speed up and enhance the data’s retrieval and querying. These techniques range from record-level to split-level, nonadaptive to adaptive, and non-hybrid to hybrid. They also cover a wide range of data types including relational-like data, spatial data, and semi-structured JSON data. The presented techniques are the backbones for numerous applications that run at scale over big data. And with the increasing complexity of these applications in terms of the size of the collected or generated data, the heterogeneity in the data types, and the growing complexity of the applied analytics and querying, the indexing techniques will play an even more important role in bringing the processing performance into the realm of feasibility.

Cross-References

References

  1. Abadi DJ (2010) Tradeoffs between parallel database systems, Hadoop, and Hadoopdb as platforms for petabyte-scale analysis. In: SSDBM, pp 1–3Google Scholar
  2. Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A (2009) HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In: VLDB, pp 922–933CrossRefGoogle Scholar
  3. Abouzied A, Bajda-Pawlikowski K, Huang J, Abadi DJ, Silberschatz A (2010) Hadoopdb in action: building real world applications. In: SIGMOD conference, pp 1111–1114Google Scholar
  4. Balmin A, Beyer KS, Ercegovac V, McPherson J, Özcan F, Pirahesh H, Shekita EJ, Sismanis Y, Tata S, Tian Y (2013) A platform for extreme analytics. IBM J Res Dev 57(3/4):4CrossRefGoogle Scholar
  5. Bayer R, McCreight E (1972) Organization and maintenance of large ordered indexes. Acta Informatica 1(3):173–189CrossRefGoogle Scholar
  6. Beyer K, Ercegovac V, Gemulla R, Balmin A, Eltabakh MY, Kanne CC, Ozcan F, Shekita E (2011) Jaql: a scripting language for large scale semi-structured data analysis. In: PVLDB, vol 4Google Scholar
  7. Chamberlin DD, Astrahan MM, Blasgen MW, Gray JN, King WF, Lindsay BG, Lorie R, Mehl JW et al (1974) A history and evaluation of system r. In: ACM computing practices, pp 632–646Google Scholar
  8. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1)CrossRefGoogle Scholar
  9. Dittrich J, Quiané-Ruiz JA, Jindal A, Kargin Y, Setty V, Schad J (2010) Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). In: VLDB, vol 3, pp 518–529CrossRefGoogle Scholar
  10. Dittrich J, Quiané-Ruiz J, Richter S, Schuh S, Jindal A, Schad J (2012) Only aggressive elephants are fast elephants. PVLDB 5(11):1591–1602CrossRefGoogle Scholar
  11. Eldawy A, Mokbel MF (2015) Spatialhadoop: a MapReduce framework for spatial data. In: 31st IEEE international conference on data engineering (ICDE 2015), Seoul, 13–17 Apr 2015, pp 1352–1363Google Scholar
  12. Eltabakh MY, Özcan F, Sismanis Y, Haas P, Pirahesh H, Vondrak J (2013) Eagle-eyed elephant: split-oriented indexing in hadoop. In: Proceedings of the 16th international conference on extending database technology (EDBT), pp 89–100Google Scholar
  13. Floratou A, Minhas UF, Özcan F (2014a) Sql-on-Hadoop: full circle back to shared-nothing database architectures. PVLDB 7(12):1295–1306CrossRefGoogle Scholar
  14. Floratou A, Özcan F, Schiefer B (2014b) Benchmarking sql-on-hadoop systems: TPC or not TPC? In: Big data benchmarking – 5th international workshop (WBDB 2014), Potsdam, 5–6 Aug 2014, pp 63–72. Revised Selected PapersCrossRefGoogle Scholar
  15. Gankidi VR, Teletia N, Patel JM, Halverson A, DeWitt DJ (2014) Indexing HDFS data in PDW: splitting the data from the index. PVLDB 7(13):1520–1528CrossRefGoogle Scholar
  16. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data (SIGMOD’84), pp 47–57Google Scholar
  17. Jiang D, Ooi BC, Shi L, Wu S (2010) The performance of MapReduce: an in-depth study. Proc VLDB Endow pp 472–483CrossRefGoogle Scholar
  18. Katsipoulakis NR, Tian Y, Ozcan F, Pirahesh H, Reinwald B (2015) A generic solution to integrate SQL and analytics for big data. In: EDBT, pp 671–676Google Scholar
  19. Liu Y, Hu S, Rabl T, Liu W, Jacobsen H, Wu K, Chen J, Li J (2014) Dgfindex for smart grid: enhancing hive with a cost-effective multidimensional range index. PVLDB 7(13):1496–1507. http://www.vldb.org/pvldb/vol7/p1496-liu.pdf CrossRefGoogle Scholar
  20. Lu P, Chen G, Ooi BC, Vo HT, Wu S (2014) Scalagist: scalable generalized search trees for MapReduce systems [innovative systems paper]. PVLDB 7(14):1797–1808Google Scholar
  21. Maier D (1983) Theory of relational databases. Computer Science Press, RockvillezbMATHGoogle Scholar
  22. Moro MM, Zhang D, Tsotras VJ (2009) Hash-based Indexing. In: LIU L., \(\ddot {\mathrm{O}}\)ZSU M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, pp 1289–1290Google Scholar
  23. Richter S, Quiané-Ruiz J, Schuh S, Dittrich J (2012) Towards zero-overhead adaptive indexing in Hadoop. CoRR abs/1212.3480Google Scholar
  24. Stonebraker M, Rowe LA, Hirohama M (1990) The implementation of POSTGRES. TKDE 2(1):125–142Google Scholar
  25. Stonebraker M et al (2010) MapReduce and parallel DBMSs: friends or foes? Commun ACM 53(1):64–71. http://doi.acm.org/10.1145/1629175.1629197 CrossRefGoogle Scholar
  26. Tian Y, Özcan F, Zou T, Goncalves R, Pirahesh H (2016) Building a hybrid warehouse: efficient joins between data stored in HDFS and enterprise warehouse. ACM Trans Database Syst 41(4):21:1–21:38MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Worcester Polytechnic InstituteWorcesterUSA

Section editors and affiliations

  • Yuanyuan Tian
    • 1
  • Fatma Özcan
    • 2
  1. 1.IBM Almaden Research CenterSAN JOSEUSA
  2. 2.IBM Research - AlmadenSan JoseUSA