Abstract
Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases. Recent works have shown that indexes can improve the performance of selective MapReduce jobs dramatically. However, one major weakness of existing approaches is high index creation costs. We present HAIL (Hadoop Aggressive Indexing Library), a novel indexing approach for HDFS and Hadoop MapReduce. HAIL creates different clustered indexes over terabytes of data with minimal, often invisible costs, and it dramatically improves runtimes of several classes of MapReduce jobs. HAIL features two different indexing pipelines, static indexing and adaptive indexing. HAIL static indexing efficiently indexes datasets while uploading them to HDFS. Thereby, HAIL leverages the default replication of Hadoop and enhances it with logical replication. This allows HAIL to create multiple clustered indexes for a dataset, e.g., one for each physical replica. Still, in terms of upload time, HAIL matches or even improves over the performance of standard HDFS. Additionally, HAIL adaptive indexing allows for automatic, incremental indexing at job runtime with minimal runtime overhead. For example, HAIL adaptive indexing can completely index a dataset as byproduct of only four MapReduce jobs while incurring an overhead as low as 11 % for the very first of those job only. In our experiments, we show that HAIL improves job runtimes by up to 68\(\times \) over Hadoop. This article is an extended version of the VLDB 2012 paper (Dittrich et al. in PVLDB 5(11):1591–1602, 2012).
Similar content being viewed by others
Notes
A simple example of such a use-case would be a distributed grep.
The professor is aware that for some situations, the opposite is true.
Actually, it is a split. The difference does not matter here. We will get back to this in Sect. 4.2.
Alternatively, HAIL can also suggest an appropriate schema to users through schema analysis.
Alternatively, HAIL allows Bob to specify the selection predicate and the projected attributes in the job configuration class.
A Hadoop instance responsible to execute map and reduce tasks.
That was obtained from the HAILInputFormat via getSplits().
Notice that, all map tasks (even from different MapReduce jobs) running on the same node interact with the same AdaptiveIndexer instance.
Hence, the AdaptiveIndexer can end up by indexing data blocks from different MapReduce jobs at the same time.
It is worth noting that \(T_{idxOverhead}\) denotes only the additional runtime that a MapReduce job has due to adaptive indexing.
For this cluster type, we allocate an additional large node to run the namenode and jobtracker.
This is the time a map task takes to read and process its input.
Recall that, this query projects all attributes, which is indeed more beneficial for Hadoop++ as it uses a row layout.
Although HAIL is still indexing further blocks.
References
Abouzied, A., Abadi, D. J., Silberschatz, A.: Invisible loading: access-driven data transfer from raw Files into database systems. In: EDBT, pp. 1–10 (2013)
Agrawal, S., et al.: Database tuning advisor for Microsoft SQL server 2005. In: VLDB, pp. 1110–1121 (2004)
Ailamaki, A., et al.: Weaving relations for Cache performance. In: VLDB, pp. 169–180 (2001)
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: Efficient query execution on raw data files. In: SIGMOD Conference, pp. 241–252 (2012)
Blanas, S., et al.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986 (2010)
Bruno, N., Chaudhuri, S.: To tune or not to tune? A lightweight physical design alerter. In: VLDB, pp. 499–510 (2006)
Bruno, N., Chaudhuri, S.: An online approach to physical design tuning. In: ICDE, pp. 826–835 (2007)
Bruno, N., Chaudhuri, S.: Physical design refinement: the merge-reduce approach. ACM TODS 32(4), 28:1–28:41 (2007)
Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB (2010)
Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for Microsoft SQL server. In: VLDB, pp. 146–155 (1997)
Chaudhuri, S., Narasayya, V.R.: Self-tuning database systems: a decade of progress. In: VLDB, pp. 3–14 (2007)
Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. PVLDB 3(1—-2), 1459–1468 (2010)
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. CACM 53(1), 72–77 (2010)
Dittrich, J., Quiané-Ruiz, J.-A.: Efficient parallel data processing in MapReduce workflows. PVLDB 5, 2014–2015 (2012)
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)
Dittrich, J.-P., Fischer, P.M., Kossmann, D.: AGILE: Adaptive indexing for context-aware information filters. In: SIGMOD, pp. 215–226 (2005)
Eltabakh, M.Y., et al.: CoHadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Finkelstein, S.J., et al.: Physical database design for relational databases. ACM TODS 13(1), 91–128 (1988)
Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S.: Concurrency control for adaptive indexing. PVLDB 5(7), 656–667 (2012)
Graefe G., Kuno, H.A.: Self-selecting, self-tuning, incrementally optimized indexes. In: EDBT, pp. 371–381 (2010)
http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
Hadoop Users. http://wiki.apache.org/hadoop/PoweredBy
Halim, F., et al.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)
Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4(11), 1111–1122 (2011)
Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results?. In: CIDR, pp. 57–68 (2011)
Idreos, S., et al.: Database cracking. In: CIDR, pp. 68–78 (2007)
Idreos, S., et al.: Self-organizing tuple reconstruction in column-stores. In: SIGMOD, pp. 297–308 (2009)
Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 586–597 (2011)
Idreos, S., Kersten, M.L., Manegold, S.: Updating a cracked database. In: SIGMOD Conference, pp. 413–424 (2007)
Jahani, E., et al.: Automatic optimization for MapReduce programs. PVLDB 4(6), 385–396 (2011)
Jiang, D., et al.: The performance of MapReduce: an in-depth study. PVLDB 3(1), 472–483 (2010)
Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: SOCC (2011)
Lin, J., et al.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: MapReduce Workshop (2011)
Logothetis, D., et al.: In-situ MapReduce for log processing. In: USENIX (2011)
Lühring, M., et al.: Autonomous management of soft indexes. In: ICDE Workshop on Self-Managing Database Systems, pp. 450–458 (2007)
Olston, C.: Keynote: programming and debugging large-scale data processing workflows. In: SOCC (2011)
Pavlo, A., et al.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)
Quiané-Ruiz, J.-A., Pinkel, C., Schad, J., Dittrich, J.: RAFTing MapReduce: fast recovery on the RAFT. In: ICDE, pp. 589–600 (2011)
Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)
Schnaitter, K., et al.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)
Thusoo, A., et al.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp. 1013–1020 (2010)
White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2011)
Yang, H.-C., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-merge clusters. In: DASFAA, pp. 308–322 (2009)
Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)
Acknowledgments
Research supported by the Cluster of Excellence on “Multimodal Computing and Interaction” and the Bundesministerium für Bildung und Forschung.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Richter, S., Quiané-Ruiz, JA., Schuh, S. et al. Towards zero-overhead static and adaptive indexing in Hadoop. The VLDB Journal 23, 469–494 (2014). https://doi.org/10.1007/s00778-013-0332-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0332-z