Towards zero-overhead static and adaptive indexing in Hadoop

Richter, Stefan; Quiané-Ruiz, Jorge-Arnulfo; Schuh, Stefan; Dittrich, Jens

doi:10.1007/s00778-013-0332-z

Towards zero-overhead static and adaptive indexing in Hadoop

Regular Paper
Published: 26 September 2013

Volume 23, pages 469–494, (2014)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Stefan Richter¹,
Jorge-Arnulfo Quiané-Ruiz²,
Stefan Schuh¹ &
…
Jens Dittrich¹

2045 Accesses
41 Citations
10 Altmetric
1 Mention
Explore all metrics

Abstract

Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases. Recent works have shown that indexes can improve the performance of selective MapReduce jobs dramatically. However, one major weakness of existing approaches is high index creation costs. We present HAIL (Hadoop Aggressive Indexing Library), a novel indexing approach for HDFS and Hadoop MapReduce. HAIL creates different clustered indexes over terabytes of data with minimal, often invisible costs, and it dramatically improves runtimes of several classes of MapReduce jobs. HAIL features two different indexing pipelines, static indexing and adaptive indexing. HAIL static indexing efficiently indexes datasets while uploading them to HDFS. Thereby, HAIL leverages the default replication of Hadoop and enhances it with logical replication. This allows HAIL to create multiple clustered indexes for a dataset, e.g., one for each physical replica. Still, in terms of upload time, HAIL matches or even improves over the performance of standard HDFS. Additionally, HAIL adaptive indexing allows for automatic, incremental indexing at job runtime with minimal runtime overhead. For example, HAIL adaptive indexing can completely index a dataset as byproduct of only four MapReduce jobs while incurring an overhead as low as 11 % for the very first of those job only. In our experiments, we show that HAIL improves job runtimes by up to 68\(\times \) over Hadoop. This article is an extended version of the VLDB 2012 paper (Dittrich et al. in PVLDB 5(11):1591–1602, 2012).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

MongoDB Vs PostgreSQL: A comparative study on performance aspects

Article Open access 05 June 2020

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

Notes

A simple example of such a use-case would be a distributed grep.
The professor is aware that for some situations, the opposite is true.
Actually, it is a split. The difference does not matter here. We will get back to this in Sect. 4.2.
Alternatively, HAIL can also suggest an appropriate schema to users through schema analysis.
Alternatively, HAIL allows Bob to specify the selection predicate and the projected attributes in the job configuration class.
A Hadoop instance responsible to execute map and reduce tasks.
That was obtained from the HAILInputFormat via getSplits().
Notice that, all map tasks (even from different MapReduce jobs) running on the same node interact with the same AdaptiveIndexer instance.
Hence, the AdaptiveIndexer can end up by indexing data blocks from different MapReduce jobs at the same time.
It is worth noting that \(T_{idxOverhead}\) denotes only the additional runtime that a MapReduce job has due to adaptive indexing.
For this cluster type, we allocate an additional large node to run the namenode and jobtracker.
This is the time a map task takes to read and process its input.
Recall that, this query projects all attributes, which is indeed more beneficial for Hadoop++ as it uses a row layout.
Although HAIL is still indexing further blocks.

References

Abouzied, A., Abadi, D. J., Silberschatz, A.: Invisible loading: access-driven data transfer from raw Files into database systems. In: EDBT, pp. 1–10 (2013)
Agrawal, S., et al.: Database tuning advisor for Microsoft SQL server 2005. In: VLDB, pp. 1110–1121 (2004)
Ailamaki, A., et al.: Weaving relations for Cache performance. In: VLDB, pp. 169–180 (2001)
Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: Efficient query execution on raw data files. In: SIGMOD Conference, pp. 241–252 (2012)
Blanas, S., et al.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986 (2010)
Bruno, N., Chaudhuri, S.: To tune or not to tune? A lightweight physical design alerter. In: VLDB, pp. 499–510 (2006)
Bruno, N., Chaudhuri, S.: An online approach to physical design tuning. In: ICDE, pp. 826–835 (2007)
Bruno, N., Chaudhuri, S.: Physical design refinement: the merge-reduce approach. ACM TODS 32(4), 28:1–28:41 (2007)
Google Scholar
Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB (2010)
Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for Microsoft SQL server. In: VLDB, pp. 146–155 (1997)
Chaudhuri, S., Narasayya, V.R.: Self-tuning database systems: a decade of progress. In: VLDB, pp. 3–14 (2007)
Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. PVLDB 3(1—-2), 1459–1468 (2010)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. CACM 53(1), 72–77 (2010)
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A.: Efficient parallel data processing in MapReduce workflows. PVLDB 5, 2014–2015 (2012)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)
Google Scholar
Dittrich, J.-P., Fischer, P.M., Kossmann, D.: AGILE: Adaptive indexing for context-aware information filters. In: SIGMOD, pp. 215–226 (2005)
Eltabakh, M.Y., et al.: CoHadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)
Google Scholar
Finkelstein, S.J., et al.: Physical database design for relational databases. ACM TODS 13(1), 91–128 (1988)
Article MathSciNet Google Scholar
Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S.: Concurrency control for adaptive indexing. PVLDB 5(7), 656–667 (2012)
Google Scholar
Graefe G., Kuno, H.A.: Self-selecting, self-tuning, incrementally optimized indexes. In: EDBT, pp. 371–381 (2010)
http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
Hadoop Users. http://wiki.apache.org/hadoop/PoweredBy
Halim, F., et al.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)
Google Scholar
Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)
Google Scholar
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4(11), 1111–1122 (2011)
Google Scholar
Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results?. In: CIDR, pp. 57–68 (2011)
Idreos, S., et al.: Database cracking. In: CIDR, pp. 68–78 (2007)
Idreos, S., et al.: Self-organizing tuple reconstruction in column-stores. In: SIGMOD, pp. 297–308 (2009)
Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 586–597 (2011)
Google Scholar
Idreos, S., Kersten, M.L., Manegold, S.: Updating a cracked database. In: SIGMOD Conference, pp. 413–424 (2007)
Jahani, E., et al.: Automatic optimization for MapReduce programs. PVLDB 4(6), 385–396 (2011)
Google Scholar
Jiang, D., et al.: The performance of MapReduce: an in-depth study. PVLDB 3(1), 472–483 (2010)
Google Scholar
Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: SOCC (2011)
Lin, J., et al.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: MapReduce Workshop (2011)
Logothetis, D., et al.: In-situ MapReduce for log processing. In: USENIX (2011)
Lühring, M., et al.: Autonomous management of soft indexes. In: ICDE Workshop on Self-Managing Database Systems, pp. 450–458 (2007)
Olston, C.: Keynote: programming and debugging large-scale data processing workflows. In: SOCC (2011)
Pavlo, A., et al.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)
Quiané-Ruiz, J.-A., Pinkel, C., Schad, J., Dittrich, J.: RAFTing MapReduce: fast recovery on the RAFT. In: ICDE, pp. 589–600 (2011)
Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)
Google Scholar
Schnaitter, K., et al.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)
Thusoo, A., et al.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp. 1013–1020 (2010)
White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2011)
Google Scholar
Yang, H.-C., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-merge clusters. In: DASFAA, pp. 308–322 (2009)
Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)

Download references

Acknowledgments

Research supported by the Cluster of Excellence on “Multimodal Computing and Interaction” and the Bundesministerium für Bildung und Forschung.

Author information

Authors and Affiliations

Information Systems Group, Saarland University, Saarbrücken, Germany
Stefan Richter, Stefan Schuh & Jens Dittrich
Qatar Computing Research Institute, Qatar Foundation, Doha, Qatar
Jorge-Arnulfo Quiané-Ruiz

Authors

Stefan Richter
View author publications
You can also search for this author in PubMed Google Scholar
Jorge-Arnulfo Quiané-Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Schuh
View author publications
You can also search for this author in PubMed Google Scholar
Jens Dittrich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Richter.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Richter, S., Quiané-Ruiz, JA., Schuh, S. et al. Towards zero-overhead static and adaptive indexing in Hadoop. The VLDB Journal 23, 469–494 (2014). https://doi.org/10.1007/s00778-013-0332-z

Download citation

Received: 11 February 2013
Revised: 19 June 2013
Accepted: 30 July 2013
Published: 26 September 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s00778-013-0332-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards zero-overhead static and adaptive indexing in Hadoop

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

MongoDB Vs PostgreSQL: A comparative study on performance aspects

The big data system, components, tools, and technologies: a survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards zero-overhead static and adaptive indexing in Hadoop

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

MongoDB Vs PostgreSQL: A comparative study on performance aspects

The big data system, components, tools, and technologies: a survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation