Advertisement

The VLDB Journal

, Volume 23, Issue 3, pp 469–494 | Cite as

Towards zero-overhead static and adaptive indexing in Hadoop

  • Stefan RichterEmail author
  • Jorge-Arnulfo Quiané-Ruiz
  • Stefan Schuh
  • Jens Dittrich
Regular Paper

Abstract

Hadoop MapReduce has evolved to an important industry standard for massive parallel data processing and has become widely adopted for a variety of use-cases. Recent works have shown that indexes can improve the performance of selective MapReduce jobs dramatically. However, one major weakness of existing approaches is high index creation costs. We present HAIL (Hadoop Aggressive Indexing Library), a novel indexing approach for HDFS and Hadoop MapReduce. HAIL creates different clustered indexes over terabytes of data with minimal, often invisible costs, and it dramatically improves runtimes of several classes of MapReduce jobs. HAIL features two different indexing pipelines, static indexing and adaptive indexing. HAIL static indexing efficiently indexes datasets while uploading them to HDFS. Thereby, HAIL leverages the default replication of Hadoop and enhances it with logical replication. This allows HAIL to create multiple clustered indexes for a dataset, e.g., one for each physical replica. Still, in terms of upload time, HAIL matches or even improves over the performance of standard HDFS. Additionally, HAIL adaptive indexing allows for automatic, incremental indexing at job runtime with minimal runtime overhead. For example, HAIL adaptive indexing can completely index a dataset as byproduct of only four MapReduce jobs while incurring an overhead as low as 11 % for the very first of those job only. In our experiments, we show that HAIL improves job runtimes by up to 68\(\times \) over Hadoop. This article is an extended version of the VLDB 2012 paper (Dittrich et al. in PVLDB 5(11):1591–1602, 2012).

Keywords

Hadoop Map reduce Indexing Adaptive indexing Big data HDFS Physical design 

Notes

Acknowledgments

Research supported by the Cluster of Excellence on “Multimodal Computing and Interaction” and the Bundesministerium für Bildung und Forschung.

References

  1. 1.
    Abouzied, A., Abadi, D. J., Silberschatz, A.: Invisible loading: access-driven data transfer from raw Files into database systems. In: EDBT, pp. 1–10 (2013)Google Scholar
  2. 2.
    Agrawal, S., et al.: Database tuning advisor for Microsoft SQL server 2005. In: VLDB, pp. 1110–1121 (2004)Google Scholar
  3. 3.
    Ailamaki, A., et al.: Weaving relations for Cache performance. In: VLDB, pp. 169–180 (2001)Google Scholar
  4. 4.
    Alagiannis, I., Borovica, R., Branco, M., Idreos, S., Ailamaki, A.: NoDB: Efficient query execution on raw data files. In: SIGMOD Conference, pp. 241–252 (2012)Google Scholar
  5. 5.
    Blanas, S., et al.: A comparison of join algorithms for log processing in MapReduce. In: SIGMOD, pp. 975–986 (2010)Google Scholar
  6. 6.
    Bruno, N., Chaudhuri, S.: To tune or not to tune? A lightweight physical design alerter. In: VLDB, pp. 499–510 (2006)Google Scholar
  7. 7.
    Bruno, N., Chaudhuri, S.: An online approach to physical design tuning. In: ICDE, pp. 826–835 (2007)Google Scholar
  8. 8.
    Bruno, N., Chaudhuri, S.: Physical design refinement: the merge-reduce approach. ACM TODS 32(4), 28:1–28:41 (2007)Google Scholar
  9. 9.
    Cafarella, M.J., Ré, C.: Manimal: relational optimization for data-intensive programs. In: WebDB (2010)Google Scholar
  10. 10.
    Chaudhuri, S., Narasayya, V.R.: An efficient cost-driven index selection tool for Microsoft SQL server. In: VLDB, pp. 146–155 (1997)Google Scholar
  11. 11.
    Chaudhuri, S., Narasayya, V.R.: Self-tuning database systems: a decade of progress. In: VLDB, pp. 3–14 (2007)Google Scholar
  12. 12.
    Chen, S.: Cheetah: a high performance, custom data warehouse on top of MapReduce. PVLDB 3(1—-2), 1459–1468 (2010)Google Scholar
  13. 13.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. CACM 53(1), 72–77 (2010)CrossRefGoogle Scholar
  14. 14.
    Dittrich, J., Quiané-Ruiz, J.-A.: Efficient parallel data processing in MapReduce workflows. PVLDB 5, 2014–2015 (2012)Google Scholar
  15. 15.
    Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)Google Scholar
  16. 16.
    Dittrich, J., Quiané-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. PVLDB 5(11), 1591–1602 (2012)Google Scholar
  17. 17.
    Dittrich, J.-P., Fischer, P.M., Kossmann, D.: AGILE: Adaptive indexing for context-aware information filters. In: SIGMOD, pp. 215–226 (2005)Google Scholar
  18. 18.
    Eltabakh, M.Y., et al.: CoHadoop: flexible data placement and its exploitation in hadoop. PVLDB 4(9), 575–585 (2011)Google Scholar
  19. 19.
    Finkelstein, S.J., et al.: Physical database design for relational databases. ACM TODS 13(1), 91–128 (1988)CrossRefMathSciNetGoogle Scholar
  20. 20.
    Graefe, G., Halim, F., Idreos, S., Kuno, H.A., Manegold, S.: Concurrency control for adaptive indexing. PVLDB 5(7), 656–667 (2012)Google Scholar
  21. 21.
    Graefe G., Kuno, H.A.: Self-selecting, self-tuning, incrementally optimized indexes. In: EDBT, pp. 371–381 (2010)Google Scholar
  22. 22.
  23. 23.
  24. 24.
    Halim, F., et al.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)Google Scholar
  25. 25.
    Halim, F., Idreos, S., Karras, P., Yap, R.H.C.: Stochastic database cracking: towards robust adaptive indexing in main-memory column-stores. PVLDB 5(6), 502–513 (2012)Google Scholar
  26. 26.
    Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB 4(11), 1111–1122 (2011)Google Scholar
  27. 27.
    Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my data files. Here are my queries. Where are my results?. In: CIDR, pp. 57–68 (2011)Google Scholar
  28. 28.
    Idreos, S., et al.: Database cracking. In: CIDR, pp. 68–78 (2007)Google Scholar
  29. 29.
    Idreos, S., et al.: Self-organizing tuple reconstruction in column-stores. In: SIGMOD, pp. 297–308 (2009)Google Scholar
  30. 30.
    Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 586–597 (2011)Google Scholar
  31. 31.
    Idreos, S., Kersten, M.L., Manegold, S.: Updating a cracked database. In: SIGMOD Conference, pp. 413–424 (2007)Google Scholar
  32. 32.
    Jahani, E., et al.: Automatic optimization for MapReduce programs. PVLDB 4(6), 385–396 (2011)Google Scholar
  33. 33.
    Jiang, D., et al.: The performance of MapReduce: an in-depth study. PVLDB 3(1), 472–483 (2010)Google Scholar
  34. 34.
    Jindal, A., Quiané-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: SOCC (2011)Google Scholar
  35. 35.
    Lin, J., et al.: Full-text indexing for optimizing selection operations in large-scale data analytics. In: MapReduce Workshop (2011)Google Scholar
  36. 36.
    Logothetis, D., et al.: In-situ MapReduce for log processing. In: USENIX (2011)Google Scholar
  37. 37.
    Lühring, M., et al.: Autonomous management of soft indexes. In: ICDE Workshop on Self-Managing Database Systems, pp. 450–458 (2007)Google Scholar
  38. 38.
    Olston, C.: Keynote: programming and debugging large-scale data processing workflows. In: SOCC (2011)Google Scholar
  39. 39.
    Pavlo, A., et al.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)Google Scholar
  40. 40.
    Quiané-Ruiz, J.-A., Pinkel, C., Schad, J., Dittrich, J.: RAFTing MapReduce: fast recovery on the RAFT. In: ICDE, pp. 589–600 (2011)Google Scholar
  41. 41.
    Schad, J., Dittrich, J., Quiané-Ruiz, J.-A.: Runtime measurements in the cloud: observing, analyzing, and reducing variance. PVLDB 3(1), 460–471 (2010)Google Scholar
  42. 42.
    Schnaitter, K., et al.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)Google Scholar
  43. 43.
    Thusoo, A., et al.: Data warehousing and analytics infrastructure at facebook. In: SIGMOD, pp. 1013–1020 (2010)Google Scholar
  44. 44.
    White, T.: Hadoop: The Definitive Guide. O’Reilly, Sebastopol (2011)Google Scholar
  45. 45.
    Yang, H.-C., Parker, D.S.: Traverse: simplified indexing on large Map-Reduce-merge clusters. In: DASFAA, pp. 308–322 (2009)Google Scholar
  46. 46.
    Zaharia, M., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys, pp. 265–278 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Stefan Richter
    • 1
    Email author
  • Jorge-Arnulfo Quiané-Ruiz
    • 2
  • Stefan Schuh
    • 1
  • Jens Dittrich
    • 1
  1. 1.Information Systems GroupSaarland UniversitySaarbrückenGermany
  2. 2.Qatar Computing Research InstituteQatar FoundationDohaQatar

Personalised recommendations