Skip to main content

Efficient Level-Based Top-Down Data Cube Computation Using MapReduce

Part of the Lecture Notes in Computer Science book series (TLDKS,volume 9260)

Abstract

Data cube is an essential part of OLAP(On-Line Analytical Processing) to support efficiently multidimensional analysis for a large size of data. The computation of data cube takes much time, because a data cube with d dimensions consists of 2d (i.e., exponential order of d) cuboids. To build ROLAP (Relational OLAP) data cubes efficiently, many algorithms (e.g., GBLP, PipeSort, PipeHash, BUC, etc.) have been developed, which share sort cost and input data scan and/or reduce data computation time. Several parallel processing algorithms have been also proposed. On the other hand, MapReduce is recently emerging for the framework processing huge volume of data like web-scale data in a distributed/parallel manner by using a large number of computers (e.g., several hundred or thousands). In the MapReduce framework, the degree of parallel processing is more important to reduce total execution time than elaborate strategies like short-share and computation-reduction which existing ROLAP algorithms use. In this paper, we propose two distributed parallel processing algorithms. The first algorithm called MRLevel, which takes advantages of the MapReduce framework. The second algorithm called MRPipeLevel, which is based on the existing PipeSort algorithm which is one of the most efficient ones for top-down cube computation. (Top-down approach is more effective to handle big data, compared to others such as bottom-up and special data structures which are dependent on main-memory size.) The proposed MRLevel algorithm tries to parallelize cube computation and to reduce the number of data scan by level at the same time. The MRPipeLevel algorithm is based on the advantages of the MRLevel and to reduce the number of data scan by pipelining at the same time. We implemented and evaluated the performance of this algorithm under the MapReduce framework. Through the experiments, we also identify the factors for performance enhancement in MapReduce to process very huge data.

Keywords

  • Data cube
  • ROLAP
  • MapReduce
  • Hadoop
  • Distributed parallel computing

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-662-47804-2_1
  • Chapter length: 19 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-662-47804-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   79.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.
Fig. 11.
Fig. 12.
Fig. 13.
Fig. 14.

References

  1. Gray, J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. In: Proceedings of Conference on Data Engineering, New Orleans, LA, pp. 152–199, February 1996

    Google Scholar 

  2. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Montreal, Canada, pp. 205–216, June 1996

    Google Scholar 

  3. Agarwal, S., et al.: On the computation of multidimensional aggregates. In: Proceedings of the 22nd International Conference on Very Large Data Bases, pp. 506–521, September 1996

    Google Scholar 

  4. Ross, K.A., Srivastava, D.: Fast computation of sparse datacubes. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 116–125, August 1997

    Google Scholar 

  5. Beyer, K., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg cubes. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Philadelphia, PA, pp. 359–370, June 1999

    Google Scholar 

  6. Dehne, F., Eavis, T., Rau-Chaplin, A.: The cgmCUBE project: optimizing parallel data cube generation for ROLAP. Distrib. Parallel Databases 19(1), 29–62 (2006)

    CrossRef  Google Scholar 

  7. Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: PnP: parallel and external memory iceberg cube computation. In: Proceedings of the International Conference on Data Engineering, Tokyo, Japan, pp. 576–577, April 2005

    Google Scholar 

  8. Jin, R., Vaidyanathan, K., Yang, G., Agrawal, G.: Communication and memory optimal parallel data cube construction. Parallel Distrib. Syst. 16(12), 1105–1119 (2005)

    CrossRef  Google Scholar 

  9. Ng, R. T., Wagner, A., and Yin, Y.: Iceberg-cube computation with PC clusters. In: Proceedings of International Conference on Management of Data, ACM SIGMOD, Santa Barbara, CA, pp. 25–36, June 2001

    Google Scholar 

  10. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google File System. In: Proceedings of 19th on operating Systems Principles, Bolton Landing, NY, pp. 29–43, December 2003

    Google Scholar 

  11. Hadoop. http://hadoop.apache.org/

  12. HDFS. http://hadoop.apache.org/hdfs/

  13. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    CrossRef  Google Scholar 

  14. Jinguo, Y., Jianging, X., Pingjian, Z., Hu, C.: A parallel algorithm for closed cube computation. In: Proceedings of 7th International Conference on Computer and Information Science, Portland, OR, pp. 95–99, May 2008

    Google Scholar 

  15. Yuxiang, W., Aibo, S., Junzhou, L.: A MapReduceMerge-based data cube construction method.” In: Proceedings of 9th International Conference on Grid and Cooperative Computing, Nanjing, China, pp. 1–6, Nov. 2010

    Google Scholar 

  16. Lee, S., Moon, Y.-S., Kim, J.: Distributed parallel top-down computation of data cube using MapReduce. In: Proceedings of 3rd International Conference on Emerging Databases, Inchoen, Korea, pp. 303–306, August 2011

    Google Scholar 

  17. Nandi, A., Yu, C., Bohannon, P., Ramakrishnan, R.: Distributed cube materialization on holistic measures. In: Proceedings 27th International Conference on Data Engineering, Hannover, Germany, pp. 183–194, April 2011

    Google Scholar 

  18. Cuzzocrea, A.: Providing probabilistically-bounded approximate answers to non-holistic aggregate range queries in OLAP. In: Proceedings of 8th International Workshop on Data Warehousing and OLAP, Bremen, Germany, pp. 97–106, November 2005

    Google Scholar 

  19. Cuzzocrea, A. Sacca, D.: Balancing accuracy and privacy of OLAP aggregations on data cubes. In: Proceedings of 13th International Workshop on Data Warehousing and OLAP, Toronto, Canada, pp. 93–98, October 2010

    Google Scholar 

  20. Cuzzocrea, A., Darmont, J., Mahboubi, H.: Fragmenting very large XML data warehouses via k-means clustering algorithm. Int. J. Bus. Intell. Data Min. 4(3), 301–328 (2009)

    CrossRef  Google Scholar 

Download references

Acknowledgement

This research work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (2011-0011824).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinho Kim .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Lee, S., Kim, J., Moon, YS., Lee, W. (2015). Efficient Level-Based Top-Down Data Cube Computation Using MapReduce. In: Hameurlain, A., Küng, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXI. Lecture Notes in Computer Science(), vol 9260. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47804-2_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-47804-2_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-47803-5

  • Online ISBN: 978-3-662-47804-2

  • eBook Packages: Computer ScienceComputer Science (R0)