Skip to main content

Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster

Abstract

A data cube is a powerful analytical tool that stores all aggregate values over a set of dimensions. It provides users with a simple and efficient means of performing complex data analysis while assisting in decision making. Since the computation time for building a data cube is very large, however, efficient methods for reducing the data cube computation time are needed. Previous works have developed various algorithms for efficiently generating data cubes using MapReduce, which is a large-scale distributed parallel processing framework. However, MapReduce incurs the overhead of disk I/Os and network traffic. To overcome these MapReduce limitations, Spark was recently proposed as a memory-based parallel/distributed processing framework. It has attracted considerable research attention owing to its high performance. In this paper, we propose two algorithms for efficiently building data cubes. The algorithms fully leverage Spark’s mechanisms and properties: Resilient Distributed Top-Down Computation (RDTDC) and Resilient Distributed Bottom-Up Computation (RDBUC). The former is an algorithm for computing the components (i.e., cuboids) of a data cube in a top-down approach; the latter is a bottom-up approach. The RDTDC algorithm has three key functions. (1) It approximates the size of the cuboid using the cardinality without additional Spark action computation to determine the size of each cuboid during top-down computation. Thus, one cuboid can be computed from the upper cuboid of a smaller size. (2) It creates an execution plan that is optimized to input the smaller sized cuboid. (3) Lastly, it uses a method of reusing the result of the already computed cuboid by top-down computation and simultaneously computes the cuboid of several dimensions. In addition, we propose the RDBUC bottom-up algorithm in Spark, which is widely used in computing Iceberg cubes to maintain only cells satisfying a certain condition of minimum support. This algorithm incorporates two primary strategies: (1) reducing the input size to compute aggregate values for a dimension combination (e.g., A, B, and C) by removing the input, which does not satisfy the Iceberg cube condition at its lower dimension combination (e.g., A and B) computed earlier. (2) We use a lazy materialization strategy that computes every combination of dimensions using only transformation operations without any action operation. It then stores them in a single action operation. To prove the efficiency of the proposed algorithms using a lazy materialization strategy by employing only one action operation, we conducted extensive experiments. We compared them to the cube() function, a built-in cube computation library of Spark SQL. The results showed that the proposed RDTDC and RDBUC algorithms outperformed Spark SQL cube().

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34

References

  1. Kim, J., Lee, W., Song, J.J., Lee, S.B.: Optimized combinatorial clustering for stochastic processes. Clust. Comput. 20, 1135–1148 (2017)

    Article  Google Scholar 

  2. Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pirahesh, H.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Disc. 1, 29–53 (1997)

    Article  Google Scholar 

  3. Xin, D., Han, J., Li, X., Wah, B.W.: Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29 (2003)

  4. Xin, D., Shao, Z., Han, J., Liu, H.: C-cubing: efficient computation of closed cubes by aggregation-based checking. In: ICDE’06. Proceedings of the 22nd International Conference on Data Engineering, 2006 (2006)

  5. Han, J., Pei, J., Dong, G., Wang, K.: Efficient computation of iceberg cubes with complex measures. In: ACM SIGMOD Record (2001)

  6. Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman, J. D.: Computing iceberg queries efficiently. In: International Conference on Very Large Databases (VLDB’98), New York, August 1998 (1999)

  7. Wang, Z., Chu, Y., Tan, K.-L., Agrawal, D., Abbadi, A.E.I., Xu, X.: Scalable data cube analysis over big data. arXiv preprint. arXiv:1311.5663 (2013)

  8. Nandi, A., Yu, C., Bohannon, P., Ramakrishnan, R.: Data cube materialization and mining over mapreduce. IEEE Trans. Knowl. Data Eng. 24, 1747–1759 (2012)

    Article  Google Scholar 

  9. Milo, T., Altshuler, E.: An efficient MapReduce cube algorithm for varied DataDistributions. In: Proceedings of the 2016 International Conference on Management of Data (2016)

  10. Apache Hadoop: Welcome to Apache Hadoop (2016)

  11. Apache Spark: Apache Spark: lightning-fast cluster computing (2015)

  12. Zhao, Y., Deshpande, P.M., Naughton, J.F.: An array-based algorithm for simultaneous multidimensional aggregates. In: ACM SIGMOD Record (1997)

  13. Agarwal, S., Agrawal, R., Deshpande, P.M., Gupta, A., Naughton, J.F., Ramakrishnan, R., Sarawagi, S.: On the computation of multidimensional aggregates. In: VLDB (1996)

  14. Beyer, K., Ramakrishnan, R.: Bottom-up computation of sparse and iceberg cube. In: ACM SIGMOD Record (1999)

  15. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (2012)

  16. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. Proceedings of the 20th International Conference Very Large Data Bases. VLDB, vol. 1215, pp. 487–499 (1994)

  17. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., et al.: Spark sql: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1383–1394 (2015)

  18. Spark-SQL: DataFrame. http://spark.apache.org/docs/latest/sql-programming-guide.html

  19. Adamic, L.A.: Zipf, power-laws, and pareto-a ranking tutorial. Xerox Palo Alto Research Center, Palo Alto, CA. http://ginger.hpl.hp.com/shl/papers/ranking/ranking.html (2000)

  20. GDELT: http://www.gdeltproject.org

  21. Lee, S., Kim, J., Moon, Y.-S., Lee, W.: Efficient distributed parallel top-down computation of ROLAP data cube using mapreduce. In: International Conference on Data Warehousing and Knowledge Discovery, pp. 168–179 (2012)

  22. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. ACM SIGMOD Record 25, 205–216 (1996)

    Article  Google Scholar 

  23. Agarwal, S., Agrawal, R., Deshpande, P.M., Gupta, A., Naughton, J.F., Ramakrishnan, R., Sarawagi, S.: On the computation of multidimensional aggregates. VLDB 96, 506–521 (1996)

    Google Scholar 

  24. Ross, K.A., Srivastava, D.: Fast computation of sparse datacubes. VLDB 97, 25–29 (1997)

    Google Scholar 

  25. Roussopoulos, N., Kotidis, Y., Roussopoulos, M.: Cubetree: organization of and bulk incremental updates on the data cube. ACM SIGMOD Record 26, 89–99 (1997)

    Article  Google Scholar 

  26. Mumick, I.S., Quass, D., Mumick, B.S.: Maintenance of data cubes and summary tables in a warehouse. ACM Sigmod Record 26, 100–111 (1997)

    Article  Google Scholar 

  27. Goil, S., Choudhary, A.: High performance OLAP and data mining on parallel computers. Data Min. Knowl. Disc. 1, 391–417 (1997)

    Article  Google Scholar 

  28. Goil, S., Choudhary, A.: Parallel data cube construction for high performance on-line analytical processing. Proceedings of the Fourth International Conference on High-Performance Computing 1997, 10–15 (1997)

    Article  Google Scholar 

  29. Goil, S., Choudhary, A.: A parallel scalable infrastructure for OLAP and data mining. In: Proceedings. IDEAS’99. International Symposium Database Engineering and Applications, 1999, pp. 178–186 (1999)

  30. Ng, R.T., Wagner, A., Yin, Y.: Iceberg-cube computation with PC clusters. ACM SIGMOD Record 30, 25–36 (2001)

    Article  Google Scholar 

  31. Dehne, F., Eavis, T., Rau-Chaplin, A.: A cluster architecture for parallel data warehousing. In: Proceedings. First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001, pp. 161–168 (2001)

  32. Dehne, F., Eavis, T., Rau-Chaplin, A.: Computing partial data cubes for parallel data warehousing applications. In: European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pp. 319–326 (2001)

  33. Dehne, F., Eavis, T., Hambrusch, S., Rau-Chaplin, A.: Parallelizing the data cube. Distrib. Parallel Databases 11, 181–201 (2002)

    MATH  Google Scholar 

  34. Dehne, F., Eavis, T., Rau-Chaplin, A.: Top-down computation of partial ROLAP data cubes. In: Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004, p. 10 (2004)

  35. Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: Parallel ROLAP data cube construction on shared-nothing multiprocessors. Distrib. Parallel Databases 15, 219–236 (2004)

    Article  Google Scholar 

  36. Dehne, F., Eavis, T., Rau-Chaplin, A.: Parallel querying of ROLAP cubes in the presence of hierarchies. In: Proceedings of the 8th ACM International Workshop on Data Warehousing and OLAP, pp. 89–96 (2005)

  37. Dehne, F., Eavis, T., Rau-Chaplin, A.: The cgmCUBE project: optimizing parallel data cube generation for ROLAP. Distrib. Parallel Databases 19, 29–62 (2006)

    Article  Google Scholar 

  38. Jin, R., Vaidyanathan, K., Yang, G., Agrawal, G.: Communication and memory optimal parallel data cube construction. IEEE Trans. Parallel Distrib. Syst. 16, 1105–1119 (2005)

    Article  Google Scholar 

  39. Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: Improved data partitioning for building large ROLAP data cubes in parallel. Int. J. Data Warehous. Mining (IJDWM) 2, 1–26 (2006)

    Article  Google Scholar 

  40. Chen, Y., Rau-Chaplin, A., Dehne, F., Eavis, T., Green, D., Sithirasenan, E.: cgmOLAP: efficient parallel generation and querying of terabyte size ROLAP data cubes. In: Proceedings of the 22nd International Conference on Data Engineering, 2006. ICDE’06, pp. 164–164 (2006)

  41. You, J., Xi, J., Zhang, P., Chen, H.: A parallel algorithm for closed cube computation. In: Seventh IEEE/ACIS International Conference on Computer and Information Science, 2008. ICIS 08, pp. 95–99 (2008)

  42. Chen, Y., Dehne, F., Eavis, T., Rau-Chaplin, A.: PnP: sequential, external memory, and parallel iceberg cube computation. Distrib. Parallel Databases 23, 99–126 (2008)

    Article  Google Scholar 

  43. Dehne, F., Zaboli, H.: Parallel real-time OLAP on multi-core processors. In: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp. 588–594 (2012)

  44. Kamat, N., Jayachandran, P., Tunga, K., Nandi, A.: Distributed and interactive cube exploration. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 472–483 (2014)

  45. Sergey, K., Yury, K.: Applying map-reduce paradigm for parallel closed cube computation. In: First International Conference on Advances in Databases, Knowledge, and Data Applications, 2009. DBKDA’09, pp. 62–67 (2009)

  46. Wang, Y., Song, A., Luo, J.: A mapreducemerge-based data cube construction method. In: 2010 9th International Conference on Grid and Cooperative Computing (GCC), pp. 1–6 (2010)

  47. Wang, Z., Chu, Y., Tan, K.-L., Agrawal, D., Abbadi, A.E.: HaCube: extending MapReduce for efficient OLAP cube materialization and view maintenance. In: International Conference on Database Systems for Advanced Applications, pp. 113–129 (2016)

  48. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. Nsdi 10, 20 (2010)

    Google Scholar 

  49. Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Gerth, J., Talbot, J., et al.: Online aggregation and continuous query support in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 1115–1118 (2010)

  50. Suan, L., Yang-Sae, M., Jinho, K.: Distributed parallel top-down computation of data cube using MapReduce. In: Proceedings of the 3rd International Conference on Emerging Databases, Incheon, Korea, pp. 303–306 (2011)

  51. Nandi, A., Yu, C., Bohannon, P., Ramakrishnan, R.: Distributed cube materialization on holistic measures. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 183–194 (2011)

  52. Li, J., Meng, L., Wang, F.Z., Zhang, W., Cai, Y.: A map-reduce-enabled SOLAP cube for large-scale remotely sensed data aggregation. Comput. Geosci. 70, 110–119 (2014)

    Article  Google Scholar 

  53. Phan, D.-H., DellÁmico, M., Michiardi, P.: On the design space of MapReduce ROLLUP aggregates. In: EDBT/ICDT Workshops, pp. 10–18 (2014)

  54. Wang, B., Gui, H., Roantree, M.: OĆonnor. Data cube computational model with hadoop mapreduce, M.F. (2014)

    Google Scholar 

  55. Lee, S., Jo, S., Kim, J.: MRDataCube: data cube computation using MapReduce. In: 2015 International Conference on Big Data and Smart Computing (BigComp), pp. 95–102 (2015)

  56. Lee, S., Kim, J.: Performance evaluation of MRDataCube for data cube computation algorithm using MapReduce. In: 2016 International Conference on Big Data and Smart Computing (BigComp), pp. 325–328 (2016)

  57. Phan, D.-H., Michiardi, P.: A novel, low-latency algorithm for multiple Group-By query optimization. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 301–312 (2016)

  58. Kim, S., Lee, S., Kim, J., Yoon, Y.-I.: MRTensorCube: tensor factorization with data reduction for context-aware recommendations. J. Supercomput. (2017). https://doi.org/10.1007/s11227-017-2002-1

    Google Scholar 

  59. Sethi, K.K., Ramesh, D.: HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J. Supercomput. (2017). https://doi.org/10.1007/s11227-017-1963-4

    Google Scholar 

  60. Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on Spark. Proc. VLDB Endow. 9(10), 804–815 (2016)

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by a Grant (17AUDP-B070719-05) from Architecture & Urban Development Research Program funded by Ministry of Land, Infrastructure and Transport of Korean government, and by the Industrial Technology Innovation Program (Project#: 10052797) through the Korea Evaluation Institute of Industrial Technology (Keit) funded by the Ministry of Trade, Industry and Energy.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suan Lee.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lee, S., Kang, S., Kim, J. et al. Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Cluster Comput 22, 2063–2087 (2019). https://doi.org/10.1007/s10586-018-1811-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-1811-1

Keywords

  • Distributed processing
  • Spark framework
  • Resilient distributed dataset
  • Data warehousing
  • On-line analytical processing
  • Multidimensional data cube
  • Iceberg cube