Skip to main content

Distributed graph cube generation using Spark framework

Abstract

Graph OLAP is a technology that generates aggregates or summaries of a large-scale graph based on the properties (or dimensions) associated with its nodes and edges, and in turn enables interactive analyses of the statistical information contained in the graph. To efficiently support these OLAP functions, a graph cube is widely used, which maintains aggregate graphs for all dimensions of the source graph. However, computing the graph cube for a large graph requires an enormous amount of time. While previous approaches have used the MapReduce framework to cut down on this computation time, the recently developed Spark environment offers superior computational performance. To leverage the advantages of Spark, we propose the GraphNaïve and GraphTDC algorithms. GraphNaïve sequentially computes graph cuboids for all dimensions in a graph, while GraphTDC computes them after first creating an execution plan. We also propose the Generate Multi-Dimension Table method to efficiently create a multidimensional graph table to express the graph. Evaluation experiments demonstrated that the GraphTDC algorithm significantly outperformed Spark SQL’s built-in library DataFrame, as the size of graphs increased.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

References

  1. Thomsen E (2002) OLAP solutions: building multidimensional information systems. Wiley, New York

    Google Scholar 

  2. Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. ACM Sigmod Rec 26:65–74

    Article  Google Scholar 

  3. Beyer K and Ramakrishnan R (1999) Bottom-up computation of sparse and iceberg cube. In: ACM Sigmod Record

  4. Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1:29–53

    Article  Google Scholar 

  5. Zhao Y, Deshpande PM, Naughton JF (1997) An array-based algorithm for simultaneous multidimensional aggregates. In: ACM SIGMOD Record

  6. Xin D, Han J, Li X, Wah BW (2003) Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol 29

  7. Xin D, Shao Z, Han J, Liu H (2006) C-cubing: efficient computation of closed cubes by aggregation-based checking. In: Proceedings of the 22nd International Conference on Data Engineering. ICDE’06

  8. Ng RT, Wagner A, Yin Y (2001) Iceberg-cube computation with PC clusters. In: ACM SIGMOD record

  9. Han J, Pei J, Dong G, Wang K (2001) Efficient computation of iceberg cubes with complex measures. In: ACM SIGMOD record

  10. Fang M, Shivakumar N, Garcia-Molina H, Motwani R, Ullman JD (1998) Computing iceberg queries efficiently. In: International Conference on Very Large Databases (VLDB’98), New York, August 1998

  11. Agarwal S, Agrawal R, Deshpande PM, Gupta A, Naughton JF, Ramakrishnan R, Sarawagi S (1996) On the computation of multidimensional aggregates. In: VLDB

  12. Li X, Han J, Gonzalez H (2004) High-dimensional OLAP: a minimal cubing approach. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol 30

  13. Wang Z, Chu Y, Tan K-L, Agrawal D, Abbadi AEI, Xu X (2013) Scalable data cube analysis over big data. arXiv preprint arXiv:1311.5663

  14. Nandi A, Yu C, Bohannon P, Ramakrishnan R (2012) Data cube materialization and mining over mapreduce. IEEE Trans Knowl Data Eng 24:1747–1759

    Article  Google Scholar 

  15. Lee S, Jo S, Kim J (2015) MRDataCube: data cube computation using MapReduce. In: 2015 International Conference on Big Data and Smart Computing (BigComp), pp 95–102

  16. Milo T, Altshuler E (2016) An efficient MapReduce cube algorithm for varied DataDistributions. In: Proceedings of the 2016 International Conference on Management of Data

  17. Lee S, Kang S, Kim J, Yu EJ (2018) Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Clust Computing 1–25. https://doi.org/10.1007/s10586-018-1811-1

  18. Yin M, Wu B, Zeng Z (2012) HMGraph OLAP: a novel framework for multi-dimensional heterogeneous network analysis. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP

  19. Qu Q, Zhu F, Yan X, Han J, Philip SY, Li H (2011) Efficient topological OLAP on information networks. In: International Conference on Database Systems for Advanced Applications

  20. Li C, Yu PS, Zhao L, Xie Y, Lin W (2011) Infonetolaper: integrating infonetwarehouse and infonetcube with infonetolap. In: Proceedings of the VLDB Endowment, vol 4

  21. Cook DJ, Holder LB (2006) Mining graph data. Wiley, New York

    Book  Google Scholar 

  22. Chen C, Yan X, Zhu F, Han J, Philip SY (2008) Graph OLAP: towards online analytical processing on graphs. In: Eighth IEEE International Conference on Data Mining, ICDM’08, pp 103–112

  23. Beheshti SMR, Benatallah B, Motahari-Nezhad HR, Allahbakhsh M (2012) A framework and a language for on-line analytical processing on graphs. In: International Conference on Web Information Systems Engineering

  24. Zhao P, Li X, Xin D, Han J (2011) Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data

  25. Ghrab A et al (2015) A framework for building OLAP cubes on graphs. In: East European Conference on Advances in Databases and Information Systems. Springer, Cham

  26. Bleco D, Yannis K (2018) Finding the needle in a haystack: entropy guided exploration of very large graph cubes. In: EDBT/ICDT Workshops

  27. Azirani E et al (2015) Efficient OLAP operations for RDF analytics. In: 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW). IEEE

  28. Wang Z, Fan Q, Wang H, Tan K-L, Agrawal D, El Abbadi A (2014) Pagrol: parallel graph olap over large-scale attributed graphs. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE)

  29. Denis B, Ghrab A, Skhiri S (2013) A distributed approach for graph-oriented multidimensional analysis. In: 2013 IEEE International Conference on Big Data

  30. Spark A (2018) Apache Spark: unified analytics engine for big data. The Apache Software Foundation. http://spark.apache.org. Accessed 8 Jan 2019

  31. Xin RS, Crankshaw D, Dave A, Gonzalez JE, Franklin MJ, Stoica I (2014) Graphx: unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394

  32. Shoro AG, Soomro TR (2015) Big data analysis: Apache spark perspective. Global J Comput Sci Technol

  33. Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

  34. Carlini E, Dazzi P, Esposito A, Lulli A, Ricci L (2014) Balanced graph partitioning with apache spark. In: European Conference on Parallel Processing

  35. Zadeh RB, Meng X, Ulanov A, Yavuz B, Pu L, Venkataraman S, Sparks E, Staple A, Zaharia M (2016) Matrix computations and optimization in apache spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

  36. Yang L et al (2018) Min-forest: fast reachability indexing approach for large-scale graphs on spark platform. In: International Conference on Web Services. Springer, Cham

  37. Lee S et al (2018) TensorLightning: a traffic-efficient distributed deep learning on commodity Spark clusters. IEEE Access 6:27671–27680

    Article  Google Scholar 

  38. Tian X et al (2017) Towards memory and computation efficient graph processing on spark. In: 2017 IEEE International Conference on Big Data. IEEE

  39. Karim MR et al (2018) Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inf Sci 432:278–300

    MathSciNet  Article  Google Scholar 

  40. Jensen SK, Torben BP, Christian T (2018) ModelarDB: modular model-based time series management with spark and cassandra. Proc VLDB Endow 11(11):1688–1701

    Article  Google Scholar 

  41. Kim J et al (2017) Optimized combinatorial clustering for stochastic processes. Cluster Comput 20(2):1135–1148

    Article  Google Scholar 

  42. Alemi Mehdi, Haghighi Hassan, Shahrivari Saeed (2017) CCFinder: using Spark to find clustering coefficient in big graphs. J Supercomput 73(11):4683–4710

    Article  Google Scholar 

  43. Hadoop A (2018) Apache Hadoop. The Apache Software Foundation. http://hadoop.apache.org. Accessed 8 Jan 2019

  44. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation

  45. Leskovec J, Sosič R (2016) Snap: a general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol (TIST) 8(1):1

    Article  Google Scholar 

  46. Mühleisen H, Bizer C (2012) Web data commons—extracting structured data from two large web corpora. In: CEUR Workshop Proceedings LDOW 2012: Linked Data on the Web, vol 937. CEUR-ws.org

  47. Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp 1383–1394

Download references

Acknowledgements

This research was supported by Korea Electric Power Corporation. (Grant Number: R18XA05) and by the Industrial Technology Innovation Program (Project#: 10052797), through the Korea Evaluation Institute of Industrial Technology (Keit), funded by the Ministry of Trade, Industry and Energy.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suan Lee.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kang, S., Lee, S. & Kim, J. Distributed graph cube generation using Spark framework. J Supercomput 76, 8118–8139 (2020). https://doi.org/10.1007/s11227-019-02746-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02746-4

Keywords

  • Distributed parallel processing
  • Spark framework
  • Resilient distributed dataset
  • Graph cube
  • Data cube
  • Online analytical processing