Abstract
Graph OLAP is a technology that generates aggregates or summaries of a large-scale graph based on the properties (or dimensions) associated with its nodes and edges, and in turn enables interactive analyses of the statistical information contained in the graph. To efficiently support these OLAP functions, a graph cube is widely used, which maintains aggregate graphs for all dimensions of the source graph. However, computing the graph cube for a large graph requires an enormous amount of time. While previous approaches have used the MapReduce framework to cut down on this computation time, the recently developed Spark environment offers superior computational performance. To leverage the advantages of Spark, we propose the GraphNaïve and GraphTDC algorithms. GraphNaïve sequentially computes graph cuboids for all dimensions in a graph, while GraphTDC computes them after first creating an execution plan. We also propose the Generate Multi-Dimension Table method to efficiently create a multidimensional graph table to express the graph. Evaluation experiments demonstrated that the GraphTDC algorithm significantly outperformed Spark SQL’s built-in library DataFrame, as the size of graphs increased.
Similar content being viewed by others
References
Thomsen E (2002) OLAP solutions: building multidimensional information systems. Wiley, New York
Chaudhuri S, Dayal U (1997) An overview of data warehousing and OLAP technology. ACM Sigmod Rec 26:65–74
Beyer K and Ramakrishnan R (1999) Bottom-up computation of sparse and iceberg cube. In: ACM Sigmod Record
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov 1:29–53
Zhao Y, Deshpande PM, Naughton JF (1997) An array-based algorithm for simultaneous multidimensional aggregates. In: ACM SIGMOD Record
Xin D, Han J, Li X, Wah BW (2003) Star-cubing: computing iceberg cubes by top-down and bottom-up integration. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol 29
Xin D, Shao Z, Han J, Liu H (2006) C-cubing: efficient computation of closed cubes by aggregation-based checking. In: Proceedings of the 22nd International Conference on Data Engineering. ICDE’06
Ng RT, Wagner A, Yin Y (2001) Iceberg-cube computation with PC clusters. In: ACM SIGMOD record
Han J, Pei J, Dong G, Wang K (2001) Efficient computation of iceberg cubes with complex measures. In: ACM SIGMOD record
Fang M, Shivakumar N, Garcia-Molina H, Motwani R, Ullman JD (1998) Computing iceberg queries efficiently. In: International Conference on Very Large Databases (VLDB’98), New York, August 1998
Agarwal S, Agrawal R, Deshpande PM, Gupta A, Naughton JF, Ramakrishnan R, Sarawagi S (1996) On the computation of multidimensional aggregates. In: VLDB
Li X, Han J, Gonzalez H (2004) High-dimensional OLAP: a minimal cubing approach. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol 30
Wang Z, Chu Y, Tan K-L, Agrawal D, Abbadi AEI, Xu X (2013) Scalable data cube analysis over big data. arXiv preprint arXiv:1311.5663
Nandi A, Yu C, Bohannon P, Ramakrishnan R (2012) Data cube materialization and mining over mapreduce. IEEE Trans Knowl Data Eng 24:1747–1759
Lee S, Jo S, Kim J (2015) MRDataCube: data cube computation using MapReduce. In: 2015 International Conference on Big Data and Smart Computing (BigComp), pp 95–102
Milo T, Altshuler E (2016) An efficient MapReduce cube algorithm for varied DataDistributions. In: Proceedings of the 2016 International Conference on Management of Data
Lee S, Kang S, Kim J, Yu EJ (2018) Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Clust Computing 1–25. https://doi.org/10.1007/s10586-018-1811-1
Yin M, Wu B, Zeng Z (2012) HMGraph OLAP: a novel framework for multi-dimensional heterogeneous network analysis. In: Proceedings of the Fifteenth International Workshop on Data Warehousing and OLAP
Qu Q, Zhu F, Yan X, Han J, Philip SY, Li H (2011) Efficient topological OLAP on information networks. In: International Conference on Database Systems for Advanced Applications
Li C, Yu PS, Zhao L, Xie Y, Lin W (2011) Infonetolaper: integrating infonetwarehouse and infonetcube with infonetolap. In: Proceedings of the VLDB Endowment, vol 4
Cook DJ, Holder LB (2006) Mining graph data. Wiley, New York
Chen C, Yan X, Zhu F, Han J, Philip SY (2008) Graph OLAP: towards online analytical processing on graphs. In: Eighth IEEE International Conference on Data Mining, ICDM’08, pp 103–112
Beheshti SMR, Benatallah B, Motahari-Nezhad HR, Allahbakhsh M (2012) A framework and a language for on-line analytical processing on graphs. In: International Conference on Web Information Systems Engineering
Zhao P, Li X, Xin D, Han J (2011) Graph cube: on warehousing and OLAP multidimensional networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data
Ghrab A et al (2015) A framework for building OLAP cubes on graphs. In: East European Conference on Advances in Databases and Information Systems. Springer, Cham
Bleco D, Yannis K (2018) Finding the needle in a haystack: entropy guided exploration of very large graph cubes. In: EDBT/ICDT Workshops
Azirani E et al (2015) Efficient OLAP operations for RDF analytics. In: 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW). IEEE
Wang Z, Fan Q, Wang H, Tan K-L, Agrawal D, El Abbadi A (2014) Pagrol: parallel graph olap over large-scale attributed graphs. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE)
Denis B, Ghrab A, Skhiri S (2013) A distributed approach for graph-oriented multidimensional analysis. In: 2013 IEEE International Conference on Big Data
Spark A (2018) Apache Spark: unified analytics engine for big data. The Apache Software Foundation. http://spark.apache.org. Accessed 8 Jan 2019
Xin RS, Crankshaw D, Dave A, Gonzalez JE, Franklin MJ, Stoica I (2014) Graphx: unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394
Shoro AG, Soomro TR (2015) Big data analysis: Apache spark perspective. Global J Comput Sci Technol
Shanahan JG, Dai L (2015) Large scale distributed data science using apache spark. In: Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Carlini E, Dazzi P, Esposito A, Lulli A, Ricci L (2014) Balanced graph partitioning with apache spark. In: European Conference on Parallel Processing
Zadeh RB, Meng X, Ulanov A, Yavuz B, Pu L, Venkataraman S, Sparks E, Staple A, Zaharia M (2016) Matrix computations and optimization in apache spark. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Yang L et al (2018) Min-forest: fast reachability indexing approach for large-scale graphs on spark platform. In: International Conference on Web Services. Springer, Cham
Lee S et al (2018) TensorLightning: a traffic-efficient distributed deep learning on commodity Spark clusters. IEEE Access 6:27671–27680
Tian X et al (2017) Towards memory and computation efficient graph processing on spark. In: 2017 IEEE International Conference on Big Data. IEEE
Karim MR et al (2018) Mining maximal frequent patterns in transactional databases and dynamic data streams: a spark-based approach. Inf Sci 432:278–300
Jensen SK, Torben BP, Christian T (2018) ModelarDB: modular model-based time series management with spark and cassandra. Proc VLDB Endow 11(11):1688–1701
Kim J et al (2017) Optimized combinatorial clustering for stochastic processes. Cluster Comput 20(2):1135–1148
Alemi Mehdi, Haghighi Hassan, Shahrivari Saeed (2017) CCFinder: using Spark to find clustering coefficient in big graphs. J Supercomput 73(11):4683–4710
Hadoop A (2018) Apache Hadoop. The Apache Software Foundation. http://hadoop.apache.org. Accessed 8 Jan 2019
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation
Leskovec J, Sosič R (2016) Snap: a general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol (TIST) 8(1):1
Mühleisen H, Bizer C (2012) Web data commons—extracting structured data from two large web corpora. In: CEUR Workshop Proceedings LDOW 2012: Linked Data on the Web, vol 937. CEUR-ws.org
Armbrust M, Xin RS, Lian C, Huai Y, Liu D, Bradley JK, Meng X, Kaftan T, Franklin MJ, Ghodsi A, Zaharia M (2015) Spark SQL: relational data processing in Spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp 1383–1394
Acknowledgements
This research was supported by Korea Electric Power Corporation. (Grant Number: R18XA05) and by the Industrial Technology Innovation Program (Project#: 10052797), through the Korea Evaluation Institute of Industrial Technology (Keit), funded by the Ministry of Trade, Industry and Energy.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kang, S., Lee, S. & Kim, J. Distributed graph cube generation using Spark framework. J Supercomput 76, 8118–8139 (2020). https://doi.org/10.1007/s11227-019-02746-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02746-4