Efficient parallel edge-centric approach for relaxed graph pattern matching

Abstract

Prior algorithms on graph simulation for distributed graphs are not scalable enough as they exhibit heavy message passing. Moreover, they are dependent on the graph partitioning quality that can be a bottleneck due to the natural skew present in real-world data. As a result, their degree of parallelism becomes limited. In this paper, we propose an efficient parallel edge-centric approach for distributed graph pattern matching. We design a novel distributed data structure called ST that allows a fine-grain parallelism, and hence guarantees linear scalability. Based on ST, we develop a parallel graph simulation algorithm called PGSim. Furthermore, we propose PDSim, an edge-centric algorithm that efficiently evaluates dual simulation in parallel. PDSim combines ST and PGSim in a Split-and-Combine approach to accelerate the computation stages. We prove the effectiveness and efficiency of these propositions through theoretical guarantees and extensive experiments on massive graphs. The achieved results confirm that our approach outperforms existing algorithms by more than an order of magnitude.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. 1.

    Bhattarai B, Liu H, Huang HH (2019) Ceci: compact embedding cluster index for scalable subgraph matching. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1447–1462. ACM, Amsterdam, Netherlands

  2. 2.

    Bi F, Chang L, Lin X, Qin L, Zhang W (2016) Efficient subgraph matching by postponing cartesian products. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1199–1214. ACM, San Francisco, California, USA

  3. 3.

    Bouhenni S, Yahiaoui S, Nouali-Taboudjemat N, Kheddouci H (2021) A survey on distributed graph pattern matching in massive graphs. ACM Comput Surv. https://doi.org/10.1145/3439724

    Article  Google Scholar 

  4. 4.

    Chakrabarti D, Zhan Y, Faloutsos C (2004) R-mat: a recursive model for graph mining. In: Proceedings of the 2004 SIAM International Conference on Data Mining, pp. 442–446. SIAM

  5. 5.

    Cordella LP, Foggia P, Sansone C, Vento M (2001) An improved algorithm for matching large graphs. Proceedings of the 3rd IAPR workshop on graph-based representations in pattern recognition 219(2):149–159. https://doi.org/10.1.1.101.5342

  6. 6.

    Csun S, Luo Q (2018) Parallelizing recursive backtracking based subgraph matching on a single machine. 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, Singapore, Singapore, pp 1–9

  7. 7.

    Dustin WS (2019) Social media statistics 2020: top networks by the numbers. https://dustinstout.com/social-media-statistics/. Accessed: 2021-03-01

  8. 8.

    Fan W (2012) Graph pattern matching revised for social network analysis. In: Proceedings of the 15th International Conference on Database Theory, ICDT ’12, p. 8-21. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2274576.2274578

  9. 9.

    Fan W, Li J, Ma S, Tang N, Wu Y, Wu Y (2010) 1Graph pattern matching: from intractable to polynomial time. Proc VLDB Endow 3(1–2):264–275 (10.14778/1920841.1920878)

    Article  Google Scholar 

  10. 10.

    Fan W, Wang X, Wu Y (2013) Diversified top-k graph pattern matching. Proc VLDB Endow 6(13):1510–1521

    Article  Google Scholar 

  11. 11.

    Fan W, Wang X, Wu Y (2013) Incremental graph pattern matching. Database Syst ACM Trans. https://doi.org/10.1145/2489791

    Article  MATH  Google Scholar 

  12. 12.

    Fan W, Wang X, Wu Y, Deng D (2014) Distributed graph simulation: impossibility and possibility. Proc VLDB Endow 7(12):1083–1094 (10.14778/2732977.2732983)

    Article  Google Scholar 

  13. 13.

    Fan W, Yu W, Xu J, Zhou J, Luo X, Yin Q, Lu P, Cao Y, Xu R (2018) Parallelizing sequential graph computations. ACM Trans Database Syst (TODS) 43(4):1–39

    MathSciNet  Article  Google Scholar 

  14. 14.

    Fard A, Nisar MU, Ramaswamy L, Miller JA, Saltz M (2013) A distributed vertex-centric approach for pattern matching in massive graphs. In: 2013 IEEE International Conference on Big Data, pp. 403–411. IEEE, Santa Clara, CA, USA. https://doi.org/10.1109/BigData.2013.6691601

  15. 15.

    Gao J, Liu P, Kang X, Zhang L, Wang J (2016) Prs: parallel relaxation simulation for massive graphs. Comput J 59(6):848–860

    MathSciNet  Article  Google Scholar 

  16. 16.

    Gao J, Zhou C, Zhou J, Yu JX (2014) Continuous pattern detection over billion-edge graph using distributed framework. 2014 IEEE 30th International Conference on Data Engineering. IEEE, Chicago, IL, USA, pp 556–567

  17. 17.

    Garey MR, Johnson DS (1979) Computers and intractability: a guide to np-completeness

  18. 18.

    Gurajada S, Seufert S, Miliaraki I, Theobald M (2014) Triad: a distributed shared-nothing rdf engine based on asynchronous message passing. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 289–300. ACM, Utah USA

  19. 19.

    Han WS, Lee J, Lee JH (2013) Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pp. 337–348. Association for Computing Machinery, New York, New York, USA. https://doi.org/10.1145/2463676.2465300

  20. 20.

    He H, Singh AK (2008) Graphs-at-a-time: Query language and access methods for graph databases. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 405–418. Association for Computing Machinery, Vancouver, Canada. https://doi.org/10.1145/1376616.1376660

  21. 21.

    Henzinger MR, Henzinger TA, Kopke PW (1995) Computing simulations on finite and infinite graphs. In: Proceedings of IEEE 36th Annual Foundations of Computer Science, pp. 453–462. IEEE, USA

  22. 22.

    Kao JS, Chou J (2016) Distributed incremental pattern matching on streaming graphs. In: Proceedings of the ACM Workshop on High Performance Graph Processing, HPGP ’16, p. 43-50. Association for Computing Machinery, Kyoto, Japan. https://doi.org/10.1145/2915516.2915519

  23. 23.

    Lai L, Qin L, Lin X, Chang L (2015) Scalable subgraph enumeration in mapreduce. Proc VLDB Endow 8(10):974–985

    Article  Google Scholar 

  24. 24.

    Lai L, Qin L, Lin X, Zhang Y, Chang L, Yang S (2016) Scalable distributed subgraph enumeration. Proc VLDB Endow 10(3):217–228

    Article  Google Scholar 

  25. 25.

    Lai L, Qing Z, Yang Z, Jin X, Lai Z, Wang R, Hao K, Lin X, Qin L, Zhang W et al (2019) Distributed subgraph matching on timely dataflow. Proc VLDB Endow 12(10):1099–1112

    Article  Google Scholar 

  26. 26.

    Leskovec J, Krevl A (2014) SNAP Datasets: stanford large network dataset collection. http://snap.stanford.edu/data

  27. 27.

    Li J, Cao Y, Ma S (2017) Relaxing graph pattern matching with explanations. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1677–1686. ACM, Singapore Singapore

  28. 28.

    Li J, Li J, Wang X (2018) A vertex-centric graph simulation algorithm for large graphs. In: Xu Z, Gao X, Miao Q, Zhang Y, Bu J (eds) Big Data. Springer, Singapore, pp 238–254

    Chapter  Google Scholar 

  29. 29.

    Liu C, Chen C, Han J, Yu PS (2006) Gplag: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pp. 872–881. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1150402.1150522

  30. 30.

    Ma S, Cao Y, Fan W, Huai J, Wo T (2011) Capturing topology in graph pattern matching. Proc VLDB Endow 5(4):310–321

    Article  Google Scholar 

  31. 31.

    Ma S, Cao Y, Huai J, Wo T (2012) Distributed graph pattern matching. In: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pp. 949–958. Association for Computing Machinery, Lyon, France. https://doi.org/10.1145/2187836.2187963

  32. 32.

    Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146

  33. 33.

    Milner R (1989) Communication and concurrency, vol. 84. Prentice hall Englewood Cliffs

  34. 34.

    Ogaard K, Roy H, Kase S, Nagi R, Sambhoos K, Sudit M (2013) Discovering patterns in social networks with graph matching algorithms. In: Greenberg AM, Kennedy WG, Bos ND (eds) Social computing, behavioral-cultural modeling and prediction. Springer, Berlin, Heidelberg, pp 341–349

    Chapter  Google Scholar 

  35. 35.

    Peng P, Zou L, Özsu MT, Chen L, Zhao D (2016) Processing sparql queries over distributed rdf graphs. VLDB J 25(2):243–268

    Article  Google Scholar 

  36. 36.

    Qiao M, Zhang H, Cheng H (2017) Subgraph matching: on compression and computation. Proc VLDB Endow 11(2):176–188

    Article  Google Scholar 

  37. 37.

    Ren X, Wang J (2015) Exploiting vertex relationships in speeding up subgraph isomorphism over large graphs. Proceedings of the VLDB Endowment 8(5):617–628

    Article  Google Scholar 

  38. 38.

    Reza T, Ripeanu M, Tripoul N, Sanders G, Pearce R (2018) Prunejuice: pruning trillion-edge graphs to a precise pattern-matching solution. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 265–281. IEEE, Dallas, Texas, USA. https://doi.org/10.1109/SC.2018.00024

  39. 39.

    Schätzle A, Przyjaciel-Zablocki M, Berberich T, Lausen G (2016) S2x: graph-parallel querying of rdf with graphx. In: Wang F, Luo G, Weng C, Khan A, Mitra P, Yu C (eds) Biomedical data management and graph online querying. Springer International Publishing, Cham, pp 155–168

    Google Scholar 

  40. 40.

    Serafini M, De Francisci Morales G, Siganos G (2017) Qfrag: distributed graph search via subgraph isomorphism. In: proceedings of the 2017 symposium on cloud computing, pp. 214–228. ACM, Santa Clara, CA

  41. 41.

    Shang H, Zhang Y, Lin X, Yu JX (2008) Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc VLDB Endow 1(1):364–375

    Article  Google Scholar 

  42. 42.

    Shemshadi A, Sheng QZ, Qin Y (2016) Efficient pattern matching for graphs with multi-labeled nodes. Know-Based Syst 109:256–265

    Article  Google Scholar 

  43. 43.

    Sun Z, Wang H, Wang H, Shao B, Li J (2012) Efficient subgraph matching on billion node graphs. Proc VLDB Endow 5(9):788–799

    Article  Google Scholar 

  44. 44.

    Ullmann JR (1976) An algorithm for subgraph isomorphism. J ACM 23(1):31–42. https://doi.org/10.1145/321921.321925

    MathSciNet  Article  Google Scholar 

  45. 45.

    Wang J, Ren X, Anirban S, Wu XW (2019) Correct filtering for subgraph isomorphism search in compressed vertex-labeled graphs. Inf Sci 482:363–373

    MathSciNet  Article  Google Scholar 

  46. 46.

    Wang Z, Gu R, Hu W, Yuan C, Huang Y (2019) Benu: Distributed subgraph enumeration with backtracking-based framework. 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, Macao, Macao, pp 136–147

  47. 47.

    Wu X, Theodoratos D, Skoutas D, Lan M (2020) Leveraging double simulation to efficiently evaluate hybrid patterns on data graphs. In: Huang Z, Beek W, Wang H, Zhou R, Zhang Y (eds) Web information systems engineering-WISE 2020. Springer International Publishing, Cham, pp 255–269

    Chapter  Google Scholar 

  48. 48.

    Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) Graphx: a resilient distributed graph system on spark. First international workshop on graph data management experiences and systems. ACM, New York, USA, pp 1–6

    Google Scholar 

  49. 49.

    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I et al (2010) Spark: cluster computing with working sets. HotCloud 10(10):95

    Google Scholar 

  50. 50.

    Zeng K, Yang J, Wang H, Shao B, Wang Z (2013) A distributed graph engine for web scale rdf data. Proc VLDB Endow 6(4):265–276

    Article  Google Scholar 

  51. 51.

    Zhao P, Han J (2010) On graph query optimization in large networks. Proc VLDB Endow 3(1–2):340–351

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Franco-Algerian program PHC Tassili BiGreen n\(^\circ\)18 MDU 111 and by the DGRSDT grant FNRSDT N\(^\circ\)253. The experiments presented in this work were carried out using the High Performance Computing Platform IBNBADIS provided by the Research Center on Scientific and Technical Information—CERIST (Algeria).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Sarra Bouhenni.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bouhenni, S., Yahiaoui, S., Nouali-Taboudjemat, N. et al. Efficient parallel edge-centric approach for relaxed graph pattern matching. J Supercomput (2021). https://doi.org/10.1007/s11227-021-03938-7

Download citation

Keywords

  • Graph pattern matching
  • Subgraph matching
  • Graph simulation
  • Dual simulation
  • Massive graph
  • Parallel algorithm