Advertisement

Similarity Search in Large-Scale Graph Databases

  • Peixiang ZhaoEmail author
Chapter

Abstract

Graphs are ubiquitous and play an essential role in modeling and representing complex structures in real-world networked applications. Given a graph database that comprises a large collection of graphs, it is fundamental and critical to enable fast and flexible search for structurally similar graphs. In this paper, we survey recent graph similarity search techniques and specifically focus on the work based on the graph edit distance (GED) metric. State-of-the-art approaches for the GED based similarity search typically adopt a pruning and verification framework. They first take advantage of some easy-to-compute lower-bounds of graph edit distance, and use novel graph indexing structures to efficiently evaluate such lower-bounds between graphs in the graph database and the query graph. This way, graphs that violate the GED lower-bound constraints can be identified and filtered from the graph database from further investigation. Then, the costly GED verification is performed only for the graphs that pass the GED lower-bound evaluation. We examine existing GED lower-bounds, graph index structures, and similarity search algorithms in detail, and compare different similarity search methods from multiple aspects including index construction cost, similarity search performance, and applicability in real-world graph databases. In the end, we envision and discuss the future research directions related to similarity search and high-performance query processing in large-scale graph databases.

Keywords

Similarity Search Inverted Index Graph Database Graph Edit Inverted List 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    C.C. Aggarwal, H. Wang, Managing and Mining Graph Data (Springer, US, 2010)Google Scholar
  2. 2.
    L. Babai, Graph isomorphism in quasipolynomial time. in Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC’16) (2016), pp. 684–697Google Scholar
  3. 3.
    D.F. Barbieri, D. Braga, S. Ceri, E.D. Valle, M. Grossniklaus, Querying rdf streams with c-sparql. SIGMOD Rec. 39(1), 20–26 (2010)CrossRefzbMATHGoogle Scholar
  4. 4.
    P. Barceló Baeza, Querying graph databases. in Proceedings of the 32nd Symposium on Principles of Database Systems (PODS’13) (2013), pp. 175–188Google Scholar
  5. 5.
    H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The protein data bank. Nucleic Acids Res. 28, 235–242 (2000)CrossRefGoogle Scholar
  6. 6.
    S. Berretti, A. Del Bimbo, E. Vicario, Efficient matching and indexing of graph models in content-based retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1089–1105 (2001)CrossRefGoogle Scholar
  7. 7.
    K.M. Borgwardt, H.-P. Kriegel, Shortest-path kernels on graphs. in Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05) (2005), pp. 74–81Google Scholar
  8. 8.
    H. Bunke, On a relation between graph edit distance and maximum common subgraph. Pattern Recogn. Lett. 18(9), 689–694 (1997)CrossRefGoogle Scholar
  9. 9.
    H. Bunke, Error correcting graph matching: on the influence of the underlying cost function. IEEE Trans. Pattern Anal. Mach. Intell. 21(9), 917–922 (1999)CrossRefGoogle Scholar
  10. 10.
    H. Bunke, K. Shearer, A graph distance metric based on the maximal common subgraph. Pattern Recogn. Lett. 19(3–4), 255–259 (1998)CrossRefzbMATHGoogle Scholar
  11. 11.
    X. Chen, K.S. Candan, M.L. Sapino, P.Shakarian, KSGM: Keynode-driven scalable graph matching. in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM’15) (2015), pp. 1101–1110Google Scholar
  12. 12.
    H. Cheng, D. Lo, Y. Zhou, X. Wang, X. Yan, Identifying bug signatures using discriminative graph mining. in Proceedings of the Eighteenth International Symposium on Software Testing and Analysis (ISSTA’09) (2009), pp. 141–152Google Scholar
  13. 13.
    J. Cheng, Y. Ke, W. Ng, Efficient query processing on graph databases. ACM Trans. Database Syst. 34(1), 2:1–2:48 (2009)CrossRefGoogle Scholar
  14. 14.
    S. Choudhury, L. Holder, G. Chin, A. Ray, S. Beus, J. Feo, Streamworks: a system for dynamic graph search. in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13) (2013), pp. 1101–1104Google Scholar
  15. 15.
    D. Conte, P. Foggia, C. Sansone, M. Vento, Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell. 18(3), 265–298 (2004)CrossRefGoogle Scholar
  16. 16.
    D.J. Cook, L.B. Holder, Mining Graph Data (Wiley, New Jersey, 2006)CrossRefzbMATHGoogle Scholar
  17. 17.
    R. Fagin, A. Lotem, M. Naor, Optimal aggregation algorithms for middleware. in Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’01) (2001), pp. 102–113Google Scholar
  18. 18.
    W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, Y. Wu, Graph pattern matching: from intractable to polynomial time. Proc. VLDB Endow. 3(1–2), 264–275 (2010)CrossRefGoogle Scholar
  19. 19.
    S. Fankhauser, K. Riesen, H. Bunke, Speeding up graph edit distance computation through fast bipartite matching. in Proceedings of the 8th International Conference on Graph-based Representations in Pattern Recognition (GBRPR’11) (2011), pp. 102–111Google Scholar
  20. 20.
    B. Gallagher, Matching structure and semantics: a survey on graph-based pattern matching. in American Association for Artificial Intelligence (AAAI’06), vol. 6 (2006), pp. 45–53Google Scholar
  21. 21.
    X. Gao, B. Xiao, D. Tao, X. Li, A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)MathSciNetCrossRefGoogle Scholar
  22. 22.
    M.R. Garey, D.S. Johnson, Computers and Intractability; A Guide to the Theory of NP-Completeness (W. H. Freeman & Co., New York, 1990)zbMATHGoogle Scholar
  23. 23.
    K. Gouda, M. Arafa, An improved global lower bound for graph edit similarity search. Pattern Recogn. Lett. 58, 8–14 (2015)CrossRefGoogle Scholar
  24. 24.
    L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, D. Srivastava, Approximate string joins in a database (almost) for free. in Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01) (2001), pp. 491–500Google Scholar
  25. 25.
    W.-S. Han, J. Lee, J.-H. Lee, Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13) (2013), pp. 337–348Google Scholar
  26. 26.
    W.-S. Han, M.-D. Pham, J. Lee, R. Kasperovics, J.X. Yu, Igraph in action: performance analysis of disk-based graph indexing techniques. in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11) (2011), pp. 1241–1242Google Scholar
  27. 27.
    H. He, A.K. Singh, Closure-tree: an index structure for graph queries. in Proceedings of the 22nd International Conference on Data Engineering (ICDE’06) (2006), pp. 38–49Google Scholar
  28. 28.
    H. He, A.K. Singh, Graphs-at-a-time: query language and access methods for graph databases. in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08) (2008), pp. 405–418Google Scholar
  29. 29.
    H.H. Hung, S.S. Bhowmick, B.Q. Truong, B. Choi, S. Zhou, Quble: blending visual subgraph query formulation with query processing on large networks. in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13) (2013), pp. 1097–1100Google Scholar
  30. 30.
    N. Jayaram, S. Goyal, C. Li, VIIQ: Auto-suggestion enabled visual interface for interactive graph query formulation. Proc. VLDB Endow. 8(12), 1940–1943 (2015)CrossRefGoogle Scholar
  31. 31.
    C. Jin, S.S. Bhowmick, X. Xiao, J. Cheng, B. Choi, GBLENDER: towards blending visual query formulation and query processing in graph databases. in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10) (2010), pp. 111–122Google Scholar
  32. 32.
    A. Khan, N. Li, X. Yan, Z. Guan, S. Chakraborty, S. Tao, Neighborhood based fast graph search in large networks. in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11) (2011), pp. 901–912Google Scholar
  33. 33.
    A. Khan, Y. Wu, C.C. Aggarwal, X. Yan, NeMa: Fast graph search with label similarity. Proc. VLDB Endow. 6(3), 181–192 (2013)CrossRefGoogle Scholar
  34. 34.
    H.W. Kuhn, B. Yaw, The hungarian method for the assignment problem. Naval Res. Logist. Quart. 83–97 (1955)Google Scholar
  35. 35.
    J. Lee, W.-S. Han, R. Kasperovics, J.-H. Lee, An in-depth comparison of subgraph isomorphism algorithms in graph databases. in Proceedings of the 39th International Conference on Very Large Data Bases (PVLDB’13) (2013), pp. 133–144Google Scholar
  36. 36.
    C. Li, J. Lu, Y. Lu, Efficient merging and filtering algorithms for approximate string searches. in Proceedings of the 2008 IEEE 24th International Conference on Data Engineering (ICDE’08) (2008), pp. 257–266Google Scholar
  37. 37.
    C. Li, B. Wang, X. Yang, VGRAM: improving performance of approximate queries on string collections using variable-length grams. in Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07) (2007), pp. 303–314Google Scholar
  38. 38.
    S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo, Strong simulation: Capturing topology in graph pattern matching. ACM Trans. Database Syst. 39(1), 4:1–4:46 (2014)Google Scholar
  39. 39.
    M. Neuhaus, H. Bunke, Bridging the Gap Between Graph Edit Distance and Kernel Machines (World Scientific Publishing, Singapore, 2007)CrossRefzbMATHGoogle Scholar
  40. 40.
    H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, M. Kanehisa, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 27(1), 29–34 (1999)CrossRefGoogle Scholar
  41. 41.
    J. Qin, W. Wang, Y. Lu, C. Xiao, X. Lin, Efficient exact edit similarity query processing with the asymmetric signature scheme. in Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11) (2011), pp. 1033–1044Google Scholar
  42. 42.
    S.A. Rahman, M. Bashton, G.L. Holliday, R. Schrader, J.M. Thornton, Small molecule subgraph detector (SMSD) toolkit. J. Cheminform. 1, 1–12 (2009)CrossRefGoogle Scholar
  43. 43.
    S. Ranu, M. Hoang, A. Singh, Answering top-k representative queries on graph databases. in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14) (2014), pp. 1163–1174Google Scholar
  44. 44.
    S. Ranu, A.K. Singh, Indexing and mining topological patterns for drug discovery. in Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12) (2012), pp. 562–565Google Scholar
  45. 45.
    K. Riesen, S. Emmenegger, H. Bunke, A novel software toolkit for graph edit distance computation. in 9th International Workshop on Graph-Based Representations in Pattern Recognition (2013), pp. 142–151Google Scholar
  46. 46.
    S. Sakr, S. Elnikety, Y. He, G-SPARQL: A hybrid engine for querying large attributed graphs. in Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM’12) (2012), pp. 335–344Google Scholar
  47. 47.
    M. Schmidt, M. Meier, G. Lausen, Foundations of SPARQL query optimization. in Proceedings of the 13th International Conference on Database Theory (ICDT’10) (2010), pp. 4–33Google Scholar
  48. 48.
    H. Shang, X. Lin, Y. Zhang, J.X. Yu, W. Wang, Connected substructure similarity search. in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10) (2010), pp. 903–914Google Scholar
  49. 49.
    A. Tefas, C. Kotropoulos, I. Pitas, Using support vector machines to enhance the performance of elastic graph matching for frontal face authentication. IEEE Trans. Pattern Anal. Mach. Intell. 23(7), 735–746 (2001)CrossRefGoogle Scholar
  50. 50.
    Y. Tian, R.C. Mceachin, C. Santos, D.J. States, J.M. Patel, SAGA: a subgraph matching tool for biological graphs. Bioinformatics 23(2), 232–239 (2007)CrossRefGoogle Scholar
  51. 51.
    E. Ukkonen, Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92(1), 191–211 (1992)MathSciNetCrossRefzbMATHGoogle Scholar
  52. 52.
    J.R. Ullmann, An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)MathSciNetCrossRefGoogle Scholar
  53. 53.
    G. Wang, B. Wang, X. Yang, G. Yu, Efficiently indexing large sparse graphs for similarity search. IEEE Trans. Knowl. Data Eng. 24(3), 440–451 (2012)CrossRefGoogle Scholar
  54. 54.
    X. Wang, X. Ding, A.K.H. Tung, S. Ying, H. Jin, An efficient graph indexing method. in Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE’12) (2012), pp. 210–221Google Scholar
  55. 55.
    X. Yan, J. Han, gSpan: graph-based substructure pattern mining. in Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM’02) (2002), pp. 721–724Google Scholar
  56. 56.
    X. Yan, P.S. Yu, J. Han, Graph indexing: a frequent structure-based approach. in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD’04) (2004), pp. 335–346Google Scholar
  57. 57.
    X. Yan, P.S. Yu, J. Han, Substructure similarity search in graph databases. in Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD’05) (2005), pp. 766–777Google Scholar
  58. 58.
    Y. Yuan, G. Wang, J.Y. Xu, L. Chen, Efficient distributed subgraph similarity matching. VLDB J. 24(3), 369–394 (2015)CrossRefGoogle Scholar
  59. 59.
    Z. Zeng, A.K.H. Tung, J. Wang, J. Feng, L. Zhou, Comparing stars: On approximating graph edit distance. Proc. VLDB Endow. 2(1), 25–36 (2009)CrossRefGoogle Scholar
  60. 60.
    S. Zhang, J. Yang, W. Jin, SAPPER: Subgraph indexing and approximate matching in large graphs. Proc. VLDB Endow. 3(1–2), 1185–1194 (2010)CrossRefGoogle Scholar
  61. 61.
    Z. Zhang, M. Hadjieleftheriou, B.C. Ooi, D. Srivastava, Bed-tree: an all-purpose index structure for string similarity search based on edit distance. in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD’10) (2010), pp. 915–926Google Scholar
  62. 62.
    P. Zhao, J. Han, On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)CrossRefGoogle Scholar
  63. 63.
    P. Zhao, J.X. Yu, P.S. Yu, Graph indexing: tree + delta \(\ge \) graph. in Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07) (2007), pp. 938–949Google Scholar
  64. 64.
    X. Zhao, C. Xiao, X. Lin, Q. Liu, W. Zhang, A partition-based approach to structure similarity search. PVLDB 7(3), 169–180 (2013)Google Scholar
  65. 65.
    X. Zhao, C. Xiao, X. Lin, W. Wang, Efficient graph similarity joins with edit distance constraints. in Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE’12) (2012), pp. 834–845Google Scholar
  66. 66.
    X. Zhao, C. Xiao, X. Lin, W. Wang, Y. Ishikawa, Efficient processing of graph similarity queries with edit distance constraints. VLDB J. 22(6), 727–752 (2013)CrossRefGoogle Scholar
  67. 67.
    W. Zheng, L. Zou, X. Lian, D. Wang, D. Zhao, Graph similarity search with edit distance constraint in large graph databases. in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management (CIKM’13) (2013), pp. 1595–1600Google Scholar
  68. 68.
    G. Zhu, X. Lin, K. Zhu, W. Zhang, J.X. Yu, TreeSpan: efficiently computing similarity all-matching. in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12) (2012), pp. 529–540Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceFlorida State UniversityTallahasseeUSA

Personalised recommendations