Abstract
Subgraph counting, as a fundamental problem in network analysis, is to count the number of subgraphs in a data graph that match a given query graph by either homomorphism or subgraph isomorphism. The importance of subgraph counting derives from the fact that it provides insights of a large graph, in particular a labeled graph, when a collection of query graphs with different sizes and labels are issued. The problem of counting is challenging. On the one hand, exact counting by enumerating subgraphs is NP-hard. On the other hand, approximate counting by subgraph isomorphism can only support small query graphs over unlabeled graphs. Another way for subgraph counting is to specify it as an SQL query and estimate the cardinality of the query in RDBMS. Existing approaches for cardinality estimation can only support subgraph counting by homomorphism up to some extent, as it is difficult to deal with sampling failure when a query graph becomes large. A question that arises is how we support subgraph counting by machine learning (ML) and deep learning (DL). To devise an ML/DL solution, apart from the query graphs, another issue is to deal with large data graphs by ML/DL, as the existing DL approach for subgraph isomorphism counting can only support small data graphs. In addition, the ML/DL approaches proposed in RDBMS context for approximate query processing and cardinality estimation cannot be used, as subgraph counting is to do complex self-joins over one relation, whereas existing approaches focus on multiple relations. In this work, we propose an active learned sketch for subgraph counting (\(\textsf{ALSS}\)) with two main components: a learned sketch for subgraph counting and an active learner. The sketch is constructed by a neural network regression model, and the active learner is to perform model updates based on new arrival test query graphs. Our holistic learning framework supports both undirected graphs and directed graphs, whose nodes and/or edges are associated zero to multiple labels. We conduct extensive experimental studies to confirm the effectiveness and efficiency of \(\textsf{ALSS}\) using large real labeled graphs. Moreover, we show that \(\textsf{ALSS}\) can assist query optimizers in finding a better query plan for complex multi-way self-joins.
Similar content being viewed by others
Notes
Homomorphism allows two nodes in a query graph to be mapped to the same node in a data graph, where subgraph isomorphism does not.
References
Aberger, C.R., Tu, S., Olukotun, K., Ré, C.: Old techniques for new join algorithms: a case study in rdf processing. In: 2016 IEEE 32nd International Conference on Data Engineering Workshops (ICDEW), pp. 97–102. IEEE, (2016)
Aberger, C.R., Lamb, A., Tu, S., Nötzli, A., Olukotun, K., Ré, C.: EmptyHeaded: a relational engine for graph processing. ACM Trans. Database Syst. 42(4), 44 (2017)
Aggarwal, C.C., Kong, X., Gu, Q., Han, J., Yu, P. S.: Active learning: a survey. In: Data Classification: Algorithms and Applications, pp. 571–606 (2014)
Ahmed, N.K., Neville, J., Rossi, R.A., Duffield, N.G.: Efficient graphlet counting for large networks. In: Proceedings ICDM’15, pp. 1–10 (2015)
Ahmed, N.K., Neville, J., Rossi, R.A., Duffield, N.G., Willke, T.L.: Graphlet decomposition: framework, algorithms, and applications. Knowl. Inf. Syst. 50(3), 689–722 (2017)
Ammar, K., McSherry, F., Salihoglu, S., Joglekar, M.: Distributed evaluation of subgraph queries using worst-case optimal low-memory dataflows. Proc. VLDB, 11(6), (2018)
Archibald, B., Dunlop, F., Hoffmann, R., McCreesh, C., Prosser, P., Trimble, J.: Sequential and parallel solution-biased search for subgraph algorithms. In: Proc. CPAIOR’19, pp. 20–38, (2019)
Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Proceedings of FOCS’08, pp. 739–748, (2008)
Battaglia, P.W., et al.: Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, (2018)
Bhattarai, B., Liu, H., Huang, H.H.: CECI: compact embedding cluster index for scalable subgraph matching. In: Proc. SIGMOD’19, pp. 1447–1462, (2019)
Bi, F., Chang, L., Lin, X., Qin, L., Zhang, W.: Efficient subgraph matching by postponing cartesian products. In: Proc. SIGMOD’16, pp. 1199–1214, (2016)
Bodirsky, M.: Graph homomorphisms and universal algebra course notes. (2015)
Bonnici, V., Giugno, R., Pulvirenti, A., Shasha, D.E., Ferro, A.: A subgraph isomorphism algorithm and its application to biochemical data. BMC Bioinform. 14(S–7), S13 (2013)
Borgwardt, K.M., et al.: Protein function prediction via graph kernels. Bioinformatics 21, i47–i56 (2005)
Bressan, M., Chierichetti, F., Kumar, R., Leucci, S., Panconesi, A.: Counting graphlets: space vs time. In: Proc. WSDM’17, pp. 557–566, (2017)
Bressan, M., Chierichetti, F., Kumar, R., Leucci, S., Panconesi, A.: Motif counting beyond five nodes. ACM Trans. Knowl. Discov. Data 12(4), 1–25 (2018)
Bressan, M., Leucci, S., Panconesi, A.: Motivo: fast motif counting via succinct color coding and adaptive sampling. Proc. VLDB 12(11), 1651–1663 (2019)
Cai, W., Balazinska, M., Suciu, D.: Pessimistic cardinality estimation: Tighter upper bounds for intermediate join cardinalities. In: Proceedings SIGMOD’19, pp. 18–35, (2019)
Cai, W., Zhang, Y., Zhou, J.: Maximizing expected model change for active learning in regression. In: Proceedings ICDM’13, pp. 51–60, (2013)
Chen, Z., Chen, L., Villar, S., Bruna, J.: Can graph neural networks count substructures? In: NeurIPS’20, (2020)
Chen, X., Lui, J. C. S.: Mining graphlet counts in online social networks. In: Proceedings ICDM’16, pp. 71–80, (2016)
Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the 3rd Annual ACM Symposium on Theory of Computing, May 3–5, 1971, Shaker Heights, Ohio, USA, pp. 151–158, (1971)
Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V.R., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proc. VLDB 12(9), 1044–1057 (2019)
Dutt, A., Wang, C., Narasayya, V., Chaudhuri, S.: Efficiently approximating selectivity functions using low overhead regression models. Proc. VLDB Endow. 13(12), 2215–2228 (2020)
Gottlob, G., Greco, G., Leone, N., Scarcello, F.: Hypertree decompositions: questions and answers. In: Proceedings of PODS’16, pp. 57–74, (2016)
Grover,A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proceedings of KDD’16, pp. 855–864, (2016)
Hamilton, W.L., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Proc. NeurIPS’17, pp. 1024–1034, (2017)
Han, M., Kim, H., Gu, G., Park, K., Han, W.: Efficient subgraph matching: harmonizing dynamic programming, adaptive matching order, and failing set together. In: Proceedings SIGMOD’19, pp. 1429–1446, (2019)
Hanneke, S.: A statistical theory of active learning. In: Foundations and Trends in Machine Learning, pp. 1–212, (2013)
Harchaoui, Z., Bach, F.R.: Image classification with segmentation graph kernels. In: Proceedings CVPR’07, (2007)
Hasan, S., et al.: Deep learning models for selectivity estimation of multi-attribute queries. In: Proceedings SIGMOD’20, pp. 1035–1050, (2020)
He, H., Singh, A.K.: Graphs-at-a-time: query language and access methods for graph databases. In Proceedings ACM SIGMOD’08, (2008)
Hilprecht, B., Schmidt, A., Kulessa, M., Molina, A., Kersting, K., Binnig, C.: Deepdb: Learn from data, not from queries! Proc. VLDB 13(7), 992–1005 (2020)
Hocevar, T., Demsar, J.: Combinatorial algorithm for counting small induced graphs and orbits. CoRR, abs/1601.06834, (2016)
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)
Hornik, K., Stinchcombe, M.B., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)
Jain, S., et al.: Impact of memory space optimization technique on fast network motif search algorithm. In: Advances in Computer and Computational Sciences, pp. 559–567. Springer, (2017)
Jha, M., et al.: Path sampling: a fast and provable method for estimating 4-vertex subgraph counts. In: Proc. WWW’15, pp. 495–505, (2015)
Jüttner, A., Madarasi, P.: VF2++ - an improved subgraph isomorphism algorithm. Discret. Appl. Math. 242, 69–81 (2018)
Khamis, M.A., Ngo, H.Q., Suciu, D.: What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? In: Proceedings PODS’17, pp. 429–444, (2017)
Kiefer, M., Heimel, M., Breß, S., Markl, V.: Estimating join selectivities using bandwidth-optimized kernel density models. Proc. VLDB 10(13), 2085–2096 (2017)
Kipf, A., et al.: Learned cardinalities: estimating correlated joins with deep learning. In: Proc. CIDR’19, (2019)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proc. ICLR’17, (2017)
Kolda, T.G., Pinar, A., Seshadhri, C.: Triadic measures on graphs: the power of wedge sampling. In: Proc. ICDM’13, pp. 10–18, (2013)
Krogh, A., and Vedelsby, J.: Neural network ensembles, cross validation, and active learning. In: Proceedings NIPS’94, pp. 231–238, (1994)
Lakshminarayanan, B., et al.: Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceeding NIPS’17, pp. 6402–6413, (2017)
Leis, V., Gubichev, A., Mirchev, A., Boncz, P.A., Kemper, A., Neumann, T.: How good are query optimizers, really? Proc. VLDB 9(3), 204–215 (2015)
Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proceedings ICML’94, pp. 148–156, (1994)
Li, F., et al.: Wander join: Online aggregation via random walks. In: Proceedings SIGMOD’16, pp. 615–629, (2016)
Li, J., Rong, Y., Cheng, H., Meng, H., Huang, W., Huang, J.: Semi-supervised graph classification: a hierarchical graph perspective. In: Proceeding WWW’19, pp. 972–982, (2019)
Liu, X., et al.: Neural subgraph isomorphism counting. In Proceeding KDD ’20, pp. 1959–1969, (2020)
Liu, J., Dong, W., Li, D., Zhou, Q.: Fauce: fast and accurate deep ensembles with uncertainty for cardinality estimation. Proc. VLDB Endow. 14(11), 1950–1963 (2021)
Lu, Y., Kandula, S., König, A.C., Chaudhuri, S.: Pre-training summarization models of structured datasets for cardinality estimation. Proc. VLDB Endow. 15(3), 414–426 (2021)
Ma, Q., and Triantafillou, P.: Dbest: revisiting approximate query processing engines with machine learning models. In: Proceeding SIGMOD’19, pp. 1553–1570, (2019)
Ma, L., Ding, B., Das, S., Swaminathan, A.: Active learning for ML enhanced database systems. In: Proceedings SIGMOD’20, pp. 175–191, (2020)
MacKay, D.J.C.: Information-based objective functions for active data selection. Neural Comput. 4(4), 590–604 (1992)
Melckenbeeck, I., Audenaert, P., Colle, D., Pickavet, M.: Efficiently counting all orbits of graphlets of any order in a graph using autogenerated equations. Bioinform 34(8), 1372–1380 (2018)
Mhedhbi, A., Salihoglu, S.: Optimizing subgraph queries by combining binary and worst-case optimal joins. Proc. VLDB 12(11), 1692–1704 (2019)
Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Morris, C., Ritzert, M., Fey, M., Hamilton, W.L., Lenssen, J.E., Rattan, G., Grohe, M.: Weisfeiler and leman go neural: Higher-order graph neural networks. In: AAAI’19, pp. 4602–4609. AAAI Press, (2019)
Mussmann, S., Liang, P.: On the relationship between data efficiency and error for uncertainty sampling. In: Proceedings ICML’18, pp. 3671–3679, (2018)
Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: Proceeding ICDE’11, pp. 984–994, (2011)
Ngo, H.Q.: Worst-case optimal join algorithms: techniques, results, and open problems. In: Proceedings PODS’18, pp. 111–124, (2018)
Ortmann, M., Brandes, U.: Efficient orbit-aware triad and quad census in directed and undirected graphs. Appl. Netw. Sci. 2, 13 (2017)
Paredes, P., Ribeiro, P.M.P.: Towards a faster network-centric subgraph census. In: Proceeding ASONAM ’13, pp. 264–271, (2013)
Park, Y., et al.: G-CARE: a framework for performance benchmarking of cardinality estimation techniques for subgraph matching. In: Proceeding SIGMOD’20, pp. 1099–1114, (2020)
Park, Y., Zhong, S., Mozafari, B.: Quicksel: Quick selectivity learning with mixture models. In: Procedding SIGMOD, pp. 1017–1033. ACM, (2020)
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: Proceeding KDD’14, pp. 701–710, (2014)
Pinar, A., et al.: ESCAPE: efficiently counting all 5-vertex subgraphs. In: Proceeding WWW’17, pp. 1431–1440, (2017)
Przulj, N.: Biological network comparison using graphlet degree distribution. Bioinform 23(2), 177–183 (2007)
Pytorch Geometric. https://github.com/rusty1s/pytorch_geometric
Pytorch. https://github.com/pytorch/pytorch
Regol, F.: Active learning on attributed graphs via graph cognizant logistic regression and preemptive query generation. In: Proceeding ICML’20
Rezende, D.J., Mohamed, S.: Variational inference with normalizing flows. In: Proceeding ICML, vol. 37 of JMLR Workshop and Conference Proceedings, pp. 1530–1538, (2015)
Ribeiro, P., et al.: A survey on subgraph counting: concepts, algorithms and applications to network motifs and graphlets. CoRR, abs/1910.13011, (2019)
Settles, B.: Active learning. In: Synthesis lectures on artificial intelligence and machine learning. Morgan & Claypool Publishers, (2012)
Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. Proc. VLDB 1(1), 364–375 (2008)
Shervashidze, N., et al.: Efficient graphlet kernels for large graph comparison. In: Proceeding AISTATS’09, pp. 488–495, (2009)
Slota, G.M., Madduri, K.: Fast approximate subgraph counting and enumeration. In: ICPP’13, pp. 210–219. IEEE Computer Society, (2013)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Stefanoni, G., Motik, B., Kostylev, E.V.: Estimating the cardinality of conjunctive queries over RDF data using graph summarisation. In: Proceedings WWW’18, pp. 1043–1052, (2018)
Strogatz, S.H.: Exploring complex networks. Nature 410(6825), 268–276 (2001)
Sun, S., Luo, Q.: In-memory subgraph matching: an in-depth study. In: Proceedings SIGMOD’20, pp. 1083 (2020)
Sun, J., Li, G.: An end-to-end learning-based cost estimator. Proc. VLDB 13(3), 307–319 (2019)
Tahmasebi, B., and Jegelka. S.: Counting substructures with higher-order graph neural networks: possibility and impossibility results. CoRR, abs/2012.03174, (2020)
Tatonetti, N.P.: Data-driven prediction of drug effects and interactions. Sci. Transl. Med. 4(125), 125ra31 (2012)
Thirumuruganathan, S., Hasan, S., Koudas, N., Das, G.: Approximate query processing for data exploration using deep generative models. In: Proceeding ICDE’20, pp. 1309 (2020)
Thirumuruganathan, S., Shetiya, S., Koudas, N., Das, G.: Prediction intervals for learned cardinality estimation: an experimental evaluation. In: ICDE’22, pp. 3051–3064. IEEE, (2022)
Tsitsulin, A., et al.: VERSE: versatile graph embeddings from similarity measures. In Proceedings WWW, pp. 539–548. ACM, (2018)
Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)
Vacic, V., Iakoucheva, L.M., Lonardi, S., Radivojac, P.: Graphlet kernels for prediction of functional residues in protein structures. J. Comput. Biol. 17(1), 55–72 (2010)
Vaswani, A.: Attention is all you need. In: Proceedings NeurIPS’17, pp. 5998–6008, (2017)
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: Proc. ICLR’18, (2018)
Vengerov, D., Menck, A.C., Zaït, M., Chakkappen, S.: Join size estimation subject to filter conditions. Proc. VLDB 8(12), 1530–1541 (2015)
Wang, H., et al.: Neural subgraph counting with wasserstein estimator. In: SIGMOD ’22, pp. 160–175. ACM, (2022)
Wang, P., Zhao, J., Zhang, X., Li, Z., Cheng, J., Lui, J.C.S., Towsley, D., Tao, J., Guan, X.: MOSS-5: A fast method of approximating counts of 5-node graphlets in large graphs. IEEE Trans. Knowl. Data Eng. 30(1), 73–86 (2018)
Wang, J., Chai, C., Liu, J., Li, G.: FACE: A normalizing flow based cardinality estimator. Proc. VLDB Endow. 15(1), 72–84 (2021)
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)
Weisfeiler, B., Lehman, A.A.: A reduction of a graph to a canonical form and an algebra arising during this reduction. Nauchno-Technicheskaya Informatsia 2(9), 12–16 (1968)
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: Proceeding ICLR’19, (2019)
Yang, Z., Liang, E., Kamsetty, A., Wu, C., Duan, Y., Chen, P., Abbeel, P., Hellerstein, J.M., Krishnan, S., Stoica, I.: Deep unsupervised cardinality estimation. Proc. VLDB 13(3), 279–292 (2019)
Yang, R., Shi, J., Xiao, X., Yang, Y., Bhowmick, S.S.: Homogeneous network embedding for massive graphs via reweighted personalized pagerank. Proc. VLDB 13(5), 670–683 (2020)
Yang, Z., Kamsetty, A., Luan, S., Liang, E., Duan, Y., Chen, P., Stoica, I.: Neurocard: one cardinality estimator for all tables. Proc. VLDB Endow. 14(1), 61–73 (2020)
Yin, Y., Wei, Z.: Scalable graph embeddings via sparse transpose proximities. In: Proceeding KDD, pp. 1429–1437. ACM, (2019)
Ying, Z.: Hierarchical graph representation learning with differentiable pooling. In: Proceeding NeurIPS’18, pp. 4805–4815, (2018)
Zhang, Y., Yang, Q.: A survey on multi-task learning. CoRR, abs/1707.08114, (2017)
Zhang, L., Song, M., Liu, Z., Liu, X., Bu, J., Chen, C.: Probabilistic graphlet cut: exploiting spatial structure cue for weakly supervised image segmentation. In: Proceeding CVPR’13, pp. 1908–1915, (2013)
Zhang, X., Xie, K., Wang, S., Huang, Z.: Learning based proximity matrix factorization for node embedding. In: Proceeding KDD, pp. 2243–2253. ACM, (2021)
Zhang, H., Yu, J.X., Zhang, Y., Zhao, K., Cheng, H.: Distributed subgraph counting: a general approach. Proc. VLDB, 13(11), (2020)
Zhang, Y.: Multi-task active learning with output constraints. In: Proceedings AAAI’10, (2010)
Zhang, J.: Prone: fast and scalable network representation learning. In: Proceeding IJCAI’19, pp. 4278–4284, (2019)
Zhang, L., Song, M., Zhao, Q., Liu, X., Bu, J., Chen, C.: Probabilistic graphlet transfer for photo cropping. IEEE Trans. Image Process. 22(2), 802–815 (2013)
Zhao, Z., Christensen, R., Li, F., Hu, X., Yi, K.: Random sampling over joins revisited. In: Proceeding SIGMOD’18, pp. 1525–1539, (2018)
Zhao, K., Yu, J.X., Zhang, H., Li, Q., Rong, Y.: A learned sketch for subgraph counting. In: SIGMOD’21, pp. 2142–2155. ACM, (2021)
Zhao, K.: Lightweight and accurate cardinality estimation by neural network gaussian process. In: SIGMOD ’22, pp. 973–987. ACM, (2022)
Zheng, G.X., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., et al.: Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8(1), 1–12 (2017)
Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Sun, M.: Graph neural networks: a review of methods and applications. CoRR, abs/1812.08434, (2018)
Zhou, C.: Scalable graph embedding for asymmetric proximity. In: Proceeding AAAI, pp. 2942–2948. AAAI Press, (2017)
Acknowledgements
This work was supported by the Research Grants Council of Hong Kong, China, under No. 14203618, No. 14202919 and No. 14205520.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, K., Yu, J.X., Li, Q. et al. Learned sketch for subgraph counting: a holistic approach. The VLDB Journal 32, 937–962 (2023). https://doi.org/10.1007/s00778-023-00781-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-023-00781-5