Abstract
Streaming graph analysis is gaining importance in various fields due to the natural dynamicity in many real graph applications. However, approximately counting triangles in real-world streaming graphs with duplicate edges and sliding window model remains an unsolved problem. In this paper, we propose SWTC algorithm to address approximate sliding-window triangle counting problem in streaming graphs. In SWTC, we propose a fixed-length slicing strategy that addresses both sample maintaining and cardinality estimation issues with a bounded memory usage. We theoretically prove the superiority of our method in sample graph size and estimation accuracy under given memory upper bound. To further improve the performance of our algorithm, we propose two optimization techniques, vision counting to avoid computation peaks, and asynchronous grouping to stabilize the accuracy. Extensive experiments also confirm that our approach has higher accuracy compared with the baseline method under the same memory usage.
Similar content being viewed by others
Notes
Depending on the semantics of binary counting or weighted counting, we need to either count distinct number of edges, or include the duplicate edges in counting.
notice that a bi-direction edge means one edge marked as bi-direction rather than 2 edges with reverse directions. In the latter case the motif include 4 edges rather than 3.
References
Berry, J.W., Hendrickson, B., LaViolette, R.A., Phillips, C.A.: Tolerating the community detection resolution limit with edge weighting. Phys. Rev. E 83(5), 056119 (2011)
Jean-Pierre, E., Elisha, Moses: Curvature of co-links uncovers hidden thematic layers in the world wide web. Proc. Natl. Acad. Sci. USA 99(9), 5825–5829 (2002)
Becchetti, L., Boldi, Paolo, Castillo, C., Gionis, A.: Efficient algorithms for large-scale local triangle counting. ACM Trans. Know. Dis. Data (TKDD) 4(3), 13 (2010)
Milo, R., Shen-Orr, Shai, Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, Uri: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)
Kang, U., Meeder, B., Papalexakis, Evangelos E., Faloutsos, C.: Heigen: Spectral analysis for billion-scale graphs. IEEE Trans. Know. Data Eng. 26(2), 350–362 (2012)
Yang, Z., Wilson, C., Wang, X., Gao, T., Zhao, B.Y., Dai, Y.: Uncovering social network sybils in the wild. ACM Trans. Know. Dis. Data (TKDD) 8(1), 1–29 (2014)
Li, Z., Yunting, Lu., Zhang, W.-P., Li, R.-H., Guo, J., Huang, X., Mao, Rui: Discovering hierarchical subgraphs of k-core-truss. Data Sci. Eng. 3(2), 136–149 (2018)
Pavan, A., Tangwongsan, K., Tirthapura, S., Kun Lung, Wu.: Counting and sampling triangles from a graph stream. Proc. Vldb Endowment 6(14), 1870–1881 (2013)
Ahmed, N. K., Duffield, N., Neville, J., & Kompella, R.: Graph sample and hold: A framework for big-graph analytics. In: Acm Sigkdd International Conference on Knowledge Discovery & Data Mining, (2014)
Wang, P., Qi, Y., Sun, Yu., Zhang, X., Guan, X.: Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage. Proc. Vldb Endowment 11(2), 162–175 (2017)
Boykin, P.O., Roychowdhury, Vwani P.: Leveraging social networks to fight spam. Computer 38(4), 61–68 (2005)
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. Siam J. Comput. 31(6), 1794–1813 (2002)
Li, Y., Zou, L., Özsu, M.T., Dongyan, Z.: Time constrained continuous subgraph search over streaming graphs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1082–1093. IEEE, (2019)
Crouch, M.S., McGregor, A., Stubbs, D.: Dynamic graphs in the sliding-window model. In European Symposium on Algorithms, pages 337–348. Springer, (2013)
Qiu, X., Cen, W., Qian, Z., Peng, Y., Zhang, Y., Lin, X., Zhou, J.: Real-time constrained cycle detection in large dynamic graphs. Proc. VLDB Endowment 11(12), 1876–1888 (2018)
Jung, M., Lim, Y., Lee, S., Kang, U.: Furl: fixed-memory and uncertainty reducing local triangle counting for multigraph streams. Data Min. Know. Dis. 33(5), 1225–1253 (2019)
De Stefani, Lorenzo, Epasto, Alessandro, Riondato, Matteo, Upfal, Eli: Triest: counting local and global triangles in fully dynamic streams with fixed memory size. ACM Trans. Know. Dis. Data (TKDD) 11(4), 1–50 (2017)
Shin, Kijung, Sejoon, Oh., Kim, Jisu, Hooi, Bryan, Faloutsos, Christos: Fast, accurate and provable triangle counting in fully dynamic graph streams. ACM Trans. Know. Dis. Data (TKDD) 14(2), 1–39 (2020)
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: Acm Sigmod International Conference on Management of Data, (2008)
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science, pages 137–156. Discrete Mathematics and Theoretical Computer Science, (2007)
Ting, D.: Streamed approximate counting of distinct elements: Beating optimal batch methods. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 442–451 (2014)
Dongjin, L., Kijung, S., Christos, F.: Temporal locality-aware sampling for accurate triangle counting in real graph streams. The VLDB Journal, pages 1–25 (2020)
Source code of swtc and the baseline method. https://github.com/StreamingTriangleCounting/TriangleCounting.git
Brian, B., Mayur, D., Rajeev, M.: Sampling from a moving window over streaming data. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 633–634. Society for Industrial and Applied Mathematics (2002)
Seidel, R., Aragon, Cecilia R.: Randomized search trees. Algorithmica 16(4), 464–497 (1996)
Kac, Mark: Statistical Independence in Probability. Courier Dover Publications, Analysis and Number, New York (2018)
Duffield, N.G., Grossglauser, M.: Trajectory sampling for direct traffic observation. IEEE/ACM Trans. Netw. 9(3), 280–292 (2001)
Duffield, Nick: Sampling for passive internet measurement: A review. Stat. Sci. 19(3), 472–498 (2004)
Aggarwal, C.C., Yuchen, Z., Yu, P.S.: Outlier detection in graph streams. In 2011 IEEE 27th international conference on data engineering, pages 399–409. IEEE, (2011)
Ashish, T., Sen, S.J., Namit, J., Zheng, S., Prasad, C., Ning, Z., Suresh, A., Hao, L., Raghotham, M.: Hive-a petabyte scale data warehouse using hadoop. In: 2010 IEEE 26th international conference on data engineering (ICDE 2010), pages 996–1005. IEEE (2010)
Maurizio, M., Saverio, N., Duffield, N.G.: A comparative experimental study of hash functions applied to packet sampling. In: Proc. of International Teletraffic Congress (ITC) (2005)
Slota, G.M., Madduri, Kamesh: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Slota, G.M., Kamesh, M.: Complex network analysis using parallel approximate motif counting. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pages 405–414. IEEE, (2014)
Bressan, M., Chierichetti, F., Kumar, R., Leucci, S., Panconesi, A.: Motif counting beyond five nodes. ACM Trans. Know. Dis. Data (TKDD) 12(4), 1–25 (2018)
Bobhash function. http://burtleburtle.net/bob/hash/doobs.html
Murmurhash function. Published by Austin Appleby at https://github.com/aappleby/smhasher
Sedgewick, R.:. Algorithms in c. Pearson Education, (2001)
Aphash and collection of other hash functions. http://www.partow.net/programming/hashfunctions/#RSHashFunction
Alon, N., Yuster, R., Zwick, U.: Finding and counting given length cycles. Algorithmica 17(3), 209–223 (1997)
Shaikh, A., Maleq, K., Madhav, M.: Patric: a parallel algorithm for counting triangles in massive networks. In Acm International Conference on Information & Knowledge Management, (2013)
Xiaocheng, H., Yufei, T., Chung, C.W.: Massive graph triangulation. In: Acm Sigmod International Conference on Management of Data, (2013)
Jinha, K., Wook, S.H., Sangyeon, L., Kyungyeol, P., Yu, H.: Opt:a new framework for overlapped and parallel triangulation in large-scale graphs. (2014)
Ha-Myung, P., Sung-Hyon, M., Kang, U.: Pte: Enumerating trillion triangles on distributed systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1115–1124 (2016)
Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, pages 623–632. Society for Industrial and Applied Mathematics (2002)
Buriol, S.L., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In Acm Sigmod-sigact-sigart Symposium on Principles of Database Systems, (2006)
Jowhari, H., Ghodsi, M.: New streaming algorithms for counting triangles in graphs. In International Computing and Combinatorics Conference, pages 710–716. Springer, (2005)
Lim, Y., Kang, U.: Mascot: Memory-efficient and accurate sampling for counting local triangles in graph streams. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 685–694. ACM (2015)
Jha, M., Seshadhri, C., Pinar, A.: A space efficient streaming algorithm for triangle counting using the birthday paradox. (2013)
Tsourakakis, C.E., Kang, U., Miller, G.L., Faloutsos, C.: Doulion: counting triangles in massive graphs with a coin. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 837–846, (2009)
Vitter, Jeffrey S.: Random sampling with a reservoir. ACM Trans. Math. Soft. (TOMS) 11(1), 37–57 (1985)
Gemulla, R., Lehner, Wolfgang, Haas, P.J.: Maintaining bounded-size sample synopses of evolving datasets. The VLDB J. 17(2), 173–201 (2008)
Braverman, V., Ostrovsky, R., Zaniolo, C.: Optimal sampling from sliding windows. In Twenty-eighth Acm Sigmod-sigact-sigart Symposium on Principles of Database Systems, (2009)
Cormode, G., Muthukrishnan, S., Yi, K., Zhang, Q.: Optimal sampling from distributed streams. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 77–86, (2010)
Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: Densification and shrinking diameters. ACM Trans. Know. Dis. Data (TKDD), 1(1):2–es, (2007)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by NSFC under Grant 61932001 and U20A20174.
Appendix
Appendix
1.1 Influence of duplication ratio
In order to evaluate the influence of duplication ratio of the streaming graph, we use a synthetic dataset FF to carry out experiments. This dataset is generated by Fire-Forest model [54]. It includes 18, 311, 282 edges and 1M nodes. There are originally no duplicate edges. We generate edge frequencies for it with power-law distribution and vary the duplication ratio to carry out experiments. The timestamps are randomly generated in this dataset. We formally define the duplication ratio as \(\frac{total\ number\ of\ edges}{number\ of\ distinct\ edges}-1\). The window length is set to be 3M and the sample rate is set to be \(4\%\). We use binary counting semantics in this experiment. The memory usage and the valid sample size does not change with the duplication ratio. The experimental result in Fig. 17 shows that MAPE and max error decreases with the increment of duplication ratio. Because with more duplicate edges, the number of distinct edges in the sliding window decreases, and the sample size becomes relatively large.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gou, X., Zou, L. Sliding window-based approximate triangle counting with bounded memory usage. The VLDB Journal 32, 1087–1110 (2023). https://doi.org/10.1007/s00778-023-00783-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-023-00783-3