Advertisement

StruClus: Scalable Structural Graph Set Clustering with Representative Sampling

  • Till Schäfer
  • Petra Mutzel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10604)

Abstract

We present a structural clustering algorithm for large-scale datasets of small labeled graphs, utilizing a frequent subgraph sampling strategy. A set of representatives provides an intuitive description of each cluster, supports the clustering process, and helps to interpret the clustering results. The projection-based nature of the clustering approach allows us to bypass dimensionality and feature extraction problems that arise in the context of graph datasets reduced to pairwise distances or feature vectors. While achieving high quality and (human) interpretable clusterings, the runtime of the algorithm only grows linearly with the number of graphs. Furthermore, the approach is easy to parallelize and therefore suitable for very large datasets. Our extensive experimental evaluation on synthetic and real world datasets demonstrates the superiority of our approach over existing structural and subspace clustering algorithms, both, from a runtime and quality point of view.

References

  1. 1.
    Ackerman, M., Ben-David, S.: Clusterability: a theoretical study. In: Proceedings of AISTATS, pp. 1–8 (2009)Google Scholar
  2. 2.
    Aggarwal, C.C., Procopiuc, C.M., Wolf, J.L., Yu, P.S., Park, J.S.: Fast algorithms for projected clustering. In: Proceedings of SIGMOD, pp. 61–72 (1999)Google Scholar
  3. 3.
    Aggarwal, C.C., Ta, N., Wang, J., Feng, J., Zaki, M.J.: XProj: a framework for projected structural clustering of XML documents. In: Proceedings of KDD, pp. 46–55 (2007)Google Scholar
  4. 4.
    Anchuri, P., Zaki, M.J., Barkol, O., Golan, S., Shamy, M.: Approximate graph mining with label costs. In: Proceedings of KDD, pp. 518–526, Chicago, Illinois, USA (2013)Google Scholar
  5. 5.
    Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “Nearest Neighbor” meaningful? In: Proceedings of ICDT, pp. 217–235 (1999)Google Scholar
  6. 6.
    Bunke, H.: On a relation between graph edit distance and maximum common subgraph. Pattern Recognit. Lett. 18(8), 689–694 (1997)CrossRefGoogle Scholar
  7. 7.
    Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognit. Lett. 19(3–4), 255–259 (1998)CrossRefzbMATHGoogle Scholar
  8. 8.
    Chávez, E., Navarro, G.: A probabilistic spell for the curse of dimensionality. In: Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX 2001. LNCS, vol. 2153, pp. 147–160. Springer, Heidelberg (2001). doi: 10.1007/3-540-44808-X_12 CrossRefGoogle Scholar
  9. 9.
    Fernández, M.-L., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognit. Lett. 22(6–7), 753–758 (2001)CrossRefzbMATHGoogle Scholar
  10. 10.
    Ferrer, M., Valveny, E., Serratosa, F., Bardají, I., Bunke, H.: Graph-based k-means clustering: a comparison of the set median versus the generalized median graph. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702, pp. 342–350. Springer, Heidelberg (2009). doi: 10.1007/978-3-642-03767-2_42 CrossRefGoogle Scholar
  11. 11.
    Foggia, P., Percannella, G., Vento, M.: Graph matching and learning in pattern recognition in the last 10 years. IJPRAI 28(1) (2014)Google Scholar
  12. 12.
    Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–569 (1983)CrossRefzbMATHGoogle Scholar
  13. 13.
    Girolami, M.A.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. 13(3), 780–784 (2002)CrossRefGoogle Scholar
  14. 14.
    Gupta, A., Krauthgamer, R., Lee, J.R.: Bounded geometries, fractals, and low-distortion embeddings. In: Proceedings of FOCS, pp. 534–543 (2003)Google Scholar
  15. 15.
    Hasan, M.A., Chaoji, V., Salem, S., Besson, J., Zaki, M.J.: ORIGAMI: mining representative orthogonal graph patterns. In: Proceedings of ICDM, pp. 153–162 (2007)Google Scholar
  16. 16.
    Huang, X., Cheng, H., Yang, J., Yu, J.X., Fei, H., Huan, J.: Semi-supervised clustering of graph objects: a subgraph mining approach. In: Lee, S., Peng, Z., Zhou, X., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA 2012. LNCS, vol. 7238, pp. 197–212. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-29038-1_16 CrossRefGoogle Scholar
  17. 17.
    Hui, X., Zhongmon, L.: Clustering validation measures. In: Data Clustering: Algorithms and Applications, pp. 571–605 (2013)Google Scholar
  18. 18.
    Jouili, S., Tabbone, S., Lacroix, V.: Median graph shift: a new clustering algorithm for graph domain. In: Proceedings of ICPR, pp. 950–953 (2010)Google Scholar
  19. 19.
    Kalinski, C., Umkehrer, M., Weber, L., Kolb, J., Burdack, C., Ross, G.: On the industrial applications of MCRs: molecular diversity in drug discovery and generic drug synthesis. Mol. Divers. 14(3), 513–522 (2010)CrossRefGoogle Scholar
  20. 20.
    Kriege, N., Mutzel, P., Schäfer, T.: Practical SAHN clustering for very large data sets and expensive distance metrics. JGAA 18(4), 577–602 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Meilă, M.: Comparing clusterings–an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. In: Proceedings of VLDB, pp. 1270–1281 (2009)Google Scholar
  23. 23.
    Patrikainen, A., Meila, M.: Comparing subspace clusterings. IEEE Trans. Knowl. Data Eng. 18(7), 902–916 (2006)CrossRefGoogle Scholar
  24. 24.
    Ranu, S., Hoang, M., Singh, A.: Answering top-k representative queries on graph databases. In: Proceedings of SIGMOD, pp. 1163–1174, Snowbird, Utah, USA (2014)Google Scholar
  25. 25.
    Seeland, M., Berger, S.A., Stamatakis, A., Kramer, S.: Parallel structural graph clustering. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6913, pp. 256–272. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-23808-6_17 CrossRefGoogle Scholar
  26. 26.
    Seeland, M., Karwath, A., Kramer, S.: Structural clustering of millions of molecular graphs. In: Symposium on Applied Computing, pp. 121–128 (2014)Google Scholar
  27. 27.
    Shervashidze, N., Schweitzer, P., van Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12, 2539–2561 (2011)MathSciNetzbMATHGoogle Scholar
  28. 28.
    Shervashidze, N., Vishwanathan, S.V.N., Petri, T., Mehlhorn, K., Borgwardt, K.M.: Efficient graphlet kernels for large graph comparison. In: Proceedings of AISTATS, pp. 488–495 (2009)Google Scholar
  29. 29.
    Thoma, M., Cheng, H., Gretton, A., Han, J., Kriegel, H., Smola, A.J., Song, L., Yu, P.S., Yan, X., Borgwardt, K.M.: Discriminative frequent subgraph mining with optimality guarantees. Stat. Anal. Data Min. 3(5), 302–318 (2010)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Tsuda, K., Kudo, T.: Clustering graphs by weighted substructure mining. In: Proceedings of ICML, pp. 953–960 (2006)Google Scholar
  31. 31.
    Tsuda, K., Kurihara, K.: Graph mining with variational dirichlet process mixture models. In: Proceedings of the International Conference on Data Mining, pp. 432–442 (2008)Google Scholar
  32. 32.
    Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R.I., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Wale, N., Watson, I.A., Karypis, G.: Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl. Inf. Syst. 14(3), 347–375 (2008)CrossRefGoogle Scholar
  34. 34.
    Wallis, W.D., Shoubridge, P., Kraetzl, M., Ray, D.: Graph distances using graph union. Pattern Recognit. Lett. 22(6/7), 701–704 (2001)CrossRefzbMATHGoogle Scholar
  35. 35.
    Yan, X., Cheng, H., Han, J., Yu, P.S.: Mining significant graph patterns by leap search. In: Proceedings of SIGMOD, pp. 433–444 (2008)Google Scholar
  36. 36.
    Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In: Proceedings of ICDM, pp. 689–692 (2003)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceTU Dortmund UniversityDortmundGermany

Personalised recommendations