Advertisement

Journal of Intelligent Information Systems

, Volume 42, Issue 2, pp 307–332 | Cite as

Finding the most descriptive substructures in graphs with discrete and numeric labels

  • Michael Davis
  • Weiru Liu
  • Paul Miller
Article

Abstract

Many graph datasets are labelled with discrete and numeric attributes. Most frequent substructure discovery algorithms ignore numeric attributes; in this paper we show how they can be used to improve search performance and discrimination. Our thesis is that the most descriptive substructures are those which are normative both in terms of their structure and in terms of their numeric values. We explore the relationship between graph structure and the distribution of attribute values and propose an outlier-detection step, which is used as a constraint during substructure discovery. By pruning anomalous vertices and edges, more weight is given to the most descriptive substructures. Our method is applicable to multi-dimensional numeric attributes; we outline how it can be extended for high-dimensional data. We support our findings with experiments on transaction graphs and single large graphs from the domains of physical building security and digital forensics, measuring the effect on runtime, memory requirements and coverage of discovered patterns, relative to the unconstrained approach.

Keywords

Graph mining Frequent substructure discovery Constraint-based mining Labelled graphs Numeric attributes Outlier detection 

Notes

Acknowledgments

We would like to thank Erich Schubert at Ludwig-Maximilians Universität München for assistance with verifying our LOF implementation and providing us with the RP + PINN + LOF implementation ahead of its official release in ELKI.

References

  1. Achert, E., Kriegel, H.P., Schubert, E., Zimek, A. (2013). Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM international conference on management of data (SIGMOD).Google Scholar
  2. Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687. doi: 10.1016/S0022-0000(03)00025-4.CrossRefzbMATHMathSciNetGoogle Scholar
  3. Barnett, V., & Lewis, T. (1994). Outliers in statistical data, 3rd edn. Chichester: Wiley Probability & Statistics.zbMATHGoogle Scholar
  4. Borgelt, C. (2006). Canonical forms for frequent graph mining. In R. Decker & H.J. Lenz (Eds.), GfKl, springer, studies in classification, data analysis, and knowledge organization (pp. 337–349).Google Scholar
  5. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J. (2000). LOF: identifying density-based local outliers. In W. Chen, J.F. Naughton, P.A. Bernstein (Eds.), SIGMOD conference (pp. 93–104). ACM.Google Scholar
  6. Chandola, V., Banerjee, A., Kumar, V. (2009). Anomaly detection: a survey. ACM Computing Surveys, 41(3), 15:1–15:58. doi: 10.1145/1541880.1541882.CrossRefGoogle Scholar
  7. Cook, D.J., & Holder, L.B. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1(1), 231–255.Google Scholar
  8. Cook, D.J., & Holder, L.B. (2000). Graph-based data mining. IEEE Intelligent Systems, 15, 32–41.CrossRefGoogle Scholar
  9. Davis, M., Liu, W., Miller, P., Redpath, G. (2011). Detecting anomalies in graphs with numeric labels. In C. Macdonald, I. Ounis, I. Ruthven (Eds.), Proceedings of the 20th ACM conference on information and knowledge management (CIKM 2011) (pp. 1197–1202). ACM.Google Scholar
  10. de Vries, T., Chawla, S., Houle, M. (2010). Finding local anomalies in very high dimensional space. In ICDM 2010 (pp. 128–137). IEEE.Google Scholar
  11. Eberle, W., & Holder, L. (2011). Compression versus frequency for mining patterns and anomalies in graphs. In Ninth workshop on mining and learning with graphs (MLG 2011), SIGKDD, at the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2011). San Diego.Google Scholar
  12. Eichinger, F., Böhm, K., Huber, M. (2008). Mining edge-weighted call graphs to localise software bugs. In W. Daelemans, B. Goethals, K. Morik (Eds.), ECML/PKDD (1), lecture notes in computer science (Vol. 5211, pp. 333–348). Springer.Google Scholar
  13. Eichinger, F., Huber, M., Böhm, K. (2010). On the usefulness of weight-based constraints in frequent subgraph mining. In ICAI 2010, BCS SGAI (pp. 65–78). doi: 10.1007/978-0-85729-130-1_5.
  14. Fortin, S. (1996). The graph isomorphism problem. Tech. rep., Univ. of Alberta.Google Scholar
  15. Fowler, J. H., & Christakis, N.A. (2008). Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study. BMJ, 337, doi: 10.1136/bmj.a2338.
  16. Günnemann, S., Boden, B., Seidl, T. (2011). DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases, ECML PKDD11 (Vol. Part I, pp. 565–580). Springer.Google Scholar
  17. Huan, J., Wang, W., Prins, J., Yang, J. (2004). SPIN: mining maximal frequent subgraphs from graph databases. In KDD 2004 (pp. 581–586). ACM.Google Scholar
  18. Inokuchi, A., Washio, T., Motoda, H. (2000). An Apriori-based algorithm for mining frequent substructures from graph data. In PKDD 2000 (pp. 13–23). Springer.Google Scholar
  19. Janssens, J., Flesch, I., Postma, E. (2009). Outlier detection with one-class classifiers from ML and KDD. In International conference on machine learning and applications (ICMLA ’09) (pp. 147–153). doi: 10.1109/ICMLA.2009.16.
  20. Jiang, C., Coenen, F., Zito, M. (2010). Frequent sub-graph mining on edge weighted graphs. In Proceedings of the 12th international conference on data warehousing and knowledge discovery, DaWaK’10 (pp. 77–88). Berlin: Springer.CrossRefGoogle Scholar
  21. Jin, W., Tung, A.K.H., Han, J. (2001). Mining top-n local outliers in large databases. In D. Lee, M. Schkolnick, F.J. Provost, R. Srikant (Eds.), KDD (pp. 293–298). ACM.Google Scholar
  22. Jin, W., Tung, A.K.H., Han, J., Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In W.K. Ng, M. Kitsuregawa, J. Li, K. Chang (Eds.), PAKDD, lecture notes in computer science (Vol. 3918, pp. 577–593). Springer.Google Scholar
  23. Kim, M., & Leskovec, J. (2011). Modeling social networks with node attributes using the multiplicative attribute graph model. In F.G Cozman & A. Pfeffer (Eds.), UAI (pp. 400–409). AUAI Press.Google Scholar
  24. Kim, M., & Leskovec, J. (2012). Multiplicative attribute graph model of real-world networks. Internet Mathematics, 8(1–2), 113–160.CrossRefzbMATHMathSciNetGoogle Scholar
  25. Klimt, B., & Yang, Y. (2004). The Enron corpus: a new dataset for email classification research. In J.F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), ECML, lecture notes in computer science (Vol. 3201, pp. 217–226). Springer.Google Scholar
  26. Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In T. Theeramunkong, B. Kijsirikul, N. Cercone, T.B. Ho (Eds.), PAKDD, lecture notes in computer science (Vol. 5476, pp. 831–838). Springer.Google Scholar
  27. Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2011). Interpreting and unifying outlier scores. In SDM (pp. 13–24). SIAM/Omnipress.Google Scholar
  28. Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In ICDM 2001 (pp. 313–320). IEEE.Google Scholar
  29. McLean, B., & Elkind, P. (2003). The smartest guys in the room: the amazing rise and scandalous fall of Enron. USA: Penguin Group.Google Scholar
  30. Moser, F., Colak, R., Rafiey, A., Ester, M. (2009). Mining cohesive patterns from graphs with feature vectors. In SDM (pp. 593–604). SIAM.Google Scholar
  31. Newman, M. (2010). Networks: an introduction. New York: Oxford University Press, Inc.CrossRefGoogle Scholar
  32. Nijssen, S., & Kok, J.N. (2004). A quickstart in frequent structure mining can make a difference. In W. Kim, R. Kohavi, J. Gehrke, W. DuMouchel (Eds.), KDD (pp. 647–652). ACM.Google Scholar
  33. Palacio, M.A.P. (2005). Spatial data modeling and mining using a graph-based representation. PhD thesis, Department of Computer Systems Engineering. Puebla: University of the Americas.Google Scholar
  34. Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C. (2003). LOCI: fast outlier detection using the local correlation integral. In U. Dayal, K. Ramamritham, T.M. Vijayaraman (Eds.), ICDE (pp. 315–326). IEEE Computer Society.Google Scholar
  35. Schubert, E., Zimek, A., Kriegel, H.P. (2012). Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery. doi: 10.1007/s10618-012-0300-z.
  36. Tang, J., Chen, Z., Fu, A.W.C., Cheung, D.W.L. (2002). Enhancing effectiveness of outlier detections for low density patterns. In M.S. Cheng, P.S. Yu, B. Liu (Eds.), PAKDD, lecture notes in computer science (Vol. 2336, pp. 535–548). Springer.Google Scholar
  37. Tenenbaum, J.B., de Silva, V., Langford, J.C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. doi: 10.1126/science.290.5500.2319.CrossRefGoogle Scholar
  38. Wang, C., Zhu, Y., Wu, T., Wang, W., Shi, B. (2005). Constraint-based graph mining in large database. In Y. Zhang, K. Tanaka, J.X. Yu, S. Wang, M. Li (Eds.), APWeb, lecture notes in computer science (Vol. 3399, pp. 133–144). Springer.Google Scholar
  39. Wörlein, M., Meinl, T., Fischer, I., Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, J. Gama (Eds.), PKDD, lecture notes in computer science (Vol. 3721, pp. 392–403). Springer.Google Scholar
  40. Yan, X., & Han, J. (2002). gSpan: graph-based substructure pattern mining. In ICDM 2002 (pp. 721–724). IEEE.Google Scholar
  41. Yan, X., & Han, J. (2003). CloseGraph: mining closed frequent graph patterns. In KDD 2003 (pp. 286–295). ACM.Google Scholar
  42. Zimek, A., Schubert, E., Kriegel, H.P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363–387.CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Centre for Secure Information Technologies (CSIT), School of Electronics, Electrica Engineering and Computer ScienceQueen’s University BelfastBelfastUK
  2. 2.Knowledge and Data Engineering Cluster, School of Electronics, Electrical Engineering and Computer ScienceQueen’s University BelfastBelfastUK

Personalised recommendations