Finding the most descriptive substructures in graphs with discrete and numeric labels

Davis, Michael; Liu, Weiru; Miller, Paul

doi:10.1007/s10844-013-0299-7

Finding the most descriptive substructures in graphs with discrete and numeric labels

Published: 27 December 2013

Volume 42, pages 307–332, (2014)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Michael Davis¹,
Weiru Liu² &
Paul Miller¹

335 Accesses
2 Citations
Explore all metrics

Abstract

Many graph datasets are labelled with discrete and numeric attributes. Most frequent substructure discovery algorithms ignore numeric attributes; in this paper we show how they can be used to improve search performance and discrimination. Our thesis is that the most descriptive substructures are those which are normative both in terms of their structure and in terms of their numeric values. We explore the relationship between graph structure and the distribution of attribute values and propose an outlier-detection step, which is used as a constraint during substructure discovery. By pruning anomalous vertices and edges, more weight is given to the most descriptive substructures. Our method is applicable to multi-dimensional numeric attributes; we outline how it can be extended for high-dimensional data. We support our findings with experiments on transaction graphs and single large graphs from the domains of physical building security and digital forensics, measuring the effect on runtime, memory requirements and coverage of discovered patterns, relative to the unconstrained approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding the Most Descriptive Substructures in Graphs with Discrete and Numeric Labels

Graph based anomaly detection and description: a survey

Article 05 July 2014

Same Stats, Different Graphs

Notes

Sometimes overlapping substructures can contribute independently: Palacio (2005) discusses alternative strategies for handling overlapping substructures in Subdue.
Figure 3 shows the results for R-MAT random graphs, with 0, 1, . . . , 9 binary labels, i.e. 1–512 vertex partitions. The label values were assigned independently from a uniform distribution. Our experiments on real datasets (Section 5) verify that the complexity of substructure discovery increases with the homogeneity of vertices and edges, and that this holds when the independence assumption is removed.
http://www.bbc.co.uk/news/uk-15679784
Graphs are created from the August 21, 2009 version of the Enron corpus, https://www.cs.cmu.edu/~enron/. Identification of individuals and job roles are from “Ex-employee Status Report”, http://www.isi.edu/~adibi/Enron/Enron.htm. This spreadsheet contains 161 names, but two of them appear to be duplicates.
http://www.boost.org/doc/libs/1_53_0/libs/graph/doc/isomorphism.html
http://www.nowozin.net/sebastian/gboost/
Environment for Developing KDD Applications Supported by Index Structures (Achert et al. 2013), http://elki.dbs.ifi.lmu.de/
Graph Exchange XML Format, http://gexf.net/format/
We were unable to calculate RP + PINN + LOF for larger datasets due to memory constraints. The random projection (Achlioptas 2003) used by PINN is designed to be “database-friendly”; by changing the indexing method it would be possible to create a PINN implementation which creates its index on disk in order to process larger or higher-dimensional datasets.

References

Achert, E., Kriegel, H.P., Schubert, E., Zimek, A. (2013). Interactive data mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM international conference on management of data (SIGMOD).
Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4), 671–687. doi:10.1016/S0022-0000(03)00025-4.
Article MATH MathSciNet Google Scholar
Barnett, V., & Lewis, T. (1994). Outliers in statistical data, 3rd edn. Chichester: Wiley Probability & Statistics.
MATH Google Scholar
Borgelt, C. (2006). Canonical forms for frequent graph mining. In R. Decker & H.J. Lenz (Eds.), GfKl, springer, studies in classification, data analysis, and knowledge organization (pp. 337–349).
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J. (2000). LOF: identifying density-based local outliers. In W. Chen, J.F. Naughton, P.A. Bernstein (Eds.), SIGMOD conference (pp. 93–104). ACM.
Chandola, V., Banerjee, A., Kumar, V. (2009). Anomaly detection: a survey. ACM Computing Surveys, 41(3), 15:1–15:58. doi:10.1145/1541880.1541882.
Article Google Scholar
Cook, D.J., & Holder, L.B. (1994). Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1(1), 231–255.
Google Scholar
Cook, D.J., & Holder, L.B. (2000). Graph-based data mining. IEEE Intelligent Systems, 15, 32–41.
Article Google Scholar
Davis, M., Liu, W., Miller, P., Redpath, G. (2011). Detecting anomalies in graphs with numeric labels. In C. Macdonald, I. Ounis, I. Ruthven (Eds.), Proceedings of the 20th ACM conference on information and knowledge management (CIKM 2011) (pp. 1197–1202). ACM.
de Vries, T., Chawla, S., Houle, M. (2010). Finding local anomalies in very high dimensional space. In ICDM 2010 (pp. 128–137). IEEE.
Eberle, W., & Holder, L. (2011). Compression versus frequency for mining patterns and anomalies in graphs. In Ninth workshop on mining and learning with graphs (MLG 2011), SIGKDD, at the 17th ACM SIGKDD conference on knowledge discovery and data mining (KDD 2011). San Diego.
Eichinger, F., Böhm, K., Huber, M. (2008). Mining edge-weighted call graphs to localise software bugs. In W. Daelemans, B. Goethals, K. Morik (Eds.), ECML/PKDD (1), lecture notes in computer science (Vol. 5211, pp. 333–348). Springer.
Eichinger, F., Huber, M., Böhm, K. (2010). On the usefulness of weight-based constraints in frequent subgraph mining. In ICAI 2010, BCS SGAI (pp. 65–78). doi:10.1007/978-0-85729-130-1_5.
Fortin, S. (1996). The graph isomorphism problem. Tech. rep., Univ. of Alberta.
Fowler, J. H., & Christakis, N.A. (2008). Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the Framingham Heart Study. BMJ, 337, doi:10.1136/bmj.a2338.
Günnemann, S., Boden, B., Seidl, T. (2011). DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases, ECML PKDD11 (Vol. Part I, pp. 565–580). Springer.
Huan, J., Wang, W., Prins, J., Yang, J. (2004). SPIN: mining maximal frequent subgraphs from graph databases. In KDD 2004 (pp. 581–586). ACM.
Inokuchi, A., Washio, T., Motoda, H. (2000). An Apriori-based algorithm for mining frequent substructures from graph data. In PKDD 2000 (pp. 13–23). Springer.
Janssens, J., Flesch, I., Postma, E. (2009). Outlier detection with one-class classifiers from ML and KDD. In International conference on machine learning and applications (ICMLA ’09) (pp. 147–153). doi:10.1109/ICMLA.2009.16.
Jiang, C., Coenen, F., Zito, M. (2010). Frequent sub-graph mining on edge weighted graphs. In Proceedings of the 12th international conference on data warehousing and knowledge discovery, DaWaK’10 (pp. 77–88). Berlin: Springer.
Chapter Google Scholar
Jin, W., Tung, A.K.H., Han, J. (2001). Mining top-n local outliers in large databases. In D. Lee, M. Schkolnick, F.J. Provost, R. Srikant (Eds.), KDD (pp. 293–298). ACM.
Jin, W., Tung, A.K.H., Han, J., Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In W.K. Ng, M. Kitsuregawa, J. Li, K. Chang (Eds.), PAKDD, lecture notes in computer science (Vol. 3918, pp. 577–593). Springer.
Kim, M., & Leskovec, J. (2011). Modeling social networks with node attributes using the multiplicative attribute graph model. In F.G Cozman & A. Pfeffer (Eds.), UAI (pp. 400–409). AUAI Press.
Kim, M., & Leskovec, J. (2012). Multiplicative attribute graph model of real-world networks. Internet Mathematics, 8(1–2), 113–160.
Article MATH MathSciNet Google Scholar
Klimt, B., & Yang, Y. (2004). The Enron corpus: a new dataset for email classification research. In J.F. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), ECML, lecture notes in computer science (Vol. 3201, pp. 217–226). Springer.
Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2009). Outlier detection in axis-parallel subspaces of high dimensional data. In T. Theeramunkong, B. Kijsirikul, N. Cercone, T.B. Ho (Eds.), PAKDD, lecture notes in computer science (Vol. 5476, pp. 831–838). Springer.
Kriegel, H.P., Kröger, P., Schubert, E., Zimek, A. (2011). Interpreting and unifying outlier scores. In SDM (pp. 13–24). SIAM/Omnipress.
Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. In ICDM 2001 (pp. 313–320). IEEE.
McLean, B., & Elkind, P. (2003). The smartest guys in the room: the amazing rise and scandalous fall of Enron. USA: Penguin Group.
Google Scholar
Moser, F., Colak, R., Rafiey, A., Ester, M. (2009). Mining cohesive patterns from graphs with feature vectors. In SDM (pp. 593–604). SIAM.
Newman, M. (2010). Networks: an introduction. New York: Oxford University Press, Inc.
Book Google Scholar
Nijssen, S., & Kok, J.N. (2004). A quickstart in frequent structure mining can make a difference. In W. Kim, R. Kohavi, J. Gehrke, W. DuMouchel (Eds.), KDD (pp. 647–652). ACM.
Palacio, M.A.P. (2005). Spatial data modeling and mining using a graph-based representation. PhD thesis, Department of Computer Systems Engineering. Puebla: University of the Americas.
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C. (2003). LOCI: fast outlier detection using the local correlation integral. In U. Dayal, K. Ramamritham, T.M. Vijayaraman (Eds.), ICDE (pp. 315–326). IEEE Computer Society.
Schubert, E., Zimek, A., Kriegel, H.P. (2012). Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Mining and Knowledge Discovery. doi:10.1007/s10618-012-0300-z.
Tang, J., Chen, Z., Fu, A.W.C., Cheung, D.W.L. (2002). Enhancing effectiveness of outlier detections for low density patterns. In M.S. Cheng, P.S. Yu, B. Liu (Eds.), PAKDD, lecture notes in computer science (Vol. 2336, pp. 535–548). Springer.
Tenenbaum, J.B., de Silva, V., Langford, J.C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. doi:10.1126/science.290.5500.2319.
Article Google Scholar
Wang, C., Zhu, Y., Wu, T., Wang, W., Shi, B. (2005). Constraint-based graph mining in large database. In Y. Zhang, K. Tanaka, J.X. Yu, S. Wang, M. Li (Eds.), APWeb, lecture notes in computer science (Vol. 3399, pp. 133–144). Springer.
Wörlein, M., Meinl, T., Fischer, I., Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, J. Gama (Eds.), PKDD, lecture notes in computer science (Vol. 3721, pp. 392–403). Springer.
Yan, X., & Han, J. (2002). gSpan: graph-based substructure pattern mining. In ICDM 2002 (pp. 721–724). IEEE.
Yan, X., & Han, J. (2003). CloseGraph: mining closed frequent graph patterns. In KDD 2003 (pp. 286–295). ACM.
Zimek, A., Schubert, E., Kriegel, H.P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363–387.
Article MathSciNet Google Scholar

Download references

Acknowledgments

We would like to thank Erich Schubert at Ludwig-Maximilians Universität München for assistance with verifying our LOF implementation and providing us with the RP + PINN + LOF implementation ahead of its official release in ELKI.

Author information

Authors and Affiliations

Centre for Secure Information Technologies (CSIT), School of Electronics, Electrica Engineering and Computer Science, Queen’s University Belfast, ECIT Building, Queen’s Road, Belfast, BT3 9DT, UK
Michael Davis & Paul Miller
Knowledge and Data Engineering Cluster, School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Bernard Crossland Building, 18 Malone Road, Belfast, BT9 5BN, UK
Weiru Liu

Authors

Michael Davis
View author publications
You can also search for this author in PubMed Google Scholar
Weiru Liu
View author publications
You can also search for this author in PubMed Google Scholar
Paul Miller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Davis.

Appendix

The substructure discovery algorithms referred to throughout the paper are included here for convenience. The original references are Yan and Han (2002) for gSpan and Cook and Holder (2000) for Subdue.

1.1 A gSpan

To apply our method to gSpan, we calculate numeric anomalies for all vertices and edges by Definition 8 as a pre-processing step. Then amend step 2 to Remove infrequent and anomalous vertices and edges.

Algorithm 1 GraphSet_Projection: search for frequent substructures
Require: Graph Transaction Database \(\mathbb {D}\), minSup
1: Sort the labels in \(\mathbb {D}\) by their frequency
2: Remove infrequent vertices and edges
3: Relabel the remaining vertices and edges
4: \(\mathbb {S}^{1} \leftarrow \) all frequent 1-edge graphs in \(\mathbb {D}\)
5: Sort \(\mathbb {S}^{1}\) in DFS lexicographic order
6: \(\mathbb {S} \leftarrow \mathbb {S}^{1}\)
7: for all edge \(e \in \mathbb {S}^{1}\) do
8: Initialise s with e, set s. D to graphs which contain e
9: Subgraph_Mining(\(\mathbb {D}\), \(\mathbb {S}\), s)
10: \(\mathbb {D} \leftarrow \mathbb {D} - e\)
11: if \(\|\mathbb {D}\| < minSup\) then
12: break
13: return Discovered Subgraphs \(\mathbb {S}\)

1.2 B Subdue

To apply our method to Subdue, calculate numeric anomalies for all vertices and edges by Definition 8 as a pre-processing step (as above). Prune all anomalous vertices and edges from the graph before step 5.

Algorithm 2 Subdue: search for frequent substructures
Require: Graph, BeamWidth, MaxBest, MaxSubSize, Limit
1: let ParentList = {}
2: let ChildList = {}
3: let BestList = {}
4: let ProcessedSubs = 0
5: Create a substructure from each unique vertex label and its single-vertex instances; insert the resulting substructures in ParentList
6: while ProcessedSubs ≤ Limit and ParentList is not empty do
7: whileParentList is not empty do do
8: let Parent = RemoveHead(ParentList)
9: Extend each instance of Parent in all possible ways
10: Group the extended instances into Child substructures
11: for all Child do
12: if SizeOf(Child) ≤ MaxSubSize then
13: Evaluate the Child
14: Insert Child in ChildList in order by value
15: if Length(ChildList) > BeamWidth then
16: Destroy the substructure at the end of ChildList
17: let ProcessedSubs = ProcessedSubs + 1
18: Insert Parent in BestList in order by value
19: if Length(BestList) > MaxBest then
20: Destroy the substructure at the end of BestList
21: Switch ParentList and ChildList
22: return BestList

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davis, M., Liu, W. & Miller, P. Finding the most descriptive substructures in graphs with discrete and numeric labels. J Intell Inf Syst 42, 307–332 (2014). https://doi.org/10.1007/s10844-013-0299-7

Download citation

Received: 07 May 2013
Revised: 07 October 2013
Accepted: 01 December 2013
Published: 27 December 2013
Issue Date: April 2014
DOI: https://doi.org/10.1007/s10844-013-0299-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding the most descriptive substructures in graphs with discrete and numeric labels

Abstract

Access this article

Similar content being viewed by others

Finding the Most Descriptive Substructures in Graphs with Discrete and Numeric Labels

Graph based anomaly detection and description: a survey

Same Stats, Different Graphs

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 A gSpan

1.2 B Subdue

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding the most descriptive substructures in graphs with discrete and numeric labels

Abstract

Access this article

Similar content being viewed by others

Finding the Most Descriptive Substructures in Graphs with Discrete and Numeric Labels

Graph based anomaly detection and description: a survey

Same Stats, Different Graphs

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 A gSpan

1.2 B Subdue

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation