Skip to main content
Log in

Patterns and anomalies in k-cores of real-world graphs with applications

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

How do the k-core structures of real-world graphs look like? What are the common patterns and the anomalies? How can we exploit them for applications? A k-core is the maximal subgraph in which all vertices have degree at least k. This concept has been applied to such diverse areas as hierarchical structure analysis, graph visualization, and graph clustering. Here, we explore pervasive patterns related to k-cores and emerging in graphs from diverse domains. Our discoveries are: (1) Mirror Pattern: coreness (i.e., maximum k such that each vertex belongs to the k-core) is strongly correlated with degree. (2) Core-Triangle Pattern: degeneracy (i.e., maximum k such that the k-core exists) obeys a 3-to-1 power-law with respect to the count of triangles. (3) Structured Core Pattern: degeneracy–cores are not cliques but have non-trivial structures such as core–periphery and communities. Our algorithmic contributions show the usefulness of these patterns. (1) Core-A, which measures the deviation from Mirror Pattern, successfully spots anomalies in real-world graphs, (2) Core-D, a single-pass streaming algorithm based on Core-Triangle Pattern, accurately estimates degeneracy up to 12 \(\times \) faster than its competitor. (3) Core-S, inspired by Structured Core Pattern, identifies influential spreaders up to 17 \(\times \) faster than its competitors with comparable accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. This paper is an extended version of [49].

  2. http://konect.uni-koblenz.de/networks/petster-friendships-hamster.

  3. http://konect.uni-koblenz.de/networks/petster-friendships-cat.

  4. http://konect.uni-koblenz.de/networks/friendster.

  5. http://www.caida.org/data/as-relationships.

  6. http://www.caida.org/tools/measurements/skitter.

  7. Spearman’s rank correlation coefficient \(\rho \) [52] is the standard (Pearson) correlation coefficient r of the ranks. Here, \(\rho \) is equivalent to r between the ranks of vertices in terms of degree and their ranks in terms of coreness. Using \(\rho \) is known to be robust to outlying values than simply using r. We ignored isolated vertices when computing \(\rho \).

  8. The fractional rank of an item is one plus the number of items greater than it plus half the number of items equal to it.

  9. Strength of core–periphery structure. The correlation between the adjacency matrix of the measured graph and that of a graph with perfect core–periphery structure. See [10] for details.

  10. Strength of community structure. The fraction of the edges within communities minus such fraction expected in a randomly connected graph. See [39] for details.

  11. Isolated vertices are ignored when we compute Spearman’s rank correlation coefficient \(\rho \).

  12. We used a machine with 2.67 GHz Intel Xeon E7-8837 CPUs and 1TB RAM.

References

  1. Abello J, Resende MG, Sudarsky S (2002) Massive quasi-clique detection. In: Latin American symposium on theoretical informatics, Springer, pp 598–612

  2. Akoglu L, McGlohon M, Faloutsos C (2010) Oddball: spotting anomalies in weighted graphs. In: Pacific–Asia conference on knowledge discovery and data mining, Springer, pp 410–421

  3. Akoglu L, Tong H, Koutra D (2015) Graph based anomaly detection and description: a survey. Data Min Knowl Discov 29(3):626–688

    Article  MathSciNet  Google Scholar 

  4. Albert R, Jeong H, Barabsi AL (1999) Internet: diameter of the world-wide web. Nature 401(6749):130–131

    Article  Google Scholar 

  5. Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2006) Large scale networks fingerprinting and visualization using the \(k\)-core decomposition. Adv Neural Inf Process Syst 18:41

    Google Scholar 

  6. Alvarez-Hamelin JI, Dall’Asta L, Barrat A, Vespignani A (2008) \(K\)-core decomposition of Internet graphs: hierarchies, self-similarity and measurement biases. Netw Heterog Media 3:371

    Article  MathSciNet  MATH  Google Scholar 

  7. Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform 4(1):2

    Article  Google Scholar 

  8. Batagelj V, Zaversnik M (2003) An o(m) algorithm for cores decomposition of networks. arXiv:cs/0310049

  9. Beutel A, Xu W, Guruswami V, Palow C, Faloutsos C (2013) Copycatch: stopping group attacks by spotting lockstep behavior in social networks. In: Proceedings of the 22nd international conference on world wide web, ACM, pp 119–130

  10. Borgatti SP, Everett MG (2000) Models of core/periphery structures. Soc Netw 21(4):375–395

    Article  Google Scholar 

  11. Bron C, Kerbosch J (1973) Algorithm 457: finding all cliques of an undirected graph. Commun ACM 16(9):575–577

    Article  MATH  Google Scholar 

  12. Brouwer AE, Haemers WH (2001) Spectra of graphs. Springer, Berlin

    MATH  Google Scholar 

  13. Charikar M (2000) Greedy approximation algorithms for finding dense components in a graph. In: International Workshop on approximation algorithms for combinatorial optimization, Springer, pp 84–95

  14. Cheng J, Ke Y, Chu S, Özsu MT (2011) Efficient core decomposition in massive networks. In: 2011 IEEE 27th international conference on data engineering, IEEE, pp 51–62

  15. Cohen J (2008) Trusses: cohesive subgraphs for social network analysis. In: National security agency technical report, p 16

  16. Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on machine learning, ACM, pp 233–240

  17. De Stefani L, Epasto A, Riondato M, Upfal E (2016) TRIÈST: counting local and global triangles in fully-dynamic streams with fixed memory size. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 825–834

  18. Erdös P (1963) On the structure of linear graphs. Israel J Math 1(3):156–160

    Article  MathSciNet  MATH  Google Scholar 

  19. Farach-Colton M, Tsai MT (2014) Computing the degeneracy of large graphs. In: Latin American symposium on theoretical informatics, Springer, pp 250–260

  20. Freuder EC (1982) A sufficient condition for backtrack-free search. J ACM (JACM) 29(1):24–32

    Article  MathSciNet  MATH  Google Scholar 

  21. Gehrke J, Ginsparg P, Kleinberg J (2003) Overview of the 2003 KDD cup. ACM SIGKDD Explor Newslett 5(2):149–151

    Article  Google Scholar 

  22. Giatsidis C, Malliaros F, Thilikos DM, Vazirgiannis M (2014) Corecluster: a degeneracy based graph clustering framework. In: Twenty-sixth annual conference on innovative applications of artificial intelligence, AAAI, pp 29–31

  23. Hall BH, Jaffe AB, Trajtenberg M (2001) The NBER patent citation data file: lessons, insights and methodological tools. doi:10.3386/w8498

  24. Hooi B, Song HA, Beutel A, Shah N, Shin K, Faloutsos C (2016a) Fraudar: bounding graph fraud in the face of camouflage. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 895–904

  25. Hooi B, Song HA, Papalexakis E, Agrawal R, Faloutsos C (2016b) Matrices, compression, learning curves: formulation, and the GROUPNTEACH algorithms. In: Pacific–Asia conference on knowledge discovery and data mining, Springer, pp 376–387

  26. Huang X, Lu W, Lakshmanan LV (2016) Truss decomposition of probabilistic graphs: semantics and algorithms. In: Proceedings of the 2016 ACM SIGMOD international conference on management of data, ACM, pp 77–90

  27. Jiang M, Beutel A, Cui P, Hooi B, Yang S, Faloutsos C (2015) A general suspiciousness metric for dense blocks in multimodal data. In: 2015 IEEE international conference on data mining, IEEE, pp 781–786

  28. Kempe D, Kleinberg J, Tardos É (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 137–146

  29. Kitsak M, Gallos LK, Havlin S, Liljeros F, Muchnik L, Stanley HE, Makse HA (2010) Identification of influential spreaders in complex networks. Nat Phys 6(11):888–893

    Article  Google Scholar 

  30. Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: European conference on machine learning, Springer, pp 217–226

  31. Kwak H, Lee C, Park H, Moon S (2010) What is twitter, a social network or a news media?. In: Proceedings of the 19th international conference on world wide web, ACM, pp 591–600

  32. Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic mathematically tractable graph generation and evolution, using kronecker multiplication. In: European conference on principles of data mining and knowledge discovery, Springer, pp 133–145

  33. Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2009) Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math 6(1):29–123

    Article  MathSciNet  MATH  Google Scholar 

  34. Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 685–694

  35. Luce RD (1950) Connectivity and generalized cliques in sociometric group structure. Psychometrika 15(2):169–90

    Article  MathSciNet  Google Scholar 

  36. Macdonald B, Shakarian P, Howard N, Moores G (2012) Spreaders in the network sir model: an empirical study. arXiv preprint arXiv:1208.4269

  37. Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on internet measurement, ACM, pp 29–42

  38. Mokken RJ (1979) Cliques, clubs and clans. Qual Quant 13(2):161–173

    Article  Google Scholar 

  39. Newman ME (2006) Modularity and community structure in networks. Proc Nat Acad Sci 103(23):8577–8582

    Article  Google Scholar 

  40. Pandit S, Chau DH, Wang S, Faloutsos C (2007) Netprobe: a fast and scalable system for fraud detection in online auction networks. In: Proceedings of the 16th international conference on world wide web, ACM, pp 201–210

  41. Prakash BA, Sridharan A, Seshadri M, Machiraju S, Faloutsos C (2010) Eigenspokes: surprising patterns and scalable community chipping in large graphs. In: Pacific–Asia conference on knowledge discovery and data mining, Springer, pp 435–448

  42. Rossi MEG, Malliaros FD, Vazirgiannis M (2015) Spread it good, spread it fast: identification of influential nodes in social networks. In: Proceedings of the 24th international conference on world wide web (companion volume), ACM, pp 101–102

  43. Saríyüce AE, Gedik B, Jacques-Silva G, Wu KL, Çatalyürek ÜV (2013) Streaming algorithms for \(k\)-core decomposition. Proc VLDB Endow 6(6):433–444

    Article  Google Scholar 

  44. Saríyüce AE, Seshadhri C, Pinar A, Catalyurek UV (2015) Finding the hierarchy of dense subgraphs using nucleus decompositions. In: Proceedings of the 24th international conference on world wide web, ACM, pp 927–937

  45. Schank T (2007) Algorithmic aspects of triangle-based network analysis. Ph.D. thesis, Universitt Karlsruhe (TH), Fakultt fr Informatik

  46. Seidman SB, Foster BL (1978) A graph theoretic generalization of the clique concept. J Math Sociol 6(1):139–154

    Article  MathSciNet  MATH  Google Scholar 

  47. Seidman SB (1983) Network structure and minimum degree. Soc Netw 5(3):269–287

    Article  MathSciNet  Google Scholar 

  48. Shin K, Hooi B, Faloutsos C (2016a) M-zoom: fast dense-block detection in tensors with quality guarantees. In: Joint European conference on machine learning and knowledge discovery in databases, Springer, pp 264–280

  49. Shin K, Eliassi-Rad T, Faloutsos C (2016b) Corescope: graph mining using \(k\)-core analysis—patterns, anomalies and algorithms. In: 2016 16th IEEE international conference on data mining, IEEE, pp 469–478

  50. Shin K, Hooi B, Jisu K, Faloutsos C (2017a) D-cube: dense-block detection in terabyte-scale tensors. In: Proceedings of the Tenth ACM international conference on web search and data mining, ACM, pp 681–690

  51. Shin K, Hooi B, Jisu K, Faloutsos C (2017b) Densealert: incremental dense-subtensor detection in tensor streams. arXiv preprint arXiv:1706.03374

  52. Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101

    Article  Google Scholar 

  53. Tsourakakis CE (2008) Fast counting of triangles in large real networks without counting: algorithms and laws. In: 2008 eighth IEEE international conference on data mining, IEEE, pp 608–617

  54. Tsourakakis CE, Kang U, Miller GL, Faloutsos C (2009) Doulion: counting triangles in massive graphs with a coin. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 837–846

  55. Van Loan CF (2000) The ubiquitous kronecker product. J Comput Appl Math 123(1):85–100

    Article  MathSciNet  MATH  Google Scholar 

  56. Wang J, Cheng J (2012) Truss decomposition in massive networks. Proc VLDB Endow 5(9):812–823

    Article  Google Scholar 

  57. Wuchty S, Almaas E (2005) Peeling the yeast protein network. Proteomics 5(2):444–449

    Article  Google Scholar 

  58. Zhang S, Zhou D, Yildirim MY, Alcorn S, He J, Davulcu H, Tong H (2017) HiDDen: hierarchical dense subgraph detection with application to financial fraud detection. In: Proceedings of the 2017 SIAM international conference on data mining, SIAM, pp 570–578

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant Nos. CNS-1314632 and IIS-1408924. Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. Kijung Shin was supported by KFAS Scholarship. Tina Eliassi-Rad was supported by NSF CNS-1314603 and by DTRA HDTRA1-10-1-0120. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kijung Shin.

Appendices

Appendix A: Interpreting sparsity patterns

We explain sparsity patterns and how to interpret them. The sparsity pattern of a graph is a plot with the axes representing the rows and columns of the adjacency matrix. For each nonzero entry (i.e., edge in the graph), a point is plotted, thus displaying sparsity patterns in the adjacency matrix.

Figure 19a shows the sparsity pattern of the degeneracy–core of Caida Dataset. The rows in the plot indicate vertices, and they are divided into two ranges, which correspond to the core and the periphery. The vertices in the core are densely connected with each other, as seen in region A in Fig. 19b. The vertices in the periphery are well connected to the vertices in the core (regions B and C) but rarely connected to each other (region D). The vertices in the core are further divided into three communities, each of which corresponds to a range of the columns in Fig. 19a. The vertices in the same community are particularly well connected to each other, as seen in regions A1, A2, and A3 in Fig. 19c, which correspond to the sparsity patterns of the communities.

Fig. 19
figure 19

Example sparsity pattern and its interpretation. a Sparsity pattern, b core–periphery in the sparsity pattern, c communities in the sparsity pattern

Appendix B: Core-D with a small number of samples

Figure 20 presents the accuracy of Core-D with different sample sizes in the two largest datasets. Even with a small number of samples less than the number of vertices, Core-D, especially Overall Model, accurately and reliably estimated degeneracy. Thus, Core-D is still effective even when the amount of available memory space is less than n.

Fig. 20
figure 20

Core-D is nimble and accurate. Points and error bars represent the average accuracy and ± one standard deviation over ten runs, respectively. Core-D reliably estimates degeneracy even with a small number of samples less than the number of vertices. a Friendster, b Twitter

Appendix C: Measuring influence using SIR model simulation

To evaluate influence as a spreader, we simulate spreading processes using SIR model [29], a widely used epidemic model. Initially, a vertex chosen as the seed is in the infectious state (I-state), while the others are in the susceptible state (S-state). Each vertex in the I-state infects each of its neighbors in the S-state with probability \(\beta \) (infection rate) and then enters the recovered state (R-state). This is repeated until no vertex is in the I-state. The influence of a seed, the initially infected vertex, can be quantified by the number of vertices infected at any time during the process. To reduce random effects, we repeat the whole process 100 times and use the average number of infected vertices as the measure of influence. \(\beta \) is set close to the epidemic threshold \(\lambda _{1}^{-1}\), as in previous work [42].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shin, K., Eliassi-Rad, T. & Faloutsos, C. Patterns and anomalies in k-cores of real-world graphs with applications. Knowl Inf Syst 54, 677–710 (2018). https://doi.org/10.1007/s10115-017-1077-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-017-1077-6

Keywords

Navigation