Encyclopedia of Big Data Technologies

Living Edition
| Editors: Sherif Sakr, Albert Zomaya

Link Analytics in Graphs

  • Peixiang ZhaoEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-3-319-63962-8_320-1

Synonyms

Definitions

Link analytics is a set of specialized data analysis and graph mining techniques that discover, examine, and evaluate the relationships or interlinked structures of graphs.

Overview

Graph-structured data are ubiquitous (Aggarwal and Wang 2010; Cook and Holder 2006), which consist of vertices (or nodes) representing physical, technological, conceptual, and societal entities or objects and edges (or links) illustrating connections, relationships, or dependencies between vertices in application-specific ways. Noteworthy examples of graphs and networked data include the World Wide Web, where webpages are vertices and hyperlinks are edges (Kleinberg et al. 1999), and social networks, where individuals are vertices and friendship relations are edges (Pitas 2015). In response to the growing popularity and wide applicability of graphs, a proliferation of link analysis techniques has emerged, focusing primarily on the modeling, quantification, mining, and evaluation of potentially useful, structure-enriched information from graphs, including, but not limited to, link-based ranking and link prediction.

Key Resesarch Findings

Link-Based Ranking

Given a graph G, link-based ranking methods aim to rank vertices of G based upon their structure significance modeled and quantified based in particular on the edges of G. Specifically, each vertex v of G is associated with (relative) quantitative assessment values indicating the significance of v throughout G. Representative link-based ranking methods include PageRank (Brin and Page 1998), Personalized PageRank (Jeh and Widom 2003; Page et al. 1998), HITS (Kleinberg 1999), and SimRank (Jeh and Widom 2002).

PageRank

PageRank is a link-based ranking algorithm, the objective of which is to assign a numerical score, called PageRank score, to each vertex by exploiting interlinked structures of the graph G. The PageRank score of a vertex v can be regarded as a “vote”, by all the other vertices of G, about how important v is. Empirically, A link to v counts as a vote of support for v. A vertex with a high PageRank score is usually considered more “important” or more “influential” than a vertex with a low PageRank score. In principle, to compute the PageRank score of a vertex v, denoted as P(v), we consider all the vertices u that links to v, i.e., for each such u, there exists an edge (u, v) in the graph G. If the degree of u, deg(u), is n, then u contributes \(\frac {1}{deg(u)}\) of its PageRank score to that of v:
$$\displaystyle \begin{aligned} P(v) = \sum_{u:(u, v) \in G} \frac{P(u)}{deg(u)} \end{aligned} $$
(1)
To account for the vertices with no outbound links, their PageRank scores can be regarded as being divided evenly among all the other vertices of G. As a result, a parametric damping factor d is introduced, and the PageRank formulation is refined as
$$\displaystyle \begin{aligned} P(v) = \frac{1-d}{N} + d \sum_{u:(u, v) \in G} \frac{P(u)}{deg(u)} \end{aligned} $$
(2)
where N is the number of vertices of G.

Based on the random surfer model (Chebolu and Melsted 2008), PageRank measures the stationary distribution of one specific kind of random walk that starts from a random vertex u of G; and in each iteration, with a predefined probability p = 1 − d, jumps to a random vertex; and, with probability 1 − p = d, follows a random outgoing edge of the current vertex u. PageRank scores can be approximated by the Power method with a high degree of accuracy (Bahmani et al. 2010; Berkhin 2005). It has been reported that the PageRank algorithm, once ran upon a graph of 322 million edges, converged within a tolerable limit in just 52 iterations (Brin and Page 1998).

Personalized PageRank

Personalized PageRank is the personalized, egocentric version of the PageRank algorithm (Lofgren et al. 2014). Given a graph G and a starting vertex v, Personalized PageRank assigns a score to every other vertex u of G. This score models how much v is interested in u or how much v trusts u. From the perspective of random Markov theory, Personalized PageRank is almost the same as PageRank, except that all the random jumps, with a predefined probability p, are made to the starting vertex v for which we are personalizing the PageRank, as opposed to any vertex of the graph G.

The exact personalized PageRank scores for all vertices of G with respect to a particular source vertex v can be computed by the Power method (Maehara et al. 2014; Page et al. 1998), which involves costly matrix multiplication operations. Furthermore, materializing the personalized PageRank score for each vertex v of G is clearly infeasible for large graphs. As a result, most existing methods have focused on approximate personalized PageRank computation for large personalized PageRank scores with accuracy guarantees (Wang et al. 2016; Fujiwara et al. 2013).

HITS

The Hyperlink-Induced Topic Search (HITS) algorithm is a link analysis method that was initially designed for webpage search. Given a user-specified query, a subgraph G of all relevant vertices to the query is first selected from the original graph. For each vertex v of G, two scores are further assigned to indicate the importance of v: authority and hub. Intuitively, v is considered an authority if it provides direct answers to specific information needs, and there exist many hub vertices linking to it. Likewise, v is considered a hub if it provides a good list of links to the high-quality vertices; that is, v points to many other authoritative vertices. As a result, authority and hub values are defined in terms of one another in a mutually recursive fashion: the authority of v is computed as the sum of the hub values for the vertices that point to v:
$$\displaystyle \begin{aligned} authority(v) = \sum_{u: (u, v) \in G} hub(u) \end{aligned} $$
(3)
and the hub value of v is the sum of the authority values for the vertices to which v points:
$$\displaystyle \begin{aligned} hub(v) = \sum_{u: (v, u) \in G} authority(u) \end{aligned} $$
(4)
The HITS algorithm iterates by updating the two scores for each vertex of G. In order to ensure convergence of the HIT algorithm, both the authority and hub scores are scaled and normalized within each iteration. In practice, after a number of iterations when both the authority and hub scores of vertices do not vary significantly, the HITS algorithm can be considered to have converged (Kleinberg 1999).
Consider the adjacency matrix A of the graph G, which is an N × N symmetric matrix, where N is the number of vertices of G. The matrix entry Aij(1 ≤ i, j ≤ N) is 1 if there exists an edge from vertex vi to vj and 0 otherwise. We further denote the hub vector hub = {hub(v1), …, hub(vN)}, where hub(vi)(1 ≤ i ≤ N) is the hub score of the vertex vi. Likewise we denote the authority vector authority = {authority(v1), …, authority(vN)}. The HITS algorithm in matrix notations can therefore be formulated as
  1. 1.

    computer hub = A ×authority;

     
  2. 2.

    computer authority = AT ×hub;

     
  3. 3.

    Iterate until convergence.

     
By substitution, it is immediate that
$$\displaystyle \begin{aligned} \mathbf{hub} = A \times A^T \times \mathbf{hub} \end{aligned} $$
(5)
and
$$\displaystyle \begin{aligned} \mathbf{authority} = A^T \times A \times \mathbf{authority} \end{aligned} $$
(6)
That is, hub is an eigenvector of AAT and authority is an eigenvector of ATA. As a result, the HITS algorithm is actually a special case of the power method, and both HITS and PageRank algorithms formalize link-based ranking into the eigenvector problems of designated matrixes (Ding et al. 2002).

SimRank

Given a graph G, it is important to assess the similarity of vertices based upon the pure, interlinked graph topology. Among a number of similarity measures, SimRank has been recognized as one of the most well adopted (Jiang et al. 2017; Tian and Xiao 2016). SimRank is proposed based on the following intuitive argument: “two vertices are considered similar if they are referenced by similar vertices.” As the base case, we consider a vertex maximally similar to itself, to which we can assign the SimRank score of 1. Furthermore, the similarity between two different vertices u and v is defined as:
$$\displaystyle \begin{aligned} s(u, v) = \frac{c}{|I(u)| \cdot |I(v)|} \sum_{a \in I(u), b \in I(v)} s(a, b) \end{aligned} $$
(7)
where I(v) denotes the set of neighboring vertices pointing to v and c ∈ (0, 1) is a decay factor. A solution to the SimRank equation can be reached by an iterative computation to a fixed-point (Lizorkin et al. 2008).
SimRank can also be interpreted in terms of coupled random walks. Consider any two vertices u and v of G, and we start random walks from u and v, respectively, such that the two random walks are always with the same length t and they meet, for the first time, at a vertex w of G. Then the SimRank score s(u, v) is equivalent to the expected f-meeting distance (Jeh and Widom 2002):
$$ \displaystyle \begin{aligned} s(u, v) = \sum_{t=0}^{\infty} c^t \times \sum_{w \in G} \Pr (u, v, w) \end{aligned} $$
(8)
where \(\Pr (u, v, w)\) is the probability of two random walks originated from u and v, respectively, that meet at w for the first time.

Link Prediction

Real-world graphs are not static but dynamically evolving with new vertices and edges added all the time. It is therefore fundamental and critical to understand the dynamics and evolution of graphs. In particular, consider two vertices u and v of a graph G, where u and v are not connected via an edge. An interesting problem that has fused intensive research interest and found widely varying applications is to predict, given the current state of the graph G, the likelihood of a future connection between u and v, that is, a new edge (u, v) (Liben-Nowell and Kleinberg 2003). This problem is commonly referred to as the link prediction problem (Martínez et al. 2016; Duan et al. 2016; Barbieri et al. 2014; Hasan and Zaki 2011).

We can model the link prediction problem as a supervised classification task, where the current graph snapshot is used as the training data to build the link prediction model and the predictions of future links can be made afterward. This is a typical binary classification task, in that our main goal is to tell, given two nonadjacent vertices u and v, if or not there will be an edge (u, v) in the near future. As a result, the existing supervised classification methods, including naive Bayes, neural networks, support vector machines (SVM), and k-nearest neighbors, can be used. The key challenge here is to select a set of appropriate, link-based features for classification.

The link-based features that are leveraged for link prediction are primarily based on the pure graph topology. Typically, they define vertex-wise similarity based on linked structures, or the ensembles of paths, between vertices (Liben-Nowell and Kleinberg 2003). The most well-known link-based features are discussed below.
  • Common Neighbors. Given two vertices u and v, the size of their common neighboring vertices in G is defined as |N(u) ∩ N(v)|, where N(⋅) denotes the set of adjacent vertices for a given vertex. Intuitively speaking, if u and v share a lot of common neighboring vertices (e.g., friends), then u and v will connect with each other with high probability in the future;

  • Jaccard Coefficient. The Jaccard coefficient metric normalizes common neighbors as follows,
    $$\displaystyle \begin{aligned} \text{Jaccard}(u, v) = \frac{|N(u) \cap N(v)|}{|N(u) \cup N(v)|} \end{aligned} $$
    (9)
    Equivalently, Jaccard coefficient can be interpreted as the probability of selecting a common neighbor of u and v from the union of the neighbors of u and v;
  • Preferential Attachment In social networks, the users who have had many friends already tend to connect more in the future, which is common referred to as the rich-get-richer phenomenon. We can quantify “richness” of a vertex by its degree, and the preferential attachment measure is thus defined as
    $$\displaystyle \begin{aligned} \text{P-A}(u, v) = |N(u)| \cdot |N(v)| \end{aligned} $$
    (10)
    Note that preferential attachment does not require any detailed information for vertex neighbors; therefore, it has the lowest computational complexity;
  • Adamic-Adar. The Adamic-Adar weighs the common neighbors of u and v with smaller degrees more heavily,
    $$\displaystyle \begin{aligned} \text{Adamic}(u, v) = \sum_{w \in N(u) \cap N(v)} \frac{1}{log |N(w)|} \end{aligned} $$
    (11)
    Intuitively, if both u and v share a common neighbor w, which turns out to be a high-degree vertex connecting to many other vertices as well, the effect of influencing the future connection between u and v, in terms of w, should be dampened;
  • Shortest Path Distance. Empirically speaking, the shorter the distance between u and v in G, the higher the probability u and v will be directly connected in the future. Due to the small-world phenomenon (Watts and Strogatz 1998), however, most vertex pairs in real-world graphs are separated by fairly short distances. As a result, this path-based feature sometimes leads to poor link prediction performance;

  • Katz. The Katz measure can be viewed as a variant of shortest path distance:
    $$\displaystyle \begin{aligned} \text{Katz}(u, v) = \sum_{l=1}^{\infty} \beta^l \cdot |paths_{u, v}^l| \end{aligned} $$
    (12)
    where β(≤ 1) is a dampening parameter to penalize the paths with long path lengths, l denotes the length of a path, and \(paths_{u, v}^l\) is the set of all the paths of length l between u and v in G. Katz considers the ensemble of all paths between u and v and thus generally works much better than shortest path distance in link prediction. However, computing Katz in real-world graphs turns out to be very expensive;
  • Hitting Time. Given two vertices u and v in a graph G, the hitting time, H(u, v), is the expected number of steps of a random walk starting at u to reach v. A shorter hitting time means that u and v are more similar, and thus there exists a higher probability that u and v will be linked together in the future. It is easy to compute H(u, v) by initiating a sample of random walks. However, its value may have high variances. As a result, link prediction based on hitting time may result in poor prediction performance;

  • Rooted PageRank. The hitting time measure is sensitive to the vertices that are far away from u and v to be examined, even if u and v are very close to each other in G. To alleviate this problem, we allow the random walk from u to v to periodically restart back to u with a fixed probability α at each random step. This way, distant part of G that is far away from u and v will almost never be explored. This approach results in the Rooted PageRank measure, which is reminiscent of Personalized PageRank in the link-based ranking methods.

Besides the aforementioned link feature-based methods, there are also Bayesian probabilistic methods, probabilistic relational methods, and linear algebraic methods for link prediction. Readers are referred to the link prediction surveys for more technical details (Hasan and Zaki 2011).

Examples of Applications

Link analytics involves data analysis techniques used in network science to evaluate the relationships in graphs. It is essentially a kind of knowledge discovery that can be used in widely varying real-world, graph-structured applications, including search engine optimization, security analysis, and medical research.

The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search. The use of hyperlinks for ranking web search results is probably one of the most noteworthy examples (Brin and Page 1998; Page et al. 1998). PageRank and HITS algorithms have been the key factors considered by web search engines in computing a composite score for a web page on any given query. Over the last decade, both of them have emerged as a very effective measure of reputation for web graphs and social networks (Liu et al. 2017).

Giving higher weights to the nearby vertices in Personalized PageRank has enabled it to find many applications in different real-world graphs, including friend recommendation in social networks (Gupta et al. 2013; Backstrom and Leskovec 2011) and graph partitioning (Andersen et al. 2006). Personalized PageRank has also been used to rank items in bipartite graphs for recommendation (Bahmani et al. 2010). Furthermore, it has also been applied from biology to chemistry and to civil engineering (Fogaras et al. 2005).

SimRank is a general link-based similarity measure, which is computed solely based on the vertices’ structural context. SimRank has been successfully employed in many graph-based applications, such as sponsored search and web spam detection (Antonellis et al. 2008) and schema matching (Melnik et al. 2002).

Link prediction has found a series of applications in social networks. For instance, in numerous social media websites, with the “friend recommendation” feature, they will suggest users connected with other potential friends in real life, or they may suggest friends you already know but just not yet connected. Beyond social network applications, link prediction has been used to find interactions between proteins (Airoldi et al. 2008). In the security domain, link prediction can help identify hidden groups of terrorists or criminals (Hasan and Zaki 2011).

Future Directions for Research

Link analytics has been a promising research direction in data science and graph mining and thus generated a series of disruptive and influential techniques and methodologies that have significantly shaped the modern networked world. There exist quite a few research frontiers for link analytics that are worthy of systematic and thorough studies in the future:
  1. 1.

    Real-world graphs and networks are oftentimes enormous in sizes and scales, making existing solutions for link analytics hard to adapt to the so-called big graphs. New link analysis principles and methodologies that can scale up or scale out upon Internet-scale graphs will be of great importance in the Big Data Era;

     
  2. 2.

    Real-world graphs are typically generated from disparate, heterogeneous data sources, thus resulting in heterogeneous, multidimensional graph data. The synergy and unification of graph topology and heterogeneous contents will enhance both the effectiveness and efficiency for existing link analysis methods;

     
  3. 3.

    Real-world graphs are not static but dynamically evolving in fast rates and speed. Enabling real-time and accurate link analytics upon fast, dynamically evolving graphs will further spur extensive research interest and potential applications for dynamic graphs and graph streams.

     

Cross-References

References

  1. Aggarwal CC, Wang H (2010) Managing and mining graph data, 1st edn. Springer Publishing Company, Inc., BostonGoogle Scholar
  2. Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014Google Scholar
  3. Andersen R, Chung F, Lang K (2006) Local graph partitioning using pagerank vectors. In: Proceedings of the 47th annual IEEE symposium on foundations of computer science (FOCS’06), pp 475–486Google Scholar
  4. Antonellis I, Molina HG, Chang CC (2008) Simrank++: query rewriting through link analysis of the click graph. Proc VLDB Endow 1(1):408–421CrossRefGoogle Scholar
  5. Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the fourth ACM international conference on web search and data mining (WSDM’11), pp 635–644Google Scholar
  6. Bahmani B, Chowdhury A, Goel A (2010) Fast incremental and personalized pagerank. Proc VLDB Endow 4(3):173–184CrossRefGoogle Scholar
  7. Barbieri N, Bonchi F, Manco G (2014) Who to follow and why: link prediction with explanations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’14), pp 1266–1275Google Scholar
  8. Berkhin P (2005) Survey: a survey on pagerank computing. Internet Math 2(1):73–120MathSciNetCrossRefGoogle Scholar
  9. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the seventh international conference on World Wide Web (WWW’98), pp 107–117CrossRefGoogle Scholar
  10. Chebolu P, Melsted P (2008) Pagerank and the random surfer model. In: Proceedings of the nineteenth annual ACM-SIAM symposium on discrete algorithms (SODA’08)Google Scholar
  11. Cook DJ, Holder LB (2006) Mining graph data. Wiley, New YorkGoogle Scholar
  12. Ding C, He X, Husbands P, Zha H, Simon HD (2002) Pagerank, hits and a unified framework for link analysis. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’02), pp 353–354Google Scholar
  13. Duan L, Aggarwal C, Ma S, Hu R, Huai J (2016) Scaling up link prediction with ensembles. In: Proceedings of the ninth ACM international conference on web search and data mining (WSDM’16), pp 367–376Google Scholar
  14. Fogaras D, Rácz B, Csalogány K, Sarlós T (2005) Towards scaling fully personalized pagerank: algorithms, lower bounds, and experiments. Internet Math 2(3):333–358MathSciNetCrossRefGoogle Scholar
  15. Fujiwara Y, Nakatsuji M, Shiokawa H, Mishima T, Onizuka M (2013) Efficient ad-hoc search for personalized pagerank. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD’13), pp 445–456Google Scholar
  16. Gupta P, Goel A, Lin J, Sharma A, Wang D, Zadeh R (2013) WTF: The who to follow service at twitter. In: Proceedings of the 22nd international conference on World Wide Web (WWW’13), pp 505–514Google Scholar
  17. Hasan M, Zaki M (2011) A survey of link prediction in social networks. In: Aggarwal CC (ed) Social network data analytics. Springer, New York, pp 243–275CrossRefGoogle Scholar
  18. Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’02), pp 538–543Google Scholar
  19. Jeh G, Widom J (2003) Scaling personalized web search. In: Proceedings of the 12th international conference on World Wide Web (WWW’03), pp 271–279Google Scholar
  20. Jiang M, Fu AWC, Wong RCW (2017) Reads: a random walk approach for efficient and accurate dynamic simrank. Proc VLDB Endow 10(9):937–948CrossRefGoogle Scholar
  21. Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632MathSciNetCrossRefGoogle Scholar
  22. Kleinberg JM, Kumar R, Raghavan P, Rajagopalan S, Tomkins AS (1999) The web as a graph: measurements, models, and methods. In: Proceedings of the 5th annual international conference on computing and combinatorics (COCOON’99), pp 1–17Google Scholar
  23. Liben-Nowell D, Kleinberg J (2003) The link prediction problem for social networks. In: Proceedings of the twelfth international conference on information and knowledge management (CIKM’03), pp 556–559Google Scholar
  24. Liu Q, Xiang B, Yuan NJ, Chen E, Xiong H, Zheng Y, Yang Y (2017) An influence propagation view of pagerank. ACM Trans Knowl Discov Data 11(3):30: 1–30:30Google Scholar
  25. Lizorkin D, Velikhov P, Grinev M, Turdakov D (2008) Accuracy estimate and optimization techniques for simrank computation. Proc VLDB Endow 1(1): 422–433CrossRefGoogle Scholar
  26. Lofgren PA, Banerjee S, Goel A, Seshadhri C (2014) FAST-PPR: scaling personalized pagerank estimation for large graphs. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’14), pp 1436–1445Google Scholar
  27. Maehara T, Akiba T, Iwata Y, Kawarabayashi Ki (2014) Computing personalized pagerank quickly by exploiting graph structures. Proc VLDB Endow 7(12): 1023–1034CrossRefGoogle Scholar
  28. Martínez V, Berzal F, Cubero JC (2016) A survey of link prediction in complex networks. ACM Comput Surv 49(4):69:1–69:33CrossRefGoogle Scholar
  29. Melnik S, Garcia-Molina H, Rahm E (2002) Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th international conference on data engineering (ICDE’02), pp 117–128Google Scholar
  30. Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. In: Proceedings of the 7th international World Wide Web conference (WWW’98), pp 161–172Google Scholar
  31. Pitas I (2015) Graph-based social media analysis. Chapman & Hall/CRC, Boca RatonGoogle Scholar
  32. Tian B, Xiao X (2016) Sling: A near-optimal index structure for simrank. In: Proceedings of the 2016 international conference on management of data (SIGMOD’16), pp 1859–1874Google Scholar
  33. Wang S, Tang Y, Xiao X, Yang Y, Li Z (2016) Hubppr: effective indexing for approximate personalized pagerank. Proc VLDB Endow 10(3):205–216CrossRefGoogle Scholar
  34. Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceFlorida State UniversityTallahasseeUSA

Section editors and affiliations

  • Hannes Voigt
    • 1
  • George Fletcher
    • 2
  1. 1.Technische Universität DresdenDresdenGermany
  2. 2.Department of Mathematics and Computer ScienceEindhoven University of Technology