In this paper, we want to identify more behavioral representative nodes in bipartite social networks. To this end, we propose a new similarity-based centrality measure, called HellRank. Since the similarity measure is usually inverse of the distance metrics, we first choose a suitable distance measure, namely Hellinger distance (Sect. 3.1). Then we apply this metric to bipartite networks. After that, we theoretically analyze the impact of the distance metric in the bipartite networks. Next, we generate a distance matrix on one side of the network. Finally, we compute the HellRank score of each node, accordingly to this matrix. As a result, the nodes with high HellRank centrality are more behavioral representative nodes in bipartite social networks.
Select a well-defined distance metric
When we want to choose a base metric, an important point is whether this measure is based on a well-defined mathematical metric. We want to introduce a similarity-based measure for each pair of nodes in the network. So, we choose a proper distance measure as base metric, because the similarity measures are in some sense the inverse of the distance metrics. A true distance metric must have several main characteristics. A metric with these characteristics on a space induces topological properties (like open and closed sets). It leads to the study of more abstract topological spaces. Hunter (2012) introduced the following definition for a distance metric.
Definition 1
A metric space is a set X that has a notion of the distance function
d(x, y) between every pair of points \(x, y \in X\). A well-defined distance metric
d on a set X is a function \(d : X \times X \rightarrow \mathrm{I\!R}\) such that for all \(x, y,z \in X\), three properties hold:
-
1.
Positive Definiteness: \(d(x, y) \ge 0\) and \(d(x, y) = 0\) if and only if \(x = y\);
-
2.
Symmetry: \(d(x, y) = d(y, x)\);
-
3.
Triangle Inequality: \(d(x,y) \le d(x,z) + d(z,y)\).
We define our distance function as the difference between probability distribution for each pair of nodes based on f-divergence function, which is defined by:
Definition 2
An f-divergence is a function \(D_f(P||Q)\) that measures the difference between two probability distributions P and Q. For a convex function f with \(f(1) = 0\), the f-divergence of Q from P is defined as (Csiszár and Shields 2004):
$$\begin{aligned} D_f(P||Q)= & {} \int _{\varOmega } {f\left( \frac{dP}{dQ}\right) dQ} \end{aligned}$$
(10)
where \(\varOmega\) is a sample space, which is the set of all possible outcomes.
In this paper, we use one type of the f-divergence metric, called Hellinger distance (aka Bhattacharyya distance), that was introduced by Ernst Hellinger in 1909 (Nikulin 2001). In probability theory and information theory, Kullback–Leibler divergence Kullback and Leibler (1951) is a more common measure of difference between two probability distributions, however it does not satisfy both the symmetry and the triangle inequality conditions (Van der Vaart 2000). Thus, this measure is not intuitively appropriate to explain similarity in our problem. As a result, we choose Hellinger distance to quantify the similarity between two probability distributions (Van der Vaart 2000). For two discrete probability distributions \(P=(p_1, \ldots , p_m)\) and \(Q=(q_1, \ldots , q_m)\), in which m is length of the vectors, Hellinger distance is defined as:
$$\begin{aligned} D_H(P||Q)= & {} \frac{1}{\sqrt{2}} \sqrt{\sum _{i=1}^{m}(\sqrt{p_i}-\sqrt{q_i})^2} \end{aligned}$$
(11)
It is obviously related to the Euclidean norm of the difference of the square root of vectors, as:
$$\begin{aligned} D_H(P||Q)= & {} \frac{1}{\sqrt{2}} \Vert {\sqrt{P}-\sqrt{Q}}\Vert _2 \end{aligned}$$
(12)
Applying Hellinger distance in bipartite networks
In this section, we want to apply the Hellinger distance to a bipartite network for measuring the similarity of the nodes on one side of the network. Assume x is a node in a bipartite network in which its neighborhood is N(x) and its degree is \(deg(x)=|N(x)|\). Suppose that the greatest node degree of the network is \(\Delta\). Let \(l_i\) be the number of x’s neighbors with degree of i. Suppose the vector \(L_x=(l_1,\dots ,l_{\varDelta })\) be the non-normalized distribution of \(l_i\) for all adjacent neighbors of x. Now, we introduce the Hellinger distance between two nodes x and y on one side of the bipartite network as follows:
$$\begin{aligned} d(x,y) = \sqrt{2}\ D_H (L_x\Vert L_y) \end{aligned}$$
(13)
The function d(x, y) represents the difference between two probability distribution of \(L_x\) and \(L_y\). To the best of our knowledge, this is the first work that introduces the Hellinger distance between each pair of nodes in a bipartite network, using degree distribution of neighbors of each node.
Theoretical analysis
In this section, we first express the Hellinger distance for all positive real vectors to show that applying this distance to bipartite networks still satisfies its metricity (lemma 1) according to Definition 1. Then, we find an upper and a lower bound for the Hellinger distance between two nodes of bipartite network.
Lemma 1
Hellinger distance for all positive real vectors is a well-defined distance metric function.
Proof
Based on the true metric properties in Definition 1, for two probability distribution vectors P and Q, the following holds:
$$\begin{aligned} D_H(P\Vert Q)\ge & {} 0 \end{aligned}$$
(14)
$$\begin{aligned} D_H(P\Vert Q)= & {} 0 \ \ \Leftrightarrow \ P=Q \end{aligned}$$
(15)
$$\begin{aligned} D_H(P\Vert Q)= & {} D_H(Q\Vert P) \end{aligned}$$
(16)
If we have another probability distribution R similar to P and Q, then according to the triangle inequality in norm 2, we should have:
$$\begin{aligned} \frac{1}{\sqrt{2}} \Vert {\sqrt{P}-\sqrt{Q}}\Vert _2\le & {} \frac{1}{\sqrt{2}} ( \Vert {\sqrt{P}-\sqrt{R}}\Vert _2+ \Vert {\sqrt{R}-\sqrt{Q}}\Vert _2 ) \nonumber \\ \Rightarrow ~~ D_H(P\Vert Q)\le & {} D_H(P\Vert R)+ D_H(R\Vert Q) \end{aligned}$$
(17)
It shows that the triangle inequality in Hellinger distance for all positive real vectors is a well-defined distribution metric function. \(\square\)
Using this distance measure, we have the ability to detect differences between local structures of nodes. In other words, this distance expresses similarity between the local structures of two nodes. If we normalize the vectors (i.e., sum of the elements equals to one), then differences between local structures of nodes may not be observed. For example, there does not exist any distance between node x with \(deg(y)=10\) that its neighbors’ degree are 2 and node y with \(deg(y)=1\) that its neighbor’s degree are 2. Therefore, our distance measure with vectors normalization is not proper for comparing two nodes.
Then, we claim that if the difference between two nodes’ degree is greater (or smaller) than a certain value, the distance between these nodes cannot be less (or more) than a certain value. In other words, their local structures cannot be similar more (or less) than a certain value. In the following theorem, we find an upper and a lower bound for the Hellinger distance between two nodes on one side of a bipartite network using their degrees’ difference.
Theorem 1
If we have two nodes x and y on one side of a bipartite network, such that \(deg(x)=k_1\), \(deg(y)=k_2\), and \(k_1 \ge k_2\), then we have a lower bound for the distance between these nodes as:
$$\begin{aligned} d(x,y)\ge \sqrt{k_1}-\sqrt{k_2} \end{aligned}$$
(18)
and an upper bound as:
$$\begin{aligned} d(x,y)\le \sqrt{k_1+k_2} \end{aligned}$$
(19)
Proof
To prove the theorem, we use the Lagrange multipliers. Suppose \(L_x=(l_1,\dots ,l_{\varDelta })\) and \(L_y=(h_1,\dots ,h_{\varDelta })\) are positive real distribution vectors of nodes x and y. Based on (13) we know \(d(x,y) = \sqrt{2}\ D_H (L_x\Vert L_y)\), so one can minimize the distance between these nodes by solving \(\min \limits _{L_x,L_y} \sqrt{2}\ D_H(L_x\Vert L_y)\), which is equivalent to find the minimum square of their distance:
$$\begin{aligned} \min _{L_x,L_y} 2 D_H^2(L_x \Vert L_y)=\min _{L_x,L_y} \sum _{i=1}^{\varDelta }\left( \sqrt{l_i}-\sqrt{h_i}\right) ^2 \end{aligned}$$
So, Lagrangian function can be defined as follows:
$$\begin{aligned} F(L_x,L_y,\lambda _1,\lambda _2)= & {} {\sum _{i=1}^{\varDelta } \left( \sqrt{l_{i}}-\sqrt{h_{i}}\right) ^2} \\&+\lambda _1\left( k_1-\sum _{i=1}^{\varDelta } {l_{i}}\right) + \lambda _2\left( k_2-\sum _{i=1}^{\varDelta } {h_{i}}\right) \end{aligned}$$
Then, we take the first derivative with respect to \(l_i\):
$$\begin{aligned} \frac{\partial F}{\partial l_{i}} = 1 - \frac{\sqrt{h_{i}}}{\sqrt{l_{i}}}- \lambda _1 =0 ~~~\Rightarrow ~~~ h_{i} = {l_{i}}{(1-\lambda _1)^2} \end{aligned}$$
Due to \(\sum _{i=1}^{\Delta }l_i=k_1\) and \(\sum _{i=1}^{\Delta }h_i=k_2\), we have:
$$\begin{aligned} \sum _{i=1}^{\Delta }h_{i} = k_2 \rightarrow \sum _{i=1}^{\Delta } {l_{i}}{(1-\lambda _1)^2} = k_2 \rightarrow (1-\lambda _1) = \pm \sqrt{\frac{k_2}{k_1}} \end{aligned}$$
But in order to satisfy \(\sqrt{h_i}= {\sqrt{l_i}}{(1-\lambda _1)}\), the statement \(1-\lambda _1\) must be positive, thus:
$$\begin{aligned} h_{i} ={l_{i}}{(1-\lambda _1)^2} = l_{i} \frac{k_2}{k_1} \end{aligned}$$
After derivation with respect to \(h_i\), we also reach similar conclusion. If this equation is true, then equality statement for minimum function will occur, as:
$$\begin{aligned} \min _{L_x,L_y} 2 D_H^2(L_x \Vert L_y)= & {} \sum _{i=1}^{\Delta }\left( \sqrt{l_i}-\sqrt{{\frac{k_2}{k_1}}}\sqrt{l_i}\right) ^2\\= & {} \sum _{i=1}^{\Delta }l_i\left( 1-\sqrt{{\frac{k_2}{k_1}}}\right) ^2\\=\,& {} (1-\sqrt{{\frac{k_2}{k_1}}})^2\sum _{i=1}^{\Delta }l_i\\= & {} \left( 1-\sqrt{{\frac{k_2}{k_1}}}\right) ^2 k_1 \ \ ~~~~~~~\Rightarrow \\ \min _{L_x,L_y} \sqrt{2}\ D_H(L_x \Vert L_y)=\, & {} \sqrt{k_1}\left( 1-\sqrt{{\frac{k_2}{k_1}}}\right) \\=\, & {} \sqrt{k_1}-\sqrt{k_2} \end{aligned}$$
So, the lower bound for distance of any pair of nodes on one side of the bipartite network could not be less than a certain value by increasing their degrees difference.
Now, we want to find an upper bound according to Equation (19). As we know, the following statement is true for any \(p_{i}, p_{j}\) and \(q_{i}, q_{j}\):
$$\begin{aligned}&\left( \sqrt{p_i}-\sqrt{q_i}\right) ^2+\left( \sqrt{p_j}-\sqrt{q_j}\right) ^2 \\&\quad \le \left( \sqrt{p_i+p_j}-0\right) ^2+\left( \sqrt{q_i+q_j}-0\right) ^2\\&\quad = p_i+p_j+q_i+q_j \end{aligned}$$
Suppose in our problem, \(p_i=l_i\), \(p_j=l_j\), and \(q_i=h_i\), \(q_j=h_j\), then this inequality holds for any two pairs of elements in \(L_x\) and \(L_y\). Eventually we have:
$$\begin{aligned} d(x,y) \le \sqrt{\left( \sqrt{k_1}-0\right) ^2+\left( \sqrt{k_2}-0\right) ^2} = \sqrt{k_1+k_2} \end{aligned}$$
We can conclude that it is not possible for any pair of nodes on one side of the bipartite network that their distance to be more than a certain value by increasing their degrees. \(\square\)
As a result, we found the upper and the lower bounds for the Hellinger distance between two nodes on one side of the bipartite network using their degrees’ difference.
An example with probabilistic view
In this example, we want to analyze the similarity among nodes based on Hellinger distance information in an artificial network. We examine how we can obtain required information for finding similar nodes to a specific node x as the expected value and variance of the Hellinger distance. Suppose that in a bipartite artificial network with \(\vert V_1\vert = n_1\) nodes on one side and \(\vert V_2\vert = n_2\) nodes on the other side, nodes in \(V_1\) is connected to nodes in \(V_2\) using Erdös–Rényi model G(m, p). In other words, there is an link with probability p between two sides of the network. Distribution function \(L_x\) of node \(x\in V_1\) can be expressed as a multinomial distribution form as:
$$\begin{aligned} P\left( l_1,\dots ,l_{\varDelta }|deg(x)=k \right)=\, & {} P\left( L_x|deg(x)=k\right) \nonumber \\= & {} \left( \begin{array}{c} k \\ l_1,\dots ,l_{\varDelta } \end{array}\right) \prod {P_i^{l_i}} \end{aligned}$$
(20)
where \(P_i=\left( \begin{array}{c} n_2-1 \\ i-1 \end{array}\right) p^{i-1} (1-p)^{n_2-i}\) is a binomial distribution probability \(B(n_2, p)\) for x’s neighbors that their degree is equal to i.
According to the central limit theorem (Johnson 2004), binomial distribution converges to a Poisson distribution \(Pois(\lambda )\) with parameter \(\lambda =(n_2-1)p\) and the assumption that \((n_2-1)p\) is fixed and \(n_2\) increases. Therefore, average distribution of \(P\left( L_x|deg(x)=k\right)\) will be \(\mu =(kp_1,kp_2,\dots ,kp_{\varDelta })\). In addition, degree distribution in Erdös–Rényi model converges to Poisson distribution by increasing \(n_1\) and \(n_2\) (\(\lambda =n_1 p\) for one side of network and \(\lambda =n_2 p\) for another one).
The limit of average distribution of \(P\left( L_x|deg(x)=k\right)\) by increasing \(\Delta\), approaches k times of a Poisson distribution. Thus, normalized \(L_x\) vector is a Poisson distribution with parameter \(\lambda =(n_2-1)p\). To find a threshold for positioning similar and closer nodes to node x, we must obtain expectation and variance of the Hellinger distance between x and the other nodes in node set \(V_1\). Before obtaining these values, we mention the following lemma to derive equal expression of Hellinger distance and difference between typical mean and geometric mean.
Lemma 2
Suppose two distribution probability vectors \(P=(p_1, \ldots , p_m)\) and \(Q=(q_1, \ldots , q_m)\) that P is \(k_1\) times of a Poisson distribution probability vector \(P_1\,\sim \,\mathrm {Poisson}(\lambda _1)\) and Q is \(k_2\) times of a Poisson distribution probability vector \(P_2\,\sim \,\mathrm {Poisson}(\lambda _2)\)
Footnote 1. The square of Hellinger distance between P and Q is calculated by:
$$\begin{aligned} D_H^2(P \Vert Q) = \frac{k_1+k_2}{2} - \sqrt{k_1k_2}\left( 1-e^{-\frac{1}{2}\left( \sqrt{\lambda _1}-\sqrt{\lambda _2}\right) ^2}\right) \end{aligned}$$
(21)
Proof
The squared Hellinger distance between two Poisson distributions \(P_1\) and \(P_2\) with rate parameters \(\lambda _1\) and \(\lambda _2\) is (Torgersen 1991):
$$\begin{aligned} D_H^2(P_1\Vert P_2)= 1-e^{-\frac{1}{2}\left( \sqrt{\lambda _1}-\sqrt{\lambda _2}\right) ^2} \end{aligned}$$
(22)
Therefore, the squared Hellinger distance for probability vectors P and Q, will be equal to \((\sum _{i=1}^m p_i= k_1, \sum _{i=1}^m q_i=k_2)\):
$$\begin{aligned} D_H^2(P \Vert Q)= & {} \frac{1}{2} \sum _{i=1}^m\left( \sqrt{p_i}-\sqrt{q_i}\right) ^2 \nonumber \\= & {} \frac{1}{2} \sum _{i=1}^m\left( p_i+q_i - 2\sqrt{p_i q_i}\right) \nonumber \\= & {} \frac{k_1+k_2}{2} - \sqrt{k_1k_2}\left( 1-e^{-\frac{1}{2}\left( \sqrt{\lambda _1}-\sqrt{\lambda _2}\right) ^2}\right) \end{aligned}$$
(23)
\(\square\)
However, in the special case of \(\lambda _1=\lambda _2\), we have:
$$\begin{aligned} D_H^2(P\Vert Q) = \frac{k_1+k_2}{2} - \sqrt{k_1k_2} \end{aligned}$$
(24)
It means that the squared Hellinger distance is equal to difference between typical mean and geometric mean.
To calculate the second moment of distance between node \(x \in V_1\) and any other nodes \(z \in V_1\) in the same side of the bipartite network based on the lemma 2, we have:
$$\begin{aligned}&E_{z\in V_1}\left[ d^2(x,z)\right] \nonumber \\&\quad =E\left[ 2\ D_H^2(L_x\Vert L)\right] \nonumber \\&\quad = \sum _{i=1}^\infty \left( \frac{e^{-n_1p}(n_1p)^i}{(n_1p)!}(k+i-2\sqrt{ki}) \right) \nonumber \\&\quad \simeq \sum _{i=1}^{n_2} \left( \frac{e^{-n_1p}(n_1p)^i}{(n_1p)!}(k+i-2\sqrt{ki}) \right) \end{aligned}$$
(25)
where \(L=(L_z|z \in V_1)\) and the infinite can be approximated by \(n_2\) elements. Similarly, for distance expectation we have:
$$\begin{aligned} E\left[ \sqrt{2}\ D_H(L_x\Vert L)\right] \simeq \displaystyle \sum \limits _{i=1}^{n_2} \left( \frac{e^{-n_1p}(n_1p)^i}{(n_1p)!}\sqrt{(k+i-2\sqrt{ki}) }\right) \end{aligned}$$
(26)
In addition, variance can also be obtained based on these calculated moments:
$$\begin{aligned} Var_{z\in V_1}\left( d(x,z)\right) = E_{z\in V_1}\left[ d^2(x,z)\right] - \left( E_{z\in V_1}\left[ d(x,z)\right] \right) ^2 \end{aligned}$$
(27)
Hence, using these parameters, the required threshold for finding similar nodes to a specific node x, can be achieved. If we want to extend our method to more complex and realistic networks, we can assume that distribution \(L_x\) is a multiple of Poisson distribution (or any other distribution) vector with parameter \(\lambda _x\), in which \(\lambda _x\) can be extracted by either the information about structure of the network or appropriate maximum likelihood estimation for node x. Therefore, the threshold will be more realistic and consistent with the structure of the real-world networks.
Generalization to weighted bipartite networks
The introduced distance metric function can be extended to weighted networks. The generalized Hellinger distance between two nodes of the weighted bipartite network can be considered as:
$$\begin{aligned} d(x,y) = \sqrt{2} D_H(W_x\Vert W_y) \end{aligned}$$
(28)
where \(W_x=(w'_1,\dots ,w'_{\Delta })\), \(w'_i = \sum \nolimits _{\begin{array}{c} j \in N(x)\\ deg(j)=i \end{array}}{w_j}\), and \(w_j\) is the vector of weights on the links of the network.
Rank prediction via HellRank
In this Section, we propose a new Hellinger-based centrality measure, called HellRank, for the bipartite networks. Now, according to the Sect. 3.2, we find the Hellinger distances between any pair of nodes in each side of a bipartite network. Then we generate an \(n_1\times n_1\) distance matrix (\(n_1\) is the number of nodes in one side of network). The Hellinger distance matrix of G shown in Fig. 1 is as follows:
According to the well-defined metric features (in Sect. 3.1) and the ability of mapping to Euclidean space, we can cluster nodes based on their distances. It means that any pair of nodes in the matrix with a less distance can be placed in one cluster by specific neighborhood radius. By averaging inverse of elements for each row in the distance matrix, we get final similarity score (HellRank) for each node of the network, by:
$$\begin{aligned} HellRank(x)= & {} \frac{n_1}{\sum _{z \in V_1}{d(x,z)}} \end{aligned}$$
(29)
Let \(HellRank^* (x)\) be the normalized HellRank of node x that is equal to:
$$\begin{aligned} HellRank^* (x)= & {} HellRank(x) . \min _{z \in V_1}\left( {HellRank(z)}\right) \end{aligned}$$
where ‘ . ’ denotes the multiplication dot, and \(\min _{z \in V_1}\left( {HellRank(z)}\right)\) is the minimum possible HellRank for each node
A similarity measure is usually (in some sense) the inverse of a distance metric: they take on small values for dissimilar nodes and large values for similar nodes. The nodes in one side with higher similarity scores represent more behavioral representation of that side of the bipartite network. In other words, these nodes are more similar than others to that side of the network. HellRank actually indicates structural similarity for each node to other network nodes. For the network shown in Fig. 1, according to Hellinger distance matrix, normalized HellRank of nodes A, B, C, and D are respectively equal to 0.71, 1, 0.94, and 0.52. It is clear that among all of the mentioned centrality measures in Sect. 2.2, only HellRank considers node B as a more behavioral representative node. Hence, sorting the nodes based on their HellRank measures will have a better rank prediction for nodes of the network. The nodes with high HellRank is more similar to other nodes. In addition, we find nodes with less scores to identify very specific nodes which are probably very different from other nodes in the network. The nodes with less HellRank are very dissimilar to other nodes on that side of the bipartite network.