5.1 Introduction

In previous chapters, we have introduced how to generate the hypergraph structure given observed data. After the hypergraph generation step, how to use this hypergraph for different applications becomes the key task. Hypergraph has the potential to be used in different areas, such as social medial analysis, medical and biological applications, and computer vision. We notice most of the applications can be categorized into several typical tasks and follow similar application patterns. In this chapter, we introduce several typical hypergraph computational tasks, which can be used for different applications.

More specifically, four typical tasks, including label propagation, data clustering, cost-sensitive learning, and link prediction, are introduced in this chapter. The first typical task is label propagation, which is also one of the most widely used methods in machine learning. The objective of label propagation is to assign a label to each unlabeled data. In general cases, label propagation on hypergraph is to propagate the label information from labeled vertices to unlabeled vertices through structural information of the hyperedges. Random walk is a basic processing for information propagation, which also plays a fundamental role in this process. We then review the hypergraph cut on hypergraphs and random-walk-based label propagation on hypergraphs. We introduce the label propagation process on single hypergraph and multi-hypergraphs [1, 2], respectively, in this part.

The second typical task is data clustering, targeting on grouping data into different clusters. We introduce how to conduct data clustering using hypergraph computation. The hypergraph structure can be used as guidance to the clustering criteria. Two types of hypergraph clustering methods are introduced, including structural hypergraph clustering and attribute hypergraph clustering, due to the different data information in the hypergraph. In structural hypergraph, the clustering tasks only use structural information, while in attribute hypergraph, each vertex is usually accompanied by attribute information from the real world. We introduce a hypergraph Laplacian smoothing filter and an embedded model specifically for hypergraph clustering tasks that named adaptive hypergraph auto-encoder (AHGAE) [3].

The third typical task is cost-sensitive learning, which is to solve the learning task under the scenario with different mis-classification costs, such as confronting the imbalanced data distribution issue. Here, we introduce two hypergraph computation methods, i.e., cost-sensitive hypergraph computation [4] and cost interval optimization for hypergraph computation [5]. First, we introduce a cost-sensitive hypergraph modeling method, in which the cost for different objectives is fixed in advanced. As the exact cost value may be not easy to be determined, we then introduce a cost interval optimization method, which can utilize the cost chosen inside the interval while generating data with high-order relations.

The fourth typical task is link prediction, which is to predict data relationship and can be used for recommender system and other applications. Here, the hypergraph link prediction is to mine the missing hyperedges or predict new coming hyperedges based on the observed hypergraph. We introduce a variational auto-encoder for heterogeneous hypergraph link prediction [6]. It aims to learn the low-dimensional heterogeneous hypergraph embedding based on the Bayesian deep generative strategy. The heterogeneous encoder generates the vertex embedding and hyperedge embedding, and the hypergraph embedding is the combination of them. The hypergraph decoder reconstructs the incidence matrix based on the vertex embedding and the hyperedge embedding, and the heterogeneous hypergraph is generated based on the reconstructed incidence matrix.

Part of the work introduced in this chapter has been published in [1,2,3,4,5,6].

5.2 Label Propagation on Hypergraph

This section mainly introduces the label propagation task on hypergraph. We first introduce the basic assumptions of the label propagation process. Given a set of vertices on a hypergraph, a part of vertices is labeled, while other vertices are unlabeled. The task is to predict the label information of these unlabeled data given the label information and the hypergraph structure. Figure 5.1 shows that the label propagation process is to propagate the label information from these labeled vertices to the unlabeled vertices.

Fig. 5.1
Two hypergraphs of order 7 and size 5, with 4 at the center. Groupings are 1, 2, 3, 4, and 4, 5, 6, 7. On label propagation, 3 points to 1 and 2, and 6 points to 4, 5, and 7.

An illustration of the label propagation on hypergraphs

When propagating label information, vertices within the same hyperedge are more likely to have the same label because they characterize themselves with similar attributes in some aspects, and therefore, they have a higher probability of sharing the same label. Under this assumption, the label propagation task can be transformed into a hypergraph cut. In a hypergraph cut, the goal is to make the cut edges as sparse as possible, with each vertex set after the cut as dense as possible. After cutting the hypergraph, different sets of vertices have different labels. This approach satisfies the goal based on the above assumption. The form of the hypergraph cut can be described below.

Suppose a vertex set \(S \in \mathbb {V}\) and its compliment \(\overline {S}\). There is a cut that splits the \(\mathbb {V}\) into S and \(\overline {S}\). A hyperedge e is cut if it is incident with the vertices in both S and \(\overline {S}\). Define the hyperedge boundary ∂S as the cut hyperedges, i.e., \(\partial S = \{e\in \mathbb {E} | e \cap S \ne \varnothing , e\cap \overline {S} \ne \varnothing \}\), and the volume of S, vol(S), be the sum of the degrees of vertices in S, i.e., vol(S) =∑vS D v(v). It can be shown as

$$\displaystyle \begin{aligned} vol(\partial S)=\sum_{e\in \partial S}w(e)\frac{|e\cap S||e\cap \overline{S}|}{{\mathbf{D}}_e(e)}. {} \end{aligned} $$
(5.1)

The derivation is shown as follows, and the details can be found in [7]. Suppose that hyperedge e is a clique, i.e., a fully connected graph. To avoid confusion, the edges in the clique are called subedges. Then, the weight \(\frac {w(e)}{{\mathbf {D}}_e(e)}\) is assigned to each subedge. When the hyperedge e is cut, \(|e\cap S|\times |e\cap \overline {S}|\) subedges are cut. The volume of the cut is the sum of the weights over these subedges. Recall that our goal is to make the cut edges as sparse as possible, with each vertex set after the cut as dense as possible. Based on the goal, the objective partition formula is written as

$$\displaystyle \begin{aligned} \arg\min_{S \subset \mathbb{V}}c(S) = vol(\partial S)\left(\frac{1}{vol(S)} + \frac{1}{vol(\overline{S})}\right). {} \end{aligned} $$
(5.2)

There are many methods to propagate label information on a hypergraph, and the propagation based on random walks is the most widely used. The following describes the label propagation by random walk, and the illustration is shown as Fig. 5.2. Suppose that the current position is \(u \in \mathbb {V}\), and at first, we walk to a hyperedge e over all hyperedges incident with u with probability w(e), and then we sample a vertex v ∈ e uniformly. By generalizing from typical random walks on graphs, we use P as the transition probability matrix of the random walk on a hypergraph, and the element p(u, v) is defined as follows:

$$\displaystyle \begin{aligned} p(u,v)=\sum_{e\in \mathbb{E}}w(e)\frac{\mathbf{H}(u,e)}{{\mathbf{D}}_v(u)}\frac{\mathbf{H}(v,e)}{{\mathbf{D}}_e(e)}. \end{aligned} $$
(5.3)
Fig. 5.2
A hypergraph of order 7 and size 5. The dotted line splits vertices 1 to 3, and 4 to 7 into two parts. There is a one-to-one connection between vertices 2 and 5, 1 and 6, and 3 and 7.

An illustration of the hypergraph label propagation based on random walks

The formula can be organized into a matrix form as \(\mathbf {P}={\mathbf {D}}_v^{-1}\mathbf {H}\mathbf {W}{\mathbf {D}}_e^{-1}{\mathbf {H}}^\top \). The stationary distribution π of the random walk is defined as

$$\displaystyle \begin{aligned} \pi(v)=\frac{d(v)}{vol(\mathbb{V})}, {} \end{aligned} $$
(5.4)

where D v(v) is denoted by d(v) for short and vol(.) is the volume of the vertices in set S, defined as vol(S) =∑vS d(v). The formula can be derived from

$$\displaystyle \begin{aligned} \sum_{u\in \mathbb{V}}\pi(u)p(u,v) =& \sum_{u\in \mathbb{V}}\frac{d(u)}{vol({\mathbb{V}})}\sum_{e\in \mathbb{E}}w(e)\frac{\mathbf{H}(u,e)}{{\mathbf{D}}_v(u)}\frac{\mathbf{H}(v,e)}{{\mathbf{D}}_e(e)} \\ =& \frac{1}{vol(\mathbb{V})}\sum_{e\in \mathbb{E}}w(e)\sum_{u\in \mathbb{V}}\mathbf{H}(u,e)\frac{\mathbf{H}(u,e)}{{\mathbf{D}}_e(e)}\\ =&\frac{1}{vol(\mathbb{V})}\sum_{e\in \mathbb{E}}w(e)\mathbf{H}(v,e)=\frac{d(v)}{vol(\mathbb{V})}. \end{aligned} $$
(5.5)

The objective function Eq. (5.2) can be written as

$$\displaystyle \begin{aligned} c(S)=\frac{vol(\partial S)}{vol(\mathbb{V})}\left(\frac{1}{vol(S)/vol(\mathbb{V})} + \frac{1}{vol(\overline{S})/vol(\mathbb{V})}\right), \end{aligned} $$
(5.6)

and then we arrive at

$$\displaystyle \begin{aligned} \frac{vol(S)}{vol(\mathbb{V})}=\sum_{v\in S}\frac{d(v)}{vol(\mathbb{V})}=\sum_{v\in \mathbb{V}}\pi(v), \end{aligned} $$
(5.7)

where \(\frac {vol(S)}{vol(\mathbb {V})}\) is the probability of random walks to vertex in S. It can then be shown as

$$\displaystyle \begin{aligned} \frac{vol(\partial S)}{vol(\mathbb{V})} = &\sum_{e\in \partial S}\frac{w(e)}{vol(\mathbb{V})}\frac{|e\cap S||e\cap \overline{S}|}{\delta(e)} \\ =& \sum_{e\in \partial S}\sum_{u\in e\cap S}\sum_{v\in e \cap \overline{S}}\frac{w(e)}{vol(\mathbb{V})}\frac{\mathbf{H}(u,e)\mathbf{H}(v,e)}{\delta(e)}\\ =& \sum_{e\in \partial S}\sum_{u\in e\cap S}\sum_{v\in e \cap \overline{S}}w(e)\frac{d(u)}{vol(\mathbb{V})}\frac{\mathbf{H}(u,e)}{d(u)}\frac{\mathbf{H}(v,e)}{\delta(e)}\\ =& \sum_{u\in e\cap S}\sum_{v\in e \cap \overline{S}}\frac{d(u)}{vol(\mathbb{V})}\sum_{e\in \partial S}w(e)\frac{\mathbf{H}(u,e)}{d(u)}\frac{\mathbf{H}(v,e)}{\delta(e)}\\ =& \sum_{u\in S}\sum_{v\in \overline{S}}\pi(u)p(u,v), \end{aligned} $$
(5.8)

where the ratio \(\frac {vol(\partial S)}{vol(\mathbb {V})}\) is the probability with the random walk from a vertex in S to \(\overline {S}\) under the stationary distribution. It can be seen that the hypergraph normalized cut criterion is to search a cut such that the probability with which the random walk crosses different clusters is as small as possible, while the probability with which the random walk stays in the same cluster is as large as possible.

Let us review the objective function Eq. (5.2). Note that it is NP complete, while it can be relaxed into the following optimization problem as

$$\displaystyle \begin{aligned} &\arg\min_{\mathbf f\in\mathbb{R}^{|V|}}\varOmega(\mathbf f)=\frac{1}{2}\sum_{e\in\mathbb{E}}\sum_{\{u,v\}\in e}\frac{w(e)}{\delta(e)}\left(\frac{\mathbf f(u)}{\sqrt{d(u)}} - \frac{\mathbf f(v)}{\sqrt{d(v)}}\right)^2,\\ &s.t.~~\sum_{v\in \mathbb{V}}\mathbf f^2(v)=1, \sum_{v\in \mathbb{V}}\mathbf f(v)\sqrt{d(v)}=0, {} \end{aligned} $$
(5.9)

where f is the to-be-learned score vector. Since the goal is label propagation, it can be arrived at for some labeled data. The optimization problem becomes the transductive inference problem as

$$\displaystyle \begin{aligned} \arg\min_{\mathbf f\in \mathbb{R}^{|\mathbb{V}|}}\{\varOmega(\mathbf f) + \lambda R_{emp}(\mathbf f)\}, \end{aligned} $$
(5.10)

where the regularizer term is Ω(f), the empirical loss term is \(R_{emp}(\mathbf f)=\|f-y\|{ }^2=\sum _{v\in \mathbb {V}}(\mathbf f(v) - \mathbf y(v))^2\), \(\mathbf y\in \mathbb {R}^{|\mathbb {V}|}\) is the label vector, and λ is the balance parameter. Let us assume that the i-th vertex is labeled, and the elements of y are all 0 except the i-th value that is 1. The regularizer Ω(f) can be turned into

$$\displaystyle \begin{aligned} \varOmega(\mathbf f) =& \frac{1}{2}\sum_{e\in\mathbb{E}}\sum_{\{u,v\}\in e}\frac{w(e)}{\delta(e)}\left(\frac{\mathbf f(u)}{\sqrt{d(u)}} - \frac{\mathbf f(v)}{\sqrt{d(v)}}\right)^2\\ =& \sum_{e\in\mathbb{E}}\sum_{\{u,v\}\in \mathbb{V}}\frac{w(e)\mathbf{H}(u,e)\mathbf{H}(v,e)}{\delta(e)}\left(\frac{\mathbf f^2(u)}{d(u)} - \frac{\mathbf f(u)\mathbf f(v)}{\sqrt{d(u)d(v)}}\right)\\ =& \sum_{u\in \mathbb{V}}\mathbf f^2(u)\sum_{e\in \mathbb{E}}\frac{w(e)\mathbf{H}(u,e)}{d(u)}\sum_{v\in \mathbb{V}}\frac{\mathbf{H}(v,e)}{\delta(e)}\\& - \sum_{e\in \mathbb{E}}\sum_{u,v\in \mathbb{V}}\frac{\mathbf f(u)\mathbf{H}(u,e)w(e)\mathbf{H}(v,e)\mathbf f(v)}{\sqrt{d(u)d(v)}\delta(e)}\\ =& \mathbf f^\top(\mathbf{I} - \varTheta) \mathbf f, \end{aligned} $$
(5.11)

where \(\varTheta ={\mathbf {D}}_v^{-\frac {1}{2}}\mathbf {H}\mathbf {W}{\mathbf {D}}_e^{-1}{\mathbf {H}}^\top {\mathbf {D}}_v^{-\frac {1}{2}}\). The hypergraph Laplacian is denoted by Λ = I − Θ. Therefore, the objective function can be rewritten as

$$\displaystyle \begin{aligned} \varOmega(\mathbf f)=\mathbf f^\top\varLambda \mathbf f. \end{aligned} $$
(5.12)

The optimization function can be turned into

$$\displaystyle \begin{aligned} \arg\min_{\mathbf f\in \mathbb{R}^{|\mathbb{V}|}} \{\mathbf f^\top\varLambda \mathbf f + \lambda\|\mathbf f-\mathbf y\|{}^2\}. {} \end{aligned} $$
(5.13)

There are two ways to solve the above problem. The first one is differentiating the objective function in Eq. (5.13) with respect to f, and it can be obtained as

$$\displaystyle \begin{aligned} \mathbf f=\left(\mathbf{I}+\frac{1}{\lambda}\varLambda\right)^{-1}\mathbf y. {} \end{aligned} $$
(5.14)

The second one is an iterative method. Similar to the iterative approach in [8], Eq. (5.13) can be efficiently solved by an iterative process. The process is illustrated in Fig. 5.3. The f t+1 can be obtained from the last iterative f t and y, and the procedure is repeated until convergence.

Fig. 5.3
A text with 3 steps to solve an equation. Step 1, initialize f at t when t equals 0. Step 2, update f by f at t plus 1. Step 3, let t equal t plus 1 and iterate back to step 2 until convergence.

The iterative solution of Eq. (5.13). This figure is from [1]

This process will converge to the solution Eq. (5.14). To prove it, we first prove that the eigenvalues of Θ are in [−1, 1]. Since \(\varTheta ={\mathbf {D}}_v^{-1/2}\mathbf {H}\mathbf {W}{\mathbf {D}}_e^{-1}{\mathbf {H}}^\top {\mathbf {D}}_v^{-1/2}\), we find that its eigenvalues are in [−1, 1]. Therefore, (I ± Θ) are positive semi-definite.

The convergence of the iterative process is proved in [1]. Without loss of generality, we assume f (0) = y. From the iterative process, it can be obtained that

$$\displaystyle \begin{aligned} {\mathbf f^{(t)}}=&\left({{\lambda}\over{1+\lambda}}\right)\sum_{i=0}^{t-1}{{\left({{{1}\over{1+\lambda}}{{\varTheta}}}\right)^{i}}\mathbf y}{+\left({{{1}\over{1+\lambda}}{{\varTheta}}}\right)^{t}}\mathbf y\\=&(1-\zeta)\sum_{i=0}^{t-1}{{{(\zeta{{\varTheta}})}^{i}}\mathbf y}+{(\zeta{{\varTheta}})^{t}}\mathbf y, \end{aligned} $$
(5.15)

where \(\zeta =\frac {1}{1+\lambda }\). Since 0 < ζ < 1, and the eigenvalues of Θ are in [−1, 1], it can be derived that

$$\displaystyle \begin{aligned} \mathop{\lim}\limits_{t\to\infty}{(\zeta{{\varTheta}})^{t}}=0 \end{aligned} $$
(5.16)

and

$$\displaystyle \begin{aligned} \mathop{\lim}\limits_{t\to\infty}\sum_{i=0}^{t-1}{{(\zeta{{\varTheta}})}^{i}}={(\mathbf{I}-\zeta{{\varTheta}})^{-1}}. \end{aligned} $$
(5.17)

Then, it turns out

$$\displaystyle \begin{aligned} \mathbf f=\mathop{\lim}\limits_{t\to\infty}{\mathbf f^{(t)}}=(1-\zeta){(\mathbf{I}-\zeta{{\varTheta}})^{-1}}\mathbf y=\!{\left({\mathbf{I}+{{1}\over{\lambda}}{{\varDelta}}}\right)^{-1}}\!\mathbf y. \end{aligned} $$
(5.18)

Therefore, the convergence of f is proved to be equal to the closed-form solution Eq. (5.14).

The random-walk-based method is the most commonly used approach in label propagation on hypergraphs. It has the advantages of being simple to implement and theoretically verifiable.

In many cases, different hypergraphs may be generated based on different criteria. Under such circumstances, we need to conduct label propagation on multi-hypergraph. Here, we briefly introduce the cross diffusion method on multi-hypergraph [2]. We assume that there are T hypergraphs, and the t-th hypergraph is denoted as \(\mathbb {G}^t=(\mathbb {V}^t, \mathbb {E}^t, {\mathbf {W}}^t)\), where \(\mathbb {V}^t\) is the vertex set, \(\mathbb {E}^t\) is the hyperedge set, and W t is a diagonal matrix, representing the weights of hyperedges.

The transition matrix is first generated for each hypergraph. The label propagation process on hypergraph is based on the assumption that the local similarities could approximate the long-range similarities, and therefore, the local similarities are more important than far-away vertices. The similarity matrix among vertices of the t-th hypergraph is shown as follows:

$$\displaystyle \begin{aligned} {\varLambda^t}(u,v)=\sum_{e\in \mathbb{E}^t} \frac{{\mathbf{W}}^t(e) {\mathbf{H}}^t(u,e) {\mathbf{H}}^t(v,e) }{\delta (e)}, \end{aligned} $$
(5.19)

or in the matrix form:

$$\displaystyle \begin{aligned} \varLambda^t = {\mathbf{H}}^t{\mathbf{W}}^t{{\mathbf{D}}_e^{t}}^{-1}{{\mathbf{H}}^t}^\top. \end{aligned} $$
(5.20)

The transition matrix P t is the normalized similarity matrix:

$$\displaystyle \begin{aligned} {\mathbf{P}}^t(i,j)=\frac{\varLambda_t(i,j) }{ \sum_{w\in \mathbb{V}^t} \varLambda_t(i,w) } \end{aligned} $$
(5.21)

and

$$\displaystyle \begin{aligned} {\mathbf{P}}^t={{\mathbf{D}}^t}^{-1}\varLambda^t, \end{aligned} $$
(5.22)

where D t is a diagonal matrix with the i-th diagonal element \({\mathbf {D}}^t(i,i)=\sum _{j=1}^{|\mathbb {V}^t|}\varLambda ^t(i,j)\).

The element of the transition matrix P t(i, j) represents the probability of transition from the vertex i to the vertex j, and P t could be regarded as the Parzen window estimators on hypergraph structure. After the generation of the transition matrix, the cross label propagation process is applied to the multi-hypergraph structure.

Denote Y 0 as the initial label matrix. For labeled vertices, the i-th row of Y 0 is the one-hot label of the i-th vertex, while for the unlabeled vertices, all elements of the i-th row are 0.5, indicating that there is no prior knowledge of the label. We denote the labeled part of the initial label matrix as \({\mathbf {Y}}_0^L\).

For simplicity, we assume the number of hypergraphs T is 2. The label propagation process for multi-hypergraph uses the output of one hypergraph as the input of the other hypergraph, which repeats until the output converges. The process could be formulated as

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^1 &\leftarrow {\mathbf{P}}^1{\mathbf{Y}}_d^2, \end{aligned} $$
(5.23)
$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^{1L} &\leftarrow {\mathbf{Y}}_0^L \end{aligned} $$
(5.24)

and

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^2 &\leftarrow {\mathbf{P}}^2{\mathbf{Y}}_d^1, \end{aligned} $$
(5.25)
$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^{2L} &\leftarrow {\mathbf{Y}}_0^L, \end{aligned} $$
(5.26)

where \({\mathbf {Y}}_d^k\) denotes the label matrix of the k-th hypergraph after d times of label propagation. This process is shown in Fig. 5.4.

Fig. 5.4
A multi-hypergraph diagram starts with an initial label matrix as input. It leads to diffusion 1, 2, up to diffusion m with a pair of hypergraphs each with with v 1 to v 6 and e 1 to e 5, resulting in the output matrix.

An illustration of the diffusion process on multi-hypergraph. This figure is from [2]

The overall matrix could be calculated according to the label matrix of each hypergraph after convergence:

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{final} = \frac 1T \sum_{i=1}^T {\mathbf{Y}}_d^i. \end{aligned} $$
(5.27)

For more complicated scenarios, where more than two hypergraphs are available, the label propagation process can repeat that, and the output of one hypergraph can be used as the input of other hypergraphs.

This diffusion process can also be used for a single hypergraph, and the framework can be described in Fig. 5.5.

Fig. 5.5
A flow diagram starts with an initial label matrix. It leads to diffusion 1, diffusion 2, up to diffusion m with v 1 to v 6 and e 1 to e 5, resulting in the output matrix.

An illustration of the diffusion process on a single hypergraph

5.3 Data Clustering on Hypergraph

Data clustering is a typical machine learning task that aims to group data into clusters. In this section, we introduce hypergraph-based data clustering methods, which can utilize the hypergraph structure for better finding correlations behind the data. For hypergraph clustering, two types of information can be used, including structural hypergraph clustering and attribute hypergraph clustering according to the data information in the hypergraph. In structural hypergraph, the clustering tasks only use structural information. For example, the hypergraph spectral clustering method[7] is extended on the basis of graph, which uses the hypergraph Laplacian to learn complex relations between nodes in the hypergraph. And some auto-encoder-based techniques[9] are also applied to structural clustering. In attribute hypergraph, each vertex is usually accompanied by attribute information from the real world. There are two assumptions as follows:

  • Vertices in the same hyperedge have similar attributes.

  • Vertices with similar features have similar attributes.

How to balance graph structure information and node feature information is a study focus of attributed graph clustering [10]. In this way, hypergraphs can utilize the features, attributes, and structured information of vertices to conduct data clustering task.

In this section, we introduce a hypergraph Laplacian smoothing filter and an embedded model called adaptive hypergraph auto-encoder (AHGAE) that is designed specifically for hypergraph clustering tasks [3]. First, we describe the hypergraph Laplacian smoothing filter and derive its low-pass filtering properties in the frequency domain. Then, we analyze the influence of each vertex on the attributes of its connected hyperedges and the feature of neighbor vertices. Finally, we introduce the detailed procedure and framework of the adaptive hypergraph auto-encoder.

The hypergraph Laplacian smoothing filter, as shown in Fig. 5.6, first merges the vertex features into hyperedge features, and the feature of hyperedge e k is defined as

$$\displaystyle \begin{aligned} {\mathbf{E}}_{k}^{(t)} &=\frac{1}{\left|N\left(e_{k}\right)\right|} \sum_{v_{j} \in N\left(e_{k}\right)} {\mathbf{X}}_{j}^{(t)}=\sum_{v_{j} \in \mathbb{V}} \frac{h(j, k)}{d_{e}(k)} {\mathbf{X}}_{j}^{(t)}, \end{aligned} $$
(5.28)

where e k denotes the k-th hyperedge in the hyperedge set \(\mathbb {E}\), v i denotes the i-th vertex in the vertex set \(\mathbb {V}\), t represents the order, \(N\left (e_{k}\right )\) is the vertex set in hyperedge e k, E k describes the hyperedge e k feature, and X j describes the feature of the vertex v j.

Fig. 5.6
A schematic diagram of five hypergraphs. One end of the input hypergraph passes through hyperedge and smoothed node features. The other end passes through the original node features. Both combine at weighted fusion to give an output.

An illustration for hypergraph Laplacian smoothing filter. This figure is from [3]

After aggregating the vertex features to get the hyperedge features, we can further combine the vertex features according to the hyperedge weights:

$$\displaystyle \begin{aligned} {\mathbf{X}}_{i}^{(t+1)} &=(1-\gamma) {\mathbf{X}}_{i}^{(t)}+\gamma \sum_{e_{k} \in N\left(v_{i}\right)} \frac{h(i, k) w(k)}{d_{v}(i)} {\mathbf{E}}_{k}^{(t)} \\ &=(1-\gamma) {\mathbf{X}}_{i}^{(t)}+\gamma \sum_{v_{j} \in \mathbb{V}} \sum_{e_{k} \in \mathbb{E}} \frac{h(i, k) w(k) h(j, k)}{d_{v}(i) d_{e}(k)} {\mathbf{X}}_{j}^{(t)},\\ {\mathbf{X}}^{(t+1)}&=(1-\gamma) {\mathbf{X}}^{(t)}+\gamma {\mathbf{D}}_{v}^{-1 / 2} \mathbf{H W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1 / 2} {\mathbf{X}}^{(t)}, \end{aligned} $$
(5.29)

where N(v) represents the hyperedge connected to vertex v, and γ ∈ [0, 1] is the weight coefficient of the filter. D v denotes the diagonal matrix of the vertex degrees, D e denotes the diagonal matrix of the hyperedge degrees, and H is the incidence matrix of the hypergraph. In order to make the spectral radius less than 1, we can replace \({\mathbf {D}}_{\mathrm {v}}^{-1} \mathbf {HWD}_{e}^{-1} {\mathbf {H}}^\top \) with symmetric normalized form:

$$\displaystyle \begin{aligned} {\mathbf{X}}^{(t+1)} &=(1-\gamma) {\mathbf{X}}^{(t)}+\gamma {\mathbf{D}}_{v}^{-1 / 2} \mathbf{H W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^\top {\mathbf{D}}_{v}^{-1 / 2} {\mathbf{X}}^{(t)} \\ &={\mathbf{X}}^{(t)}-\gamma\left(\mathbf{I}-{\mathbf{D}}_{v}^{-1 / 2} \mathbf{H W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^\top {\mathbf{D}}_{v}^{-1 / 2}\right) {\mathbf{X}}^{(t)}. \end{aligned} $$
(5.30)

Then, the multi-order hypergraph Laplacian smoothing filter can be written as

$$\displaystyle \begin{aligned} {\mathbf{X}}^{(t)}=(\mathbf{I}-\gamma \mathbf{L})^{t} \mathbf{X}. \end{aligned} $$
(5.31)

After decomposing the eigenvalues of the hypergraph Laplacian operator L = U Λ U −1, the diagonal elements of the diagonal matrix Λ are eigenvalues of L. The frequency response function is as

$$\displaystyle \begin{aligned} p(\boldsymbol{\varLambda})=\operatorname{diag}\left(p\left(\lambda_{1}\right), \ldots, p\left(\lambda_{|\mathbb{V}|}\right)\right), \end{aligned} $$
(5.32)
$$\displaystyle \begin{aligned} p(\lambda)=1-\gamma \lambda, \gamma \in[0,1]. \end{aligned} $$
(5.33)

Due to the eigenvalue of the hypergraph Laplacian λ ∈ [0, 1], p(Λ) is a positive semi-definite matrix, and the value of p(λ) decreases as λ increases. Therefore, the hypergraph Laplacian smoothed filtered can effectively suppress high-frequency signals:

$$\displaystyle \begin{aligned} \mathbf{F}=\mathbf{U} p(\boldsymbol{\varLambda}) {\mathbf{U}}^{-1}=\mathbf{U}(\mathbf{I}-\gamma \boldsymbol{\varLambda}) {\mathbf{U}}^{-1}=\mathbf{I}-\gamma \mathbf{L}. \end{aligned} $$
(5.34)

Figure 5.7 illustrates how to use the relational reconstruction auto-encoder after getting the smoothed feature matrix to conduct vertex representation learning in low-dimensional environments without losing information. First, the incidence matrix is used to generate the adjacency matrix:

$$\displaystyle \begin{aligned} \mathbf{A}=\varepsilon\left(\mathbf{H H}^{\top}\right), \end{aligned} $$
(5.35)
$$\displaystyle \begin{aligned} \varepsilon(x)=\left\{\begin{array}{ll} 1, & x>0 \\ 0, & x=0 \end{array}.\right. \end{aligned} $$
(5.36)
Fig. 5.7
An illustration. An input of hypergraphs passes through matrix structure H and A connected to S. Over-smooth vertices of hypergraph pass through features X, embedding Z, and S giving spectral clustering.

The framework of the adaptive hypergraph auto-encoder framework. This figure is from [3]

A single fully connected layer is used to compress the filtered feature matrix:

$$\displaystyle \begin{aligned} \mathbf{Z}=\operatorname{scale}\left({\mathbf{X}}_{\mathbf{s m}} \boldsymbol{\varTheta}\right), \end{aligned} $$
(5.37)
$$\displaystyle \begin{aligned} \operatorname{scale}(\mathbf{x})=\frac{\mathbf{x}-\min (\mathbf{x})}{\max (\mathbf{x})-\min (\mathbf{x})}, \end{aligned} $$
(5.38)

where Z represents the vertex embedding matrix, which includes both structural and feature information, and Θ is the learnable parameter that is used to extract features from the vertices. In order to rescale the range of vertex characteristics to [0, 1], scale (⋅) represents a normalization function. So the following is the similarity matrix for vertex features:

$$\displaystyle \begin{aligned} \mathbf{S}=\operatorname{sigmoid}\left(\mathbf{Z Z}^\top\right), \end{aligned} $$
(5.39)
$$\displaystyle \begin{aligned} \operatorname{sigmoid}(x)=\frac{1}{1+e^{-x}}. \end{aligned} $$
(5.40)

This is the inner product decoder used to reconstruct vertex and its neighbors. The objective is to minimize the error between the adjacency matrix A and the similarity matrix S. However, using Eq. (5.35) to construct an adjacency matrix leads to a problem: the number of edges is too large when the hyperedge degree increases. To solve this problem, the elements in matrix A are weighted as

$$\displaystyle \begin{aligned} {\mathbf{W}}_{i j}=\left\{\begin{array}{cc} \frac{|\mathbb{V}|{}^{2}-\sum \sum {\mathbf{A}}_{i j}}{\sum \sum {\mathbf{A}}_{i j}} & , {\mathbf{A}}_{i j}=1 \\ 1 & , {\mathbf{A}}_{i j}=0 \end{array} .\right. \end{aligned} $$
(5.41)

The reconstruction loss can be calculated by using the weighted binary cross-entropy function:

$$\displaystyle \begin{aligned} L_{r e}=\frac{1}{|\mathbb{V}|{}^{2}} \sum_{i=1}^{|\mathbb{V}|} \sum_{j=1}^{|\mathbb{V}|}-{\mathbf{W}}_{i j}\left[{\mathbf{A}}_{i j} \log {\mathbf{S}}_{i j}+\left(1-{\mathbf{A}}_{i j}\right) \log \left(1-{\mathbf{S}}_{i j}\right)\right]. \end{aligned} $$
(5.42)

The relational reconstruction auto-encoder can be trained to produce the learned vertex embeddings, and the spectral clustering technique can be further used to obtain the final clustering results.

5.4 Cost-Sensitive Learning on Hypergraph

Most of the machine learning applications may suffer from cost-sensitive scenarios. It is noted that different types of faults in real-world jobs might result in losses with varying severity. In diagnostic work, for example, misdiagnosing a patient as a healthy person is significantly more erroneous than classifying a healthy individual as a patient, as shown in Fig. 5.8. Similar cases also happen in the application of software defect prediction. Misjudging the flaws of software modules as a good one may destroy the software system and have disastrous repercussions in software defect prediction. In these cases, cost-sensitive learning methods [11,12,13] have been developed to deal with these issues.

Fig. 5.8
Four sets of illustrations labeled ground truth and predicted. a. Healthy person, healthy person. b. Healthy person, patient. c. Patient, healthy person. d. Patient, patient.

A medical example of cost-sensitive classification scenario

In many cases, the data from a group of categories may be enough, while the data from other categories may be very limited. These imbalanced data distributions lead to different costs for the classification performance of different categories. Under such circumstances, imbalanced learning [13, 14] attracts much attention, which aims to attain a predictive prediction using imbalanced sampling data. In traditional methods, sampling methods [15, 16] are used to over-sample the minority class and under-sample the majority class to solve the imbalanced sample problem. Another way is to conduct cost-sensitive learning that can focus more on the minority class.

To confront the cost-sensitive issue in hypergraph computation, in this section, we introduce cost-sensitive hypergraph computation framework [4] and cost interval optimization for hypergraph computation [5], respectively. First, we describe how to quantify cost in the hypergraph modeling procedure [4], in which a fixed cost value is provided for modeling, and thereafter, we illustrate how to use the cost-sensitive hypergraph computation approach to tackle imbalanced problems. As the cost value for mis-classification results may not be feasible in practice, we then introduce the hypergraph computation method with cost interval optimization [5], which can utilize the cost chosen inside the interval while generating data with high-order relations. Figure 5.9 shows the frameworks of hypergraph computation under cost-sensitive scenarios, from traditional hypergraph modeling, hypergraph modeling with cost matrix, to hypergraph modeling with cost matrix using cost interval.

Fig. 5.9
A framework of hypergraph computation. Hypergraph modeling leads to cost-sensitive hypergraph learning, resulting in hypergraph with cost interval optimization.

The frameworks of hypergraph computation under cost-sensitive scenarios

(1) Cost-Sensitive Hypergraph Computation

In this part, we introduce a cost-sensitive hypergraph computation method [4], and Fig. 5.10 shows the framework of this method. This framework consists of two stages to handle the cost-sensitive issue: F-measure is used in the initial step to calculate candidate cost information for cost-sensitive learning, and then the hypergraph structure is utilized to model the high-order correlations among the data in the second stage.

Fig. 5.10
A framework. The cost information uses F-measure optimization on hypergraphs, leading to cost-sensitive hypergraph learning via cost-value graphs, and results in total cost optimization.

The framework of cost-sensitive hypergraph computation framework. This figure is from [4]

First, we introduce the hypergraph modeling with cost matrix. In traditional hypergraph modeling, each vertex represents a subject, and the hyperedges connect related vertices. To introduce cost information in hypergraph modeling, a cost matrix is associated with each vertex, indicating different costs for mis-classification, as shown in Fig. 5.11 for a binary classification task. The definition of cost matrix is as follows.

Fig. 5.11
A hypergraph of order 8 and size 5. Vertices v 2 and v 6 are connected by two cost matrices of 2 by 2, respectively. The matrices consist of four cells with the text C subscript T P, C subscript P N, C subscript F P, C subscript F N.

An illustration of hypergraph modeling with cost matrix

As shown in Fig. 5.11, the cost matrix is a 2 × 2 matrix, including the true positive cost C TP, the true negative cost C TN, the false positive cost C FP, and the false negative cost C FN, respectively. The true positive cost and the true negative cost are mostly 0 in the matrix since that denotes the correct prediction. The cost-sensitive hypergraph’s propensity for each class is achieved by giving various values to the false positive cost and the false negative cost in the cost matrix. A special case is that, if the false positive cost and the false negative cost are equal, then the cost-sensitive hypergraph reduces to traditional hypergraph modeling.

We generate candidate cost information at first and then apply F-measure to reduce the expense for both binary and multi-class data. For a classifier h, we can define the error profile as

$$\displaystyle \begin{aligned} \varPsi(h)=\left(\mathrm{FN}_{1}(h), \mathrm{FP}_{1}(h), \ldots, \mathrm{FN}_{N_{c}}(h), \mathrm{FP}_{N_{c}}(h)\right), \end{aligned} $$
(5.43)

where N c represents the number of classes, and FN and FP represent the false negative and the false positive probabilities. For simplicity, we let ψ 2k−1 represent the FN possibility of the k-th class and ψ 2k represent the FP possibility of the k-th class. The F-measure for binary classification can be defined as

$$\displaystyle \begin{aligned} F_{\beta}(\varPsi)=\frac{\left(1+\beta^{2}\right)\left(P_{1}-\psi_{1}\right)}{\left(1+\beta^{2}\right) P_{1}-\psi_{1}(h)+\psi_{2}(h)}, \end{aligned} $$
(5.44)

where P k represents the marginal probability of class k. Similarly, the micro-F-measure for multi-class classification can be defined as

$$\displaystyle \begin{aligned} m c F_{\beta}(\varPsi)=\frac{\left(1+\beta^{2}\right)\left(1-P_{1}-\sum_{k=2}^{C} \psi_{2 k-1}\right)}{\left(1+\beta^{2}\right)\left(1-P_{1}\right)-\sum_{k=2}^{C} \psi_{2 k-1}+\psi_{1}}. \end{aligned} $$
(5.45)

We can further divide the F-measure values in the region [0, 1] into a collection of equally spaced values F = {f i} to calculate the cost of various mis-classifications. The cost function Υ is then used to construct the cost vector using every f i. For binary classification, we constrain the denominator of Eq. (5.44) to be positive and F β(Ψ) ≤ f i for a value c of the F-measure:

$$\displaystyle \begin{aligned} \left(1+\beta^{2}-f\right) \psi_{1}+f \psi_{2}+\left(1+\beta^{2}\right) P_{1}(f-1) \geq 0 . \end{aligned} $$
(5.46)

Therefore, the cost of ψ 1 and ψ 2 can be allocated according to f and 1 + β 2 − f, and the cost function can be written as follows:

$$\displaystyle \begin{aligned} \varUpsilon_{i}^{F_{\beta}}=\left\{\begin{array}{ll} 1+\beta^{2}-f, & \text{if sample from class 1} \\ f, & \text{if sample from class 2} \\ 0, & \text{otherwise } \end{array}\right. . \end{aligned} $$
(5.47)

Similarly, the cost function of multi-class classification can be written as follows:

$$\displaystyle \begin{aligned} \varUpsilon_{i}^{m l F_{\beta}}=\left\{\begin{array}{ll} 1+\beta^{2}-f, & \text{if sample from odd class and not from class 1} \\ f, & \text{if sample from class 1} \\ 0, & \text{otherwise} \end{array}\right. . \end{aligned} $$
(5.48)

The cost of F-measure optimization is added to the optimization function to increase the efficacy of the hypergraph computation method in imbalanced data. We first regard each data to be a vertex of the hypergraph and then apply the k nearest neighbor algorithm to construct the hypergraph. The cost-sensitive hypergraph differs in that it includes the cost matrix information of each vertex in addition to the original hypergraph correlation structure. With training and testing samples represented by O, cost-sensitive hypergraph computation function can be expressed as

$$\displaystyle \begin{aligned} \arg\underset{\omega, \mathbf{W}}{ \min} &\Big \{\mu \varOmega(\omega) + \mathbb{R}_{emp}(\omega) + \lambda\varPhi(\mathbf{W}) \Big \},\\ s.t.~&\sum_{j=1}^N {\mathbf{W}}_{j,j} = 1, \forall~{\mathbf{W}}_{j,j}\geq 0,\\ \end{aligned} $$
(5.49)

where Ω(ω) = (O ω) Δ(O ω) represents the hypergraph Laplacian regularized with hypergraph Laplacian Δ, \(\mathbb {R}_{emp}(\omega )=\|\varUpsilon (\mathbf {O} \omega -\mathbf {y})\|{ }_{2}^{2}=\sum _{i=1}^{N}\left (\varUpsilon _{i, i}\left ({\mathbf {o}}_{i} \omega -{\mathbf {y}}_{i}\right )\right )^{2}\) is the empirical loss using cost information with diagonal matrix Υ that Υ i,i represents the cost of the i-th data, \(\varPhi (\mathbf {W})=\lambda \|\mathbf {W}\|{ }_{\mathrm {F}}^{2}\) stands for the hypergraph regularization, ω represents the mapping vector to be learnt, W is a diagonal matrix representing hyperedge weights, and μ and λ are the trade-off hyperparameter. We first fix W to optimize ω, and then the optimization equation can be expressed as

$$\displaystyle \begin{aligned} \arg \min _{{\omega}}\left\{\|\varUpsilon(\mathbf{O} {\omega})-\mathbf{y}\|{}_{2}^{2}+\mu(\mathbf{O} {\omega})^{\top} \varDelta(\mathbf{O} {\omega})\right\}. \end{aligned} $$
(5.50)

The optimal ω can be obtained as

$$\displaystyle \begin{aligned} {\omega}=\left({\mathbf{O}}^{\top} {\varUpsilon}^{2} \mathbf{O}+\mu {\mathbf{O}}^{\top} \varDelta \mathbf{O}\right)^{-1}\left({\mathbf{O}}^{\top} {\varUpsilon} \mathbf{y}\right). \end{aligned} $$
(5.51)

Following that, we fix ω to enhance W:

$$\displaystyle \begin{aligned} \arg\underset{\mathbf{W}}{ \min} &\Big \{\mu(\mathbf{O}\omega)^{\top} \varDelta(\mathbf{O} \omega)+\lambda\|\mathbf{W}\|{}_{{F}}^{2} \Big \}.\\ s.t.~&\sum_{j=1}^N {\mathbf{W}}_{j,j} = 1, \forall~{\mathbf{W}}_{j,j}\geq 0.\\ \end{aligned} $$
(5.52)

We can have W as

$$\displaystyle \begin{aligned} \mathbf{W}=\frac{\mu \varLambda^{\top} \varLambda\left({\mathbf{D}}_{e}\right)^{-1}-\eta \mathbf{I}}{2 \lambda}, \end{aligned} $$
(5.53)

where η can be calculated as \(\eta =\frac {\mu \varLambda \left ({\mathbf {D}}_{e}\right )^{-1} \varLambda ^{\top }-2 \lambda }{N}\), and Λ can be calculated as \(\varLambda =(\mathbf {O} {\omega })^{\top }\left ({\mathbf {D}}_{v}\right )^{-1 / 2} \mathbf {H}\). The optimized mapping vector ω allows sample ζ i in the test set to obtain the classification result γ = ζ i ω.

Each piece of potential cost information c i generates a cost matrix Υ, which is then used to build a cost-sensitive hypergraph structure \(\mathbb {G}_i\). The model then employs an efficient collection to choose the cost-sensitive hypergraph with the greatest F-measure as the best choice.

(2) Cost Interval Optimization for Hypergraph Computation

As the cost value for cost-sensitive hypergraph modeling is not easy to be determined in practice, in this part, we introduce a cost interval optimization method for hypergraph computation [5], in which the fixed cost value is replaced by a cost interval, which is much easier to be provided than a fixed cost value.

Given a hypergraph \(\mathbb {G}=(\mathbb {V}, \mathbb {E}, \mathbf {W})\), the regularization foundation of the cost-sensitive hypergraph can be divided into three components, i.e., empirical loss using cost information, the hypergraph Laplacian regularizer, and the hypergraph regularization, in order to optimize the overall cost by adding the mis-classification costs of various categories to the hypergraph framework.

The empirical loss using cost information can be formulated as

$$\displaystyle \begin{aligned} \mathbb{R}_{emp}({\omega})=\|{\varPhi}(\mathbf{S} {\omega}-\mathbf{y})\|{}_{2}^{2}=\sum_{i=1}^{N_{v}}\left({\varPhi}_{i, i}\left({\mathbf{s}}_{i} {\omega}-{\mathbf{y}}_{i}\right)\right)^{2}, \end{aligned} $$
(5.54)

where ω represents the mapping vector, and Φ is a diagonal matrix representing mis-classification cost weights. The hypergraph Laplacian regularizer can be written as

$$\displaystyle \begin{aligned} \varOmega({\omega})&=\frac{1}{2} \sum_{e \in \mathbb{E}} \sum_{v_{i}, v_{j} \in \mathbb{V}} \frac{\mathbf{W}(e) \mathbf{H}\left(v_{i}, e\right) \mathbf{H}\left(v_{j}, e\right)}{\delta(e)}\left(\frac{{\omega} {\mathbf{s}}_{i}}{\sqrt{d\left(v_{i}\right)}}-\frac{{\omega s}_{j}}{\sqrt{d\left(v_{j}\right)}}\right)^{2} \\ &=(\mathbf{S} {\omega})^{\top} \varDelta(\mathbf{S} {\omega}). \end{aligned} $$
(5.55)

To adjust the hyperedges weights and hence the hypergraph classification ability, the hypergraph regularization is written as \(\varPsi (\mathbf {W})=\|\mathbf {W}\|{ }_{{F}}^{2}\). It is noted that this part can be removed in different applications, if not required.

Combining the above three, the whole optimization task for cost-sensitive hypergraph computation can be written as

$$\displaystyle \begin{aligned} \arg\underset{\omega, \mathbf{W}}{ \min} &\Big \{\|{\varPhi}(\mathbf{S} {\omega}-\mathbf{y})\|{}_{2}^{2} + \mu(\mathbf{S} {\omega})^{\top} \varDelta(\mathbf{S} {\omega}) + \lambda\|\mathbf{W}\|{}_{\mathrm{F}}^{2} \Big \},\\ s.t.~&\sum_{j=1}^{N_e} {\mathbf{W}}_{j,j} = 1, \forall~{\mathbf{W}}_{j,j}\geq 0,\\ \end{aligned} $$
(5.56)

where μ and λ are the trade-off hyperparameters.

The precise cost of each category is required for cost-sensitive hypergraph computation, but the cost is frequently impossible to be obtained, and it can only be known that the cost is within a cost interval [C max, C min]. Therefore, a simple idea is to attempt all values inside the cost interval and minimize the overall cost. However, this is inefficient given the possibly huge cost interval. As the actual cost is difficult to establish, we need to find a surrogate cost c to guide the optimization procedure, and the surrogate classifier h is supposed to be as successful as the true cost classifier h t. In this way, the problem can be formulated as

$$\displaystyle \begin{aligned} \min_{h,c^{*}} &~L(h, c^*),\\ s.t.&~p(L(h,c)<\theta)>1-\varphi, \forall~c\in [C_{min},C_{max}],C_{min}\leq c^{*}\leq C_{max}, \end{aligned} $$
(5.57)

where L(h, c) is the empirical risk. L(h, c) is formulated as \(L(h, c)=\sum _{i=1}^{N_v} cI(\rho _i\neq y\wedge y=+)+I\left (\rho _{i} \neq y \wedge y=-\right )\), where ρ i = s i ω is the i-th data labeling in the test set, and +  and − represent the label of the important class and the unimportant class, respectively.

The worst-case risk is considered first to guarantee that all limitations can be fulfilled. The worst-case classifier h can be written as

$$\displaystyle \begin{aligned} h^{*}=\arg~\underset{h}{\min}~\underset{c}{\sup}~L(h,c) \end{aligned} $$
(5.58)

and

$$\displaystyle \begin{aligned} p\left(\underset{c}{\sup} L\left(h_{*}, c\right)<\theta\right)>1-\varphi. \end{aligned} $$
(5.59)

We have \(p\left ( L\left (h_{*}, c\right )<\theta \right )>1-\varphi \) for any c. The worst-case risk is attained when the surrogate cost c equals C max. However, only a solution that meets the requirements can be acquired in this manner, and the cost cannot be guaranteed to be close to the true cost. As the average cost is the smallest maximum distortion of the genuine risk, it is another good choice, which can be calculated as C mean = 0.5(C max + C min).

With the use of alternative costs C max and C mean, we can conduct cost interval optimization. First, C max is used as a surrogate cost, and a collection of cost-sensitive hypergraph structures with varying parameter values is learned in the first stage. Then, C mean is used as a surrogate cost to determine the lowest overall cost on the valid dataset, and then we choose the hypergraph structure as the final solution.

In this section, we describe cost-sensitive hypergraph computation methods. Imbalanced data issue is very common in many applications. The cost-sensitive hypergraph computation methods introduce cost matrix in hypergraph modeling, and both fixed cost value and cost interval can be used in the learning process.

5.5 Link Prediction on Hypergraph

Link prediction is a fundamental task in network analysis. The objective of link prediction is to predict whether two vertices in a network may have a link. Link prediction has wide applications in different domains, such as social relation exploration [17, 18], protein interaction prediction [19, 20], and recommender system [21, 22], which has attracted much attention in the past decades.

Link prediction on hypergraph aims to discover missing relations or predict new coming hyperedges based on the observed hypergraph, where hypergraph computation can be used to deeply exploit the underneath high-order correlations among these data. Unlike the link prediction task on the graph structure [23, 24], the hypergraph models the high-order correlation among the data, which is heterogeneous in many applications, as the vertices are in different types. For example, in a bibliographic network, the vertex can represent a paper, an author, or a venue, while the hyperedge represents the relation where the paper is written by multiple authors and published in a venue. These different types of vertices do not necessarily share the same representation space. The heterogeneous hypergraph consists of two kinds of vertex in the view of the hypergraph event, i.e., identifier vertex and slave vertex. Identifier vertex is the vertex that determines a hyperedge uniquely, while slave vertex is the other vertex except for the identifier vertex. In this section, we introduce the Heterogeneous Hypergraph Variational Auto-encoder (HeteHG-VAE) method [6] for heterogeneous hypergraph link prediction task.

The overview of HeteHG-VAE can be found in Fig. 5.12. HeteHG-VAE aims to learn the low-dimensional heterogeneous hypergraph embedding based on the Bayesian deep generative strategy. The input hypergraph is represented by the incidence matrix H, whose sub-hypergraph represents the hypergraph generated by different types of slave vertices. The heterogeneous encoder can project the vertices and the hyperedges to the vertex embedding and hyperedge embedding, respectively. The hypergraph embedding is the combination of the vertex embedding and the hyperedge embedding, which can be used for reconstructing the incidence matrix by the hypergraph decoder.

Fig. 5.12
A flow diagram. A heterogeneous hypergraph input leads to incidence matrix H. It proceeds to heterogeneous node and hyperedge encoders, followed by their embedding, leading to hypergraph embedding and decoder. This produces a reconstruction matrix, resulting in a reconstructed hypergraph.

An illustration of the HeteHG-VAE method. This figure is from [6]

In the following part of this section, we first introduce the variational evidence lower bound with the task specific derivation. Then, the inference model, including the heterogeneous vertex encoder and the heterogeneous hyperedge encoder, is presented. At last, the generative model and the link prediction method are introduced.

Denote \(\{ x_k \}_{k=1}^K\) as the observed data with the total number K, \({\mathbf {Z}}^V_k\) as the latent vertex embedding, and Z E as the latent hyperedge embedding. HeteHG-VAE assumes that \(\mathbf Z^V_k\) and Z E are drawn i.i.d. from a Gaussian prior, i.e., \(\mathbf Z^V_k\sim p_0(\mathbf Z^V_k)\) and Z E ∼ p 0(Z E), and x k are drawn from the conditional distribution, \(x_k\sim p(x_k|\mathbf Z^V_k,Z^E;\lambda _k)\), where λ k is the parameter of the distribution. The objective of HeteHG-VAE is to maximize the log-likelihood of the observed data by optimizing λ k as follows:

$$\displaystyle \begin{aligned} {} &\log p(x_1, \cdots, x_K; \lambda) \\ &\quad =\log \int_{{\mathbf{Z}}^V_1}\cdots \int_{{\mathbf{Z}}^V_K}\int_{{\mathbf{Z}}^E} p(x_1, \cdots, x_K, {\mathbf{Z}}^V_1, \cdots, {\mathbf{Z}}^V_K, {\mathbf{Z}}^E;\lambda)d{\mathbf{Z}}^V_1\cdots d{\mathbf{Z}}^V_Kd{\mathbf{Z}}^E\\ &\quad \geq \mathbb{E}_q \left( \log \frac{ p(x_1, \cdots, x_K, {\mathbf{Z}}^V_1, \cdots, {\mathbf{Z}}^V_K, {\mathbf{Z}}^E;\lambda) }{ q({\mathbf{Z}}^V_1, \cdots, {\mathbf{Z}}^V_K, {\mathbf{Z}}^E|x_1, \cdots, x_K;\theta)} \right) \\ &\quad := \mathbb{L}(x_1, \cdots, x_K; \theta, \lambda), \end{aligned} $$
(5.60)

where q(⋅) is the variational posterior for the estimation of the true posterior \(p(\mathbf Z^V_1, \ldots , \mathbf Z^V_K, \mathbf Z^E|x_1, \ldots , x_K)\), which is inaccessible, and θ is the parameter to be estimated. Then, \(\mathbb {L}(x_1, \ldots , x_K; \theta , \lambda )\) is the evidence lower bound of the log marginal likelihood. Based on the evidence lower bound, an inference encoder is presented to parameterize q, and a generative decoder is used to parameterize p.

The inference encoder of HeteHG-VAE consists of two main parts, i.e., the heterogeneous vertex encoder and the heterogeneous hyperedge encoder. Heterogeneous vertex encoder first maps the observed data x k to a latent space \(\tilde {\mathbf Z}^V_k\), which can be written as

$$\displaystyle \begin{aligned} \tilde{\mathbf Z}^V_k = f^V(x_k\mathbf W^V_k+b^V_k), \end{aligned} $$
(5.61)

where \(\mathbf W^V_k\) and \(b^V_k\) are the to-be-learned weights of the model, and f V is a nonlinear activation function. Two separated linear layers map the latent representation of the means \(\mu ^V_k\) and variances \(\sigma ^V_k\) of q:

$$\displaystyle \begin{aligned} & \mu^V_k = \tilde{\mathbf Z}^V_k\mathbf W^{V\mu}_k + b^{V\mu}_k, \end{aligned} $$
(5.62)
$$\displaystyle \begin{aligned} &\sigma^V_k = \tilde{\mathbf Z}^V_k\mathbf W^{V\sigma}_k + b^{V\sigma}_k, \end{aligned} $$
(5.63)

where \(\mathbf W^{V\mu }_k\), \( b^{V\mu }_k\), \(\mathbf W^{V\sigma }_k\), and \(b^{V\sigma }_k\) are learnable parameters. The vertex embedding is the sample from the Gaussian distribution \(\mathbb {N}(\mu ^V_k,\sigma ^V_k)\).

Heterogeneous hyperedge encoder first maps the observed data x k to a latent space \(\tilde {\mathbf Z}^E_k\), which can be written as

$$\displaystyle \begin{aligned} \tilde{\mathbf Z}^E_k = f^E(x_k^\top \mathbf W^E_k+b^E_k), \end{aligned} $$
(5.64)

where \(\mathbf W^E_k\) and \(b^E_k\) are the to-be-learned weights of the model, and f E is a nonlinear activation function. Then, the importance of different types of vertices is learned by the hyperedge attention mechanism, which can be written as

$$\displaystyle \begin{aligned} \tilde{\alpha}_k = \text{Tan}h (\tilde{\mathbf Z}^E_k \mathbf W_k^{E\alpha} + b_k^{E\alpha} ) \mathbf P, \end{aligned} $$
(5.65)

where \(\mathbf W_k^{E\alpha }\), \(b_k^{E\alpha }\), and P are learnable parameters. The attention score α k is obtained by normalizing \(\tilde {\alpha }_k\), and the hyperedge embedding can be written as

$$\displaystyle \begin{aligned} \tilde{\mathbf Z}^E = \sum_{k=1}^K \alpha_k \tilde{\mathbf Z}^E_k. \end{aligned} $$
(5.66)

Similarly, two separated linear layers map the latent representation of the means μ E and variances σ E of the distribution q:

$$\displaystyle \begin{aligned} & \mu^E = \tilde{\mathbf Z}^E{\mathbf{W}}^{E\mu} + b^{E\mu}, \end{aligned} $$
(5.67)
$$\displaystyle \begin{aligned} &\sigma^E = \tilde{\mathbf Z}^E{\mathbf{W}}^{E\sigma} + b^{E\sigma}, \end{aligned} $$
(5.68)

where W , b , W , and b are learnable parameters. The vertex embedding is the sample from the Gaussian distribution \(\mathbb {N}(\mu ^E,\sigma ^E)\).

The incidence matrix is sampled from a Bernoulli distribution parameterized by \(\mathbb {H}_k\):

$$\displaystyle \begin{aligned} p(\mathbf H_{ij} | {\mathbf{Z}}^V_{k,i}, {\mathbf{Z}}^E_{k,j}; \lambda_k ) = Ber(\mathbb{H}_{ij}), \end{aligned} $$
(5.69)

where \(\mathbb {H}_{ij}\) is the dot product of the vertex embedding and the hyperedge embedding:

$$\displaystyle \begin{aligned} \mathbb{H}_{ij} = Sigmoid (\mathbf Z^V_{k,i}(\mathbf Z^E_j)^\top). \end{aligned} $$
(5.70)

The likelihood of the connection among vertices could be obtained based on the vertex embedding and hyperedge embedding as follows:

$$\displaystyle \begin{aligned} p_{conn}(\mathbf Z^V_i,\mathbf Z^E_j)=|| \mathbf Z^V_i,\mathbf Z^E_j ||{}_2. \end{aligned} $$
(5.71)

In this section, we have introduced the Heterogeneous Hypergraph Variational Auto-encoder method [6] for the task of link prediction on hypergraph, which captures the high-order correlations among the data while preserving the origin low-order topology. Link prediction on hypergraph has shown superior performance in different experiments and can be further used in other applications.

5.6 Summary

In this chapter, we introduce four typical hypergraph computation tasks, including label propagation, data clustering, imbalance learning, and link prediction. Label propagation on hypergraph is to predict the labels for the vertices on a hypergraph, i.e., assigning a label to each unlabeled vertex in the hypergraph, based on the labeled information. Data clustering on hypergraph divides the vertices in a hypergraph into several groups. Imbalanced learning on hypergraph considers the imbalanced data distributions and introduces cost-sensitive hypergraph computation methods. Link prediction on hypergraph discovers missing relations or predicts new coming hyperedges based on the observed hypergraph. We note that these four tasks are typical ways to use hypergraph computation in practice. Other tasks can also be deployed under the hypergraph computation framework, such as data regression, data completion, and data generation. Following these typical hypergraph computation tasks, we can use them in different applications, such as social media analysis and computation vision.