Typical Hypergraph Computation Tasks

Dai, Qionghai; Gao, Yue

doi:10.1007/978-981-99-0185-2_5

Qionghai Dai⁵ &
Yue Gao⁶

Part of the book series: Artificial Intelligence: Foundations, Theory, and Algorithms ((AIFTA))

3478 Accesses

Abstract

After hypergraph structure generation for the data, the next step is how to conduct data analysis on the hypergraph. In this chapter, we introduce four typical hypergraph computation tasks, including label propagation, data clustering, imbalance learning, and link prediction. The first typical task is label propagation, which is to predict the labels for the vertices, i.e., assigning a label to each unlabeled vertex in the hypergraph, based on the labeled information. In general cases, label propagation is to propagate the label information from labeled vertices to unlabeled vertices through structural information of the hyperedges. In this part, we discuss the hypergraph cut on hypergraphs and random walk interpretation of label propagation on hypergraphs. The second typical task is data clustering, which is formulated as dividing the vertices into several parts in a hypergraph. In this part, we introduce a hypergraph Laplacian smoothing filter and an embedded model for hypergraph clustering tasks. The third typical task is cost-sensitive learning, which targets on learning with different mis-classification costs. The fourth typical task is link prediction, which aims to discover missing relations or predict new coming hyperedges based on the observed hypergraph.

You have full access to this open access chapter, Download chapter PDF

5.1 Introduction

In previous chapters, we have introduced how to generate the hypergraph structure given observed data. After the hypergraph generation step, how to use this hypergraph for different applications becomes the key task. Hypergraph has the potential to be used in different areas, such as social medial analysis, medical and biological applications, and computer vision. We notice most of the applications can be categorized into several typical tasks and follow similar application patterns. In this chapter, we introduce several typical hypergraph computational tasks, which can be used for different applications.

More specifically, four typical tasks, including label propagation, data clustering, cost-sensitive learning, and link prediction, are introduced in this chapter. The first typical task is label propagation, which is also one of the most widely used methods in machine learning. The objective of label propagation is to assign a label to each unlabeled data. In general cases, label propagation on hypergraph is to propagate the label information from labeled vertices to unlabeled vertices through structural information of the hyperedges. Random walk is a basic processing for information propagation, which also plays a fundamental role in this process. We then review the hypergraph cut on hypergraphs and random-walk-based label propagation on hypergraphs. We introduce the label propagation process on single hypergraph and multi-hypergraphs [1, 2], respectively, in this part.

The second typical task is data clustering, targeting on grouping data into different clusters. We introduce how to conduct data clustering using hypergraph computation. The hypergraph structure can be used as guidance to the clustering criteria. Two types of hypergraph clustering methods are introduced, including structural hypergraph clustering and attribute hypergraph clustering, due to the different data information in the hypergraph. In structural hypergraph, the clustering tasks only use structural information, while in attribute hypergraph, each vertex is usually accompanied by attribute information from the real world. We introduce a hypergraph Laplacian smoothing filter and an embedded model specifically for hypergraph clustering tasks that named adaptive hypergraph auto-encoder (AHGAE) [3].

The third typical task is cost-sensitive learning, which is to solve the learning task under the scenario with different mis-classification costs, such as confronting the imbalanced data distribution issue. Here, we introduce two hypergraph computation methods, i.e., cost-sensitive hypergraph computation [4] and cost interval optimization for hypergraph computation [5]. First, we introduce a cost-sensitive hypergraph modeling method, in which the cost for different objectives is fixed in advanced. As the exact cost value may be not easy to be determined, we then introduce a cost interval optimization method, which can utilize the cost chosen inside the interval while generating data with high-order relations.

The fourth typical task is link prediction, which is to predict data relationship and can be used for recommender system and other applications. Here, the hypergraph link prediction is to mine the missing hyperedges or predict new coming hyperedges based on the observed hypergraph. We introduce a variational auto-encoder for heterogeneous hypergraph link prediction [6]. It aims to learn the low-dimensional heterogeneous hypergraph embedding based on the Bayesian deep generative strategy. The heterogeneous encoder generates the vertex embedding and hyperedge embedding, and the hypergraph embedding is the combination of them. The hypergraph decoder reconstructs the incidence matrix based on the vertex embedding and the hyperedge embedding, and the heterogeneous hypergraph is generated based on the reconstructed incidence matrix.

Part of the work introduced in this chapter has been published in [1,2,3,4,5,6].

5.2 Label Propagation on Hypergraph

This section mainly introduces the label propagation task on hypergraph. We first introduce the basic assumptions of the label propagation process. Given a set of vertices on a hypergraph, a part of vertices is labeled, while other vertices are unlabeled. The task is to predict the label information of these unlabeled data given the label information and the hypergraph structure. Figure 5.1 shows that the label propagation process is to propagate the label information from these labeled vertices to the unlabeled vertices.

Two hypergraphs of order 7 and size 5, with 4 at the center. Groupings are 1, 2, 3, 4, and 4, 5, 6, 7. On label propagation, 3 points to 1 and 2, and 6 points to 4, 5, and 7. — **Fig. 5.1**

When propagating label information, vertices within the same hyperedge are more likely to have the same label because they characterize themselves with similar attributes in some aspects, and therefore, they have a higher probability of sharing the same label. Under this assumption, the label propagation task can be transformed into a hypergraph cut. In a hypergraph cut, the goal is to make the cut edges as sparse as possible, with each vertex set after the cut as dense as possible. After cutting the hypergraph, different sets of vertices have different labels. This approach satisfies the goal based on the above assumption. The form of the hypergraph cut can be described below.

Suppose a vertex set $S \in \mathbb {V}$ and its compliment $\overline {S}$. There is a cut that splits the $\mathbb {V}$ into S and $\overline {S}$. A hyperedge e is cut if it is incident with the vertices in both S and $\overline {S}$. Define the hyperedge boundary ∂S as the cut hyperedges, i.e., $\partial S = \{e\in \mathbb {E} | e \cap S \ne \varnothing , e\cap \overline {S} \ne \varnothing \}$, and the volume of S, vol(S), be the sum of the degrees of vertices in S, i.e., vol(S) =∑_{v ∈ S} D _v(v). It can be shown as

$$\displaystyle \begin{aligned} vol(\partial S)=\sum_{e\in \partial S}w(e)\frac{|e\cap S||e\cap \overline{S}|}{{\mathbf{D}}_e(e)}. {} \end{aligned} $$

(5.1)

The derivation is shown as follows, and the details can be found in [7]. Suppose that hyperedge e is a clique, i.e., a fully connected graph. To avoid confusion, the edges in the clique are called subedges. Then, the weight $\frac {w(e)}{{\mathbf {D}}_e(e)}$ is assigned to each subedge. When the hyperedge e is cut, $|e\cap S|\times |e\cap \overline {S}|$ subedges are cut. The volume of the cut is the sum of the weights over these subedges. Recall that our goal is to make the cut edges as sparse as possible, with each vertex set after the cut as dense as possible. Based on the goal, the objective partition formula is written as

$$\displaystyle \begin{aligned} \arg\min_{S \subset \mathbb{V}}c(S) = vol(\partial S)\left(\frac{1}{vol(S)} + \frac{1}{vol(\overline{S})}\right). {} \end{aligned} $$

(5.2)

There are many methods to propagate label information on a hypergraph, and the propagation based on random walks is the most widely used. The following describes the label propagation by random walk, and the illustration is shown as Fig. 5.2. Suppose that the current position is $u \in \mathbb {V}$, and at first, we walk to a hyperedge e over all hyperedges incident with u with probability w(e), and then we sample a vertex v ∈ e uniformly. By generalizing from typical random walks on graphs, we use P as the transition probability matrix of the random walk on a hypergraph, and the element p(u, v) is defined as follows:

$$\displaystyle \begin{aligned} p(u,v)=\sum_{e\in \mathbb{E}}w(e)\frac{\mathbf{H}(u,e)}{{\mathbf{D}}_v(u)}\frac{\mathbf{H}(v,e)}{{\mathbf{D}}_e(e)}. \end{aligned} $$

(5.3)

A hypergraph of order 7 and size 5. The dotted line splits vertices 1 to 3, and 4 to 7 into two parts. There is a one-to-one connection between vertices 2 and 5, 1 and 6, and 3 and 7. — **Fig. 5.2**

The formula can be organized into a matrix form as $\mathbf {P}={\mathbf {D}}_v^{-1}\mathbf {H}\mathbf {W}{\mathbf {D}}_e^{-1}{\mathbf {H}}^\top $. The stationary distribution π of the random walk is defined as

$$\displaystyle \begin{aligned} \pi(v)=\frac{d(v)}{vol(\mathbb{V})}, {} \end{aligned} $$

(5.4)

where D _v(v) is denoted by d(v) for short and vol(.) is the volume of the vertices in set S, defined as vol(S) =∑_{v ∈ S} d(v). The formula can be derived from

$$\displaystyle \begin{aligned} \sum_{u\in \mathbb{V}}\pi(u)p(u,v) =& \sum_{u\in \mathbb{V}}\frac{d(u)}{vol({\mathbb{V}})}\sum_{e\in \mathbb{E}}w(e)\frac{\mathbf{H}(u,e)}{{\mathbf{D}}_v(u)}\frac{\mathbf{H}(v,e)}{{\mathbf{D}}_e(e)} \\ =& \frac{1}{vol(\mathbb{V})}\sum_{e\in \mathbb{E}}w(e)\sum_{u\in \mathbb{V}}\mathbf{H}(u,e)\frac{\mathbf{H}(u,e)}{{\mathbf{D}}_e(e)}\\ =&\frac{1}{vol(\mathbb{V})}\sum_{e\in \mathbb{E}}w(e)\mathbf{H}(v,e)=\frac{d(v)}{vol(\mathbb{V})}. \end{aligned} $$

(5.5)

The objective function Eq. (5.2) can be written as

$$\displaystyle \begin{aligned} c(S)=\frac{vol(\partial S)}{vol(\mathbb{V})}\left(\frac{1}{vol(S)/vol(\mathbb{V})} + \frac{1}{vol(\overline{S})/vol(\mathbb{V})}\right), \end{aligned} $$

(5.6)

and then we arrive at

$$\displaystyle \begin{aligned} \frac{vol(S)}{vol(\mathbb{V})}=\sum_{v\in S}\frac{d(v)}{vol(\mathbb{V})}=\sum_{v\in \mathbb{V}}\pi(v), \end{aligned} $$

(5.7)

where $\frac {vol(S)}{vol(\mathbb {V})}$ is the probability of random walks to vertex in S. It can then be shown as

$$\displaystyle \begin{aligned} \frac{vol(\partial S)}{vol(\mathbb{V})} = &\sum_{e\in \partial S}\frac{w(e)}{vol(\mathbb{V})}\frac{|e\cap S||e\cap \overline{S}|}{\delta(e)} \\ =& \sum_{e\in \partial S}\sum_{u\in e\cap S}\sum_{v\in e \cap \overline{S}}\frac{w(e)}{vol(\mathbb{V})}\frac{\mathbf{H}(u,e)\mathbf{H}(v,e)}{\delta(e)}\\ =& \sum_{e\in \partial S}\sum_{u\in e\cap S}\sum_{v\in e \cap \overline{S}}w(e)\frac{d(u)}{vol(\mathbb{V})}\frac{\mathbf{H}(u,e)}{d(u)}\frac{\mathbf{H}(v,e)}{\delta(e)}\\ =& \sum_{u\in e\cap S}\sum_{v\in e \cap \overline{S}}\frac{d(u)}{vol(\mathbb{V})}\sum_{e\in \partial S}w(e)\frac{\mathbf{H}(u,e)}{d(u)}\frac{\mathbf{H}(v,e)}{\delta(e)}\\ =& \sum_{u\in S}\sum_{v\in \overline{S}}\pi(u)p(u,v), \end{aligned} $$

(5.8)

where the ratio $\frac {vol(\partial S)}{vol(\mathbb {V})}$ is the probability with the random walk from a vertex in S to $\overline {S}$ under the stationary distribution. It can be seen that the hypergraph normalized cut criterion is to search a cut such that the probability with which the random walk crosses different clusters is as small as possible, while the probability with which the random walk stays in the same cluster is as large as possible.

Let us review the objective function Eq. (5.2). Note that it is NP complete, while it can be relaxed into the following optimization problem as

$$\displaystyle \begin{aligned} &\arg\min_{\mathbf f\in\mathbb{R}^{|V|}}\varOmega(\mathbf f)=\frac{1}{2}\sum_{e\in\mathbb{E}}\sum_{\{u,v\}\in e}\frac{w(e)}{\delta(e)}\left(\frac{\mathbf f(u)}{\sqrt{d(u)}} - \frac{\mathbf f(v)}{\sqrt{d(v)}}\right)^2,\\ &s.t.~~\sum_{v\in \mathbb{V}}\mathbf f^2(v)=1, \sum_{v\in \mathbb{V}}\mathbf f(v)\sqrt{d(v)}=0, {} \end{aligned} $$

(5.9)

where f is the to-be-learned score vector. Since the goal is label propagation, it can be arrived at for some labeled data. The optimization problem becomes the transductive inference problem as

$$\displaystyle \begin{aligned} \arg\min_{\mathbf f\in \mathbb{R}^{|\mathbb{V}|}}\{\varOmega(\mathbf f) + \lambda R_{emp}(\mathbf f)\}, \end{aligned} $$

(5.10)

where the regularizer term is Ω(f), the empirical loss term is $R_{emp}(\mathbf f)=\|f-y\|{ }^2=\sum _{v\in \mathbb {V}}(\mathbf f(v) - \mathbf y(v))^2$, $\mathbf y\in \mathbb {R}^{|\mathbb {V}|}$ is the label vector, and λ is the balance parameter. Let us assume that the i-th vertex is labeled, and the elements of y are all 0 except the i-th value that is 1. The regularizer Ω(f) can be turned into

$$\displaystyle \begin{aligned} \varOmega(\mathbf f) =& \frac{1}{2}\sum_{e\in\mathbb{E}}\sum_{\{u,v\}\in e}\frac{w(e)}{\delta(e)}\left(\frac{\mathbf f(u)}{\sqrt{d(u)}} - \frac{\mathbf f(v)}{\sqrt{d(v)}}\right)^2\\ =& \sum_{e\in\mathbb{E}}\sum_{\{u,v\}\in \mathbb{V}}\frac{w(e)\mathbf{H}(u,e)\mathbf{H}(v,e)}{\delta(e)}\left(\frac{\mathbf f^2(u)}{d(u)} - \frac{\mathbf f(u)\mathbf f(v)}{\sqrt{d(u)d(v)}}\right)\\ =& \sum_{u\in \mathbb{V}}\mathbf f^2(u)\sum_{e\in \mathbb{E}}\frac{w(e)\mathbf{H}(u,e)}{d(u)}\sum_{v\in \mathbb{V}}\frac{\mathbf{H}(v,e)}{\delta(e)}\\& - \sum_{e\in \mathbb{E}}\sum_{u,v\in \mathbb{V}}\frac{\mathbf f(u)\mathbf{H}(u,e)w(e)\mathbf{H}(v,e)\mathbf f(v)}{\sqrt{d(u)d(v)}\delta(e)}\\ =& \mathbf f^\top(\mathbf{I} - \varTheta) \mathbf f, \end{aligned} $$

(5.11)

where $\varTheta ={\mathbf {D}}_v^{-\frac {1}{2}}\mathbf {H}\mathbf {W}{\mathbf {D}}_e^{-1}{\mathbf {H}}^\top {\mathbf {D}}_v^{-\frac {1}{2}}$. The hypergraph Laplacian is denoted by Λ = I − Θ. Therefore, the objective function can be rewritten as

$$\displaystyle \begin{aligned} \varOmega(\mathbf f)=\mathbf f^\top\varLambda \mathbf f. \end{aligned} $$

(5.12)

The optimization function can be turned into

$$\displaystyle \begin{aligned} \arg\min_{\mathbf f\in \mathbb{R}^{|\mathbb{V}|}} \{\mathbf f^\top\varLambda \mathbf f + \lambda\|\mathbf f-\mathbf y\|{}^2\}. {} \end{aligned} $$

(5.13)

There are two ways to solve the above problem. The first one is differentiating the objective function in Eq. (5.13) with respect to f, and it can be obtained as

$$\displaystyle \begin{aligned} \mathbf f=\left(\mathbf{I}+\frac{1}{\lambda}\varLambda\right)^{-1}\mathbf y. {} \end{aligned} $$

(5.14)

The second one is an iterative method. Similar to the iterative approach in [8], Eq. (5.13) can be efficiently solved by an iterative process. The process is illustrated in Fig. 5.3. The f ^t+1 can be obtained from the last iterative f ^t and y, and the procedure is repeated until convergence.

A text with 3 steps to solve an equation. Step 1, initialize f at t when t equals 0. Step 2, update f by f at t plus 1. Step 3, let t equal t plus 1 and iterate back to step 2 until convergence. — **Fig. 5.3**

This process will converge to the solution Eq. (5.14). To prove it, we first prove that the eigenvalues of Θ are in [−1, 1]. Since $\varTheta ={\mathbf {D}}_v^{-1/2}\mathbf {H}\mathbf {W}{\mathbf {D}}_e^{-1}{\mathbf {H}}^\top {\mathbf {D}}_v^{-1/2}$, we find that its eigenvalues are in [−1, 1]. Therefore, (I ± Θ) are positive semi-definite.

The convergence of the iterative process is proved in [1]. Without loss of generality, we assume f ⁽⁰⁾ = y. From the iterative process, it can be obtained that

$$\displaystyle \begin{aligned} {\mathbf f^{(t)}}=&\left({{\lambda}\over{1+\lambda}}\right)\sum_{i=0}^{t-1}{{\left({{{1}\over{1+\lambda}}{{\varTheta}}}\right)^{i}}\mathbf y}{+\left({{{1}\over{1+\lambda}}{{\varTheta}}}\right)^{t}}\mathbf y\\=&(1-\zeta)\sum_{i=0}^{t-1}{{{(\zeta{{\varTheta}})}^{i}}\mathbf y}+{(\zeta{{\varTheta}})^{t}}\mathbf y, \end{aligned} $$

(5.15)

where $\zeta =\frac {1}{1+\lambda }$. Since 0 < ζ < 1, and the eigenvalues of Θ are in [−1, 1], it can be derived that

$$\displaystyle \begin{aligned} \mathop{\lim}\limits_{t\to\infty}{(\zeta{{\varTheta}})^{t}}=0 \end{aligned} $$

(5.16)

and

$$\displaystyle \begin{aligned} \mathop{\lim}\limits_{t\to\infty}\sum_{i=0}^{t-1}{{(\zeta{{\varTheta}})}^{i}}={(\mathbf{I}-\zeta{{\varTheta}})^{-1}}. \end{aligned} $$

(5.17)

Then, it turns out

$$\displaystyle \begin{aligned} \mathbf f=\mathop{\lim}\limits_{t\to\infty}{\mathbf f^{(t)}}=(1-\zeta){(\mathbf{I}-\zeta{{\varTheta}})^{-1}}\mathbf y=\!{\left({\mathbf{I}+{{1}\over{\lambda}}{{\varDelta}}}\right)^{-1}}\!\mathbf y. \end{aligned} $$

(5.18)

Therefore, the convergence of f is proved to be equal to the closed-form solution Eq. (5.14).

The random-walk-based method is the most commonly used approach in label propagation on hypergraphs. It has the advantages of being simple to implement and theoretically verifiable.

In many cases, different hypergraphs may be generated based on different criteria. Under such circumstances, we need to conduct label propagation on multi-hypergraph. Here, we briefly introduce the cross diffusion method on multi-hypergraph [2]. We assume that there are T hypergraphs, and the t-th hypergraph is denoted as $\mathbb {G}^t=(\mathbb {V}^t, \mathbb {E}^t, {\mathbf {W}}^t)$, where $\mathbb {V}^t$ is the vertex set, $\mathbb {E}^t$ is the hyperedge set, and W ^t is a diagonal matrix, representing the weights of hyperedges.

The transition matrix is first generated for each hypergraph. The label propagation process on hypergraph is based on the assumption that the local similarities could approximate the long-range similarities, and therefore, the local similarities are more important than far-away vertices. The similarity matrix among vertices of the t-th hypergraph is shown as follows:

$$\displaystyle \begin{aligned} {\varLambda^t}(u,v)=\sum_{e\in \mathbb{E}^t} \frac{{\mathbf{W}}^t(e) {\mathbf{H}}^t(u,e) {\mathbf{H}}^t(v,e) }{\delta (e)}, \end{aligned} $$

(5.19)

or in the matrix form:

$$\displaystyle \begin{aligned} \varLambda^t = {\mathbf{H}}^t{\mathbf{W}}^t{{\mathbf{D}}_e^{t}}^{-1}{{\mathbf{H}}^t}^\top. \end{aligned} $$

(5.20)

The transition matrix P ^t is the normalized similarity matrix:

$$\displaystyle \begin{aligned} {\mathbf{P}}^t(i,j)=\frac{\varLambda_t(i,j) }{ \sum_{w\in \mathbb{V}^t} \varLambda_t(i,w) } \end{aligned} $$

(5.21)

and

$$\displaystyle \begin{aligned} {\mathbf{P}}^t={{\mathbf{D}}^t}^{-1}\varLambda^t, \end{aligned} $$

(5.22)

where D ^t is a diagonal matrix with the i-th diagonal element ${\mathbf {D}}^t(i,i)=\sum _{j=1}^{|\mathbb {V}^t|}\varLambda ^t(i,j)$.

The element of the transition matrix P ^t(i, j) represents the probability of transition from the vertex i to the vertex j, and P ^t could be regarded as the Parzen window estimators on hypergraph structure. After the generation of the transition matrix, the cross label propagation process is applied to the multi-hypergraph structure.

Denote Y ₀ as the initial label matrix. For labeled vertices, the i-th row of Y ₀ is the one-hot label of the i-th vertex, while for the unlabeled vertices, all elements of the i-th row are 0.5, indicating that there is no prior knowledge of the label. We denote the labeled part of the initial label matrix as ${\mathbf {Y}}_0^L$.

For simplicity, we assume the number of hypergraphs T is 2. The label propagation process for multi-hypergraph uses the output of one hypergraph as the input of the other hypergraph, which repeats until the output converges. The process could be formulated as

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^1 &\leftarrow {\mathbf{P}}^1{\mathbf{Y}}_d^2, \end{aligned} $$

(5.23)

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^{1L} &\leftarrow {\mathbf{Y}}_0^L \end{aligned} $$

(5.24)

and

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^2 &\leftarrow {\mathbf{P}}^2{\mathbf{Y}}_d^1, \end{aligned} $$

(5.25)

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{d+1}^{2L} &\leftarrow {\mathbf{Y}}_0^L, \end{aligned} $$

(5.26)

where ${\mathbf {Y}}_d^k$ denotes the label matrix of the k-th hypergraph after d times of label propagation. This process is shown in Fig. 5.4.

A multi-hypergraph diagram starts with an initial label matrix as input. It leads to diffusion 1, 2, up to diffusion m with a pair of hypergraphs each with with v 1 to v 6 and e 1 to e 5, resulting in the output matrix. — **Fig. 5.4**

The overall matrix could be calculated according to the label matrix of each hypergraph after convergence:

$$\displaystyle \begin{aligned} {\mathbf{Y}}_{final} = \frac 1T \sum_{i=1}^T {\mathbf{Y}}_d^i. \end{aligned} $$

(5.27)

For more complicated scenarios, where more than two hypergraphs are available, the label propagation process can repeat that, and the output of one hypergraph can be used as the input of other hypergraphs.

This diffusion process can also be used for a single hypergraph, and the framework can be described in Fig. 5.5.

A flow diagram starts with an initial label matrix. It leads to diffusion 1, diffusion 2, up to diffusion m with v 1 to v 6 and e 1 to e 5, resulting in the output matrix. — **Fig. 5.5**

5.3 Data Clustering on Hypergraph

Data clustering is a typical machine learning task that aims to group data into clusters. In this section, we introduce hypergraph-based data clustering methods, which can utilize the hypergraph structure for better finding correlations behind the data. For hypergraph clustering, two types of information can be used, including structural hypergraph clustering and attribute hypergraph clustering according to the data information in the hypergraph. In structural hypergraph, the clustering tasks only use structural information. For example, the hypergraph spectral clustering method[7] is extended on the basis of graph, which uses the hypergraph Laplacian to learn complex relations between nodes in the hypergraph. And some auto-encoder-based techniques[9] are also applied to structural clustering. In attribute hypergraph, each vertex is usually accompanied by attribute information from the real world. There are two assumptions as follows:

Vertices in the same hyperedge have similar attributes.
Vertices with similar features have similar attributes.

How to balance graph structure information and node feature information is a study focus of attributed graph clustering [10]. In this way, hypergraphs can utilize the features, attributes, and structured information of vertices to conduct data clustering task.

In this section, we introduce a hypergraph Laplacian smoothing filter and an embedded model called adaptive hypergraph auto-encoder (AHGAE) that is designed specifically for hypergraph clustering tasks [3]. First, we describe the hypergraph Laplacian smoothing filter and derive its low-pass filtering properties in the frequency domain. Then, we analyze the influence of each vertex on the attributes of its connected hyperedges and the feature of neighbor vertices. Finally, we introduce the detailed procedure and framework of the adaptive hypergraph auto-encoder.

The hypergraph Laplacian smoothing filter, as shown in Fig. 5.6, first merges the vertex features into hyperedge features, and the feature of hyperedge e _k is defined as

$$\displaystyle \begin{aligned} {\mathbf{E}}_{k}^{(t)} &=\frac{1}{\left|N\left(e_{k}\right)\right|} \sum_{v_{j} \in N\left(e_{k}\right)} {\mathbf{X}}_{j}^{(t)}=\sum_{v_{j} \in \mathbb{V}} \frac{h(j, k)}{d_{e}(k)} {\mathbf{X}}_{j}^{(t)}, \end{aligned} $$

(5.28)

where e _k denotes the k-th hyperedge in the hyperedge set $\mathbb {E}$, v _i denotes the i-th vertex in the vertex set $\mathbb {V}$, t represents the order, $N\left (e_{k}\right )$ is the vertex set in hyperedge e _k, E _k describes the hyperedge e _k feature, and X _j describes the feature of the vertex v _j.

A schematic diagram of five hypergraphs. One end of the input hypergraph passes through hyperedge and smoothed node features. The other end passes through the original node features. Both combine at weighted fusion to give an output. — **Fig. 5.6**

After aggregating the vertex features to get the hyperedge features, we can further combine the vertex features according to the hyperedge weights:

$$\displaystyle \begin{aligned} {\mathbf{X}}_{i}^{(t+1)} &=(1-\gamma) {\mathbf{X}}_{i}^{(t)}+\gamma \sum_{e_{k} \in N\left(v_{i}\right)} \frac{h(i, k) w(k)}{d_{v}(i)} {\mathbf{E}}_{k}^{(t)} \\ &=(1-\gamma) {\mathbf{X}}_{i}^{(t)}+\gamma \sum_{v_{j} \in \mathbb{V}} \sum_{e_{k} \in \mathbb{E}} \frac{h(i, k) w(k) h(j, k)}{d_{v}(i) d_{e}(k)} {\mathbf{X}}_{j}^{(t)},\\ {\mathbf{X}}^{(t+1)}&=(1-\gamma) {\mathbf{X}}^{(t)}+\gamma {\mathbf{D}}_{v}^{-1 / 2} \mathbf{H W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1 / 2} {\mathbf{X}}^{(t)}, \end{aligned} $$

(5.29)

where N(v) represents the hyperedge connected to vertex v, and γ ∈ [0, 1] is the weight coefficient of the filter. D _v denotes the diagonal matrix of the vertex degrees, D _e denotes the diagonal matrix of the hyperedge degrees, and H is the incidence matrix of the hypergraph. In order to make the spectral radius less than 1, we can replace ${\mathbf {D}}_{\mathrm {v}}^{-1} \mathbf {HWD}_{e}^{-1} {\mathbf {H}}^\top $ with symmetric normalized form:

$$\displaystyle \begin{aligned} {\mathbf{X}}^{(t+1)} &=(1-\gamma) {\mathbf{X}}^{(t)}+\gamma {\mathbf{D}}_{v}^{-1 / 2} \mathbf{H W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^\top {\mathbf{D}}_{v}^{-1 / 2} {\mathbf{X}}^{(t)} \\ &={\mathbf{X}}^{(t)}-\gamma\left(\mathbf{I}-{\mathbf{D}}_{v}^{-1 / 2} \mathbf{H W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^\top {\mathbf{D}}_{v}^{-1 / 2}\right) {\mathbf{X}}^{(t)}. \end{aligned} $$

(5.30)

Then, the multi-order hypergraph Laplacian smoothing filter can be written as

$$\displaystyle \begin{aligned} {\mathbf{X}}^{(t)}=(\mathbf{I}-\gamma \mathbf{L})^{t} \mathbf{X}. \end{aligned} $$

(5.31)

After decomposing the eigenvalues of the hypergraph Laplacian operator L = U Λ U ⁻¹, the diagonal elements of the diagonal matrix Λ are eigenvalues of L. The frequency response function is as

$$\displaystyle \begin{aligned} p(\boldsymbol{\varLambda})=\operatorname{diag}\left(p\left(\lambda_{1}\right), \ldots, p\left(\lambda_{|\mathbb{V}|}\right)\right), \end{aligned} $$

(5.32)

$$\displaystyle \begin{aligned} p(\lambda)=1-\gamma \lambda, \gamma \in[0,1]. \end{aligned} $$

(5.33)

Due to the eigenvalue of the hypergraph Laplacian λ ∈ [0, 1], p(Λ) is a positive semi-definite matrix, and the value of p(λ) decreases as λ increases. Therefore, the hypergraph Laplacian smoothed filtered can effectively suppress high-frequency signals:

$$\displaystyle \begin{aligned} \mathbf{F}=\mathbf{U} p(\boldsymbol{\varLambda}) {\mathbf{U}}^{-1}=\mathbf{U}(\mathbf{I}-\gamma \boldsymbol{\varLambda}) {\mathbf{U}}^{-1}=\mathbf{I}-\gamma \mathbf{L}. \end{aligned} $$

(5.34)

Figure 5.7 illustrates how to use the relational reconstruction auto-encoder after getting the smoothed feature matrix to conduct vertex representation learning in low-dimensional environments without losing information. First, the incidence matrix is used to generate the adjacency matrix:

$$\displaystyle \begin{aligned} \mathbf{A}=\varepsilon\left(\mathbf{H H}^{\top}\right), \end{aligned} $$

(5.35)

$$\displaystyle \begin{aligned} \varepsilon(x)=\left\{\begin{array}{ll} 1, & x>0 \\ 0, & x=0 \end{array}.\right. \end{aligned} $$

(5.36)

An illustration. An input of hypergraphs passes through matrix structure H and A connected to S. Over-smooth vertices of hypergraph pass through features X, embedding Z, and S giving spectral clustering. — **Fig. 5.7**

A single fully connected layer is used to compress the filtered feature matrix:

$$\displaystyle \begin{aligned} \mathbf{Z}=\operatorname{scale}\left({\mathbf{X}}_{\mathbf{s m}} \boldsymbol{\varTheta}\right), \end{aligned} $$

(5.37)

$$\displaystyle \begin{aligned} \operatorname{scale}(\mathbf{x})=\frac{\mathbf{x}-\min (\mathbf{x})}{\max (\mathbf{x})-\min (\mathbf{x})}, \end{aligned} $$

(5.38)

where Z represents the vertex embedding matrix, which includes both structural and feature information, and Θ is the learnable parameter that is used to extract features from the vertices. In order to rescale the range of vertex characteristics to [0, 1], scale (⋅) represents a normalization function. So the following is the similarity matrix for vertex features:

$$\displaystyle \begin{aligned} \mathbf{S}=\operatorname{sigmoid}\left(\mathbf{Z Z}^\top\right), \end{aligned} $$

(5.39)

$$\displaystyle \begin{aligned} \operatorname{sigmoid}(x)=\frac{1}{1+e^{-x}}. \end{aligned} $$

(5.40)

This is the inner product decoder used to reconstruct vertex and its neighbors. The objective is to minimize the error between the adjacency matrix A and the similarity matrix S. However, using Eq. (5.35) to construct an adjacency matrix leads to a problem: the number of edges is too large when the hyperedge degree increases. To solve this problem, the elements in matrix A are weighted as

$$\displaystyle \begin{aligned} {\mathbf{W}}_{i j}=\left\{\begin{array}{cc} \frac{|\mathbb{V}|{}^{2}-\sum \sum {\mathbf{A}}_{i j}}{\sum \sum {\mathbf{A}}_{i j}} & , {\mathbf{A}}_{i j}=1 \\ 1 & , {\mathbf{A}}_{i j}=0 \end{array} .\right. \end{aligned} $$

(5.41)

The reconstruction loss can be calculated by using the weighted binary cross-entropy function:

$$\displaystyle \begin{aligned} L_{r e}=\frac{1}{|\mathbb{V}|{}^{2}} \sum_{i=1}^{|\mathbb{V}|} \sum_{j=1}^{|\mathbb{V}|}-{\mathbf{W}}_{i j}\left[{\mathbf{A}}_{i j} \log {\mathbf{S}}_{i j}+\left(1-{\mathbf{A}}_{i j}\right) \log \left(1-{\mathbf{S}}_{i j}\right)\right]. \end{aligned} $$

(5.42)

The relational reconstruction auto-encoder can be trained to produce the learned vertex embeddings, and the spectral clustering technique can be further used to obtain the final clustering results.

5.4 Cost-Sensitive Learning on Hypergraph

Most of the machine learning applications may suffer from cost-sensitive scenarios. It is noted that different types of faults in real-world jobs might result in losses with varying severity. In diagnostic work, for example, misdiagnosing a patient as a healthy person is significantly more erroneous than classifying a healthy individual as a patient, as shown in Fig. 5.8. Similar cases also happen in the application of software defect prediction. Misjudging the flaws of software modules as a good one may destroy the software system and have disastrous repercussions in software defect prediction. In these cases, cost-sensitive learning methods [11,12,13] have been developed to deal with these issues.

Four sets of illustrations labeled ground truth and predicted. a. Healthy person, healthy person. b. Healthy person, patient. c. Patient, healthy person. d. Patient, patient. — **Fig. 5.8**

In many cases, the data from a group of categories may be enough, while the data from other categories may be very limited. These imbalanced data distributions lead to different costs for the classification performance of different categories. Under such circumstances, imbalanced learning [13, 14] attracts much attention, which aims to attain a predictive prediction using imbalanced sampling data. In traditional methods, sampling methods [15, 16] are used to over-sample the minority class and under-sample the majority class to solve the imbalanced sample problem. Another way is to conduct cost-sensitive learning that can focus more on the minority class.

To confront the cost-sensitive issue in hypergraph computation, in this section, we introduce cost-sensitive hypergraph computation framework [4] and cost interval optimization for hypergraph computation [5], respectively. First, we describe how to quantify cost in the hypergraph modeling procedure [4], in which a fixed cost value is provided for modeling, and thereafter, we illustrate how to use the cost-sensitive hypergraph computation approach to tackle imbalanced problems. As the cost value for mis-classification results may not be feasible in practice, we then introduce the hypergraph computation method with cost interval optimization [5], which can utilize the cost chosen inside the interval while generating data with high-order relations. Figure 5.9 shows the frameworks of hypergraph computation under cost-sensitive scenarios, from traditional hypergraph modeling, hypergraph modeling with cost matrix, to hypergraph modeling with cost matrix using cost interval.

A framework of hypergraph computation. Hypergraph modeling leads to cost-sensitive hypergraph learning, resulting in hypergraph with cost interval optimization. — **Fig. 5.9**

(1) Cost-Sensitive Hypergraph Computation

In this part, we introduce a cost-sensitive hypergraph computation method [4], and Fig. 5.10 shows the framework of this method. This framework consists of two stages to handle the cost-sensitive issue: F-measure is used in the initial step to calculate candidate cost information for cost-sensitive learning, and then the hypergraph structure is utilized to model the high-order correlations among the data in the second stage.

A framework. The cost information uses F-measure optimization on hypergraphs, leading to cost-sensitive hypergraph learning via cost-value graphs, and results in total cost optimization. — **Fig. 5.10**

First, we introduce the hypergraph modeling with cost matrix. In traditional hypergraph modeling, each vertex represents a subject, and the hyperedges connect related vertices. To introduce cost information in hypergraph modeling, a cost matrix is associated with each vertex, indicating different costs for mis-classification, as shown in Fig. 5.11 for a binary classification task. The definition of cost matrix is as follows.

A hypergraph of order 8 and size 5. Vertices v 2 and v 6 are connected by two cost matrices of 2 by 2, respectively. The matrices consist of four cells with the text C subscript T P, C subscript P N, C subscript F P, C subscript F N. — **Fig. 5.11**

As shown in Fig. 5.11, the cost matrix is a 2 × 2 matrix, including the true positive cost C _TP, the true negative cost C _TN, the false positive cost C _FP, and the false negative cost C _FN, respectively. The true positive cost and the true negative cost are mostly 0 in the matrix since that denotes the correct prediction. The cost-sensitive hypergraph’s propensity for each class is achieved by giving various values to the false positive cost and the false negative cost in the cost matrix. A special case is that, if the false positive cost and the false negative cost are equal, then the cost-sensitive hypergraph reduces to traditional hypergraph modeling.

We generate candidate cost information at first and then apply F-measure to reduce the expense for both binary and multi-class data. For a classifier h, we can define the error profile as

$$\displaystyle \begin{aligned} \varPsi(h)=\left(\mathrm{FN}_{1}(h), \mathrm{FP}_{1}(h), \ldots, \mathrm{FN}_{N_{c}}(h), \mathrm{FP}_{N_{c}}(h)\right), \end{aligned} $$

(5.43)

where N _c represents the number of classes, and FN and FP represent the false negative and the false positive probabilities. For simplicity, we let ψ _2k−1 represent the FN possibility of the k-th class and ψ _2k represent the FP possibility of the k-th class. The F-measure for binary classification can be defined as

$$\displaystyle \begin{aligned} F_{\beta}(\varPsi)=\frac{\left(1+\beta^{2}\right)\left(P_{1}-\psi_{1}\right)}{\left(1+\beta^{2}\right) P_{1}-\psi_{1}(h)+\psi_{2}(h)}, \end{aligned} $$

(5.44)

where P _k represents the marginal probability of class k. Similarly, the micro-F-measure for multi-class classification can be defined as

$$\displaystyle \begin{aligned} m c F_{\beta}(\varPsi)=\frac{\left(1+\beta^{2}\right)\left(1-P_{1}-\sum_{k=2}^{C} \psi_{2 k-1}\right)}{\left(1+\beta^{2}\right)\left(1-P_{1}\right)-\sum_{k=2}^{C} \psi_{2 k-1}+\psi_{1}}. \end{aligned} $$

(5.45)

We can further divide the F-measure values in the region [0, 1] into a collection of equally spaced values F = {f _i} to calculate the cost of various mis-classifications. The cost function Υ is then used to construct the cost vector using every f _i. For binary classification, we constrain the denominator of Eq. (5.44) to be positive and F _β(Ψ) ≤ f _i for a value c of the F-measure:

$$\displaystyle \begin{aligned} \left(1+\beta^{2}-f\right) \psi_{1}+f \psi_{2}+\left(1+\beta^{2}\right) P_{1}(f-1) \geq 0 . \end{aligned} $$

(5.46)

Therefore, the cost of ψ ₁ and ψ ₂ can be allocated according to f and 1 + β ² − f, and the cost function can be written as follows:

$$\displaystyle \begin{aligned} \varUpsilon_{i}^{F_{\beta}}=\left\{\begin{array}{ll} 1+\beta^{2}-f, & \text{if sample from class 1} \\ f, & \text{if sample from class 2} \\ 0, & \text{otherwise } \end{array}\right. . \end{aligned} $$

(5.47)

Similarly, the cost function of multi-class classification can be written as follows:

$$\displaystyle \begin{aligned} \varUpsilon_{i}^{m l F_{\beta}}=\left\{\begin{array}{ll} 1+\beta^{2}-f, & \text{if sample from odd class and not from class 1} \\ f, & \text{if sample from class 1} \\ 0, & \text{otherwise} \end{array}\right. . \end{aligned} $$

(5.48)

The cost of F-measure optimization is added to the optimization function to increase the efficacy of the hypergraph computation method in imbalanced data. We first regard each data to be a vertex of the hypergraph and then apply the k nearest neighbor algorithm to construct the hypergraph. The cost-sensitive hypergraph differs in that it includes the cost matrix information of each vertex in addition to the original hypergraph correlation structure. With training and testing samples represented by O, cost-sensitive hypergraph computation function can be expressed as

$$\displaystyle \begin{aligned} \arg\underset{\omega, \mathbf{W}}{ \min} &\Big \{\mu \varOmega(\omega) + \mathbb{R}_{emp}(\omega) + \lambda\varPhi(\mathbf{W}) \Big \},\\ s.t.~&\sum_{j=1}^N {\mathbf{W}}_{j,j} = 1, \forall~{\mathbf{W}}_{j,j}\geq 0,\\ \end{aligned} $$

(5.49)

where Ω(ω) = (O ω)^⊤ Δ(O ω) represents the hypergraph Laplacian regularized with hypergraph Laplacian Δ, $\mathbb {R}_{emp}(\omega )=\|\varUpsilon (\mathbf {O} \omega -\mathbf {y})\|{ }_{2}^{2}=\sum _{i=1}^{N}\left (\varUpsilon _{i, i}\left ({\mathbf {o}}_{i} \omega -{\mathbf {y}}_{i}\right )\right )^{2}$ is the empirical loss using cost information with diagonal matrix Υ that Υ _i,i represents the cost of the i-th data, $\varPhi (\mathbf {W})=\lambda \|\mathbf {W}\|{ }_{\mathrm {F}}^{2}$ stands for the hypergraph regularization, ω represents the mapping vector to be learnt, W is a diagonal matrix representing hyperedge weights, and μ and λ are the trade-off hyperparameter. We first fix W to optimize ω, and then the optimization equation can be expressed as

$$\displaystyle \begin{aligned} \arg \min _{{\omega}}\left\{\|\varUpsilon(\mathbf{O} {\omega})-\mathbf{y}\|{}_{2}^{2}+\mu(\mathbf{O} {\omega})^{\top} \varDelta(\mathbf{O} {\omega})\right\}. \end{aligned} $$

(5.50)

The optimal ω can be obtained as

$$\displaystyle \begin{aligned} {\omega}=\left({\mathbf{O}}^{\top} {\varUpsilon}^{2} \mathbf{O}+\mu {\mathbf{O}}^{\top} \varDelta \mathbf{O}\right)^{-1}\left({\mathbf{O}}^{\top} {\varUpsilon} \mathbf{y}\right). \end{aligned} $$

(5.51)

Following that, we fix ω to enhance W:

$$\displaystyle \begin{aligned} \arg\underset{\mathbf{W}}{ \min} &\Big \{\mu(\mathbf{O}\omega)^{\top} \varDelta(\mathbf{O} \omega)+\lambda\|\mathbf{W}\|{}_{{F}}^{2} \Big \}.\\ s.t.~&\sum_{j=1}^N {\mathbf{W}}_{j,j} = 1, \forall~{\mathbf{W}}_{j,j}\geq 0.\\ \end{aligned} $$

(5.52)

We can have W as

$$\displaystyle \begin{aligned} \mathbf{W}=\frac{\mu \varLambda^{\top} \varLambda\left({\mathbf{D}}_{e}\right)^{-1}-\eta \mathbf{I}}{2 \lambda}, \end{aligned} $$

(5.53)

where η can be calculated as $\eta =\frac {\mu \varLambda \left ({\mathbf {D}}_{e}\right )^{-1} \varLambda ^{\top }-2 \lambda }{N}$, and Λ can be calculated as $\varLambda =(\mathbf {O} {\omega })^{\top }\left ({\mathbf {D}}_{v}\right )^{-1 / 2} \mathbf {H}$. The optimized mapping vector ω allows sample ζ _i in the test set to obtain the classification result γ = ζ _i ω.

Each piece of potential cost information c _i generates a cost matrix Υ, which is then used to build a cost-sensitive hypergraph structure $\mathbb {G}_i$. The model then employs an efficient collection to choose the cost-sensitive hypergraph with the greatest F-measure as the best choice.

(2) Cost Interval Optimization for Hypergraph Computation

As the cost value for cost-sensitive hypergraph modeling is not easy to be determined in practice, in this part, we introduce a cost interval optimization method for hypergraph computation [5], in which the fixed cost value is replaced by a cost interval, which is much easier to be provided than a fixed cost value.

Given a hypergraph $\mathbb {G}=(\mathbb {V}, \mathbb {E}, \mathbf {W})$, the regularization foundation of the cost-sensitive hypergraph can be divided into three components, i.e., empirical loss using cost information, the hypergraph Laplacian regularizer, and the hypergraph regularization, in order to optimize the overall cost by adding the mis-classification costs of various categories to the hypergraph framework.

The empirical loss using cost information can be formulated as

$$\displaystyle \begin{aligned} \mathbb{R}_{emp}({\omega})=\|{\varPhi}(\mathbf{S} {\omega}-\mathbf{y})\|{}_{2}^{2}=\sum_{i=1}^{N_{v}}\left({\varPhi}_{i, i}\left({\mathbf{s}}_{i} {\omega}-{\mathbf{y}}_{i}\right)\right)^{2}, \end{aligned} $$

(5.54)

where ω represents the mapping vector, and Φ is a diagonal matrix representing mis-classification cost weights. The hypergraph Laplacian regularizer can be written as

$$\displaystyle \begin{aligned} \varOmega({\omega})&=\frac{1}{2} \sum_{e \in \mathbb{E}} \sum_{v_{i}, v_{j} \in \mathbb{V}} \frac{\mathbf{W}(e) \mathbf{H}\left(v_{i}, e\right) \mathbf{H}\left(v_{j}, e\right)}{\delta(e)}\left(\frac{{\omega} {\mathbf{s}}_{i}}{\sqrt{d\left(v_{i}\right)}}-\frac{{\omega s}_{j}}{\sqrt{d\left(v_{j}\right)}}\right)^{2} \\ &=(\mathbf{S} {\omega})^{\top} \varDelta(\mathbf{S} {\omega}). \end{aligned} $$

(5.55)

To adjust the hyperedges weights and hence the hypergraph classification ability, the hypergraph regularization is written as $\varPsi (\mathbf {W})=\|\mathbf {W}\|{ }_{{F}}^{2}$. It is noted that this part can be removed in different applications, if not required.

Combining the above three, the whole optimization task for cost-sensitive hypergraph computation can be written as

$$\displaystyle \begin{aligned} \arg\underset{\omega, \mathbf{W}}{ \min} &\Big \{\|{\varPhi}(\mathbf{S} {\omega}-\mathbf{y})\|{}_{2}^{2} + \mu(\mathbf{S} {\omega})^{\top} \varDelta(\mathbf{S} {\omega}) + \lambda\|\mathbf{W}\|{}_{\mathrm{F}}^{2} \Big \},\\ s.t.~&\sum_{j=1}^{N_e} {\mathbf{W}}_{j,j} = 1, \forall~{\mathbf{W}}_{j,j}\geq 0,\\ \end{aligned} $$

(5.56)

where μ and λ are the trade-off hyperparameters.

The precise cost of each category is required for cost-sensitive hypergraph computation, but the cost is frequently impossible to be obtained, and it can only be known that the cost is within a cost interval [C _max, C _min]. Therefore, a simple idea is to attempt all values inside the cost interval and minimize the overall cost. However, this is inefficient given the possibly huge cost interval. As the actual cost is difficult to establish, we need to find a surrogate cost c ^∗ to guide the optimization procedure, and the surrogate classifier h ^∗ is supposed to be as successful as the true cost classifier h ^t. In this way, the problem can be formulated as

$$\displaystyle \begin{aligned} \min_{h,c^{*}} &~L(h, c^*),\\ s.t.&~p(L(h,c)<\theta)>1-\varphi, \forall~c\in [C_{min},C_{max}],C_{min}\leq c^{*}\leq C_{max}, \end{aligned} $$

(5.57)

where L(h, c) is the empirical risk. L(h, c) is formulated as $L(h, c)=\sum _{i=1}^{N_v} cI(\rho _i\neq y\wedge y=+)+I\left (\rho _{i} \neq y \wedge y=-\right )$, where ρ _i = s _i ω is the i-th data labeling in the test set, and + and − represent the label of the important class and the unimportant class, respectively.

The worst-case risk is considered first to guarantee that all limitations can be fulfilled. The worst-case classifier h ^∗ can be written as

$$\displaystyle \begin{aligned} h^{*}=\arg~\underset{h}{\min}~\underset{c}{\sup}~L(h,c) \end{aligned} $$

(5.58)

and

$$\displaystyle \begin{aligned} p\left(\underset{c}{\sup} L\left(h_{*}, c\right)<\theta\right)>1-\varphi. \end{aligned} $$

(5.59)

We have $p\left ( L\left (h_{*}, c\right )<\theta \right )>1-\varphi $ for any c. The worst-case risk is attained when the surrogate cost c ^∗ equals C _max. However, only a solution that meets the requirements can be acquired in this manner, and the cost cannot be guaranteed to be close to the true cost. As the average cost is the smallest maximum distortion of the genuine risk, it is another good choice, which can be calculated as C _mean = 0.5(C _max + C _min).

With the use of alternative costs C _max and C _mean, we can conduct cost interval optimization. First, C _max is used as a surrogate cost, and a collection of cost-sensitive hypergraph structures with varying parameter values is learned in the first stage. Then, C _mean is used as a surrogate cost to determine the lowest overall cost on the valid dataset, and then we choose the hypergraph structure as the final solution.

In this section, we describe cost-sensitive hypergraph computation methods. Imbalanced data issue is very common in many applications. The cost-sensitive hypergraph computation methods introduce cost matrix in hypergraph modeling, and both fixed cost value and cost interval can be used in the learning process.

5.5 Link Prediction on Hypergraph

Link prediction is a fundamental task in network analysis. The objective of link prediction is to predict whether two vertices in a network may have a link. Link prediction has wide applications in different domains, such as social relation exploration [17, 18], protein interaction prediction [19, 20], and recommender system [21, 22], which has attracted much attention in the past decades.

Link prediction on hypergraph aims to discover missing relations or predict new coming hyperedges based on the observed hypergraph, where hypergraph computation can be used to deeply exploit the underneath high-order correlations among these data. Unlike the link prediction task on the graph structure [23, 24], the hypergraph models the high-order correlation among the data, which is heterogeneous in many applications, as the vertices are in different types. For example, in a bibliographic network, the vertex can represent a paper, an author, or a venue, while the hyperedge represents the relation where the paper is written by multiple authors and published in a venue. These different types of vertices do not necessarily share the same representation space. The heterogeneous hypergraph consists of two kinds of vertex in the view of the hypergraph event, i.e., identifier vertex and slave vertex. Identifier vertex is the vertex that determines a hyperedge uniquely, while slave vertex is the other vertex except for the identifier vertex. In this section, we introduce the Heterogeneous Hypergraph Variational Auto-encoder (HeteHG-VAE) method [6] for heterogeneous hypergraph link prediction task.

The overview of HeteHG-VAE can be found in Fig. 5.12. HeteHG-VAE aims to learn the low-dimensional heterogeneous hypergraph embedding based on the Bayesian deep generative strategy. The input hypergraph is represented by the incidence matrix H, whose sub-hypergraph represents the hypergraph generated by different types of slave vertices. The heterogeneous encoder can project the vertices and the hyperedges to the vertex embedding and hyperedge embedding, respectively. The hypergraph embedding is the combination of the vertex embedding and the hyperedge embedding, which can be used for reconstructing the incidence matrix by the hypergraph decoder.

A flow diagram. A heterogeneous hypergraph input leads to incidence matrix H. It proceeds to heterogeneous node and hyperedge encoders, followed by their embedding, leading to hypergraph embedding and decoder. This produces a reconstruction matrix, resulting in a reconstructed hypergraph. — **Fig. 5.12**

In the following part of this section, we first introduce the variational evidence lower bound with the task specific derivation. Then, the inference model, including the heterogeneous vertex encoder and the heterogeneous hyperedge encoder, is presented. At last, the generative model and the link prediction method are introduced.

Denote $\{ x_k \}_{k=1}^K$ as the observed data with the total number K, ${\mathbf {Z}}^V_k$ as the latent vertex embedding, and Z ^E as the latent hyperedge embedding. HeteHG-VAE assumes that $\mathbf Z^V_k$ and Z ^E are drawn i.i.d. from a Gaussian prior, i.e., $\mathbf Z^V_k\sim p_0(\mathbf Z^V_k)$ and Z ^E ∼ p ₀(Z ^E), and x _k are drawn from the conditional distribution, $x_k\sim p(x_k|\mathbf Z^V_k,Z^E;\lambda _k)$, where λ _k is the parameter of the distribution. The objective of HeteHG-VAE is to maximize the log-likelihood of the observed data by optimizing λ _k as follows:

$$\displaystyle \begin{aligned} {} &\log p(x_1, \cdots, x_K; \lambda) \\ &\quad =\log \int_{{\mathbf{Z}}^V_1}\cdots \int_{{\mathbf{Z}}^V_K}\int_{{\mathbf{Z}}^E} p(x_1, \cdots, x_K, {\mathbf{Z}}^V_1, \cdots, {\mathbf{Z}}^V_K, {\mathbf{Z}}^E;\lambda)d{\mathbf{Z}}^V_1\cdots d{\mathbf{Z}}^V_Kd{\mathbf{Z}}^E\\ &\quad \geq \mathbb{E}_q \left( \log \frac{ p(x_1, \cdots, x_K, {\mathbf{Z}}^V_1, \cdots, {\mathbf{Z}}^V_K, {\mathbf{Z}}^E;\lambda) }{ q({\mathbf{Z}}^V_1, \cdots, {\mathbf{Z}}^V_K, {\mathbf{Z}}^E|x_1, \cdots, x_K;\theta)} \right) \\ &\quad := \mathbb{L}(x_1, \cdots, x_K; \theta, \lambda), \end{aligned} $$

(5.60)

where q(⋅) is the variational posterior for the estimation of the true posterior $p(\mathbf Z^V_1, \ldots , \mathbf Z^V_K, \mathbf Z^E|x_1, \ldots , x_K)$, which is inaccessible, and θ is the parameter to be estimated. Then, $\mathbb {L}(x_1, \ldots , x_K; \theta , \lambda )$ is the evidence lower bound of the log marginal likelihood. Based on the evidence lower bound, an inference encoder is presented to parameterize q, and a generative decoder is used to parameterize p.

The inference encoder of HeteHG-VAE consists of two main parts, i.e., the heterogeneous vertex encoder and the heterogeneous hyperedge encoder. Heterogeneous vertex encoder first maps the observed data x _k to a latent space $\tilde {\mathbf Z}^V_k$, which can be written as

$$\displaystyle \begin{aligned} \tilde{\mathbf Z}^V_k = f^V(x_k\mathbf W^V_k+b^V_k), \end{aligned} $$

(5.61)

where $\mathbf W^V_k$ and $b^V_k$ are the to-be-learned weights of the model, and f ^V is a nonlinear activation function. Two separated linear layers map the latent representation of the means $\mu ^V_k$ and variances $\sigma ^V_k$ of q:

$$\displaystyle \begin{aligned} & \mu^V_k = \tilde{\mathbf Z}^V_k\mathbf W^{V\mu}_k + b^{V\mu}_k, \end{aligned} $$

(5.62)

$$\displaystyle \begin{aligned} &\sigma^V_k = \tilde{\mathbf Z}^V_k\mathbf W^{V\sigma}_k + b^{V\sigma}_k, \end{aligned} $$

(5.63)

where $\mathbf W^{V\mu }_k$, $ b^{V\mu }_k$, $\mathbf W^{V\sigma }_k$, and $b^{V\sigma }_k$ are learnable parameters. The vertex embedding is the sample from the Gaussian distribution $\mathbb {N}(\mu ^V_k,\sigma ^V_k)$.

Heterogeneous hyperedge encoder first maps the observed data x _k to a latent space $\tilde {\mathbf Z}^E_k$, which can be written as

$$\displaystyle \begin{aligned} \tilde{\mathbf Z}^E_k = f^E(x_k^\top \mathbf W^E_k+b^E_k), \end{aligned} $$

(5.64)

where $\mathbf W^E_k$ and $b^E_k$ are the to-be-learned weights of the model, and f ^E is a nonlinear activation function. Then, the importance of different types of vertices is learned by the hyperedge attention mechanism, which can be written as

$$\displaystyle \begin{aligned} \tilde{\alpha}_k = \text{Tan}h (\tilde{\mathbf Z}^E_k \mathbf W_k^{E\alpha} + b_k^{E\alpha} ) \mathbf P, \end{aligned} $$

(5.65)

where $\mathbf W_k^{E\alpha }$, $b_k^{E\alpha }$, and P are learnable parameters. The attention score α _k is obtained by normalizing $\tilde {\alpha }_k$, and the hyperedge embedding can be written as

$$\displaystyle \begin{aligned} \tilde{\mathbf Z}^E = \sum_{k=1}^K \alpha_k \tilde{\mathbf Z}^E_k. \end{aligned} $$

(5.66)

Similarly, two separated linear layers map the latent representation of the means μ ^E and variances σ ^E of the distribution q:

$$\displaystyle \begin{aligned} & \mu^E = \tilde{\mathbf Z}^E{\mathbf{W}}^{E\mu} + b^{E\mu}, \end{aligned} $$

(5.67)

$$\displaystyle \begin{aligned} &\sigma^E = \tilde{\mathbf Z}^E{\mathbf{W}}^{E\sigma} + b^{E\sigma}, \end{aligned} $$

(5.68)

where W ^Eμ, b ^Eμ, W ^Eσ, and b ^Eσ are learnable parameters. The vertex embedding is the sample from the Gaussian distribution $\mathbb {N}(\mu ^E,\sigma ^E)$.

The incidence matrix is sampled from a Bernoulli distribution parameterized by $\mathbb {H}_k$:

$$\displaystyle \begin{aligned} p(\mathbf H_{ij} | {\mathbf{Z}}^V_{k,i}, {\mathbf{Z}}^E_{k,j}; \lambda_k ) = Ber(\mathbb{H}_{ij}), \end{aligned} $$

(5.69)

where $\mathbb {H}_{ij}$ is the dot product of the vertex embedding and the hyperedge embedding:

$$\displaystyle \begin{aligned} \mathbb{H}_{ij} = Sigmoid (\mathbf Z^V_{k,i}(\mathbf Z^E_j)^\top). \end{aligned} $$

(5.70)

The likelihood of the connection among vertices could be obtained based on the vertex embedding and hyperedge embedding as follows:

$$\displaystyle \begin{aligned} p_{conn}(\mathbf Z^V_i,\mathbf Z^E_j)=|| \mathbf Z^V_i,\mathbf Z^E_j ||{}_2. \end{aligned} $$

(5.71)

In this section, we have introduced the Heterogeneous Hypergraph Variational Auto-encoder method [6] for the task of link prediction on hypergraph, which captures the high-order correlations among the data while preserving the origin low-order topology. Link prediction on hypergraph has shown superior performance in different experiments and can be further used in other applications.

5.6 Summary

In this chapter, we introduce four typical hypergraph computation tasks, including label propagation, data clustering, imbalance learning, and link prediction. Label propagation on hypergraph is to predict the labels for the vertices on a hypergraph, i.e., assigning a label to each unlabeled vertex in the hypergraph, based on the labeled information. Data clustering on hypergraph divides the vertices in a hypergraph into several groups. Imbalanced learning on hypergraph considers the imbalanced data distributions and introduces cost-sensitive hypergraph computation methods. Link prediction on hypergraph discovers missing relations or predicts new coming hyperedges based on the observed hypergraph. We note that these four tasks are typical ways to use hypergraph computation in practice. Other tasks can also be deployed under the hypergraph computation framework, such as data regression, data completion, and data generation. Following these typical hypergraph computation tasks, we can use them in different applications, such as social media analysis and computation vision.

References

Y. Gao, M. Wang, D. Tao, R. Ji, Q. Dai, 3-D object retrieval and recognition with hypergraph analysis. IEEE Trans. Image Process. 21(9), 4290–4303 (2012)
Article MathSciNet MATH Google Scholar
Z. Zhang, H. Lin, J. Zhu, X. Zhao, Y. Gao, Cross diffusion on multi-hypergraph for multi-modal 3d object recognition, in Proceedings of Pacific Rim Conference on Multimedia (2018), pp. 38–49
Google Scholar
Y. Hu, X. Li, Y. Wang, Y. Wu, Y. Zhao, C. Yan, Y. Gao, Adaptive hypergraph auto-encoder for relational data clustering. IEEE Trans. Knowl. Data Eng. (2021)
Google Scholar
N. Wang, R. Liang, X. Zhao, Y. Gao, Cost-sensitive hypergraph learning with F-measure optimization. IEEE Trans. Cyber. (2021) pp. 1–12
Google Scholar
X. Zhao, N. Nan, H. Shi, H. Wan, J. Huang, Y. Gao, Hypergraph learning with cost interval optimization, in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (2018), pp. 4522–4529
Google Scholar
H. Fan, F. Zhang, Y. Wei, Z. Li, C. Zou, Y. Gao, Q. Dai, Heterogeneous hypergraph variational autoencoder for link prediction. IEEE Trans. Pattern Analy. Mach. Intell. 44(8), 4125–4138 (2021)
Google Scholar
D. Zhou, J. Huang, B. Schölkopf, Learning with hypergraphs: clustering, classification, and embedding, in Proceedings of Advances in Neural Information Processing Systems (2006), pp. 1601–1608
Google Scholar
D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in Proceedings of the Advances in Neural Information Processing Systems (2003), pp. 321–328
Google Scholar
F. Ye, C. Chen, Z. Zheng, Deep autoencoder-like nonnegative matrix factorization for community detection, in Proceedings of the 27th ACM International Conference on Information and Knowledge Management (2018), pp. 1393–1402
Google Scholar
T. Yang, R. Jin, Y.Chi, S. Zhu, Combining link and content for community detection: a discriminative approach, in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009), pp. 927–936
Google Scholar
C. Elkan, The foundations of cost-sensitive learning, in Proceedings of the International Joint Conference on Artificial Intelligence (2001), pp. 973–978
Google Scholar
C. Zhang, K. Tan, H. Li, G. Hong, A cost-sensitive deep belief network for imbalanced classification. IEEE Trans. Neural Netw. Learn. Syst. 30(1), 109–122 (2018)
Article Google Scholar
D. Tomar, S. Agarwal, Prediction of defective software modules using class imbalance learning, in Proceedings of the Applied Computational Intelligence and Soft Computing (2016)
Google Scholar
N. Wang, X. Zhao, Y. Jiang, Y. Gao, Iterative metric learning for imbalance data classification, in Proceedings of the 27th International Joint Conference on Artificial Intelligence (2018), pp. 2805–2811
Google Scholar
S. Barua, Md.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26(2), 405–425 (2012)
Google Scholar
P. Sobhani, H. Viktor, S. Matwin, Learning from imbalanced data using ensemble methods and cluster-based undersampling, in Proceedings of the International Workshop on New Frontiers in Mining Complex Patterns (2014), pp. 69–83
Google Scholar
D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, in Proceedings of the 2003 ACM International Conference on Information and Knowledge Management (2003), pp. 556–559
Google Scholar
L. Liao, X. He, H. Zhang, T. Chua, Attributed social network embedding. IEEE Trans. Knowl. Data Eng. 30(12), 2257–2270 (2018)
Article Google Scholar
A. Clauset, C. Moore, M.E. Newman, Hierarchical structure and the prediction of missing links in networks. Nature 453(7191), 98–101 (2008)
Article Google Scholar
M. Zitnik, M. Agrawal, J. Leskovec, Modeling polypharmacy side effects with graph convolutional networks. Bioinfor. 34(13), 457–466 (2018)
Article Google Scholar
W. Feng, J. Wang, Incorporating heterogeneous information for personalized tag recommendation in social tagging systems, in Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2012), pp. 1276–1284
Google Scholar
C. Shi, B. Hu, W. X. Zhao, P.S. Yu, Heterogeneous information network embedding for recommendation. IEEE Trans. Knowl. Data Eng. 31(2), 357–370 (2019)
Article Google Scholar
A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), pp. 855–864
Google Scholar
M. Zhang, Y. Chen, Link prediction based on graph neural networks, in Proceedings of the Advances in Neural Information Processing Systems (2018), pp. 5165–5175
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Automation, Tsinghua University, Beijing, China
Qionghai Dai
School of Software, Tsinghua University, Beijing, China
Yue Gao

Authors

Qionghai Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yue Gao
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dai, Q., Gao, Y. (2023). Typical Hypergraph Computation Tasks. In: Hypergraph Computation. Artificial Intelligence: Foundations, Theory, and Algorithms. Springer, Singapore. https://doi.org/10.1007/978-981-99-0185-2_5

Download citation

DOI: https://doi.org/10.1007/978-981-99-0185-2_5
Published: 17 January 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0184-5
Online ISBN: 978-981-99-0185-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics