6.1 Introduction

The hypergraph structure models the high-order and complex correlations among data, and thus the quality of topology structure plays an important role in learning tasks on hypergraph. As shown in the previous chapter, there have been implicit and explicit methods of hypergraph generation from observed data. However, the generated hypergraph may contain abundant, missing, and noisy connections due to the disturbances in the process of data collection and hypergraph construction. In other words, there may exist biases between the generated hypergraph and the ground truth structure. Under such circumstances, it is essential to optimize the hypergraph structure to make it fit the ground truth high-order correlation more accurately. The quality of a hypergraph can be directly qualified by comparing with the ground truth structure if available, or indirectly evaluated by the performance of downstream applications. Most existing hypergraph computation methods rely on static hypergraph structure, such as k-nn-based method [1], cluster-based method [2], and spare representation-based method [3]. These methods may suffer from the inaccurate hypergraph structure that exists in practice. In this chapter, we introduce hypergraph structure evolution methods under the dynamic hypergraph structure learning mechanism. Hypergraph structure evolution can be divided into two main categories, i.e., hypergraph component optimization and hypergraph structure optimization. The problem of hypergraph structure evolution is usually integrated with the learning process and formulated as a bi-level optimization problem. Part of the work introduced in this chapter has been published in [4,5,6,7].

6.2 Hypergraph Component Optimization

Besides the main structure of a hypergraph, i.e., the incidence matrix, a hypergraph is also composed of a group of components such as the weights for hyperedges, vertices, and even sub-hypergraphs, which play an important role on the hypergraph structure. Hypergraph component optimization aims to explore the optimal components of the hypergraphs, i.e., hyperedge weights, vertices weights, and sub-hypergraph weights. The hyperedge weights represent the strength of each high-order correlation among data, while the vertex weights represent the importance of different samples on the structure. In many cases, we may construct multiple hypergraphs using multi-modal data or different criteria, which can be regarded as sub-hypergraphs. The sub-hypergraph weights are used to measure the importance of different sub-hypergraphs on the overall structure. The optimization procedure adjusts the hyperedge weights, the vertex weights, and sub-hypergraph weights during the training process in order to improve the performances on the downstream applications.

6.2.1 Hyperedge Weight Optimization

The hyperedge is a basic component of the hypergraph, representing the high-order complex correlation among data. The initial hypergraph usually assigns an identical weight to all hyperedges. However, hyperedges actually have different effects for a given task. The hyperedge weights indicate the importance of different hyperedges contributing to the whole structure. In this section, we introduce the hyperedge weight learning methods [4], in which the weights of hyperedges are adaptively adjusted during the training process, and thus the importance of different hyperedges can be automatically modulated.

We assume that there are m hyperedges in the hypergraph, denoted by \(\left \{e_{1}, e_{2}, \ldots , e_{n}\right \}\). The weights of the hyperedges are defined by the n × 1 vectors \(w=\left [w_{1}, w_{2}, \ldots , w_{n}\right ]^\top \). There is usually a constraint on the hypergraph weights that their sum is equal to one, i.e., \(\sum _{i=1}^{n} \omega _{i}=1\). We use F to denote the output of hypergraph learning. The problem of learning hyperedge weights can be formulated in a dual-optimization form mathematically

$$\displaystyle \begin{aligned} \begin{gathered} \arg \underset{\mathbf{F},w}{\min} \varPsi(\mathbf{F}):=\left\{\varOmega(\mathbf{F})+\lambda R_{\mathrm{emp}}(\mathbf{F})+\mu\varPhi(w)\right\} {},\\ s.t. \ \sum_{e \in \mathbb{E}} \mathbf{W}(e)=1. \end{gathered} \end{aligned} $$
(6.1)

Here, Ω(F) and R emp(F) are the regularizer and empirical loss of F, respectively. Φ(w) is the regularizer on w. λ and μ are the scalars controlling the relative importance of these three items.

The general formulation can be implemented by specifying the functions Ω(⋅), R emp(⋅) and Φ(⋅). As said before, F is the to-be-learned labels in the node classification task. The regularizer Ω(F) can be defined as F ⊤ Δ F, where Δ is the Laplacian matrix. The empirical loss R emp(F) in the general form can be instantiated by the difference between the learned F and observed labels of training data Y, which are called the least residuals. The regularizer on w is a 2-norm. The general formulation can be written as

$$\displaystyle \begin{aligned} \begin{gathered} \arg \underset{\mathbf{F},w}{\min} \varPsi(\mathbf{F}):=\left\{{\mathbf{F}}^{\top} \varDelta \mathbf{F}+\lambda\|\mathbf{F}-\mathbf{Y}\|{}^{2}+\mu \sum_{i=1}^{n} w_{i}^{2}\right\}, {} \\ \textit { s.t. } \sum_{i=1}^{n} w_{i}^{2}=1. \end{gathered} \end{aligned} $$
(6.2)

The aim of the learning process is to search the optimal solution of F and w to minimize the cost function in Eq. (6.2).

There are two variables to be optimized in Eq. (6.2), which can be solved by the alternating optimization algorithm. For each instant in time, one variable is optimized, while the other is kept constant for the to-be-learned two variables F and w. The details of the alternating optimization strategy are introduced as follows.

Given the initial hyperedge weights, the first step is fixing w and optimizing Ω(F). The sub-problem is written as

$$\displaystyle \begin{aligned} \arg \min_{\mathbf{F}} \varPsi(\mathbf{F})=\arg \min_{\mathbf{F}}\left\{{\mathbf{F}}^{\top} \varDelta \mathbf{F}+\lambda\|\mathbf{F}-\mathbf{Y}\|{}^{2}\right\}. \end{aligned} $$
(6.3)

A closed-form solution of Eq. (6.3) has already been achieved from the traditional hypergraph learning. The solution is written as

$$\displaystyle \begin{aligned} \mathbf{F} &=\left(I+\frac{1}{\lambda} \varDelta\right)^{-1} \mathbf{Y} \notag \\ &=\left(\mathbf{I}+\frac{1}{\lambda}(\mathbf{I}-\varTheta)\right)^{-1} \mathbf{Y} \notag \\ &=\frac{\lambda+1}{\lambda}\left(\mathbf{I}-\frac{1}{\lambda+1} \varTheta\right)^{-1} \mathbf{Y} . {} \end{aligned} $$
(6.4)

Let \(\zeta =\frac {1}{\lambda +1}\), and Eq. (6.4) can be rewritten as

$$\displaystyle \begin{aligned} \mathbf{F}=\frac{1}{1-\xi}(\mathbf{I}-\xi \varTheta)^{-1} \mathbf{Y} . \end{aligned} $$
(6.5)

With the updated F, the next step is fixing F while optimizing w, and the sub-problem about w is

$$\displaystyle \begin{aligned} \begin{gathered} \arg \min _{w} \varPsi(\mathbf{F})=\arg \min _{\mathbf{F}}\left\{{\mathbf{F}}^{\top} \varDelta \mathbf{F}+\mu \sum_{i=1}^{n} w_{i}^{2}\right\}, \\ \textit { s.t. } \sum_{i=1}^{n} w_{i}=1, \mu>0. \end{gathered} \end{aligned} $$
(6.6)

The Lagrangian multipliers method is employed here, and the sub-problem is replaced with

$$\displaystyle \begin{aligned} &\arg \min _{w, \eta} {\mathbf{F}}^{\top} \varDelta \mathbf{F}+\mu \sum_{i=1}^{n} w_{i}^{2}+\eta\left(\sum_{i=1}^{n} w_{i}-1\right) \notag \\ &=\quad \arg \min _{w, \eta} {\mathbf{F}}^{\top}\left(\mathbf{I}-{\mathbf{D}}_{v}^{-\frac{1}{2}} \mathbf{H} \mathbf{W D}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-\frac{1}{2}}\right) \mathbf{F} +\mu \sum_{i=1}^{n} w_{i}^{2}+\eta\left(\sum_{i=1}^{n} w_{i}-1\right). \end{aligned} $$
(6.7)

Let \(\varGamma ={\mathbf {D}}_{0}^{-\frac {1}{2}} \mathbf {H}\), and it can be shown that

$$\displaystyle \begin{aligned} \eta=\frac{{\mathbf{F}}^{\top} \varGamma \mathbf{F}-2 \mu}{n} \end{aligned} $$
(6.8)

and

$$\displaystyle \begin{aligned} w_{i}=\frac{1}{n}-\frac{{\mathbf{F}}^{\top} \varGamma {\mathbf{D}}_{e}^{-1} \varGamma^{\top} \mathbf{F}}{2 n \mu}+\frac{{\mathbf{F}}^{\top} \varGamma_{i} {\mathbf{D}}_{e}^{-1}(i, i) \varGamma_{i}^{\top} \mathbf{F}}{2 \mu}. \end{aligned} $$
(6.9)

Here, Γ i defines the i-th column of Γ.

In this way, F and w are alternatively updated until convergence. Finally, the optimal values of F and w are obtained. We note that the above method is a typical way to optimize the hyperedge weights using the l 2-norm. Other methods can also be used to learn the hyperedge weights using different constraints.

6.2.2 Vertex Weight Optimization

Early hypergraph computation methods may not take the importance of vertices into account and mainly focus on the weights of hyperedges. However, the vertex set in the hypergraph may have heterogeneous, unbalanced, and outlier problems, resulting in performance degeneration of learning process. Therefore, it is highly required to consider the weights of vertices to define the impact of different subjects during the learning process. For example, vertices belonging to the minority class may require larger weights and vice versa for imbalanced data. In this part, we introduce the vertex-weighted hypergraph learning method [5], which can update the vertex weights during the learning process.

The aim of vertex-weighted hypergraph learning algorithm is to emphasize the vertices with distinguishable information and disregard the redundant vertices that bring in bias and noise instead of useful information. On the basis of learning hyperedge weights, vertex-weighted learning algorithm further considers the vertex weights. Here let {v 1, v 2, …, v n} denote all n vertices in the hypergraph. The corresponding weight for vertex v i is represented by u i. Let U denote the diagonal matrix of vertex weights. The overall cost function is similar to learning hyperedge weights, but with the impact of U simultaneously taken into consideration. The general formulation is written as

$$\displaystyle \begin{aligned} \begin{gathered} \arg \underset{\mathbf{F},w}{\operatorname{min}} \varPsi_{\mathbf{U}}(\mathbf{F}):=\left\{\varOmega_{\mathbf{U}}(\mathbf{F})+\lambda R_{\mathrm{emp}}(\mathbf{F})+\mu\varPhi(w)\right\} {},\\ \textit { s.t. } \mathbf{W}(e) \leq 0, \sum_{e \in \mathbb{E}} \mathbf{H}(v,e)\mathbf{W}(e)={\mathbf{D}}_{v}(v). \end{gathered} \end{aligned} $$
(6.10)

The key point of vertex weight optimization is to design a reasonable vertex weighting scheme that scores the importance of each subject during the learning process. First, the pairwise distances between vertices are calculated based on the features. Let d ij denote the distances between vertices v i and v j, and \(\hat {d}_i\) declares the mean distance between v i and all other training vertices with the same label. The vertex weight is then defined as

$$\displaystyle \begin{aligned} u_i = \frac{\hat{d}_i}{\sum_{j=1}^{n_{train}}\hat{d}_j}, \end{aligned} $$
(6.11)

where n train denotes the number of training samples. It is noted that only the training data are labeled and further weighted. The unlabeled vertices are initialized with an identical weight. Normalization is then applied to the vertex weights. This weighting scheme can assign higher weights to vertices that are far from other intra-class vertices and vice versa. Therefore, the importance of repeated/close samples is relatively smaller than the outliers during the hypergraph learning process.

Since the hypergraph structure is updated with vertex weights, the hypergraph structure regularizer is different from the initial one. As stated already, the hypergraph regularizer is defined based on the cut cost. Here, the cut cost is related to not only just the hyperedge weights but to the vertex weights. In general, the higher the weight of two vertices, the higher the cut cost. Therefore, the regularizer of the hypergraph structure Ψ U(F) is rewritten as

$$\displaystyle \begin{aligned} \varOmega(\mathbf{F})&=\sum_{k=1}^{C} \sum_{e \in \mathbb{E}} \sum_{u, v \in \mathbb{V}} \frac{\mathbf{W}(e) \mathbf{U}(u) \mathbf{H}(u, e) \mathbf{U}(v) \mathbf{H}(v, e)}{2 \delta(e)}\left(\frac{\mathbf{F}(u, k)}{\sqrt{d(u)}}-\frac{\mathbf{F}(v, k)}{\sqrt{d(v)}}\right)^{2} \notag\\ &=\sum_{k=1}^{C} \sum_{e \in \mathbb{E}} \sum_{u, v \in \mathbb{V}} \frac{\mathbf{W}(e) \mathbf{U}(u) \mathbf{H}(u, e) \mathbf{U}(v) \mathbf{H}(v, e)}{\delta(e)}\notag \\&\quad \times\left(\frac{\mathbf{F}(u, k)^{2}}{d(u)}-\frac{\mathbf{F}(u, k) \mathbf{F}(v, k)}{\sqrt{d(u) d(v)}}\right) \notag \\ &=\sum_{k=1}^{C}\left\{\sum_{u \in \mathbb{V}} \mathbf{U}(u) \mathbf{F}(u, k)^{2} \sum_{e \in \mathbb{E}} \frac{\mathbf{W}(e) \mathbf{H}(u, e)}{d(u)} \sum_{v \in \mathbb{V}} \frac{\mathbf{H}(v, e) \mathbf{U}(v)}{\delta(e)}\right. \notag\\ &\quad \left.-\sum_{e \in \mathbb{E}} \sum_{u, v \in \mathbb{V}} \frac{\mathbf{F}(u, k) \mathbf{U}(u) \mathbf{H}(u, e) \mathbf{W}(e) \mathbf{H}(v, e) \mathbf{U}(v) \mathbf{F}(v, k)}{\sqrt{d(u) d(v)} \delta(e)}\right\} \notag\\ &=\sum_{k=1}^{C} \mathbf{F}(:, k)^{\top} \boldsymbol{\varDelta}_{\mathbf{U}} \mathbf{F}(:, k)\notag\\ &={\mathbf{F}}^{\top} \boldsymbol{\varDelta}_{\mathbf{U}} \mathbf{F}. \end{aligned} $$
(6.12)

Here, F(:, k) is the k-th column of F and C is the number of data categories. Δ U is the vertex-weighted hypergraph Laplacian, which can be defined as

$$\displaystyle \begin{aligned} \boldsymbol{\varDelta}_{\mathbf{U}} = \mathbf{U} -{ \varTheta} = \mathbf{U} - {\mathbf{D}}_{v}^{-1/2} \mathbf{U} \mathbf{H W} {{\mathbf{D}}_{e}}^{-1} {\mathbf{H}}^{\top} {\mathbf{UD}_{v}}^{-1/2}. \end{aligned} $$
(6.13)

Compared with the traditional hypergraph Laplacian \(\boldsymbol {\varDelta } = \mathbf {I} - {\mathbf {D}}_{v}^{-\frac {1}{2}} \mathbf {H W} {{\mathbf {D}}_{e}}^{-1} {\mathbf {H}}^{\top } {\mathbf { D}_{v} }^{-\frac {1}{2}}\), the hypergraph Laplacian with weighted vertices takes different weights of vertices into consideration during the evaluation of the cost on the hypergraph structure. Therefore, the learning task can be further defined as

$$\displaystyle \begin{aligned} \begin{gathered} \arg \underset{\mathbf{F},\mathbf{W}}{\min} \varPsi(\mathbf{F}):=\left\{{\mathbf{F}}^{\top}\boldsymbol{\varDelta}_{\mathbf{U}} \mathbf{F}+\lambda\|\mathbf{F}-\mathbf{Y}\|{}^{2}+\mu \sum_{e \in \mathbb{E}} \mathbf{W}(e)^{2}\right\}, \\ s.t. \ \mathbf{W}(e) \geq 0, \sum_{e \in \mathbb{E}} \mathbf{H}(v, e) \mathbf{W}(e)={\mathbf{D}}_{v}(v). \end{gathered} \end{aligned} $$
(6.14)

The above optimization problem can be solved by the alternative optimization algorithm. The sub-problem about F has the closed-form solution as in traditional hypergraph learning. The sub-problem about W is written as

$$\displaystyle \begin{aligned} \begin{gathered} \arg \underset{\mathbf{F},\mathbf{W}}{\min} \varPsi(\mathbf{F}):=\left\{{\mathbf{F}}^{\top}\boldsymbol{\varDelta}_{\mathbf{U}} \mathbf{F}+\mu \sum_{e \in \mathbb{E}} \mathbf{W}(e)^{2}\right\}, \\ s.t.\ \mathbf{W}(e) \geq 0, \sum_{e \in \mathbb{E}} \mathbf{H}(v, e) \mathbf{W}(e)={\mathbf{D}}_{v}(v). \end{gathered} \end{aligned} $$
(6.15)

The above optimization task can be solved via quadratic programming, since it is convex on W. Through vertex weight optimization, the vertex-weighted hypergraph structure takes the contribution of each vertex to the whole hypergraph structure into consideration, and thus it can model the high-order relevance among objects more accurately. During the learning process, the impact of low-quality training samples on the structure and subsequent classification tasks decreases continuously, while high-quality training data, which account for a minority, can be given greater importance. On the other hand, the minority of training data can have greater importance. The additional vertex weights lead to an optimal Laplacian matrix of hypergraph that measures data correlation better than the traditional one and consequently lead to improvement of the classification performance.

6.2.3 Sub-hypergraph Weight Optimization

Given multiple sub-hypergraphs that are used to jointly formulate the correlation among data, it is important to measure how these sub-hypergraphs work in the main task. Sub-hypergraph weight optimization adjusts the importance of the sub-hypergraphs, which models the complex correlation among the multi-model data. In this part, we introduce the inductive multi-hypergraph learning (iMHL) [7] to learn the weights of the model and adjust the weights of the sub-hypergraphs during the training process simultaneously, which models the high-order correlation of the multi-model data with the multi-hypergraph, diffuses the sub-hypergraphs as the modality weight, and learns the map from the data to the labels under the supervised setting. Given testing data, the learning projection can be used to predict corresponding labels. The framework of iMHL is illustrated in Fig. 6.1, where the offline training and online training are both supported by the inductive learning process, which can easily handle new coming data efficiently.

Fig. 6.1
A framework explains the offline training and online classification of inductive multi-hypergraph learning. Inductive multi-hypergraph learning during offline training leads to online classification. Multimodal data also leads to classification.

The framework of inductive multi-hypergraph learning method. This figure is from [7]

Here, we denote m as the total number of all sub-hypergraphs and \(\mathbb {G}_i=( \mathbb {V}_i, \mathbb {E}_i, {\mathbf {W}}_i )\) as the i-th hypergraph for the i-th modality. The projection matrices M i are combined as per the sub-hypergraph weights and are used to map the data to the label for prediction. The combination weights ω = [ω 1, ⋯ , ω m] are another object to be optimized, which represents the weight of the corresponding modality, subject to \(\sum _{i=1}^m \omega _i = 1\) and ω ≥ 0.

The loss function \(\bar {\varPsi }\) for learning all M i can be formulated as

$$\displaystyle \begin{aligned} \bar{\varPsi} = \sum_{i=1}^m \omega_i \{ \varOmega({\mathbf{M}}_i) + \lambda R_{emp}({\mathbf{M}}_i) + \mu \varPhi({\mathbf{M}}_i) \} + \eta \varGamma(\boldsymbol{\omega}), \end{aligned} $$
(6.16)

which consists of two main parts, i.e., the summation of the cost of each sub-hypergraph and the regularization on the sub-hypergraph weights ω. Φ(M) is the regularizer on the projection matrix. We assume that the vertices with similar labels are connected strongly, and Ω(M) can then be written as

$$\displaystyle \begin{aligned} \varOmega({\mathbf{M}}) &= \frac 12 \sum_{k=1}^c \sum_{e\in\mathbb{E}}\sum_{u,v\in\mathbb{V}}\frac{\mathbf{W}(e) \mathbf{H}(u,e) \mathbf{H}(v,e)} {\delta(e)} \left( \frac{{\mathbf{X}}^\top \mathbf{M} (u,k) }{\sqrt{d(u)}} - \frac{{\mathbf{X}}^\top \mathbf{M} (v,k) }{\sqrt{d(v)}} \right)^2 \notag\\ {} &= \text{tr} ( {\mathbf{M}}^\top\mathbf{X}\varDelta {\mathbf{X}}^\top \mathbf{M} ), \end{aligned} $$
(6.17)

where Δ denotes the normalized hypergraph Laplacian,

$$\displaystyle \begin{aligned} \varDelta &= \mathbf{I} - {\mathbf{D}}_{v}^{-1/2} \mathbf{HW}{\mathbf{D}}_{e}^{-1}{\mathbf{H}}^\top {\mathbf{D}}_{v}^{-1/2}. \end{aligned} $$
(6.18)

The empirical loss term R emp(M) can be written as

$$\displaystyle \begin{aligned} R_{emp}(\mathbf{M}) = || {\mathbf{X}}^{\top}\mathbf{M} - \mathbf{Y} ||{}^2. \end{aligned} $$
(6.19)

Φ(M) can be formulated as the ℓ 2,1-norm of M,

$$\displaystyle \begin{aligned} \varPhi(\mathbf{M}) = ||\mathbf{M}||{}_{2,1}, \end{aligned} $$
(6.20)

which produces row sparsity for more informative features. Γ(ω) is the ℓ 2-norm of the sub-hypergraph weights

$$\displaystyle \begin{aligned} \varGamma(\omega) = ||\omega||{}^2, \end{aligned} $$
(6.21)

which aims to learn the optimal weights for each sub-hypergraph.

The inductive multi-hypergraph learning task can be formulated as

$$\displaystyle \begin{aligned} \begin{gathered} \arg \min_{{\mathbf{M}}_i, \omega \geq 0} \sum_{i=1}^m \omega_i \left( \varOmega({\mathbf{M}}_i) + \lambda R_{emp}({\mathbf{M}}_i) + \mu \varPhi({\mathbf{M}}_i) \right ) + \eta \varGamma(\boldsymbol{\omega}), \\ s.t.\ \sum_{i=1}^m \omega_i = 1. \end{gathered} {} \end{aligned} $$
(6.22)

It is observed that Eq. (6.22) could be split into m + 1 independent sub-problems, each M i is optimized individually, and the combination weights ω are optimized to fuse all multi-hypergraphs.

The optimization of M i shown below can be solved by iterative algorithm.

$$\displaystyle \begin{aligned} \arg\min_{{\mathbf{M}}_i} \varOmega({\mathbf{M}}_i) + \lambda R_{emp}({\mathbf{M}}_i) + \mu \varPhi({\mathbf{M}}_i). \end{aligned} $$
(6.23)

The optimization problem of ω can then be written as

$$\displaystyle \begin{aligned} \begin{gathered} {} \arg \min_{\omega \geq 0} \sum_{i=1}^m \omega_i \left( \varOmega({\mathbf{M}}_i) + \lambda R_{emp}({\mathbf{M}}_i) + \mu \varPhi({\mathbf{M}}_i) \right ) + \eta ||\boldsymbol{\omega}||{}^2, \\ s.t.\ \sum_{i=1}^m \omega_i = 1. \end{gathered} \end{aligned} $$
(6.24)

We denote Υ i = Ω(M i) + λR emp(M i) + μΦ(M i), and Eq. (6.24) can be simplified to

$$\displaystyle \begin{aligned} \begin{gathered} {} \arg \min_{\omega \geq 0} \sum_{i=1}^m \omega_i \varUpsilon_i + \eta ||\boldsymbol{\omega}||{}^2, \\ s.t.\ \sum_{i=1}^m \omega_i = 1. \end{gathered} \end{aligned} $$
(6.25)

The Lagrangian algorithm can be applied to solve Eq. (6.25), which can be formulated as

$$\displaystyle \begin{aligned} \arg \min_{\boldsymbol{\omega}, \zeta}\sum_{i=1}^m \omega_i \varUpsilon_i + \eta ||\boldsymbol{\omega}||{}^2 + \zeta \left( \sum_{i=1}^m \omega_i - 1 \right). \end{aligned} $$
(6.26)

Then, we can have

$$\displaystyle \begin{aligned} \zeta &= \frac{-\sum_{i=1}^m \varUpsilon_i - 2 \eta}{m} \end{aligned} $$
(6.27)

and

$$\displaystyle \begin{aligned} \omega_i &= \frac 1m + \frac{\sum_{i=1}^m \varUpsilon_i}{2m\eta} - \frac{\varUpsilon_i}{2\eta}. \end{aligned} $$
(6.28)

Given the testing sample \(x^t = \{ x_1^t, \cdots , x_m^t \}\) features for each modality, the prediction of the corresponding label can be achieved by

$$\displaystyle \begin{aligned} C(x^t) = \arg\max_k \sum_{i=1}^m \omega_i {x_i^t}^\top{\mathbf{M}}_i. \end{aligned} $$
(6.29)

The overall algorithm is shown in Fig. 6.2. The optimization of sub-hypergraph weights is effective as the incorporation of the multi-modal data via multiple sub-hypergraphs can make it flexible to investigate the contributions of different data or information on the learning process.

Fig. 6.2
An illustration of the steps involved in the sub-hypergraph weight optimization method. The headings are Input and Output. Training data, label matrix, and testing sample are the inputs, along with 8 steps under output.

The workflow for the sub-hypergraph weight optimization method

6.3 Hypergraph Structure Optimization

Although the above component optimization methods can modify the weights of hyperedges, vertices, or sub-hypergraphs, it is not easy to precisely adjust the inappropriate or wrong connections since the intersections between vertices and hyperedges cannot be changed, i.e., the incidence matrix of the hypergraph is fixed. To solve this challenge and further optimize the hypergraph structure, it is imperative to investigate how to finely optimize the hypergraph structure and dynamically learn the high-order relationship. It can be regarded as finding the optimal hypergraph structure in a hypergraph space, as shown in Fig. 6.3.

Fig. 6.3
A 3-D heatmap with many troughs and an illustration. The illustration depicts the formation of three hypergraphs at three points on a cylindrical graph. The first point is labeled H subscript 0 and the last is labeled H.

An illustration of hypergraph structure evolution

In this part, we introduce the dynamic hypergraph structure learning method [6], and Fig. 6.4 shows the framework of this method. Different from the above methods, structure optimization on incidence matrix aims to optimize the incidence matrix H.

Fig. 6.4
A flow diagram starts with labeled and unlabeled input, leading to hypergraph construction and dynamic hypergraph structure learning. The learning method results in incidence matrix H and label projection F.

The framework of dynamic hypergraph structure learning method. This figure is from [6]

The output F and the incidence matrix H are jointly optimized by the dual-optimization method. The objective function of the joint learning can be formulated as

$$\displaystyle \begin{aligned} \arg \underset{\mathbf{F}, 0 \preceq \mathbf{H} \preceq 1}{\operatorname{min}} \varPsi(\mathbf{F}):=\left\{\varOmega(\mathbf{F}, \mathbf{H})+\lambda \mathbb{R}_{\text{emp }}(\mathbf{F})+\mu \varPhi(\mathbf{H})\right\}. {} \end{aligned} $$
(6.30)

There are three terms in the objective function, explained as follows:

  • First, Ω(F, H) is the regularizer related to F and H. The output F is the to-be-learned label vectors of vertices. Therefore, smoothness is expected to be conducted on the hypergraph structure, where the commonly used regularizer of hypergraph smoothness can be written as

    $$\displaystyle \begin{aligned} \varOmega(\mathbf{F}, \mathbf{H}) = \operatorname{tr}\left({\mathbf{F}}^{\top}\left(\mathbf{I}-{\mathbf{D}}_{v}^{-1/2} \mathbf{H} \mathbf{W D}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1/2}\right) \mathbf{F}\right). \end{aligned} $$
    (6.31)

    However, the regularizer in the previous methods is a function only of F, while H is a stable parameter. Here, the regularizer is a function of both F and H.

  • Second, the empirical loss \(\mathbb {R}_{\text{emp }}(\mathbf {F})\) is the l 2-norm between F and Y.

  • Third, Φ(H) is the regularizer only related to H to additionally constrain H to satisfy the prior knowledge. For instance, given the feature information of data, the hypergraph structure is expected to preserve smoothness not just in the label space but in the feature space as well. Let X denote the features of vertices, and the regularizer can be formulated as

    $$\displaystyle \begin{aligned} \varPhi(\mathbf{F}) = \operatorname{tr}\left({\mathbf{X}}^{\top}\left(\mathbf{I}-{\mathbf{D}}_{v}^{-1/2} \mathbf{H} \mathbf{W D}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1/2}\right) \mathbf{X}\right). \end{aligned} $$
    (6.32)

To summarize, the general objective function in Eq. (6.30) for dynamic hypergraph structure learning is instantiated as

$$\displaystyle \begin{aligned} \arg \underset{\mathbf{F}, 0 \preceq \mathbf{H} \preceq 1}{\operatorname{min}} \varPsi(\mathbf{F})&:= \operatorname{tr}\left(\left(\mathrm{I}-{\mathbf{D}}_{v}^{-1/2} \mathbf{H} \mathbf{W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1/2}\right)\left(\mathbf{F F}^{\top}+\mu \mathbf{X} {\mathbf{X}}^{\top}\right)\right)\\ &\quad +\lambda\|\mathbf{F}-\mathbf{Y}\|{}^{2}. \end{aligned} $$
(6.33)

Similar to the previous methods, the alternative optimization algorithm is applied to solve the dual-optimization problem. The sub-problem about F has the same closed-form solution as traditional hypergraph learning [8].

The most important point that is different from the previous one is the sub-problem about H, which is written as

$$\displaystyle \begin{aligned} \arg \min _{0 \preceq \mathbf{H} \preceq 1} \mathbb{Q}(\mathbf{H}) &=\varOmega(\mathbf{H})+\mu \varPhi(\mathbf{H}) \notag \\ &=\operatorname{tr}\left(\left(\mathrm{I}-{\mathbf{D}}_{v}^{-1/2} \mathbf{H} \mathbf{W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1/2}\right) \mathbf{K}\right), {} \end{aligned} $$
(6.34)

where K = FF ⊤ + μ XX ⊤. The projected gradient method is employed here because Eq. (6.34) is a complex function of H with a bound constraint. The gradient is derived as

$$\displaystyle \begin{aligned} \nabla \mathbb{Q}(\mathbf{H})=&\mathbf{J}\left(\mathbf{I} \otimes {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1/2} \mathbf{K} {\mathbf{D}}_{v}^{-1/2} \mathbf{H}\right) \mathbf{W} {\mathbf{D}}_{e}^{-2}\notag\\ &+{\mathbf{D}}_{v}^{-3/2} \mathbf{H} \mathbf{W} {\mathbf{D}}_{e}^{-1} {\mathbf{H}}^{\top} {\mathbf{D}}_{v}^{-1/2} \mathbf{K} \mathbf{J} \mathbf{W} -2 {\mathbf{D}}_{v}^{-1/2} {\mathbf{K D} _ { \mathbf{v }} }^{-1/2} \mathbf{H} \mathbf{W D}_{e}^{-1}, \end{aligned} $$
(6.35)

where J = 11 ⊤. The detailed derivation process can be found in [6]. The step size of learning H is set as α. Since H is required to be in the range of [0, 1], the projection P on the feasible set is conducted after each update. Therefore, H is updated by

$$\displaystyle \begin{aligned} {\mathbf{H}}_{k+1}=\boldsymbol{P}\left[{\mathbf{H}}_{k}-\alpha \nabla \mathbb{Q}\left({\mathbf{H}}_{k}\right)\right], \end{aligned} $$
(6.36)

where

$$\displaystyle \begin{aligned} \boldsymbol{P}\left[h_{i j}\right]=\begin{cases} h_{i j} & \text{ if } 0 \leq h_{i j} \leq 1 \\ 0 & \text{ if } h_{i j}<0 \\ 1 & \text{ if } h_{i j}>1 \end{cases} . \end{aligned} $$
(6.37)

In this way, we can alternately optimize F and H until the objective function converges.

The dynamic hypergraph structure learning method can outperform the traditional hypergraph learning consistently. This is due to the fact that the dynamic hypergraph structure can fit the data better and formulate the high-order correlation more effectively. Furthermore, both the feature and the label information are applied for the hypergraph structure optimization. Therefore, the learned hypergraph structure is smooth on the feature space and the label space. In other words, the vertices with the same labels have stronger high-order connections, which benefit the downstream task. We also note that the above dynamic hypergraph structure optimization method is with relatively high computational complexity, as it optimizes the whole incidence matrix H.

6.4 Incremental Learning on Growing Data

Most of the existing methods consider the static structures with fixed sets of vertices and edges, while the data are generally dynamic in real-world applications. Under such circumstances, the vertices and connections can be added or removed, and the vertex attributes and connects weights change during the dynamic procedure. Generally, there are two typical ways of dynamic structure learning, i.e., using recurrent architectures [9, 10] and capturing temporal patterns [11, 12]. However, the efficient learning of temporally growing structure has not been explored yet, where the vertex and edge sets are expanding over time. Taking the citation network into consideration, new publications and citation links are continuously added into the network.

The incremental subgraph is the subgraph with the newly appeared vertices and related new edges in the given growing graph at each time step. The edges connecting the vertices from the same incremental subgraph are denoted as intra-edges, while the edges connecting the vertices from different incremental subgraphs are denoted as inter-edges. The incremental learning method aims to update the model based on the incremental subgraphs at each time step and perform on the entire graph consistently. The challenge of the incremental graph learning method is how to design the efficient strategy to update the model with incremental data and maintain the performance on the whole dataset.

The main differences between incremental graph learning and existing incremental learning methods are as follows:

  • Incremental learning on growing graphs should store the observed vertices, which may be connected with the newly coming vertices, while existing incremental learning methods always drop the old samples under some scenarios.

  • Considering the effect of the inter-edges on training, it is also essential to use previous data when updating models with newly appeared data.

There are two straightforward solutions of incremental graph learning. First, the static graph learning methods can directly be applied on the whole graph at each time step, which suffers from a high computation cost. Second, only learn from the incremental subgraph, which leads to bias to the newly coming subgraphs and loses the information of the inter-edges.

In this section, we introduce incremental learning for graphs on the growing data. During training, a graph \(\mathbb {G}^L_t\) with a smaller number of vertices and edge sets from the growing graph \(\{\mathbb {G}_t\}\) is generated for updating current model, which can be implemented by existing GNN methods for specified graph learning task and can perform on the entire observed graph at any time. Vertices and edges within restricted numbers from the old graph are selected and combined with new subgraph into \(\mathbb {G}^L_t\). Therefore, \(\mathbb {G}^L_t\) is unbiased to the entire graph and enough inter-edges are preserved. The overview of the IGL is shown in Fig. 6.5.

Fig. 6.5
An illustration. A growing graph with time contains old and new vertices. Training involves a graph neural network where a learning model is generated by copying vertices of a new sub-graph and selecting from the old graph. Test is done with the entire observed graphs.

The framework of the incremental graph learning method

To address these issues of subgraph bias and inter-edges missing, the following conditions should be considered for generating learning.

Unbiased Estimation of Neighboring Aggregation

To alleviate the bias of subgraph, the aggregation results of vertices in \(\mathbb {G}^L_t\) should be unbiased estimations of them in the entire graph, i.e., \(\forall \mathbf {v} \in \mathbb {V}_t\),

(6.38)

where \(agg(\mathbf {v}, \mathbb {N})\) is the aggregator function of GNN to aggregate vertex embeddings from \(\mathbb {N}\) to v, and \(\mathbb {N}_t(\mathbf {v}) = \{\mathbf {u} \in \mathbb {V}_t ~|~ (\mathbf {u}, \mathbf {v}) \in \mathbb {E}_t\}\) is the neighborhood set of vertex v. Thus, \(\mathbb {N}_{t}(\mathbf {v}) \cap \mathbb {V}^L_{t}\) represents the sampled neighboring vertices in \(\mathbb {G}^L_t\).

Preservation of Inter-edges

Since the missing of inter-edges may seriously affect training, we aim at preserving more edges of \(\mathbb {E}^{inter}_t\) in \(\varDelta \mathbb {E}^L_t\), which can be formulated as

$$\displaystyle \begin{aligned} \begin{gathered} {} \max_{\varDelta \mathbb{E}^L_t} |\varDelta \mathbb{E}^L_t \cap \mathbb{E}^{inter}_t|.\\ s.t.~~|\varDelta \mathbb{E}^L_t| \le E_{max}. \end{gathered} \end{aligned} $$
(6.39)

The edge preservation can be required as a definite optimization problem in Eq. (6.39) or sampling problem with priority to vertices with higher degrees so that \(P(\mathbf {u} \in \mathbb {V}^L_t) \propto |\{(\mathbf {u}, \mathbf {v}) \in \mathbb {E}^{inter}_t\;|\;\mathbf {v} \in \mathbb {V}^{new}_t\}|\).

IGL is based on the unbiased and edge-preserved conditions. In the presentation of method, we follow the memory constraint V max and set \(E_{max} = (|\mathbb {V}^{new}_t| + V_{max})^2 - |\mathbb {E}^{intra}_t|\) by default. The generated edges can be uniformly sampled if a smaller E max is required. The sample-based strategy is presented to select a subgraph from the previous graph for learning. The following cluster-based strategy is presented to construct a cluster graph that satisfies both the unbiased and edge-preserved conditions in midway. The presented strategies are illustrated in Fig. 6.6.

Fig. 6.6
An illustration. Sample-based strategy consists of sampled vertices, identity, and a graph for learning. Cluster-based strategy consists of clusters, three averages, and a graph for learning. Current model interacts with graph learning via retrieve and save.

An illustration of the sample-based and cluster-based strategies

(1) Sample-Based Strategy

The strategy of sampling a representative subgraph from previous data based on the required conditions is studied first. We assume that a subset \(\varDelta \mathbb {V}^L_t\) from \(\mathbb {V}_{t-1}\) in size of V max is sampled, and all the related edges are preserved, i.e.,

$$\displaystyle \begin{aligned} &\varDelta \mathbb{V}^L_t = Sample~(\mathbb{G}_{t-1}, V_{max}),\\ &\varDelta \mathbb{E}^L_t = \{(\mathbf{u}, \mathbf{v}) \in \mathbb{E}^{inter}_t ~|~ \mathbf{u} \in \varDelta \mathbb{V}^L_t, \mathbf{v} \in \mathbb{V}^{new}_t\}, \end{aligned} $$
(6.40)

where Sample() denotes the sampling function. Considering the required conditions, we explore the following pragmatic methods for appropriate sampling:

  • Random selection is for the unbiased condition that uniformly selects V max vertices from \(\mathbb {V}_{t-1}\). However, it cannot preserve enough edges for efficient training, especially in sparse graphs.

  • Random jump is a traversal-based sampling, and we adapt it in the following steps. Starting out with any vertex in \(\mathbb {V}^{new}_t\), we either randomly walk to a neighboring vertex in \(\mathbb {V}_{t-1}\) with probability p and select it, or randomly jump to a vertex in \(\mathbb {V}^{new}_t\) with probability (1 − p). We repeat to fill the sampled set. It has been proved that the probability of sampling a vertex tends to be proportional to its degree, which works under the edge-preserved condition.

  • Degree-based selection is for the edge-preserved condition that samples vertices with priority to those connected with more inter-edges. Let \(D_t(\mathbf {u}) = |\{(\mathbf {u}, \mathbf {v}) \in \mathbb {E}_t\}|\) be the degree of u, and we define \(D_t^{new} (\mathbf {u}) = \frac {D_t(\mathbf {u}) - D_{t-1}(\mathbf {u})}{D_t(\mathbf {u})}, \forall \mathbf {u} \in \mathbb {V}_{t-1}\) as the new degree of vertices to measure their closeness to the new subgraph through inter-edges. We then select top-V max vertices in \(\mathbb {V}_{t-1}\) by their new degrees.

The above methods take into consideration only part of required conditions. It can be proved that, ignoring the ideal case when all the vertices in \(\mathbb {V}_{t-1}\) connect with the same number of vertices in \(\mathbb {V}^{new}_t\), sampling in Eq. (6.40) satisfies the two required conditions when all the vertices have been sampled, i.e., joint training.

(2) Cluster-Based Strategy

The sample-based strategy selects a subgraph from the previous graph for learning. However, in such a process, \(\mathbb {G}_{t-1}\) is not completely covered, and some important vertices might be dropped. Then, the selected subgraph cannot perform full communication with the new subgraph. The assumption of sampling that \(\mathbb {G}^L_t\) must be a subgraph of \(\mathbb {G}_t\) is relaxed, and a cluster graph is constructed. Technically, we first arrange vertices in \(\mathbb {V}_{t-1}\) into K cluster sets \(\{\mathbb {C}^{t-1}_i\}_{i=1}^K\) with centers \(\{{\mathbf {c}}^{t-1}_{i}\}_{i=1}^K\) in average values of clusters. We set the number of clusters K = V max. The cluster graph is therefore defined as

$$\displaystyle \begin{aligned} \varDelta \mathbb{V}^L_t &= \{{\mathbf{c}}^{t-1}_{1}, ..., {\mathbf{c}}^{t-1}_{K}\},\\ \varDelta \mathbb{E}^L_t &= \ \{({\mathbf{c}}^{t-1}_i, \mathbf{v})~|~\mathbf{v} \in \mathbb{V}_{t}^{new}, \exists~\mathbf{u} \in \mathbb{C}_i^{t-F1}, (\mathbf{u}, \mathbf{v}) \in \mathbb{E}^{inter}_t\} ~\cup \\ &\quad \{({\mathbf{c}}^{t-1}_i, {\mathbf{c}}^{t-1}_j)~|~\exists~{\mathbf{u}}_1 \in \mathbb{C}_i^{t-1}, {\mathbf{u}}_2 \in \mathbb{C}_j^{t-1}, ({\mathbf{u}}_1, {\mathbf{u}}_2) \in \mathbb{E}_{t-1}\}, \end{aligned} $$
(6.41)

which suggests that the cluster centers be added as new cluster vertices, and the edges connecting to any vertex in \(\mathbb {V}_{t-1}\) be directly transferred to the corresponding cluster vertex. It is noted that the additional edge sets in Eq. (6.41) represent \(\mathbb {E}^{inter}_t\) and \(\mathbb {E}_{t-1}\), respectively.

Due to the continued growth of the graph, direct clustering on the entire graph is time-consuming. For an approximate but efficient clustering with a balanced size, we first conduct clustering on the new vertices \(\mathbb {V}^{new}_t\) into cluster sets \(\{\varDelta \mathbb {C}^t_i\}_{i=1}^K\) with centers \(\{\hat {\mathbf {c}}_i^t\}_{i=1}^K\). The bipartite matching algorithm is applied to optimize a bijective matching function \(M(\cdot ) : \{1,...,K\} \rightarrow \{1,...,K\}\) for the objective: \(\min _{m(\cdot )} \varSigma _{k=1}^{K}\|{\mathbf {c}}_k^{t-1} - \hat {\mathbf {c}}_{m(k)}^{t}\|{ }^2_2\), which assigns new clusters to be closer with old clusters. Then, we merge the clusters as \(\mathbb {C}^t_k = \mathbb {C}^{t-1}_k \cup \varDelta \mathbb {C}^t_{m(k)}\) and update the value of centers \({\mathbf {c}}_{k}^t\).

In a word, incremental graph learning (IGL) is a general framework for efficient learning on growing graphs in an incremental manner, which has the following advantages. First, IGL is more suitable in real-world applications, since the dynamic graphs are commonly appeared. Second, the sample-based and cluster-based strategies significantly improve the efficiency when the large scale graph grows. However, only the incremental of the nodes and edges are considered, while the deletion are ignored, which limits the application of the method. The general dynamic patterns are worth studying in the future works.

6.5 Summary

In this chapter, we introduce hypergraph structure evolution methods, i.e., hyperedge weight optimization, vertex weight optimization, sub-hypergraph weight optimization, dynamic hypergraph learning, and the techniques for incremental learning on growing graphs. The hyperedge weight optimization adjusts weights of each hyperedge for different contributions, while the vertex weight optimization considers the different importance of vertices on hypergraph. The sub-hypergraph weight optimization method further combines multiple hypergraphs for multi-modal data with learned weights. Dynamic hypergraph learning optimizes the hypergraph structure by modifying the inappropriate connections, which can partially solve the missing and incorrect connection issue. Finally, we introduce the incremental learning method on growing graphs, which can update the data structure under the incremental scenario.

It is noted that the optimization of hypergraph, either component or the structure, will bring in extra computational cost and lead to potentially high computation complexity in practice. How to effectively and efficiently adjust the hypergraph structure is still a challenging problem, which requires further investigation in future.