11.1 Introduction

Hypergraphs have demonstrated excellent performance in modeling high-order relationship of data and have been applied in several fields. In computer vision, this property of hypergraphs is also promising for a wide range of works, and many researches focus on how to use hypergraph modeling to solve visual problems. On one hand, hypergraphs can be used to model high-order relationship of images within a class or different classes, and then to conduct the hypergraph-based label propagation procedures, which is useful for visual classification and retrieval. On the other hand, the relation can be modeled within the elements in a visual object to exploit the structural information.

In this chapter, we discuss four typical applications of hypergraph computation in computer vision, i.e., visual classification [1,2,3,4,5,6], 3D object retrieval [2, 7,8,9,10,11,12], and tag-based social image retrieval [13,14,15,16,17]. In these applications, the vertices represent the visual objects, and a hypergraph is constructed to formulate the high-order correlations among all the samples by some metric. In this hypergraph, some vertices are labeled. The prediction of other vertices can be obtained by the label propagation procedure. Visual classification and retrieval problems can be solved by this method. The elements within one sample, such as pixels in an image, can also be used to construct the hypergraphs. The properties with each element can be learned by conducting hypergraph computation, in which the semantic information can be learnt during this procedure. Part of the work introduced in this chapter has been published in [1, 2, 13].

11.2 Visual Classification

Visual classification is the most widely used area of hypergraph in computer vision. Since visual data have a strong clustering characteristic, i.e., visual objects under one label show a clustered distribution in the feature space, this property is fully consistent with the hypothesis of hypergraph-based semi-supervised learning, and therefore, hypergraph-based semi-supervised learning is theoretically well-suited for image classification. A large number of researches have demonstrated its good performance [1, 2]. While there are many applications of hypergraph computation for image classification, they almost follow the same process. It starts out with hypergraph modeling of visual data. After extracting features by some feature extractors, the hypergraph is modeled based on the nearest neighbor relationship of visual features in the Euclidean space, and then label propagation on the hypergraph is adopted to achieve classification. We use the example of multi-view 3D object classification to introduce the process in detail.

First, view-based 3D object classification needs to be introduced. Each 3D object can be represented by a set of views. Compared with the model representation method, the multi-view representation method is more flexible, with less computational overhead. It also has good representation capability. Classification of 3D objects is illustrated in Fig. 11.1. After obtaining the multi-view 3D object data, the first step is to extract the features. There are many feature extraction methods for multi-view 3D objects, such as MVCNN [18], Zernike moments, etc. After obtaining the features of each group of views and each image in them, hyperedges can be constructed by k-NN with Euclidean distance as the metric. In fact, if several different features are used, multiple hypergraphs can be constructed, i.e., each hypergraph is constructed based on one feature. If m features are used, m hypergraphs can be generated, denoted by \(\mathbb {G}_1=(\mathbb {V}_1, \mathbb {E}_1, {\mathbf {W}}_1), \mathbb {G}_2=(\mathbb {V}_2, \mathbb {E}_2, {\mathbf {W}}_2), \dots , \mathbb {G}_m=(\mathbb {V}_m, \mathbb {E}_m, {\mathbf {W}}_m)\). After obtaining multiple hypergraphs, a weight ω i, i = 1, …, m is assigned to each hypergraph \(\mathbb {G}_i\), which constitutes a weight vector ω. Up to this point, we obtain m hypergraphs with weights from the multi-view 3D dataset.

Fig. 11.1
A three-dimensional illustration represents a group of cows, cups, cars, and horses. An arrow with a question mark indicates nine components. The components are labeled bridge, butterfly, cup, horse, starship, truck, guitar, face mask, and sailboat.

An illustration of the view-based 3D object classification framework. This figure is from [1]

Transductive Hypergraph Computation

After getting multiple hypergraphs, we can get the label of each vertex by the formula of hypergraph-based semi-supervised learning. The pipeline is shown in Fig. 11.2a. Note that since we are using multi-modal data, the contribution of different modalities to the classification may be different, such that we also have to take into account the influence of different modal weights when calculating the classification results and updating the weights during the computing process. The method of weight updating is described in the next section, and the focus here is to establish the idea of hypergraph processing of multi-modal features.

Fig. 11.2
Two illustrations represent two frameworks for multi-hypergraph algorithms. The components are multimodal data, multi-hypergraph construction, transductive multi-hypergraph learning, and classification. In b, inductive learning is used and another multimodal data points to classification.

The general frameworks of transductive and inductive multi-hypergraph computation algorithms. (a) tMHL: transductive multi-hypergraph computation. (b) iMHL: inductive multi-hypergraph computation. This figure is from [1]

Inductive Hypergraph Computation

In real-world visual classification endeavors, transductive hypergraph computation can only be updated globally, and the high time complexity can hardly meet efficiency requirements of visual classification. To help solve this problem, inductive hypergraph computation is introduced, which can learn both projections of data to labels and weight vectors of multiple hypergraphs. It can also achieve real-time inference performance for newly added data, as shown in Fig. 11.2b. It is described in the following.

In inductive hypergraph computation, a projection matrix M is learned, and the prediction for the unlabeled data is computed by M.

The objective function for learning M is illustrated as

$$\displaystyle \begin{aligned} \arg \min \limits _{{{\mathbf{M}}}}~\left \{{ {\varOmega \left ({{{{\mathbf{M}}}} }\right) + \lambda {\mathbb R_{emp}}\left ({{{{\mathbf{M}}}} }\right) + \mu \varPhi \left ({{{{\mathbf{M}}}}}\right)} }\right \}\!.\end{aligned} $$
(11.1)

Under the assumption that it is more likely that the vertices connected with one or more hyperedges have the same label, the hypergraph Laplacian regularizer Ω(M) is defined as follows, and it is in quadratic form of M:

$$\displaystyle \begin{aligned} \varOmega \left ({{\mathbf{M}} }\right)=&\frac {1}{2}\sum \limits _{k = 1}^{c} \sum \limits _{e \in {\mathbb E}} \sum \limits _{u,v \in {\mathbb V}} \frac {{ {\mathbf{W}}\left ({e }\right){\mathbf{H}}\left ({{u,e} }\right){\mathbf{H}}\left ({{v,e} }\right)}}{\delta (e)}\vartheta \\=&\mathrm {tr}\left ({{ {\mathbf{M}}^{\top} {\mathbf{X}}\varDelta {\mathbf{X}}^{\top} {\mathbf{M}}}}\right),\end{aligned} $$
(11.2)

where \(\vartheta =\left (\frac {{\mathbf {X}}^\top \mathbf {M}(u,k)}{\sqrt {d(u)}}-\frac {{\mathbf {X}}^\top \mathbf {M}(v,k)}{\sqrt {d(v)}}\right )^2\). It can be noted that Ω(M) is in quadratic form of M. The empirical loss term \(\mathbb {R}_{emp}(\mathbf {M})\) is defined as

$$\displaystyle \begin{aligned} {\mathbb R_{emp}}\left ({{\mathbf{M}} }\right) = || {\mathbf{X}}^{\top} {\mathbf{M}}- {\mathbf{Y}}||{}^{2}.\end{aligned} $$
(11.3)

Φ(M) is an l 2,1 norm regularizer. It is used to avoid overfitting for M. Meanwhile, it makes the rows in the matrix more sparse to be informative. It is defined as

$$\displaystyle \begin{aligned} \varPhi ({\mathbf{M}}) = || {\mathbf{M}}||{}_{2,1}.\end{aligned} $$
(11.4)

The objective function of inductive hypergraph computation task can be written as

$$\displaystyle \begin{aligned} \arg \min \limits _{\mathbf{M}}~\left \{{ {\mathrm {tr}\left ({{ {\mathbf{M}}^{\top} {\mathbf{X}}\varDelta {\mathbf{X}}^{\top} {\mathbf{M}}}}\right) + \lambda || {\mathbf{X}}^{\top} {\mathbf{M}}- {\mathbf{Y}}||{}^{2} + \mu || {\mathbf{M}}||{}_{2,1} } }\right \}\!.\!\!\!\! \end{aligned} $$
(11.5)

Note that the regularizer Φ(M) is convex and non-smooth. Therefore, the objective function can be relaxed to the following:

$$\displaystyle \begin{aligned} \arg \min \limits _{{ {\mathbf{M}}, {\mathbf{U}}}}\left \{{ { \mathrm {tr}\left ({{ {\mathbf{M}}^{\top} {\mathbf{X}}\varDelta {\mathbf{X}}^{\top} {\mathbf{M}}}}\right) \!+\! \lambda || {\mathbf{X}}^{\top} {\mathbf{M}}\!-\! {\mathbf{Y}}||{}^{2} \!+\! \mu \mathrm {tr}\left ({{ {\mathbf{M}}^{\top} {\mathbf{U}} {\mathbf{M}} }}\right) } }\right \},\end{aligned} $$
(11.6)

where U is a diagonal matrix, and its elements are defined as

$$\displaystyle \begin{aligned}{\mathbf{U}}_{i,i} = \frac {1}{2|| {\mathbf{M}}\left ({{\mathbf{i}},:}\right)||{}^{2}_{2}},\quad i = 1, {\dots },d. \end{aligned} $$
(11.7)

To solve this optimization problem, U is set as an identity matrix first, and the iteratively reweighted least squares method is adopted. More specifically, each variable is updated alternately with the other fixed until convergence is achieved. First, U is fixed, and we derive objection with respect to M. The closed-form solution is

$$\displaystyle \begin{aligned} {\mathbf{M}}= \lambda \left ({{ {\mathbf{X}}\varDelta {\mathbf{X}}^{\top} + \lambda {\mathbf{X}} {\mathbf{X}}^{\top} + \mu {\mathbf{U}} } }\right)^{-1} {\mathbf{X}} {\mathbf{Y}}. \end{aligned} $$
(11.8)

Then M is fixed, while U is updated by Eq. (11.7). The procedure is repeated until both U and M converge.

Given a testing sample x t, the prediction of x t can be obtained by

$$\displaystyle \begin{aligned} C(x^{t}) = \arg \max \limits _{k}~{x^{t}}^{\top} {\mathbf{M}}.\end{aligned} $$
(11.9)

Hypergraph computation can achieve good results in visual classification problems, where inductive hypergraph computation can achieve real-time online classification while maintaining good classification performance.

11.3 3D Object Retrieval

3D object retrieval targets on finding similar 3D objects in the database, given a 3D query. Usually, each 3D object can be described by several different types of data, such as multiple views, point clouds, mesh, or voxel. The main task of 3D object retrieval is to define an appropriate measure to calculate the similarity between each pair of 3D objects. Therefore, how to define such measures is the key for 3D object retrieval. Traditional methods mainly focus on either representation learning for each type of data or the distance metric for specific features. It is noted that the correlations among 3D objects are very complex, where the pair correlations and beyond-pair correlation both exist. To achieve better 3D object retrieval performance, it is important to take such high-order correlation among 3D objects into consideration. In this retrieval task, each vertex denotes a 3D object in the database, and thus the number of vertices is equivalent to the number of objects in the database.

Hypergraph can be used for such correlation modeling in 3D object retrieval. We introduce the hypergraph computation method [2] for 3D object retrieval here, and the framework is shown in Fig. 11.3. First a group of hypergraphs can be generated, and the learning process is conducted for similarity measurement.

Fig. 11.3
An illustrative flow. A 3-D object database, followed by view clustering, leads to multiple hypergraphs, which use multi-hypergraph weighted fusion learning for recognition of training data, and multi-hypergraph average fusion learning for retrieval of the query object.

An illustration of the hypergraph computation method for 3D object retrieval using multiple views. This figure is from [2]

We take the multi-view representation as an example. All views of these 3D objects are first grouped into clusters. Objects with views in one cluster are then connected by hyperedges (note that a hyperedge can connect multiple vertices in a hypergraph). As a result, a hypergraph can be generated, in which vertices represent objects in a database. A hyperedge’s weight is determined by the visual similarity between any two views in a cluster. Multiple hypergraphs can be generated by varying the number of clusters. These hypergraphs encode the relationships between objects at various granularities. When two 3D objects are connected by more and stronger hyperedges, they are with higher similarity. Then, these information can be used for 3D object retrieval.

To generate a 3D object hypergraph, each object is as a vertex in the hypergraph \(\mathbb {G} = (\mathbb {V}, \mathbb {E}, \mathbf {W})\). The generated hypergraph has n vertices if there are n objects in a database. Each view for these 3D objects can be represented by pre-defined features, which can be different with respect to various of tasks. Given these features, the K-means clustering method can be used to group visual objects into clusters. Each object in a cluster has a corresponding hyperedge connecting them. There are two diagonal matrices D v and D e that represent the vertex and hyperedge degrees, respectively, and an incidence matrix H is generated. The weight of a hyperedge e can be measured by

$$\displaystyle \begin{aligned} w(e) = \sum_{x_a, x_b \in e} \text{exp} \left( -\frac{d(x_a, x_b)^2}{\sigma^2} \right), \end{aligned} $$
(11.10)

where d(x a, x b) is the distance between x a and x b, which are two views in the same view cluster. d(x a, x b) can be calculated using the Euclidean distance. The parameter σ is empirically set to the median distance between all pairs of these views. The hypergraph generation procedure is shown in Fig. 11.4.

Fig. 11.4
Two illustrations. a. A space has datasets for views of objects 1 to 6, arranged clockwise from left, forming an inverted U-shaped periphery. b. The same space has 3 clusters, identified and marked by ovals, forming a hypergraph of order 6 and size 3.

An illustration of the hypergraph construction for 3D object hypergraph. (a) Views of different visual objects. (b) Hyperedges construction by view clusters. This figure is from [2]

Let \(\mathbb {G}_1 = (\mathbb {V}_1, \mathbb {E}_1, \mathbf W_1)\), \(\mathbb {G}_2 = (\mathbb {V}_2, \mathbb {E}_2, \mathbf W_2)\), ⋯, and \(\mathbb {G}_{n_g} = (\mathbb {V}_{n_g} , \mathbb {E}_{n_g} , \mathbf W_{n_g} )\) denote n g hypergraphs, and \(\{{\mathbf {D}}_{v_1}, {\mathbf {D}}_{v_2},\ldots , {\mathbf {D}}_{vn_g} \}\), and \(\{{\mathbf {D}}_{e_1}, {\mathbf {D}}_{e_2},\ldots , {\mathbf {D}}_{en_g} \}\), and \(\{{\mathbf {H}}_1, {\mathbf {H}}_2,\ldots , {\mathbf {H}}_{n_g} \}\) be the vertex degree matrices, hyperedge degree matrices, and incidence matrices, respectively. The retrieval results are based on the fusion of these hypergraphs. The weight of the i-th hypergraph is denoted by α i, where \(\sum ^{n_g}_{i=1} \alpha _i = 1\), and α i ≤ 0.

It is possible to consider retrieval as a one-class classification problem [19]. As a result, we formulate the transductive inference in terms of a regularization problem: arg minf {λR emp(f)} + Ω(f), and the regularizer term Ω(f) is defined by

$$\displaystyle \begin{aligned} \frac{1}{2} \sum_{i=1}^{n_g} \alpha_i \sum_{e \in \mathbb{E}_i} \sum_{u, v \in \mathbb{V}_i} \frac{w_i(e) {\mathbf{H}}_i(u, e) {\mathbf{H}}_i(v, e)}{\sigma_i(e)} \times \left( \frac{\mathbf f(u)}{\sqrt{d_i(u)}} - \frac{\mathbf f(v)}{\sqrt{d_i(v)}} \right)^2 , \end{aligned} $$
(11.11)

where vector f represents the relevance score to be learned.

In this way, the similarity between each object and the query can be calculated based on the relevance score. It is noted that the feature used in this method can be selected based on the task itself, and multiple types of representations can also be used here. Given multiple features for the same data, or different features for multi-modal data, we can generate the hypergraph(s) using the method introduced in Chap. 4.

11.4 Tag-Based Social Image Retrieval

User-generated tags are widely associated with the social images, which describe the content of the images. These tags are useful for the social image retrieval tasks benefited from the rich contents. Figure 11.5 shows some examples of social images associated with tags.

Fig. 11.5
Three photographs and an illustration of a half-cut green apple on a bowl, a parked vintage car, a bright flower and its five buds on the stem, and a spaceship and six fighter jets above a city. Each has a surrounding text at their right.

Some social image examples with associated with tags. This figure is from [13]

The main challenge of applying such tags to social image retrieval is that too much noise makes it hard to mine the true relation among the tags and images, and the separation usage of the tags and images leads to a sub-optimal for image retrieval. In this section, we introduce a visual–textual joint relevance learning approach using hypergraph computation [13]. Figure 11.6 shows the illustration of the visual–textual joint relevance learning method on hypergraph for tag-based social image retrieval. In this method, the features for both the images and the tags are first extracted, and the hypergraph is constructed based on these features. Then, the hypergraph learning method is performed, and the learned semantic similarity can be used for tag-based social image retrieval.

Fig. 11.6
A set of illustrations represents the extraction process for search results through listed photographs, leading to the hypergraph construction and hypergraph learning with the support from the pseudo-positive images selection by semantic analysis.

The framework of the visual–textual joint relevance learning method on hypergraph for tag-based social image retrieval. This figure is from [13]

In this example, the bag-of-visual-words feature is selected for image representation. For the i-th image, the visual content is represented by bag-of-visual-words \(f_i^{bow}\), while for the corresponding tags, the bag-of-textual words representation \(f_i^{tag}\) is employed. Then, the visual-content-based hyperedges and the tag-based hyperedges are constructed, respectively. The visual-content-based hyperedges connect the images that have the same visual word, and the tag-based hyperedges connect the images that have the same tag word. Figure 11.7 provides the examples of hyperedge generation process using textual information and visual information, respectively. Therefore, the overall hypergraph has n e = n c + n t hyperedges, where n c denotes the number of visual words, and n t denotes the number of tag words. After the construction of the hypergraph, the images sharing more visual words or tags are connected by more hyperedges, which can be used for further processing. Figure 11.8 further shows the connections between two social images, based on the textual and the visual information, respectively.

Fig. 11.7
Two illustrations. a. a hyperedge connects people, gun, and tank, via, labeled texts. b. a hyperedge connects cars, fruit and vegetable bowls, via, visual words.

Two examples of hyperedge generation. (a) shows hyperedges based on the textual information, in which the social images with the same textual words are connected by a hyperedge. (b) shows hyperedges based on the visual information, in which the social images with the same visual words are connected by a hyperedge. This figure is from [13]

Fig. 11.8
Two photographs of fruit baskets. The labels on the left are fruit, apple, candle, basket, and green. The labels on the right are fruit, apple, banana, green, yellow, and dole. The 2 baskets connect visually and textually.

An example of connections between two images from textual and visual directions. This figure is from [13]

Denoting f as the relevance score vector, y as the ground truth relevance, and w is the weight vector of hyperedges, the hypergraph computation can be formulated as

$$\displaystyle \begin{aligned} \begin{gathered} \arg\min_{\mathbf f, \mathbf w} \varPhi(\mathbf f) = \arg\min_{\mathbf f} \left\{ \mathbf f^\top\varDelta \mathbf f + \lambda ||\mathbf f-\mathbf y||{}^2 + \mu \sum_{i=1}^{n_e} \mathbf w(i)^2 \right\}, \\ s.t.~~\sum_{i=1}^{n_e} \mathbf w(i) = 1, \end{gathered} \end{aligned} $$
(11.12)

where λ and μ are the weighted parameters. The first term in Eq. (11.12) is the regularizer on the hypergraph structure, which is used to guarantee the smoothness over the hypergraph. The second term is the empirical loss between the relevance score vector and the ground truth. The last term represents the ℓ 2 norm of the hyperedge weights, which is used to learn better combination of different hyperedges. This optimization task can be easily solved using alternating optimization. First, w is fixed, and f is optimized by

$$\displaystyle \begin{aligned} \arg\min_{\mathbf f}\varPhi(\mathbf f) = \arg\min_{\mathbf f} \left\{ \mathbf f^\top \varDelta \mathbf f + \lambda ||\mathbf f-\mathbf y||{}^2 \right\}, \end{aligned} $$
(11.13)

from which we can have

$$\displaystyle \begin{aligned} \mathbf f = \frac{1}{1-\xi} (\mathbf I-\xi\varTheta)^{-1}\mathbf y, \end{aligned} $$
(11.14)

where \(\xi =\frac 1{1+\lambda }\), Θ = I − Δ.

Then, f is fixed, and w is optimized by

$$\displaystyle \begin{aligned} \begin{gathered} \arg\min_{\mathbf{w}} \varPhi(\mathbf f) = \arg\min_{\mathbf f} \left\{ \mathbf f^\top\varDelta \mathbf f + \mu\sum_{i=1}^{n_e}\mathbf w(i)^2 \right\}. \\ s.t. \sum_{i=1}^{n_e}\mathbf w(i) = 1, \mu>0. \end{gathered} \end{aligned} $$
(11.15)

The Lagrangian can be applied here, and we have

$$\displaystyle \begin{aligned} \mathbf w(i) = \frac 1{n_e} - \frac{ \mathbf f^\top\varGamma \mathbf D_e^{-1}\varGamma^\top \mathbf f }{2n_e\mu} + \frac{ \mathbf f^\top \varGamma_i \mathbf D_e^{-1}(i,i)\varGamma_i^\top \mathbf f }{2\mu}, \end{aligned} $$
(11.16)

where \(\varGamma = \mathbf D_v^{-\frac 12}\mathbf H\) and Γ i represents the i-th column of Γ.

The semantic relevance between an image x i and the query tag t q is estimated by

$$\displaystyle \begin{aligned} s(x_i,t_q)=\frac 1{n_i}\sum_t s_{tag} (t_q,t), \end{aligned} $$
(11.17)

which denotes the average similarity between t q and all corresponding tags of x i, and s tag can be calculated as

$$\displaystyle \begin{aligned} s_{tag}(t_1,t_2) = e^{-FD(t_1,t_2)}, \end{aligned} $$
(11.18)

where FD represents the Flickr distance [20].

Given these similarities between each image and the query tag, we can have the retrieval results accordingly. We also note that the features used in this application can be changed with respect to the requirement of different tasks.

11.5 Summary

In this chapter, we have introduced the applications of hypergraph computation on computer vision, including visual classification, 3D object retrieval, and tag-based social image retrieval. For classification and retrieval tasks, hypergraphs can be used to model the high-order relationships among samples in the feature space and solve the problem by hypergraph-based label propagation methods. The success of hypergraphs for computer vision is due to the fact that the feature correlations of visual data are more complex that are hard to be explored by pairwise correlation methods. Hypergraph computation can be further used in other computer vision tasks, such as visual registration, visual segmentation, gaze estimation, etc.