Abstract
In this chapter, the applications of hypergraph computation in computer vision are introduced. Computer vision is one of the most widely used areas of hypergraph computation. The hypergraphs can be constructed by modeling the high-order relationship among inter- or intra-visual samples, and then computer vision tasks can be solved by hypergraph computation procedures. More specifically, four typical applications, including visual classification, 3D object retrieval, and tag-based social image retrieval, are provided, in which hypergraphs are used to model high-order relationship among samples and solve visual problems by hypergraph computation. For example, in social image retrieval, hypergraphs are used to model the high-order relationship among social images based on both visual and textual information, which is the high-order modeling of elements within samples.
You have full access to this open access chapter, Download chapter PDF
11.1 Introduction
Hypergraphs have demonstrated excellent performance in modeling high-order relationship of data and have been applied in several fields. In computer vision, this property of hypergraphs is also promising for a wide range of works, and many researches focus on how to use hypergraph modeling to solve visual problems. On one hand, hypergraphs can be used to model high-order relationship of images within a class or different classes, and then to conduct the hypergraph-based label propagation procedures, which is useful for visual classification and retrieval. On the other hand, the relation can be modeled within the elements in a visual object to exploit the structural information.
In this chapter, we discuss four typical applications of hypergraph computation in computer vision, i.e., visual classification [1,2,3,4,5,6], 3D object retrieval [2, 7,8,9,10,11,12], and tag-based social image retrieval [13,14,15,16,17]. In these applications, the vertices represent the visual objects, and a hypergraph is constructed to formulate the high-order correlations among all the samples by some metric. In this hypergraph, some vertices are labeled. The prediction of other vertices can be obtained by the label propagation procedure. Visual classification and retrieval problems can be solved by this method. The elements within one sample, such as pixels in an image, can also be used to construct the hypergraphs. The properties with each element can be learned by conducting hypergraph computation, in which the semantic information can be learnt during this procedure. Part of the work introduced in this chapter has been published in [1, 2, 13].
11.2 Visual Classification
Visual classification is the most widely used area of hypergraph in computer vision. Since visual data have a strong clustering characteristic, i.e., visual objects under one label show a clustered distribution in the feature space, this property is fully consistent with the hypothesis of hypergraph-based semi-supervised learning, and therefore, hypergraph-based semi-supervised learning is theoretically well-suited for image classification. A large number of researches have demonstrated its good performance [1, 2]. While there are many applications of hypergraph computation for image classification, they almost follow the same process. It starts out with hypergraph modeling of visual data. After extracting features by some feature extractors, the hypergraph is modeled based on the nearest neighbor relationship of visual features in the Euclidean space, and then label propagation on the hypergraph is adopted to achieve classification. We use the example of multi-view 3D object classification to introduce the process in detail.
First, view-based 3D object classification needs to be introduced. Each 3D object can be represented by a set of views. Compared with the model representation method, the multi-view representation method is more flexible, with less computational overhead. It also has good representation capability. Classification of 3D objects is illustrated in Fig. 11.1. After obtaining the multi-view 3D object data, the first step is to extract the features. There are many feature extraction methods for multi-view 3D objects, such as MVCNN [18], Zernike moments, etc. After obtaining the features of each group of views and each image in them, hyperedges can be constructed by k-NN with Euclidean distance as the metric. In fact, if several different features are used, multiple hypergraphs can be constructed, i.e., each hypergraph is constructed based on one feature. If m features are used, m hypergraphs can be generated, denoted by \(\mathbb {G}_1=(\mathbb {V}_1, \mathbb {E}_1, {\mathbf {W}}_1), \mathbb {G}_2=(\mathbb {V}_2, \mathbb {E}_2, {\mathbf {W}}_2), \dots , \mathbb {G}_m=(\mathbb {V}_m, \mathbb {E}_m, {\mathbf {W}}_m)\). After obtaining multiple hypergraphs, a weight ω i, i = 1, …, m is assigned to each hypergraph \(\mathbb {G}_i\), which constitutes a weight vector ω. Up to this point, we obtain m hypergraphs with weights from the multi-view 3D dataset.
Transductive Hypergraph Computation
After getting multiple hypergraphs, we can get the label of each vertex by the formula of hypergraph-based semi-supervised learning. The pipeline is shown in Fig. 11.2a. Note that since we are using multi-modal data, the contribution of different modalities to the classification may be different, such that we also have to take into account the influence of different modal weights when calculating the classification results and updating the weights during the computing process. The method of weight updating is described in the next section, and the focus here is to establish the idea of hypergraph processing of multi-modal features.
Inductive Hypergraph Computation
In real-world visual classification endeavors, transductive hypergraph computation can only be updated globally, and the high time complexity can hardly meet efficiency requirements of visual classification. To help solve this problem, inductive hypergraph computation is introduced, which can learn both projections of data to labels and weight vectors of multiple hypergraphs. It can also achieve real-time inference performance for newly added data, as shown in Fig. 11.2b. It is described in the following.
In inductive hypergraph computation, a projection matrix M is learned, and the prediction for the unlabeled data is computed by M.
The objective function for learning M is illustrated as
Under the assumption that it is more likely that the vertices connected with one or more hyperedges have the same label, the hypergraph Laplacian regularizer Ω(M) is defined as follows, and it is in quadratic form of M:
where \(\vartheta =\left (\frac {{\mathbf {X}}^\top \mathbf {M}(u,k)}{\sqrt {d(u)}}-\frac {{\mathbf {X}}^\top \mathbf {M}(v,k)}{\sqrt {d(v)}}\right )^2\). It can be noted that Ω(M) is in quadratic form of M. The empirical loss term \(\mathbb {R}_{emp}(\mathbf {M})\) is defined as
Φ(M) is an l 2,1 norm regularizer. It is used to avoid overfitting for M. Meanwhile, it makes the rows in the matrix more sparse to be informative. It is defined as
The objective function of inductive hypergraph computation task can be written as
Note that the regularizer Φ(M) is convex and non-smooth. Therefore, the objective function can be relaxed to the following:
where U is a diagonal matrix, and its elements are defined as
To solve this optimization problem, U is set as an identity matrix first, and the iteratively reweighted least squares method is adopted. More specifically, each variable is updated alternately with the other fixed until convergence is achieved. First, U is fixed, and we derive objection with respect to M. The closed-form solution is
Then M is fixed, while U is updated by Eq. (11.7). The procedure is repeated until both U and M converge.
Given a testing sample x t, the prediction of x t can be obtained by
Hypergraph computation can achieve good results in visual classification problems, where inductive hypergraph computation can achieve real-time online classification while maintaining good classification performance.
11.3 3D Object Retrieval
3D object retrieval targets on finding similar 3D objects in the database, given a 3D query. Usually, each 3D object can be described by several different types of data, such as multiple views, point clouds, mesh, or voxel. The main task of 3D object retrieval is to define an appropriate measure to calculate the similarity between each pair of 3D objects. Therefore, how to define such measures is the key for 3D object retrieval. Traditional methods mainly focus on either representation learning for each type of data or the distance metric for specific features. It is noted that the correlations among 3D objects are very complex, where the pair correlations and beyond-pair correlation both exist. To achieve better 3D object retrieval performance, it is important to take such high-order correlation among 3D objects into consideration. In this retrieval task, each vertex denotes a 3D object in the database, and thus the number of vertices is equivalent to the number of objects in the database.
Hypergraph can be used for such correlation modeling in 3D object retrieval. We introduce the hypergraph computation method [2] for 3D object retrieval here, and the framework is shown in Fig. 11.3. First a group of hypergraphs can be generated, and the learning process is conducted for similarity measurement.
We take the multi-view representation as an example. All views of these 3D objects are first grouped into clusters. Objects with views in one cluster are then connected by hyperedges (note that a hyperedge can connect multiple vertices in a hypergraph). As a result, a hypergraph can be generated, in which vertices represent objects in a database. A hyperedge’s weight is determined by the visual similarity between any two views in a cluster. Multiple hypergraphs can be generated by varying the number of clusters. These hypergraphs encode the relationships between objects at various granularities. When two 3D objects are connected by more and stronger hyperedges, they are with higher similarity. Then, these information can be used for 3D object retrieval.
To generate a 3D object hypergraph, each object is as a vertex in the hypergraph \(\mathbb {G} = (\mathbb {V}, \mathbb {E}, \mathbf {W})\). The generated hypergraph has n vertices if there are n objects in a database. Each view for these 3D objects can be represented by pre-defined features, which can be different with respect to various of tasks. Given these features, the K-means clustering method can be used to group visual objects into clusters. Each object in a cluster has a corresponding hyperedge connecting them. There are two diagonal matrices D v and D e that represent the vertex and hyperedge degrees, respectively, and an incidence matrix H is generated. The weight of a hyperedge e can be measured by
where d(x a, x b) is the distance between x a and x b, which are two views in the same view cluster. d(x a, x b) can be calculated using the Euclidean distance. The parameter σ is empirically set to the median distance between all pairs of these views. The hypergraph generation procedure is shown in Fig. 11.4.
Let \(\mathbb {G}_1 = (\mathbb {V}_1, \mathbb {E}_1, \mathbf W_1)\), \(\mathbb {G}_2 = (\mathbb {V}_2, \mathbb {E}_2, \mathbf W_2)\), ⋯, and \(\mathbb {G}_{n_g} = (\mathbb {V}_{n_g} , \mathbb {E}_{n_g} , \mathbf W_{n_g} )\) denote n g hypergraphs, and \(\{{\mathbf {D}}_{v_1}, {\mathbf {D}}_{v_2},\ldots , {\mathbf {D}}_{vn_g} \}\), and \(\{{\mathbf {D}}_{e_1}, {\mathbf {D}}_{e_2},\ldots , {\mathbf {D}}_{en_g} \}\), and \(\{{\mathbf {H}}_1, {\mathbf {H}}_2,\ldots , {\mathbf {H}}_{n_g} \}\) be the vertex degree matrices, hyperedge degree matrices, and incidence matrices, respectively. The retrieval results are based on the fusion of these hypergraphs. The weight of the i-th hypergraph is denoted by α i, where \(\sum ^{n_g}_{i=1} \alpha _i = 1\), and α i ≤ 0.
It is possible to consider retrieval as a one-class classification problem [19]. As a result, we formulate the transductive inference in terms of a regularization problem: arg minf {λR emp(f)} + Ω(f), and the regularizer term Ω(f) is defined by
where vector f represents the relevance score to be learned.
In this way, the similarity between each object and the query can be calculated based on the relevance score. It is noted that the feature used in this method can be selected based on the task itself, and multiple types of representations can also be used here. Given multiple features for the same data, or different features for multi-modal data, we can generate the hypergraph(s) using the method introduced in Chap. 4.
11.4 Tag-Based Social Image Retrieval
User-generated tags are widely associated with the social images, which describe the content of the images. These tags are useful for the social image retrieval tasks benefited from the rich contents. Figure 11.5 shows some examples of social images associated with tags.
The main challenge of applying such tags to social image retrieval is that too much noise makes it hard to mine the true relation among the tags and images, and the separation usage of the tags and images leads to a sub-optimal for image retrieval. In this section, we introduce a visual–textual joint relevance learning approach using hypergraph computation [13]. Figure 11.6 shows the illustration of the visual–textual joint relevance learning method on hypergraph for tag-based social image retrieval. In this method, the features for both the images and the tags are first extracted, and the hypergraph is constructed based on these features. Then, the hypergraph learning method is performed, and the learned semantic similarity can be used for tag-based social image retrieval.
In this example, the bag-of-visual-words feature is selected for image representation. For the i-th image, the visual content is represented by bag-of-visual-words \(f_i^{bow}\), while for the corresponding tags, the bag-of-textual words representation \(f_i^{tag}\) is employed. Then, the visual-content-based hyperedges and the tag-based hyperedges are constructed, respectively. The visual-content-based hyperedges connect the images that have the same visual word, and the tag-based hyperedges connect the images that have the same tag word. Figure 11.7 provides the examples of hyperedge generation process using textual information and visual information, respectively. Therefore, the overall hypergraph has n e = n c + n t hyperedges, where n c denotes the number of visual words, and n t denotes the number of tag words. After the construction of the hypergraph, the images sharing more visual words or tags are connected by more hyperedges, which can be used for further processing. Figure 11.8 further shows the connections between two social images, based on the textual and the visual information, respectively.
Denoting f as the relevance score vector, y as the ground truth relevance, and w is the weight vector of hyperedges, the hypergraph computation can be formulated as
where λ and μ are the weighted parameters. The first term in Eq. (11.12) is the regularizer on the hypergraph structure, which is used to guarantee the smoothness over the hypergraph. The second term is the empirical loss between the relevance score vector and the ground truth. The last term represents the ℓ 2 norm of the hyperedge weights, which is used to learn better combination of different hyperedges. This optimization task can be easily solved using alternating optimization. First, w is fixed, and f is optimized by
from which we can have
where \(\xi =\frac 1{1+\lambda }\), Θ = I − Δ.
Then, f is fixed, and w is optimized by
The Lagrangian can be applied here, and we have
where \(\varGamma = \mathbf D_v^{-\frac 12}\mathbf H\) and Γ i represents the i-th column of Γ.
The semantic relevance between an image x i and the query tag t q is estimated by
which denotes the average similarity between t q and all corresponding tags of x i, and s tag can be calculated as
where FD represents the Flickr distance [20].
Given these similarities between each image and the query tag, we can have the retrieval results accordingly. We also note that the features used in this application can be changed with respect to the requirement of different tasks.
11.5 Summary
In this chapter, we have introduced the applications of hypergraph computation on computer vision, including visual classification, 3D object retrieval, and tag-based social image retrieval. For classification and retrieval tasks, hypergraphs can be used to model the high-order relationships among samples in the feature space and solve the problem by hypergraph-based label propagation methods. The success of hypergraphs for computer vision is due to the fact that the feature correlations of visual data are more complex that are hard to be explored by pairwise correlation methods. Hypergraph computation can be further used in other computer vision tasks, such as visual registration, visual segmentation, gaze estimation, etc.
References
Z. Zhang, H. Lin, X. Zhao, R. Ji, Y. Gao, Inductive multi-hypergraph learning and its application on view-based 3D object classification. IEEE Trans. Image Process. 27(12), 5957—5968 (2018)
Y. Gao, M. Wang, D. Tao, R. Ji, Q. Dai, 3-D object retrieval and recognition with hypergraph analysis. IEEE Trans. Image Process. 21(9), 4290–4303 (2012)
J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classification. IEEE Trans. Image Process. 21(7), 3262–3272 (2012)
D. Di, C. Zou, Y. Feng, H. Zhou, R. Ji, Q. Dai, Y. Gao, Generating hypergraph-based high-order representations of whole-slide histopathological images for survival prediction. IEEE Trans. Pattern Analy. Mach. Intell. 1–16 (2022). https://doi.org/10.1109/TPAMI.2022.3209652
D. Di, S. Li, J. Zhang, Y. Gao, Ranking-based survival prediction on histopathological whole-slide images, in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, (2020), pp. 428–438
D. Di, J. Zhang, F. Lei, Q. Tian, Y. Gao, Big-hypergraph factorization neural network for survival prediction from whole slide image. IEEE Trans. Image Process. 31, 1149–1160 (2022)
J. Bai, B. Gong, Y. Zhao, F. Lei, C. Yan, Y. Gao, Multi-scale representation learning on hypergraph for 3D shape retrieval and recognition. IEEE Trans. Image Process. 30, 5327–5338 (2021)
G.Y. An, Y. Huo, S.E. Yoon, Hypergraph propagation and community selection for objects retrieval, in Proceedings of the Advances in Neural Information Processing Systems, (2021), pp. 3596–3608
D. Pedronette, L. Valem, J. Almeida, R. Torres, Multimedia retrieval through unsupervised hypergraph-based manifold ranking. IEEE Trans. Image Process. 28(12), 5824–5838 (2019)
L. Nong, J. Wang, J. Lin, H. Qiu, L. Zheng, W. Zhang, Hypergraph wavelet neural networks for 3D object classification. Neurocomputing. 463, 580–595 (2021)
S. Bai, X. Bai, Q. Tian, L.J. Latecki, Regularized diffusion process on bidirectional context for object retrieval. IEEE Trans. Pattern Analy. Mach. Intell. 41(5), 1213–1226 (2019)
F. Chen, B. Li, L. Li, 3D object retrieval with graph-based collaborative feature learning. J. Visual Commun. Image Represen. 28, 261–268 (2019)
Y. Gao, M. Wang, Z. Zha, J. Shen, X. Li, X. Wu, Visual-textual joint relevance learning for tag-based social image search. IEEE Trans. Image Process. 22(1), 363–376 (2013)
Y. Wang, L. Zhu, X. Qian, J. Han, Joint hypergraph learning for tag-based image retrieval. IEEE Trans. Image Process. 27(9), 4437–4451 (2018)
L. Chen, Y. Gao, Y. Zhang, S, Wang, B. Zheng, Scalable hypergraph-based image retrieval and tagging system, in Proceedings of the 34th IEEE International Conference on Data Engineering (2018), pp. 257–268
N. Bouhlel, G. Feki, C.B. Amar, Visual re-ranking via adaptive collaborative hypergraph learning for image retrieval, in Proceedings of the Advances in Information Retrieval - 42nd European Conference on IR Research (2020), pp. 511–526
Y. Chu, C. Feng, C. Guo, Social-guided representation learning for images via deep heterogeneous hypergraph embedding, in Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (2018), pp. 1–6
H. Su, S. Maji, E. Kalogerakis, E. Learned-Miller, Multi-view convolutional neural networks for 3d shape recognition, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 945–953
Y. Huang, Q. Liu, S. Zhang, D. Metaxas, Image retrieval via probabilistic hypergraph ranking, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2010), pp. 3376–3383
L. Wu, X. Hua, N. Yu, W. Ma, S. Li, Flickr distance: a relationship measure for visual concepts, IEEE Transa. Pattern Analy. Mach. Intell. 34(5), 863–875 (2012)
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Dai, Q., Gao, Y. (2023). Hypergraph Computation for Computer Vision. In: Hypergraph Computation. Artificial Intelligence: Foundations, Theory, and Algorithms. Springer, Singapore. https://doi.org/10.1007/978-981-99-0185-2_11
Download citation
DOI: https://doi.org/10.1007/978-981-99-0185-2_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-0184-5
Online ISBN: 978-981-99-0185-2
eBook Packages: Computer ScienceComputer Science (R0)