Keywords

1 Introduction

Empowered by the ubiquitous access to computer devices and the Internet, an ever-growing amount of digital images has been emerged [25]. In light of this, image retrieval is considered as an active research topic that aims at retrieving relevant images to a user query from a large database of digital images [11, 14, 21, 26]. Until recently, most of the popular search engines (e.g., Flickr) are built upon the textual information associated with images [4, 7, 24]. Nevertheless, they cannot comprehensively describe the rich content of images since they totally ignore the visual information [10]. Besides, they suffer from the fact that the textual information is often noisy, ambiguous and language-dependent [8, 12]. As a consequence, the retrieved results may be noisy and irrelevant which may affect the retrieval performance [17, 24]. To tackle those issues, visual re-ranking has been introduced to refine the text-based retrieval results using the visual information [4, 19, 32, 35]. Namely, it attempts to boost the rank of relevant images with respect to the textual query [24]. Recently, the hypergraph learning has been widely used in many applications for its capability in capturing complex relationships among samples [4, 15, 23]. In case of visual re-ranking, the textual results are taken as vertices and the re-ranking problem is formulated as a transductive learning on the hypergraph [2, 9]. The potential of the hypergraph learning is essentially determined by the hypergraph construction scheme [22]. Most of previous hypergraph learning methods adopt a neighborhood-based strategy to build the hypergraph, in which textual results are taken as vertices and each vertex is linked to its k nearest neighbors by an hyperedge. While obvious, this method suffers from the following drawbacks: (1) it is sensitive to noise (2) lacks the ability to discover the real neighborhood structure (3) the parameter k is fixed as global parameter for all samples regardless their local data distribution. To tackle those issues, recent works have proposed to leverage the regularized regression models, namely the sparse representation and the ridge regression for hypergraph construction [22]. Compared to the neighborhood-based hypergraph, the sparse hypergraph achieves superior performance in revealing the local data structure and handling the noisy data. However, it cannot discover related samples to one hyperedge centroid as thoroughly as possible. Moreover, the sparse constraint makes the hypergraph construction very expensive [41]. Recently, the ridge regression has gained considerable attention not only for its effectiveness in data representation but also for its computational efficiency [41]. In contrast to sparse representation which aims at encouraging the competition between samples to represent a datum, the ridge regression attempt to include all samples in the representation process. That’s why this framework is often called the collaborative representation. Owing to these desirable properties, in this paper, we put a particular emphasis on the collaborative representation and we propose an adaptive collaborative hypergraph learning for visual re-ranking. The proposed data representation technique adaptively preserve the locality structure and discard irrelevant/outlier samples with respect to a test sample by integrating a distance-regularizer on the representation coefficients. At the feature level, we impose a weight matrix on the representation errors to adaptively highlight the important features and reduce the effect of redundant/noisy ones. Moreover, to enhance the representation interpretability, a nonnegativity constraint is added in such a way that the representation coefficients can directly reveal the similarity among samples. This way, we obtain a more informative and quality hypergraph which not only captures the grouping information but also reveal the local neighborhood structure and exhibit more discriminative power and robustness to noisy data. Extensive experiments on the public MediaEval benchmarks demonstrate that our re-ranking method achieves consistently superior results, compared to state of-the-art methods.

2 Related Works

In recent years, many visual re-ranking methods have been proposed in the literature. According to the statistical analysis model used, they could be classified as supervised and unsupervised methods. The former cast the re-ranking to a classification problem that aims at separating relevant from irrelevant images using data from the initial results as training samples. For instance, authors in [30] built a supervised classification model using expert annotations to assign a relevance score to each image. The latter assumes that relevant samples are probably to be close to each other than to irrelevant ones. It aims at discovering and mining patterns using pair-wise similarities. Clearly, there are two paramount ways. The first is to leverage clustering to group images with respect to their visual closeness. For instance, a Hierarchical Clustering is applied in [1] and [29] to cluster samples by relevance. Authors in [28] apply a graph-based clustering method where a similarity graph is initially built to represent relationships among images. The second way is to adopt the graph-based learning for its effectiveness in modeling the intrinsic structure within data. VisualRank proposed by Jing and Baluja [20] is the most popular graph-based re-ranking method. It applies a random walk on an affinity graph where images are taken as nodes and their visual similarities as probabilistic hyper-links. In [39], a manifold ranking process is applied over the data manifold, with the aim of naturally finding the most relevant images. Although promising results are achieved, how to represent complex and high-order relationships hidden in data still the performance bottleneck for graph-based re-ranking. As a generalization of the graph learning, the hypergraph learning is receiving increasing attention in recent years owing to its ability in modeling complex data structure in a more flexible and elegant way [3, 23]. Considering the visual re-ranking, the hypergraph learning is widely used for relevance estimation. For instance, in [2], authors construct a k-nearest neighbor graphs based on the visual similarity between images. Then, a hypergraph ranking is performed to learn the images’ relevance scores. Although efficient, this method suffers from some drawbacks. First, the neighborhood strategy cannot capture the local data distribution of each datum since it uses a fixed number of neighbors k for all samples [35]. Second, the neighborhood strategy is very sensitive to noisy data due to the use of the Euclidean distance as similarity measure [22, 37]. To address those limitations, some researchers have proposed to exploit the regression models for data representation. The most widely used model is the sparse representation (SR) in which each sample is represented as a linear combination of the remaining samples [15, 36]. Compared to the neighborhood-based hypergraph, the sparse hypergraph achieves superior performance in revealing the local data structure and handling the noisy data. However, it cannot discover related samples to one hyperedge centroid as thoroughly as possible. Moreover, the sparse constraint makes the hypergraph construction very expensive. Recently, the collaborative representation has gained considerable attention not only for its effectiveness in data representation but also for its computational efficiency [41]. Therefore, in this paper, we put a particular emphasis on the collaborative representation and we propose an adaptive collaborative hypergraph learning for visual re-ranking.

3 The Proposed Hypergraph Model for Visual Re-Ranking

3.1 Adaptive Collaborative Representation Representation

For clarity, we first introduce some important notations used throughout this paper. The matrix \(X=\left[ x_{1},...,x_{N} \right] \in \mathbb {R}^{d\times N} \) is a collection of N data samples where \(x_i \in \mathbb {R}^{d} \) denotes the i-th data sample. \(||Z||_F\) is the Frobenius norm of matrix Z. 1 and 1 are a matrix and a vector whose elements are equal to 1, \(\odot \) denotes te element-wise multiplication. For a scalar v, we define \((v)_+\) as \((v)_+=max(v,0)\) [27].

Problem Formulation. Conventionally, the collaborative representation aims to solve the following least square problem:

$$\begin{aligned} \underset{Z}{argmin}\left.{\Vert } X-XZ \right.{\Vert }_{2}^{2} + \lambda \left.{\Vert } Z \right.{\Vert }_{2}^{2} \end{aligned}$$
(1)

In this paper, we propose an adaptive collaborative representation formulated as follows:

$$\begin{aligned} \underset{Z,W}{argmin}\left.{\Vert } W^{1/2}\odot (X-XZ) \right.{\Vert }_{F}^{2} + \frac{\beta }{2} \left\| W \right\| ^{2}_{F} + \lambda \left.{\Vert } Z \right.{\Vert }_{F}^{2} +\gamma {tr}(D^{T}Z) \nonumber \\ \text { s.t } W \ge 0,W^T\mathbf 1 =\mathbf 1 ,Z\ge 0,diag(Z)=0,Z\mathbf 1 =\mathbf 1 \end{aligned}$$
(2)

Specifically, the objective function contains the following terms:

  1. 1.

    The self-representation term: It represents the reconstruction error between the estimated and the real data. Many references have pointed out that redundant/noisy features are likely to have large reconstruction errors [23, 40]. Based on this assumption, we regularize the reconstruction errors by a nonnegative weight matrix W. Hence, we adaptively highlight the important features while reducing the effect of redundant/noisy ones.

  2. 2.

    The \(\ell _2-\)regularizer on the weight matrix: This term as well as the constraint \(W^T\mathbf 1 =\mathbf 1 \) are imposed to avoid the trivial solution of W as in [42].

  3. 3.

    The regularization term on the representation matrix: It shrinks the representation coefficients towards zero by imposing an \(\ell _2-\)-regularizer on their sizes. Indeed, all samples will collaborate during the representation process of a test sample since their coefficients will never become exactly zero.

  4. 4.

    The locality-preserving term: The collaborative representation does not consider the data locality which has been observed to be critical for many learning tasks [34]. For this purpose, we incorporate a locality-preserving term in our model so that (1) the local structure is preserved (i.e, close samples will have close representation) and (2) irrelevant/outliers samples are discarded. Mathematically, each element of the distance matrix D is defined as: \(d_ {ij}=\left| | x_i-x_j \right| |_{2}^{2}\).

  5. 5.

    Finally, we add the following constraints on the representation matrix Z:

    • \(Z\ge 0\): A non-negative representation coefficient \(z_{ij}\) can directly reveal the similarity between the samples \(x_i\) and \(x_j\) [45].

    • \(diag(Z)=0\): this constraint is used to avoid that a sample is represented as a linear combination of itself.

    • Z1 = 1: the sum of each row of Z is set to be equal to 1 which ensure that all samples are selected in the joint representation.

The ADMM-Based Optimization. There are two unknown variables in the problem (2), e.g., Z and W. To make the problem (2) separable, some auxiliary variables are added as follows:

$$\begin{aligned} \underset{Z,W}{argmin}\left.{\Vert } W^{1/2}\odot E \right.{\Vert }_{F}^{2}\nonumber + \frac{\beta }{2} \left\| W \right\| ^{2}_{F} + \lambda \left.{\Vert } J \right.{\Vert }_{F}^{2} +\gamma {tr}(D^{T}Z) \\ \text { s.t } W \ge 0,W^T\mathbf 1 =\mathbf 1 , Z\ge 0, diag(Z)=0, Z\mathbf 1 =\mathbf 1 , E=X-XZ,J=Z \end{aligned}$$
(3)

Considering the problem (3) as a two-block optimization problem, we adopt the alternating direction method (ADM) to solve it [38]. Thus, we define the augmented Lagrangian function as:

$$\begin{aligned} \mathfrak {L}(Z,W,E,J,C_1,C_2)= \left.{\Vert } W^{1/2}\odot E \right.{\Vert }_{F}^{2} + \frac{\beta }{2} \left\| W \right\| ^{2}_{F}+ \lambda \left.{\Vert } J \right.{\Vert }_{F}^{2} +\gamma {tr}\left( D^{T}Z\right) \nonumber \\ +\frac{\mu }{2}\left( \left\| X-XZ-E+\frac{C_1}{\mu } \right\| ^2_F+\left\| Z-J+\frac{C_2}{\mu } \right\| ^2_F \right) \end{aligned}$$
(4)

where \(C_1\), \(C_2\) are the Lagrangian multipliers and \(\mu \) is a penalty parameter.

Then, we solve each unknown variable while fixing the other variables in an alternate way.

Step 1: The variable W is obtained by minimizing the following problem while fixing the other variables:

$$\begin{aligned} \underset{W}{min}\left.{\Vert } W^{1/2}\odot E \right.{\Vert }_{F}^{2} + \frac{\beta }{2}||W||^{2}_{F} \ \text { s.t } W\ge 0,W^T\mathbf {1}=1 \end{aligned}$$
(5)

Solving the problem (5) is equivalent to solve:

$$\begin{aligned} \underset{ w_{ij}\ge 0,\sum _{j}w_{ij}=1}{min}\sum _{i,j}\left( w_{ij}+\frac{e_{ij}^2}{\beta }\right) ^2 \end{aligned}$$
(6)

The problem (6) can be written in the vector form since it is independent for different i [27].

$$\begin{aligned} \underset{w_i\ge 0,w_i^T{{\varvec{1}}}=1}{min}\left\| w_i+\frac{h_i}{\beta } \right\| _2^2 \end{aligned}$$
(7)

where \(H=E \odot E\)

The associated Lagrangian function is:

$$\begin{aligned} \mathfrak {L}(w_i,c,m_i)=\frac{1}{2}\left\| w_i+\frac{h_i}{\beta } \right\| _2^2-c(w_i^T{\varvec{{1}}}-1)-m_i^T w_i \end{aligned}$$
(8)

where c and \(m_i\) are the Lagrangian multipliers associated to the boundary constraints on \(w_i\).

Given the fact that \(m_{ij}w_{ij}=0\) according to the KKT condition [42], we have:

$$\begin{aligned} w_{ij}=\left( c-\frac{h_{ij}}{\beta } \right) _+ \end{aligned}$$
(9)

Finally, we update the Lagrangian multiplier c according to the constraint \(w_i^T{\varvec{{1}}}=1\) as follows:

$$\begin{aligned} \sum _{i=1}^{N}( c-\frac{h_{ij}}{\beta } )=1\Rightarrow c=\frac{1}{N}+\frac{1}{N\beta }\sum _{j=1}^{N}h_{ij} \end{aligned}$$
(10)

Step 2: We can obtain the error matrix E by solving the following problem:

$$\begin{aligned} \underset{E}{min} ||W^{1/2} \odot E||^{2}_{F} + \frac{\mu }{2}||E-G||^2_F \; where \; G=X-XZ+\frac{C_1}{\mu } \end{aligned}$$
(11)

The problem (11) is equivalent to :

$$\begin{aligned} \sum _{i,j}\underset{e_{ij}}{min}\left( e_{ij}-\frac{\mu g_{ij}}{\mu +2w_{ij}}\right) ^2 \end{aligned}$$
(12)

Then, the optimal solution of each element \(e_{ij}\) is

$$\begin{aligned} e_{ij}=\frac{\mu g_{ij}}{\mu +2w_{ij}} \end{aligned}$$
(13)

Step 3: We can obtain the matrix J by solving the following problem:

$$\begin{aligned} \underset{ J}{min } \lambda \left.{\Vert } J \right.{\Vert }_{F}^{2} + \frac{\mu }{2}||Z-J+\frac{C_2}{\mu }||^2_F \end{aligned}$$
(14)

The close-form of J can be obtained by setting the derivative of (14) w.r.t J to zero:

$$\begin{aligned} J^*=\frac{\mu G}{\mu +2\lambda } \;where \; G=Z+\frac{C_2}{\mu } \end{aligned}$$
(15)

Step 4: The variable Z can be obtained by solving the following problem:

$$\begin{aligned} \underset{Z}{min}\ \gamma {tr}(D^{T}Z)+ \frac{\mu }{2}\left( ||M_1-XZ||^2_F+ ||Z-M_2||^2_F \right) \nonumber \\ \text { s.t } Z\ge 0, diag(Z)=0,Z\mathbf 1 =\mathbf 1 \end{aligned}$$
(16)

where \(M_1=X-E+\frac{C_1}{\mu }\) and \(M_2=J-\frac{C_2}{\mu }\)

Considering the following unconstrained problem:

$$\begin{aligned} \underset{Z}{argmin}\ \gamma {tr}(D^{T}Z)+ \frac{\mu }{2}\left( ||M_1-XZ||^2_F+ ||Z-M_2||^2_F \right) \end{aligned}$$
(17)

The problem (17) has a closed-form solution obtained by setting its derivative equal to zero:

$$\begin{aligned} \widehat{Z}=\left( X^TX+I \right) ^{-1}\left( X^TM_1+M_2-\frac{\gamma }{\mu }D \right) \end{aligned}$$
(18)

Then, the optimal solution Z of the problem (16) can be obtained more efficiently by solving the following problem:

$$\begin{aligned} \underset{Z\ge 0, diag(Z)=0,Z\mathbf 1 =\mathbf 1 }{min}\ ||Z-\widehat{Z}||^2_F \Leftrightarrow \underset{z_{ij}\ge 0,z_{ii}=0,\sum _{i}z_{ij}=1}{min}\left( z_{ij}-\widehat{z_{ij}} \right) ^2 \end{aligned}$$
(19)

We obtain the optimal solution for each row \(z_{i}\) as in problem (6):

$$\begin{aligned} z_{i}=\left( \eta _{i}I_f^T+\bar{z_{i}} \right) _+ \end{aligned}$$
(20)

where \(I_f\) is a column vector whose elements are equal to one expect the \(i-\)th is set equal to zero. \(\bar{z_{i}}\) is defined as:

$$\begin{aligned} \bar{z_{i}} =\left\{ \begin{array}{cc} \widehat{{z}_{ij}} &{} i\ne j\\ 0&{} otherwise \end{array}\right. \end{aligned}$$
(21)

\(\eta _i\) is the Lagrangian multiplier which is calculated as:

$$\begin{aligned} \eta _i=\frac{1+\bar{z_{i}}{\varvec{{1}}}}{N-1} \end{aligned}$$
(22)

Step 5: We update the Lagrangian multipliers and the penalty parameter as follows, respectively:

$$\begin{aligned} C_1=C_1+\mu \left( X-XZ-E \right) \end{aligned}$$
(23)
$$\begin{aligned} C_2=C_2\mu \left( Z-J \right) \end{aligned}$$
(24)
$$\begin{aligned} \mu =min(\mu _{max},\mu \rho ) \end{aligned}$$
(25)

Convergence and Computational Complexity. In this section, we first analyze the computational complexity of the proposed representation model. Clearly, the most computationally-demanding step in the ADMM-based Optimization is the step 4 which includes matrix multiplication and matrix inverse operations. It costs \( O (N^3)\) for \(N\times N\) matrix. Fortunately, the term \(\left( X^TX+I \right) ^{-1}\) can be pre-calculated before the iteration loop since it is independent from all variables and. The first two steps are efficiently calculated since they can be considered as element-wise operations. The third step mainly involves matrix addition operation. Hence, their computational complexities can be ignored compared to the fourth step.

3.2 The Proposed Hypergraph Construction Scheme

In this work, we assume that the representation vectors corresponding to two similar samples should be close since they can be similarly represented using remaining ones. More formally, we measure the similarity between two data samples as follows:

$$\begin{aligned} A(i,j)= z_i \cdot z_j \end{aligned}$$
(26)

In terms of hypergraph, such information is very useful to characterize the incidence relations between hyperedges and their vertices:

$$\begin{aligned} h(v_{i},e_{j})={\left\{ \begin{array}{ll} A \left( i,j\right) , \;\; \text {if} \;\; z_{ij} \ge \theta \\ 0, \;\; otherwise\end{array}\right. } \end{aligned}$$
(27)

Here, we set \(\theta \) as the mean values of \(\left\{ z_{ik} \right\} _{k=1}^{N} \) . According to this formulation, each vertex \(v_{i}\) is associated to hyperedge \(e_{j}\) based on whether it has prominently contributed in the representation of its centroid \(v_{j}\). Moreover, for each centroid, the number of neighbors is adaptively selected. Hence, its distinctive neighborhood structure is well preserved.

3.3 The Hypergraph-Based Re-Ranking

In this work, we formulate the visual re-ranking problem as a transductive learning framework on the adaptive collaborative hypergraph model \(G = (V, E, \omega )\):

$$\begin{aligned} \arg \underset{f}{min}\left\{ \varOmega (f)+\mu R_{emp}(f) \right\} \end{aligned}$$
(28)

where the vector f is constituted of the relevance scores to be learned.

Following the Zhou’ works [44], the regularization term can be written as follows:

$$\begin{aligned} \varOmega (f)=f^{T}(I-\varTheta ) f =f^{T}\left( I-D_{v}^{-1/2}HWD_{e}^{-1}H^{T}D_{v}^{-1/2} \right) f \end{aligned}$$
(29)

The empirical loss \( R_{emp}(f) \) guarantees that final ranking scores are close to the initial ones. It is defined as:

$$\begin{aligned} R_{emp}(f)= \Vert f-y \Vert ^{2}=\sum _{v_{i} \in V}(f(v_{i})-f(v_{i}))^{2} \end{aligned}$$
(30)

Where the initial ranking vector y is uniformly defined as:

$$\begin{aligned} y_{i}=1-\frac{i}{N} \end{aligned}$$
(31)

By substituting (29) and (30) into (28) and setting the derivative of (28) with respect to f to 0, we have

$$\begin{aligned} f(I-\varTheta )+\mu (f-y)=0 \Rightarrow f=\frac{\mu }{1+\mu }(I-\frac{\varTheta }{1+\mu })^{-1}y \end{aligned}$$
(32)
Table 1. Description of databases

4 Experiments

4.1 Experimental Settings

In this section, we have conducted visual re-ranking experiments on four public databases designed within the MediaEval 2014 [16] and MediaEval 2016 [18] competitions and listed in Table 1. In particular, the MediaEval 2014 benchmark consists of information for 153 one-concept location queries (e.g., buildings, museums, roads,bridges, sites, monuments, etc) with about 300 photos per location [16]. The MediaEval 2016 benchmarks consists of 135 complex and general-purpose multi-concept queries (e.g., animals at zoo, sunset in the city, accordion player, etc)[18]. We choose those databases for the following reasons: (1) they are consisted of real-world images (i.e. images are initially retrieved from Flickr in response to a textual query) (2) they are publicly available and (3) annotations are carried out by experts [17].

We use the convolutional neural networks based descriptors to represent images of all databases for its impressive performance in image retrieval [43]. In all experiments, we followed the rules of the MediaEval competitions. Indeed, in evaluation, a photo is considered to be relevant if it is a common photo representation of the query [16, 18]. Experiments were carried out for different cut-off points, \(X \in \left\{ 5, 10, 20, 30, 40, 50 \right\} \). For performance evaluation, we adopt the precision P@20 as the official ranking for both MediaEval 2014 and MediaEval 2016 benchmarks was set to a cut-off of 20 images [16, 18]. For fair comparison, we conducted all experiments on the same platform, i.e., Matlab platform running on Windows7, with an Intel (R)-Core(TM) i7-4500U 3.40 GHz processor and 8 GB memory. Moreover, we manually tuned the parameters of all other methods to obtain their optimal results.

4.2 Performance Comparison with State-of-the-art Methods

This experiment is conducted in order to compare our method with other methods that achieved best performance during the MediaEval competitions. In this experiment, we select only those visual-based methods. Comparison results are reported in Table 2. First, it can be observed that our method achieves a consistent improvement over the Flickr baseline on all databases. For examples, at a cut-off point \(X=20\), the precision gains of ACR-HG over Flickr are \(6.67\%\), \(8.29\%\), \(10.07\%\) and \(6.49\%\) on Landmark-30, Landmark-123, General-65 and General-70 respectively. Second, our method almost always outperforms other methods on all databases. For example, on Landmark-123, the precision of our method is \(P@20=0.8894\) while other methods achieve 0.769 (TUW)[28], 0.7561 (SocSens) [31] and 0.748 (PeRCeiVe)[29]. On the General-70 database, which is a complex and general-purpose multi-concept database, we achieve a \(P@20=0.7921\) compared to \(P@20=0.5437\) achieved by the best team (LAPI) [6]. Our method, which not only models the complex and high-order relationships among visual samples via hypergraph but also capture the overall contextual information by the means of collaborative representation, achieves the best performance among the compared methods. This clearly demonstrates the validity of our method for visual re-ranking not only on for landmark image retrieval but also for multi-topic image retrieval.

Table 2. Performance comparison to state-of-the-art re-ranking methods.
Table 3. Performance comparison to graph/hypergraph-based methods

4.3 Performance Comparison for Hypergraph Learning

In this experiment, we aim to validate the superiority of our hypergraph model over the conventional graph/hypergraph models. Results are showed in Table 3. From the results, the following observations can be drawn:

  • Despite their ability in refining the initial retrieval results, graph-based re-ranking methods are almost outperformed by the hypergraph-based ones. This demonstrates that, in contrast to graph model, hypergraph model has and inherent ability to capture the local group information and latent high-order relationships among samples.

  • The experimental results reveal also the good robustness and discriminative power of representation based hypergraph learning compared to neighborhood based hypergraph learning. On different databases, the representation based hypergraph ranking achieves the highest precision compared to hypergraph ranking based on neighborhood relationships. In particular, our method consistently and significantly achieves the best relevance improvement among other representation based hypergraph ranking.

  • The adaptive collaborative representation has bring more robustness and discriminative power to the hypergraph than the collaborative representation. For instance, the precision gains of ACR-HG over the CR-HG are \(1.17\%\), \(1.66\%\), \(3.57\%\) and \(4.22\%\) on Landmark-30, Landmark-123, General-70 and General-65 respectively. One explanation is that the adaptive collaborative representation impose a locality-preserving regularizer on the representation coefficients which enable to capture the global and local structures of data during the hypergraph learning.

Fig. 1.
figure 1

Evolution curve of relevance for different landmark query topics

4.4 Performance Evaluation per Topic Class

The aim of this experiment is the investigate the performance stability of our method for different query topics. Comparison results are presented in Figs. 1 and 2. We find that our method outperforms Flickr for almost all query topics. The experimental results also reveal that, the relevance of retrieval results is higher for landmarks queries compared to complex queries. One explanation is that, non -relevant images were likely to be arisen when the query is ambiguous or involve multiple topics. For example, the query ‘baby in stroller’ may give rise to images that contain an empty stroller. Another interesting observations, is that the retrieval performance is degraded for some queries (e.g. ‘baby in stroller’). This can be attributed to the fact that a high relevance score for a non-relevant image will be propagated to its visually similar neighbors since only the visual information is used for building the hypergraph.

Fig. 2.
figure 2

Evolution curve of relevance for different general multi-concept query topics

5 Conclusion

In this paper, we proposed a novel hypergraph-based visual re-ranking method to enhance the performance of text-based image retrieval. At the core of our method is the data representation. Particularly, we proposed a novel representation technique called adaptive collaborative representation to build a more informative hypergraph. By constraining the self-representation term with an weighted matrix, the effect of those redundant and useless features can be adaptively minimized so that a more robust hypergraph can be constructed. In addition, our data representation technique has the advantage of simultaneously capturing both global and local structures of data during hypergraph learning by introducing a locality-preserving term. Based on the obtained representation matrix, we showed how to generate consistent hyperedge connections and hyperedge weights. Finally, a transductive learning is successfully performed upon the constructed hypergraph to learn the images’ relevance scores. Experimental results performed on public MediaEval benchmarks demonstrate that our method achieves consistently superior results compared to state-of-the art re-ranking methods.