Keywords

1 Introduction

With the rapid development of information technology, large-scale and high-dimensional image data have been widespread on the Internet. A variety of efforts have been made to deal with the large scale similarity search, which is shown to be useful for many practical applications (e.g. computer vision [3, 25, 37], machine learning [9, 27, 39], and data mining [44]). The hashing technique [1, 5, 6, 16, 32, 34, 35, 38] is a popular approach of encoding high-dimensional data as low-dimensional binary codes, which benefits from its computation and storage efficiencies. Learning based hashing [10, 11, 20, 23, 29, 45] which mines the data properties and the semantic affinities shows better performance than data-independent hashing methods [8].

Fig. 1.
figure 1

Illustration of our approach. Unlike most existing learning-based hashing methods (on the top) which solve the differential difficulty by continuous relaxations, our method (on the bottom) modifies the non-smooth part as a stochastic policy, where samples for binary codes are encouraged to earn maximum rewards for similarity preservation. The network is trained via policy gradient directly

Most previous learning-based hashing methods encode data samples with shallow architectures [11, 20, 29], which map similar samples to close in the learned hamming space by learning a single projection matrix. While encouraging performance can be obtained, most of them suffer from the non-linear feature representation, scalability and non-linearity issues. Recently, deep learning based hashing methods [17, 42] have been proposed to learn discriminative feature representations and nonlinear hash mappings, which have shown state-of-the-art performance on various scalable image retrieval datasets. However, the binary constraint of the non-smooth discrete optimization is a challenging problem in these methods, which prevents deep hashing to be learned in a truly end-to-end manner. By continuous relaxation, the non-smooth optimization can be transformed to a continuous one which can be solved by standard gradient methods, leading to the deviation from the optimal binary codes. While many methods have been proposed to control the quantization errors, they still cannot learn exactly binary hash codes in an optimization procedure. Hence this may lead to substantial performance loss due to the sub-optimal of the learned binary codes.

In this paper, we present a relaxation-free deep hashing method via policy gradient (PGDH) for scalable image search. Figure 1 shows the key idea of our proposed method. Specifically, we formulate the non-smooth part of the hashing network as sampling with a stochastic policy, so that the relaxation procedure used in most previous hashing methods can be removed. We directly generate binary codes and maximize the expectation of rewards for similarity preservation, which leads to more effective gradient and binary hash codes and the differentiation issue for discrete optimization can be naturally addressed. Extensive evaluations on three benchmark datasets show that our method significantly improves the state-of-the-arts.

2 Related Work

A variety of learning-based hashing methods have been proposed in recent years, which can be mainly classified into unsupervised hashing and supervised hashing.

Unsupervised hashing methods learn binary codes by exploiting data properties such as distributions and manifold structures. For example, spectral hashing (SH) [40] formulated hashing learning as a graph partitioning problem and approximately solved the problem with the assumption of the uniform data distribution. Anchor graph hashing (AGH) [26] approximated neighborhoods by using a tractable graph based method. Deep hashing (DH) [21] employed a multi-layer neural network to learn hash functions to preserve the nonlinear relationship of samples. Iterative quantization (ITQ) [9] minimized quantization loss by seeking a rotation matrix in an iterative manner. Manifold hashing (MH) [31] learned binary embeddings from cluster centers and mapped data into a low-dimension manifold. Discrete graph hashing (DGH) [24] presented a tractable alternating optimization method for similarity preservation in the discrete code space.

Supervised hashing methods learn binary codes by exploiting the label information of samples, which have shown superior performance than unsupervised approaches. For example, kernelized supervised hashing (KSH) [25] utilized the equivalence between code inner products and Hamming distances, which aims to keep the inner product of hash codes consistent with the pairwise supervision. Fast supervised hashing [19] employed boosted decision trees to iteratively perform alternative optimization on a subset of binary codes. Supervised discrete hashing (SDH) [30] formulated the discrete optimization objective by introducing an auxiliary variable and used a kernel based hashing function to learn binary codes. The supervised extension of deep hashing [21] learned multi-layer functions by considering the label information of samples. Recent advances in deep learning [12, 15, 33] show that deep convolutional networks learn robust and powerful feature representations for complex data, which has gained great successes in many computer vision applications. Hence, it is natural to leverage deep learning to obtain compact binary codes. For example, CNNH [42] adopted a two-stage strategy in which the first stage learned hash codes and the second stage learned a deep network based hash function to obtain the codes. DNNH [17] improved the two-stage CNNH with a simultaneous feature learning and hash coding pipeline so that representations and hash codes can be optimized in a joint learning procedure. DSH [22] improved DNNH by adding a max-margin loss and a quantization loss which jointly preserved pairwise similarity and controlled the quantization error. HashNet [2] gradually approximated the non-smooth sign activation with a smoothed activation by a continuation method.

3 Approach

3.1 Overview of General Relaxation Framework

Given a training set of N points (images) \(\varvec{X} = \{\varvec{x}_i\}_{i=1}^{N}\), each sample is represented by either a D-dimensional feature vector or raw pixels. A set of pairwise labels \(\varvec{\mathcal {S}}\) \(= \{s_{ij}\}\) is provided, where \(s_{ij} = 1\) if \(\varvec{x}_{i}\) and \(\varvec{x}_{j}\) are similar while \(s_{ij} = -1\) if \(\varvec{x}_{i}\) and \(\varvec{x}_{j}\) are dissimilar. For supervised hashing, \(\varvec{\mathcal {S}}\) can be constructed from semantic labels of data points or the relevance feedback from click-through data. We aim to learn a mapping function \(f: \varvec{x} \mapsto \varvec{b} \in \{-1,1\}^K\) from the input space to the Hamming space \(\{-1,1\}^K\), where each data point \(\varvec{x}\) is encoded as a compact K-bit binary hash code. The binary codes \(\varvec{B} = \{\varvec{b}_i\}_{i=1}^{N}\) should preserve some notion of similarity in \(\mathcal {S}\). Hence, the hashing learning problem can be generally formulated as follows:

$$\begin{aligned} \min _{f,\varvec{B}} \mathcal {L}(\varvec{B}), \quad s{.}t{.} \quad \varvec{B} \in \{-1,1\}^{n \times K} \end{aligned}$$
(1)

where \(\mathcal {L}\) is the predefined loss function with similarity preservation.

To directly optimize the problem in Eq. (1) with the discrete constrain on \(\varvec{B}\), we need to adopt the sign function \(\varvec{b} = \text {sgn}(\varvec{h})\) as the activation function to convert the continuous representation \(\varvec{h}\) to the binary hash code \(\varvec{b}\). However, the sign function is non-differentiable at zero and with zero gradient for all nonzero inputs, which makes standard back-propagation infeasible. As a result, it is inappropriate to directly solve the discrete optimization problem by standard gradient-based methods. Most existing hashing methods relax the intractable optimization problem mainly in two ways: (1) continuous relaxation by introducing a quantization function, and (2) approximating the sign function with sigmoid or tanh relaxation [2, 17]. For the first strategy, these methods derive an optimization problem \(\mathcal {L}(\varvec{H})\) from the hashing objective \(\mathcal {L}(\varvec{B})\) by continuous relaxation and control the quantization loss between \(\varvec{B}\) and \(\varvec{H}\), which is denoted as \(\mathcal {Q}(\varvec{B},\varvec{H})\). The objective of these methods can be usually reformulated as:

$$\begin{aligned} \min _{f,\varvec{H},\varvec{B}} \mathcal {L}(\cdot ) + \mathcal {Q}(\varvec{B},\varvec{H}), \quad s{.}t{.} \quad \varvec{B} \in \{-1,1\}^{n \times K} \end{aligned}$$
(2)

where \(\mathcal {L}(\cdot )\) indicates \(\mathcal {L}(\varvec{H})\) for continuous optimization [18] or \(\mathcal {L}(\varvec{B})\) for discrete optimization [22]. However, since \(\mathcal {Q}(\varvec{B},\varvec{H})\) is NP-complete and cannot be minimized to zero, there still exists a gap between \(\varvec{B}\) and \(\varvec{H}\). Thus a local minimum is usually obtained by such relaxation optimization problems.

For the second strategy, the non-smooth sign function is approximated by continuation method, which leads to a convergence to the original hash learning objective. However, to obtain feasible gradients, such relaxation inevitably becomes more non-smooth and slows down or suppresses the convergence, which makes it difficult to optimize the learning model.

3.2 Relaxation-Free Deep Hashing via Policy Gradient

In this section, we propose a new architecture for deep learning to hash with policy gradient inspired by the REINFORCE algorithm [41]. The architecture of our proposed framework contains: (1) a convolutional network (CNN) for learning deep representations of images, and (2) a fully-connected policy layer with a sigmoid activation function for transforming each feature representation into a K-dimensional vector, where each dimension represents the probability of taking the binary action. The proposed end-to-end learning framework can be viewed as an agent that interacts with an external environment (images in our case). The aim of the agent is to get maximum possible similarity preservation with difference minimization, which can be considered as the reward to the agent.

We define a policy as \(\varvec{\pi }(\varvec{x}_i,\theta )=\{\pi _{\varvec{x}_i,\theta }^{(k)}\}_{k=1:K}\), which is parametrized by network parameter \(\theta \) with i-th input \(\varvec{x}_i \). The policy generates a sequence of actions \(\varvec{a}_{i}=\{a_{i,k}\}_{k=1:K} \sim P_{\theta }(\varvec{x}_{i})\), where \(a_{i,k}=\{0,1\}\) represents a binary action value. \(\pi _{\varvec{x}_i,\theta }^{(k)}\) only outputs the probability of the hash code \(+1\), which is different from most existing reinforcement learning methods which predict the probability distribution for each possible action (e.g. softmax probability). Hence, the probability distribution in our method can be formulated as follows:

$$\begin{aligned} p(a_{i,k})= {\left\{ \begin{array}{ll} \pi _{\varvec{x}_i,\theta }^{(k)},&{} \text {if } a_{i,k} = 1\\ 1-\pi _{\varvec{x}_i,\theta }^{(k)},&{} \text {if } a_{i,k} = 0\\ \end{array}\right. } \end{aligned}$$
(3)

Having generated action \(\varvec{a}_{i}\), the agent observes a reward \(r(\varvec{a}_i)\) that is related to the similarity preservation. The reward is computed by an evaluation metric by comparing the similarity relationship in the Hamming space with ground-truth similarity function \(\mathcal {S}\).

We adopt a minibatch-based strategy for learning and sample a minibatch of points from the whole training set in each iteration. For each mini-batch with m training samples, we aim to utilize the global information by maximizing the preserved information between each binary code \(\varvec{b}_i =2*(\varvec{a}_i-0.5)\) and the codebook \(\varvec{C} = \{\varvec{\hat{b}}_j\}_{j=1}^n\) of all the training points in the Hamming space. For a pair of binary codes \(\varvec{b}_i \) and \(\varvec{\hat{b}}_j\), we represent the Hamming distance \(dist_{H}(\cdot ,\cdot )\) by inner product \(\langle \cdot ,\cdot \rangle \) as: \(dist_H(\varvec{b}_i ,\varvec{\hat{b}}_j) = \frac{1}{2}(K - \langle \varvec{b}_i ,\varvec{\hat{b}}_j\rangle )\). The weighted reward of learning to effective hash codes can be written as follows:

$$\begin{aligned} r(\varvec{a}_i) = -\frac{1}{2}\sum _{j=1}^{n} \hat{s}_{ij} (K - \varvec{b}_i^T \varvec{\hat{b}}_j) \nonumber \\ s{.}t{.} \quad \varvec{b}_i ,\varvec{\hat{b}}_j \in \{-1, +1\}^K \end{aligned}$$
(4)

where

$$\begin{aligned} \hat{s}_{ij}= {\left\{ \begin{array}{ll} \beta ,&{} \text {if } s_{ij} = 1\\ \beta - 1, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(5)

is the weighted similarity measurement to compensate the imbalance of positive and negative pairs. The parameter \(\beta \) allows different weights on the positive and negative pairs. Note that the codebook \(\varvec{C}\) is updated slower than the learning model \(\theta \) during the training process, which will be discussed later.

The goal of training is to minimize the negative expected reward of the minibatch:

$$\begin{aligned} \mathcal {L}(\theta ) = -\sum _i\mathbb {E}_{\varvec{a}_i \sim P_{\theta }(\varvec{x}_i)}[r(\varvec{a}_i)] \end{aligned}$$
(6)

Note that in our framework the description of the environment consists of images, which is not determined by the previous states or actions. Strictly speaking, this formulation is not a full reinforcement learning framework where a state transition is clearly defined. Here we only focus on the optimization under the guidance of the rewards related to similarity preservation and improving performance of hash learning.

Policy Gradient with REINFORCE: In our proposed hash learning method, the expected reward r is non-differentiable. In order to compute \(\nabla L(\theta )\) directly, we use the REINFORCE algorithm, which computes the expected gradient of the non-differentiable reward function as follows:

$$\begin{aligned} \nabla _{\theta }\mathcal {L}(\theta ) = -\sum _i \mathbb {E}_{\varvec{a}_i \in \mathcal {A}_{i}}[r(\varvec{a}_i)\nabla _{\theta } \log (P_{\theta }(\varvec{a}_i|\varvec{x}_i))] \end{aligned}$$
(7)

where \(\mathcal {A}_{i} \) is the set of all possible actions for i-th input data in the minibatch. The expected gradient can be approximated using Monte Carlo sample. We represent a T-samples Monte Carlo on \(\varvec{a}_i\) as:

$$\begin{aligned} \mathcal {A}_{i} = \{\varvec{a}_i^1,\varvec{a}_i^2,...,\varvec{a}_i^T\} = MC^{P_{\theta }(\varvec{a}_i|\varvec{x}_i)}(T) \end{aligned}$$
(8)

For training examples in a minibatch, the expected policy gradient can be computed as:

$$\begin{aligned} \nabla _{\theta }\mathcal {L}(\theta ) \approx -\frac{1}{T}\sum _i \sum _t[r(\varvec{a}_i^t)\nabla _{\theta } \log (P_{\theta }(\varvec{a}_i^t|\varvec{x}_i))] \end{aligned}$$
(9)

where the log probability in Eq. (9) can be calculated by the binary cross entropy over the Bernoulli distribution in Eq. (3).

REINFORCE with a Baseline: The above gradient estimator is simple but suffers from high variance because of the difficulty of credit assignment. To reduce the variance of the gradient estimation, we again approximate the expected gradient with widely used Baseline method in policy gradient. For each training minibatch:

$$\begin{aligned} \nabla _{\theta }\mathcal {L}(\theta ) \approx -\frac{1}{T}\sum _i \sum _t[(r(\varvec{a}_i^t)-r')\nabla _{\theta } \log (P_{\theta }(\varvec{a}_i^t|\varvec{x}_i))] \end{aligned}$$
(10)

where the baseline \(r'\) should be the value which is independent on the action. Adding such a baseline term will not change the expectation of the gradient Footnote 1 but can reduce the variance of the gradient estimation. Here we choose average of all rewards in each mini-batch as the baseline. The binary codes that preserve more similarity information with the codebook \(\varvec{C}\) than the baseline will get positive rewards, while those that with less similarity information will be penalized by negative rewards. We then update the network’s parameters as:

$$\begin{aligned} \theta \leftarrow \theta - \lambda \nabla _{\theta }\mathcal {L}(\theta ) \end{aligned}$$
(11)

where \(\lambda \) denotes the learning rate.

During the learning process, the codebook \(\varvec{C}\) is updated slower than the model for the training stability and performance improvement. We can formulate the codebook update as:

$$\begin{aligned} \hat{\varvec{b}}_{j}= 2*(\hat{\varvec{a}}_j-0.5), \hat{\varvec{a}}_i \sim P(\varvec{x}_{j} \vert \theta ^{-}) \end{aligned}$$
(12)

This strategy is motived by [28] which introduces a target network \(\theta ^{-}\) with slower updating rate than the online network \(\theta \) to gain more stable performance.

In summary, Algorithm 1 shows full details of the proposed method.

figure a

3.3 Out-of-Sample Extensions

Having completed the learning procedure, we only generate the optimized hash codes for the training points by maximizing the expectation of rewards. How to perform out-of-sample extensions to generate hash codes for the points which are not in the training dataset remains unclear. To address this, we perform the out-of-sample extensions in two ways: Deterministic and Stochastic.

Deterministic Generation: Denote a data point which is not in the training dataset as \(\varvec{x}_q\), we feed it to our proposed architecture and get a vector with K values \(\pi _{\varvec{x}_q,\theta }\), each represents the probability of the binary action 1 (sigmoid activation ranges from 0 to 1). We can directly obtain the binary codes in the deterministic way:

$$\begin{aligned} b_q^{k} = {\left\{ \begin{array}{ll} +1,&{} \text {if } \pi _{\varvec{x}_q,\theta }^{(k)} > 0.5\\ -1, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(13)

Stochastic Generation: Having obtained the probability vector, we can write the stochastic code generation function as:

$$\begin{aligned} b_q^k = {\left\{ \begin{array}{ll} +1,&{} \text {with probability} \quad \pi _{\varvec{x}_q,\theta }^{(k)}\\ -1, &{} \text {with probability} \quad 1 - \pi _{\varvec{x}_q,\theta }^{(k)} \end{array}\right. } \end{aligned}$$
(14)

The stochastic way seems more appealing than the deterministic one but in practice the performance differs slightly after the learning model converges. In our experiments, we report the performance directly using deterministic generation and we also conduct investigation on the two ways to generate hash codes.

4 Experiments

4.1 Datasets and Experimental Settings

We conduct extensive empirical evaluation on three public widely used benchmark datasets: CIFAR-10 [14], NUS-WIDE [43] and ImageNet [4]. CIFAR10 contains 60,000 manually single-labeled color images belonging to 10 classes (6000 images per class). Following the same setting in [36], we construct the query set by randomly sampling 1,000 images with 100 images per category and use the remaining 59,000 images to form the database. Then we uniformly select 500 images per class to form the training set from the database. NUS-WIDEFootnote 2 is a public Web image dataset of 269,648 images collected from Flickr. This is a multi-label dataset, namely, each image is associated with one or multiple labels from a given 81 concepts. We follow the settings in [42, 46] and use the subset of 195,834 images that are associated with the 21 most frequent concepts, where each concept consists of at least 5,000 images. We randomly sample 2,100 images with 100 images per category to form the test set and use the remaining images as the database. We uniformly sample 500 images per category out of the database to form a training set. ImageNet is a large dataset for visual recognition which contains over 1.2M images in the training set and 50K images in the validation set covering 1,000 categories. Following the same setting in [2], we randomly select 100 categories, use all the images of these categories in the training set as the database and all the images in the validation set as the queries. To train hashing methods, we randomly select 100 images per category from the database as the training points.

Following the same evaluation protocol as previous work [22], the similarity information, which is constructed from image labels, is used for ground truth evaluation and constructing the pairwise similarity matrix for training. For both single and multiple labeled dataset, we define the ground truth semantic neighbors as images sharing at least one label. Note that by constructing the training data in this way, all three datasets exhibit the data imbalance problem because of the imbalance of positive and negative pairs, which can be used to evaluate the effects of our weighted rewards controlled by \(\beta \).

We evaluate the retrieval performance of generated binary codes with the following metrics: mean average precision (MAP), precision-recall (P-R) curve, precision at top retrieved samples (P@N), and Hamming lookup precision within a Hamming radius \(r=2\) (HLP@2). We choose to evaluate the performance over binary codes with lengths of 16, 32, 48, and 64 bits. Note that for the ImageNet dataset we calculate the MAP@1000 as each category has only 1,300 images, and for NUS-WIDE we adopt MAP@5000.

In our implementation of PGDH, we utilize the AlexNet network structure and implement it in the Pytorch framework. We initialize first seven layers of PGDH by copying the parameters of convolutional layers \(conv1-conv5\) and fully-connected layers \(fc6-fc7\) in the pre-trained model on ImageNet and fine-tuned these layers. We also initialize the final policy layer with the Guassian distribution and train this layer from scratch. In the training phase, we use Adam [13] with the initial learning rate as 0.005 and set the batch size as 128. For parameter tuning, we evenly split the training set into ten parts to cross validate the parameters. We fix the Monte Carlo samples T as 10 in each iteration and codebook update interval R as 5.

4.2 Results and Analysis

Comparison with the State-of-the-Arts: We compare the proposed PGDH with twelve state-of-the-art hashing methods, including unsupervised methods: LSH [8], SH [40], ITQ [9], supervised methods: KSH [25], CCA-ITQ [9], FastH [19], SDH [30], and supervised deep methods: CNNH [42], DNNH [17], DPSH [18], DSH [22], HashNet [2]. We report their results by running the source codes provided by their respective authors to train the models by ourselves, except for DNNH due to the inaccessibility of the source code. For conventional hashing methods, we use \(DeCAF_7\) [7] features as input. For deep hashing methods, we directly use raw images as input and resize images to fit the adopted network. Note that we adopt the AlexNet architecture for all deep hashing for fair comparison.

Table 1. The comparison of the retrieval performance among all compared hashing methods in terms of mean average precision (MAP %) on the three image datasets for different number of bits of 16, 32, 48, and 64
Fig. 2.
figure 2

The experimental results of PGDH and comparison methods on the CIFAR-10 dataset under three evaluation metrics

Table 1 shows the overall retrieval performance of different hashing methods in terms of MAP at different code lengths. We can observe that our proposed PGDH outperforms all compared methods. Compared with the best competitor in deep learning based hashing methods, PGDH consistently outperforms by around 3%. The significant performance improvement attributes to the effective binary codes obtained via policy gradient instead of the general relaxation framework. Note that our PGDH also utilizes the weighted rewards function to attack the data imbalance problem which is ignored by many existing methods. Also, we see that the recently proposed HashNet boosts the performance of other deep learning methods (e.g. DSH and DPSH) because HashNet tackles the optimization difficulty by continuation method and the data imbalance problem by weighted maximum likelihood. Compared with the best conventional hashing methods, PGDH also boosts the performance by a large improvement. Note that the deep hashing methods sustainably outperform the conventional hash learning methods on both datasets by a large margin even though the conventional ones utilize the CNN features, which suggests the end-to-end learning scheme is advantageous.

Fig. 3.
figure 3

The experimental results of PGDH and comparison methods on the NUSWIDE dataset under three evaluation metrics

Fig. 4.
figure 4

The experimental results of PGDH and comparison methods on the ImageNet dataset under three evaluation metrics

The performance on CIFAR-10, NUS-WIDE and ImageNet datasets in terms of Precision-Recall (PR) curves for 64-bit binary codes are shown in Figs. 2(a), 3(a) and 4(a). Here we only show the results in terms of PR curves on the deep learning based hashing methods to evaluate the effectiveness of the hashing learning. The results show that PGDH outperforms all the compared methods by large margins. PGDH achieves much higher precision at the same recall level than compared methods which suggests that effective hash codes are learnt via policy gradient. This attribute is appreciated in practical precision-first image retrieval system where high probability of finding true neighbors is more important.

The performance on the three datasets in terms of the average precision with respect to different numbers of top retrieved results(P@N) of deep learning methods for 64-bit binary codes are shown in Figs. 2(b), 3(b) and 4(b). Note that the maximum of N is set to 1,000 here for the consistency on all the three datasets. From the result figures, we can see that PGDH consistently provides superior precision than the compared hashing methods for the same amount of retrieved samples. This stands for that more semantic neighbors are retrieved, which is desirable in practical use.

The performance in terms of Hamming lookup precision within Hamming radius 2 (HLP@2) for deep learning based hashing methods at different bit lengths on three datasets are shown in Figs. 2(c), 3(c) and 4(c). This evaluation metric measures the precision of the retrieved results falling into the buckets within the Hamming radius 2. The results validate the compactness of the binary codes learnt by PGDH. We also observe that the best performance is achieved at a moderate length of binary codes. This is because that longer binary code makes the data distribution in Hamming space sparse and fewer samples fall within the set Hamming ball.

Fig. 5.
figure 5

Effects of the number of Monte Carlo samples in terms of MAP with 16, 32, 48 and 64-bit binary codes on the CIFAR-10 dataset

Fig. 6.
figure 6

Effects of the frequency of codebook update by setting R as 1, 5 and 40 in terms of MAP with 64-bit binary codes on CIFAR-10 (Color figure online)

Investigation on Samples: We study the effects of the number of Monte Carlo samples in the optimization procedure by changing the parameter T in PGDH. Note that it costs more time to train a minibatch of data as T increases. We report the performance results of different T values selected from \(\{2, 5, 8, 10, 12, 15, 20\}\) in Fig. 5 in terms of MAP on the CIFAR-10 dataset. The results show that when T is small, the search quality degrades because efficient gradients cannot be obtained without enough MC samples. We also observe that the performance exhibits saturation when we keep enlarging T. For a tradeoff of the search quality and the training efficiency, we choose to fix T as 10 during training.

Investigation on Codebook Update: We study the effects of the frequency of codebook update during training by changing the interval parameter R in PGDH. Figure 6 shows MAP performance evolution of the first 60 epochs during training with respect to R on the CIFAR-10 dataset with length of binary codes set as 64 bits. The network is hard to optimize and MAP exhibits a very low value during training (red curve) when we update the codebook \(\varvec{C}\) every iteration (\(R=1\)). When we update the codebook \(\varvec{C}\) once a epoch (\(R = 40\)), the network can be trained steadily but MAP raises up very slowly (green curve). We also observe that the best performance (blue curve) is achieved at a moderate value of \(R = 5\).

Fig. 7.
figure 7

Loss of search quality in MAP (by red bars) due to conversion from continuous features to 64-bit binary codes on the CIFAR-10 dataset (Color figure online)

Fig. 8.
figure 8

Time cost to encode one new-coming image of different hashing methods on the CIFAR-10 dataset with 64-bit binary codes

Table 2. MAP (%) for different code generation schemes related to different training epochs on the CIFAR-10 dataset with 64-bit binary codes
Table 3. MAP (%) for different \(\beta \) on the three datasets with 64-bits binary codes

Deterministic vs. Stochastic: We investigate the deterministic and stochastic generation during the testing phase. Table 2 shows the MAP performance of the 64-bit codes generated by these two ways at different epochs on the CIFAR-10 dataset. We can observe that the performance differs a lot during the first decades of epochs. This is because that the stochastic way generates binary codes by sampling in an uncertain manner, which will influence results if the model doesn’t converge. We also observe that the MAP differs slightly when the learning model converges as the epochs increase. Although the stochastic way seems more appealing in PGDH, it will take more time for code generation during testing because of the sampling operation in practice.

Investigation on Weighted Rewards: We investigate the effect of weighted rewards on dealing with the imbalance problem. The weight is controlled by the parameter \(\beta \) in Eq. (5). The algorithm merely utilizes the positive pairs to learn hash codes when we set \(\beta \) to a large value. Setting \(\beta \) close to 0, the algorithm merely utilizes the negative pairs to learn hash codes. With the definition of semantic similarity and the datasets, the imbalance problem substantially deteriorates the performance of hashing methods. Table 3 shows the variation of performance in terms of MAP with respect to \(\beta \) on three datasets with the length of binary codes set as 64 bits. The retrieval performance ascends when setting \(\beta > 0.5\), which shows the effect of introducing weighted rewards in our method.

Comparison of Search Quality Degradation: A crucial superiority of PGDH over the comparison methods lies in that PGDH directly learns effective compact binary codes via policy gradient, while comparison methods relax the discrete objective to adopt to the gradient-based algorithm. Intuitively, searching with binary codes using Hamming distance is evidently inferior to searching with continuous features using Euclidean distance, due to substantial information loss by relaxation. The search quality loss in terms of MAP due to binarization is shown in Fig. 7. Note that since PGDH directly outputs the binary codes, we only show the absolute MAP value for PGDH. From the result figure, we see that DNNH (9 % degradation), DPSH (3.85 % degradation), DSH (1.56 % degradation) and HashNet (0.9 % degradation) suffer from MAP loss while our PGDH can even break the bottleneck of the search quality with continuous features obtained by the compared methods. In other words, PGDH can learn more effective binary codes which are more accurate than all other methods.

Comparison of Encoding Time: The time to generate the binary code for a new-coming sample is an important factor to evaluate retrieval system in practical use. In this part, we compare the encoding time of our PGDH with (1) five deep learning based hashing methods, CNNH, DNNH, DPSH, DSH and HashNet, and (2) three conventional hashing methods, ITQ, ITQ-CCA, SDH, including the unsupervised and supervised hashing with linear and nonlinear hashing functions. For deep hashing methods, which directly take the raw images as input, we report the encoding time on GPUs. For conventional hashing methods, we take into consideration both the time cost for deep feature extraction on GPUs and the time cost for hashing encoding on CPUs. Figure 8 shows the comparison of the encoding time of involved hashing methods in logarithmic scale on the CIFAR-10 dataset with 64-bit binary codes. Our computing platform is equipped with a 4.0 GHz Intel CPU, 32 GB RAM, and NVIDIA GTX 1080Ti. Although HashNet and DSH are faster than our PGDH because of the higher computational efficiency in Caffe implementation, we can easily convert the trained Pytorch model into a Caffe version during the test phase to realize the encoding acceleration while keeping the retrieval performance.

5 Conclusion

In this paper, we have proposed a new relaxation-free framework for deep hashing via policy gradient. We modified the non-smooth part of the hashing network for sampling as a stochastic policy to address the back-propagation difficulty. We directly generated binary codes through the network and maximized the expectation of the rewards related to the similarity preservation. We trained the proposed network via policy gradient, which naturally avoids the differentiation difficulty for discrete optimization, leading to more effective binary codes. We have conducted extensive experiments to validate the superiority of the proposed PGDH through comparison with the state-of-the-art hashing methods.