1 Introduction

The focus of this work is multi-codebook quantization (MCQ), an approach to vector compression analogous to k-means clustering, where cluster centres arise from the combinatorial combination of entries in multiple codebooks. Modern systems for very large-scale approximate nearest neighbour (ANN) search typically rely on a data structure that shortlists candidates, followed by search using the compressed representation obtained from a variant of MCQ [5, 16, 17, 32].

Systems for efficient large-scale search in high-dimensional spaces have important applications to prominent problems in machine learning and computer vision. For example, Mussman et al. [24] use Gumbel variables to randomly perturb nearest neighbour queries and accelerate learning and inference in log-linear models. Douze et al. [11] use a large-scale similarity graph constructed via MCQ to improve learning in deep “low-shot” models. Guo et al.[14] use an MCQ-based system to achieve state-of-the-art performance in maximum inner product search (MIPS), and accelerate large-scale recommender systems. Finally, Blalock and Guttag [7] use MCQ to reduce memory usage and accelerate large-scale data mining applications.

MCQ is an optimization problem over two latent variables that approximate a given dataset: the codebooks and the codes (i.e., the assignments of the data to those codebooks). The error of this approximation provides a bound for Euclidean distance and dot-product approximations in ANN and MIPS. Therefore, finding optimization methods that achieve low-error solutions is crucial for improving the performance of MCQ applications.

In similarity search, our goal is often to tackle very large datasets, so it is important that the optimization techniques scale gracefully – consider, for example, the case where one wants to index one billion vectors using the classical inverted file (IVF) [17]. An IVF partitions the dataset into K disjoint cells and learns a quantizer for each subset. Typically, \(K \in \{2^{12}, 2^{13}\}\), so one has to run the training method 4 096–8 192 times. If a method takes one hour to run, then one has to wait roughly 6–12 months for training to complete. On the other hand, if a method has a running time of one minute, then the total wait time is reduced to roughly 3–6 days.Footnote 1 Unfortunately, running time is often not reported in recent work on MCQ. Here, we focus on characterizing recent methods for MCQ in terms of their running time vs accuracy trade-off, and introduce novel improvements in both speed and accuracy to LSQ, a state-of-the-art MCQ method.

Problem formulation. MCQ is the task of finding a set of codes B and (multiple) codebooks C that minimize quantization error on a given dataset X. Our objective is to determine

$$\begin{aligned} \min _{C,B} ||X-CB ||_F^2, \end{aligned}$$
(1)

where \(X \in \mathbb {R}^{d \times n}\) contains n d-dimensional vectors, and \(C = [C_1, C_2, \dots , C_m] \in \mathbb {R}^{d \times mh}\) is composed of m subcodebooks \(C_i \in \mathbb {R}^{d \times h}\), with d dimensions and h entries each. Finally, \(B = [\mathbf {b}_1, \mathbf {b}_2, \dots , \mathbf {b}_n] \in \{0,1\}^{mh \times n}\) contains n binary codes, each with m entries \(\mathbf {b}_{i,j} \in \{0,1\}^h\) that select one entry from a different codebook \(\mathbf {b}_i = [\mathbf {b}_{i,1}, \mathbf {b}_{i,2}, \dots , \mathbf {b}_{i,m}]^\top \); in other words, \(||\mathbf {b}_{i,j} ||_0 = 1\) and \(||\mathbf {b}_{i,j} ||_1 = 1\).

MCQ is useful for large-scale ANN search because, in this representation, the Euclidean distance between a query vector \(\mathbf {q}\) and a compressed database vector \(\mathbf {x}_i \approx \hat{\mathbf {x}}_i = \sum _{j=1}^m C_j\mathbf {b}_{i,j}\), can be computed using the expansion

$$\begin{aligned} ||\mathbf {q} - \hat{\mathbf {x}}_i ||_2^2&= ||\mathbf {q} ||_2^2 - 2 \cdot \sum _{j=1}^m \langle \mathbf {q}, C_j\mathbf {b}_{i,j} \rangle + ||\hat{\mathbf {x}}_i ||_2^2. \end{aligned}$$
(2)

When searching for nearest neighbours, the first term can be ignored, as it is constant for all database vectors; the second term can be computed with m lookups in precomputed dot-product tables, and it is typical to use one extra codebook to quantize the third (scalar) term. Note that, when MCQ is used to approximate dot-products (e.g., in MIPS) or convolutions, it is not necessary to store the norm of the encoded vector, and the full memory budget can be used to improve the quality of the approximation.

We typically set \(h=256\) [3, 12, 17, 21, 27, 34], which means that each index into a codebook can be stored using 8 bits. Thus, if we use \(m=\{7, 15\}\) codebooks, and set aside an extra table for storing the norm of the approximation with \(h=256\) entries as well, the memory used per vector is only 64 (resp. 128) bits.

2 Related Work

Early work in MCQ adopted orthogonal codebooks [12, 17, 26], which considerably simplifies the problem and leads to very scalable solutions, at the expense of accuracy. More recent work has focused on using non-orthogonal codebooks, which increase accuracy but also result in increased computational costs, deterring their wider adoption. For example, the recently released FAISS libraryFootnote 2 [18] implements only orthogonal MCQ techniques. In this work, we aim to better characterize and understand recent work in MCQ, with the goal of accelerating and improving the performance of non-orthogonal MCQ techniques.

Non-orthogonal MCQ. Chen et al. [9] introduced non-orthogonal codebooks for MCQ and proposed residual vector quantization (RVQ), a greedy optimization method that runs k-means on each codebook in a sequential manner. Later, Ai et al. [1] and Martinez et al. [22] independently proposed enhanced RVQ and resp. stacked quantizers (SQ), a refinement of RVQ that obtains lower quantization error, but maintains the same encoding complexity.

Babenko and Lempitsky [3] proposed additive quantization (AQ), which uses an expectation-maximization (EM)-like approach for optimization. The authors used beam search for updating the codes, and a conjugate gradient method for the codebook update step. Although unaware of RVQ, this paper has proven influential due to its insights and proper characterization of the hard combinatorial problems that arise in non-orthogonal MCQ. Later, Martinez et al. [21] introduced local search quantization (LSQ), an encoding method based on iterated local search [19] with iterated conditional modes (ICM), which improves upon the accuracy vs computation tradeoffs of the beam search method of AQ, leading to overall higher recall. Initialization consists of OPQ followed by a simpler version of optimized tree quantization (OTQ) [4].

Zhang et al. [34] proposed composite quantization (CQ), which minimizes quantization error but also penalizes the deviation of cross-codebook terms from an (also latent) constant. The method is also EM-like, and the authors use ICM for the encoding step, and the L-BFGS [25] solver for the codebook update step. Initialization consists of PQ followed by unconstrained MCQ (Expression 1).

Finally, Ozan et al. [27] introduced competitive quantization (CompQ), a method that updates the codebooks with stochastic gradient descent (SGD), and updates the codes using beam search within a search space whose size is controlled by a hyperparameter that trades-off accuracy and computation.

3 Comparative Perfomance Evaluation

While recent work has used different experimental setups, fortunately all studies have reported results on the SIFT1M dataset at 64 bits. Thus, first we focus on comparing the three methods that report the best results on this dataset: CQ [34] (R@1 of 0.290), LSQ [21] (R@1 of 0.298) and CompQ [27] (R@1 of 0.352). We measure all our timings on a desktop with an 8-core i7-7700K CPU @4.20 GHz, 32 GB of RAM and an NVIDIA Titan Xp GPU.

LSQ vs composite quantization (CQ). For LSQ, we use as a starting point the publicly available implementation due to Martinez and Clement,Footnote 3 written in Julia [6]. For CQ, we use the recently released implementation due to ZhangFootnote 4. This release is written in multithreaded C++, and uses the heavily optimized libraries MKL (for matrix operations) and libLBFGS (for codebook update).

Table 1. Comparison between CQ and LSQ on SIFT1M using 64 bits.

We let CQ use \(m=8\) codebooks and LSQ use \(m=7\) codebooks, plus an extra codebook for the database norms. This means that both methods have the same query time and use the same amount of memory. We run both methods for 30 iterations, and use all the default hyperparameters as provided in their respective code releases. To make the comparison more fair, we have ported OTQ and LSQ encodings to C++ with OpenMP multithreading. These methods are called from Julia, and we leave the rest of the code untouched.

The results reported by Zhang et al. [34] on SIFT1M were trained on the base set. SIFT1M is provided with a learn set, and the more common protocol is to learn the model parameters exclusively on the learn set [1, 3, 9, 12, 17, 21, 22, 26, 27, 35]. Thus, we also run the method limiting its parameter learning to the learn set.

We report the results of our experiments on Table 1. LSQ achieves slightly higher recall than CQ when CQ is trained on the base set, but LSQ has an overall \(20\times \) faster running time. The running time of CQ decreases drastically when we train it on the learn set, but the learned parameters do not generalize well to the base set (R@1 of 0.162). On the LabelMe22K and MNIST datasets (which traditionally do not have a learn partition), we have observed that LSQ consistently achieves higher recall than CQ with roughly \(10\times \) faster running times. From these results, we conclude that LSQ is faster, more accurate, and more sample-efficient (i.e., it requires less training data) than CQ.

LSQ vs competitive quantization (CompQ). Since there is no publicly available implementation of CompQ, we have tried to reproduce the reported results ourselves, with moderate success. We have not, for example, been able to reproduce the transform coding initialization reported in the paper, but have instead used RVQ, which was reported to achieve slightly worse results. We obtained a R@1 of 0.346 using a beam search width of 32, and training for 250 epochs (the parameters of the best reported result). The small difference in recall may be attributed to our different initialization.

However, the largest barrier to experimentation on our side is that our CompQ implementation, written in Julia, takes roughly 40 min per epoch to run. This means that our experiment on SIFT1M with 64 bits took almost one week to finish. We contacted the CompQ authors, and they mentioned using a multithreaded C++ implementation with pinned memory, an ad-hoc sort implementation, and special handling of threads. Their implementation takes 551 s per epoch, or about 38 h (\({\sim }1.5\) days) in total for 250 epochs on a desktop with a 10-core Xeon E5 2650 v3 @2.3 GHz CPU. We compare CompQ to LSQ using our multithreaded C++ implementation (same as in Table 1). We also use \(m=8\) codebooks in total, which controls for query time and memory use with respect to CompQ. We train for 25 iterations in total, and again use all the default parameters of LSQ.

Table 2. Comparison between CompQ and LSQ on SIFT1M using 64 bits.

LSQ and CompQ live on opposite sides of the parallelism spectrum: while CompQ uses stochastic gradient descent (SGD) with a batch size of 1, and is thus primarily sequential, LSQ is EM-like, so it is very amenable to parallelization. This means that CompQ is unlikely to benefit from a GPU implementation, as these require fairly large batch sizes to deliver higher throughput than CPUs (in fact, using large batch sizes to accelerate the training of large deep neural networks with SGD is an active area of research [13, 29]). To further explore the consequences of this algorithmic trade-off, we used the publicly available CUDA implementation of LSQ encoding [23] to accelerate training and base processing. We also implemented OTQ encoding in CUDA.

We report the results of our experiments on Table 2. Our C++ implementation of LSQ is about \(150 \times \) faster than CompQ and, when using a GPU, LSQ achieves roughly a \(500 \times \) speedup over CompQ. However, the recall of LSQ lags by 0.012 behind CompQ. Further research into CompQ may focus on finding ways to increase its batch size, so that it can leverage modern GPUs.

Improving LSQ: Desiderata. In the light of these results, we suggest the following criteria to improve LSQ:

  1. (a)

    First, we would like to make LSQ more accurate, so that it can narrow the gap with (and ideally, surpass) CompQ in terms of recall.

  2. (b)

    Second, we would like to maintain the parallelism of LSQ, because it is a distinctive feature that makes it fast in practice.

  3. (c)

    Finally, because LSQ is faster than its competitors, we want to find ways to trade-off running time for accuracy. To make this trade-off more attractive in practice, we would also like to decrease LSQ’s overall running time.

Next, we propose improvements to LSQ that satisfy all these criteria.

4 Lower Running Time with a Fast Codebook Update

While benchmarking LSQ using a GPU, we noticed that the codebook update step is the most computationally expensive part of LSQ. This is somewhat counterintuitive, because encoding has historically been identified as the bottleneck in MCQ [3, 21]. However, recent hardware and algorithmic improvements have upended this idea. In particular, out of the 2.8 min of training time for LSQ with \(m=8\) codebooks and 25 iterations (last row of Table 2), 2.34 min are spent updating the codebook C. Thus, decreasing the running time of the codebook update step would significantly decrease the overall running time of LSQ. Formally, the codebook update step amounts to determining

$$\begin{aligned} \min _C ||X - CB ||_F^2; \end{aligned}$$
(3)

the current state-of-the-art method for this step was originally proposed by Babenko and Lempitsky [3], who noticed that finding C corresponds to a least-squares problem where C can be found independently in each dimension. Since B can be seen as a very sparse matrix, the authors proposed using iterative conjugate gradient (CG) methods in this step. This has the additional advantage that B can be reused for the d problems that finding C decomposes into. We have identified two problems with this approach:

  1. 1.

    Explicit sparse matrix construction is inefficient. CG APIs typically require that B be represented as an explicit sparse matrix. Although efficient data structures for sparse matrices exist (e.g., the compressed sparse row of numpy), in practice, B is stored as an \(m \times n\) uint8 matrix. We would like to use this representation and avoid using an additional data structure.

  2. 2.

    Failure to exploit the binary nature of B. The matrix B is composed exclusively of ones and zeros (i.e., it is binary). Data structures used for sparse matrices are commonly designed for the general case when the non-zero entries are arbitrary real numbers, leaving room for additional optimization.

Direct codebook update. We now introduce a method for fast codebook update, which takes advantage of these two observations. First, we note that it is possible to use a direct method instead of iterative CG, by rewriting Expression 3 as a regularized least-squares problem:

$$\begin{aligned} \min _C ||X - CB ||^2_F + \lambda ||C ||^2_F. \end{aligned}$$
(4)

In this case, the optimal solution can be obtained by taking the derivative with respect to C and setting it to zero

$$\begin{aligned} C = XB^\top ( B B^\top + \lambda I )^{-1}. \end{aligned}$$
(5)

While we are not interested in a regularized solution, we can still benefit from this formulation by setting \(\lambda \) to a very small value (\(\lambda = 10^{-4}\) in our experiments), which simply renders the solution numerically stable. A crucial advantage of this formulation is that the matrix \(B B^\top + \lambda I \in \mathbb {R}^{mh \times mh}\) is square, symmetric, positive-definite and fairly compact; notably, its size is independent of n. Furthermore, thanks to regularization, \(B B^\top + \lambda I\) is guaranteed to be full-rank. Thus, matrix inversion can be performed directly with the help of a Cholesky decomposition in \(O(m^3h^3)\) time. Because matrix inversion is efficient, the bottleneck of our method lies in computing \(BB^{\top } \in \mathbb {N}^{mh \times mh}\), as well as \(XB^\top \in \mathbb {R}^{d \times mh}\). We exploit the structure in B to accelerate both operations.

Computing \(\varvec{BB}^\top \). By indexing B across each codebook, \(B = [ B_1, \cdots , B_m ]^\top \), \(BB^\top \) can be written as a block-symmetric matrix composed of \(m^2\) blocks of size \(h \times h\) each:

$$\begin{aligned} BB^{\top }= \begin{bmatrix} B_1B_1^\top&B_1B_2^\top&\dots&B_1B_m^\top \\ B_2B_1^\top&B_2B_2^\top&\dots&B_2B_m^\top \\ \vdots&\vdots&\ddots&\vdots \\ B_mB_1^\top&B_mB_2^\top&\dots&B_mB_m^\top \\ \end{bmatrix}. \end{aligned}$$
(6)

Here, the diagonal blocks \(B_NB_N^\top \) are diagonal matrices themselves, and since B is binary, their entries are a histogram of the codes in \(B_{N}\). Moreover, the off-diagonal blocks are the transpose of their symmetric counterparts: \(B_NB_M^\top = (B_MB_N^\top )^{\top }\), and can be computed as bivariate histograms of the codes in \(B_M\) and \(B_N\). Using these two observations, this method takes \(O(m^2n)\) time, while computing \(BB^\top \) naïvely would take \(O(m^2h^2n)\).

Computing \(\varvec{XB}^\top \). We again take advantage of the structure of B to accelerate this step. \(XB^\top \) can be written as a matrix of m blocks of size \(d \times h\) each,

$$\begin{aligned} XB^\top = [XB_1^\top , XB_2^\top , \dots , XB_m^\top ]. \end{aligned}$$
(7)

Each block \(XB_i^\top \) can be computed by treating the \(B_i^\top \) columns as binary vectors that select the columns of X to sum together. This method takes O(mnd) time, while computing \(XB^\top \) naively would take O(mhnd).

Codebook update in CQ. Zhang et al. [34] propose a formulation similar to Eq. 5 for codebook update, which they use to warm-start the CQ optimization process, but do not introduce regularization. Since \(BB^\top \) is not guaranteed to have full rank, the authors use SVD for computing its inverse, disregarding solution components associated with small singular values. They also did not exploit the sparsity in B to compute the other terms of the solution. In our experiments, their method takes more than a minute to run, while our solution runs in well under a second.

5 Higher Recall with Stochastic Relaxations

Our goal in this Section is to make LSQ more accurate, while maintaining the high level of parallelism and speed that it already enjoys in practice. To this end, we note that LSQ is fast because its optimization process is EM-like, which allows it to take advantage of highly parallel architectures. However, a well-known problem with such EM-like approaches is their tendency to converge to local minima. We also note that MCQ is analogous to k-means clustering (with combinatorial codebooks). Many years of research into k-means have resulted in a number of improvements to the original Lloyd’s algorithm (e.g., k-means++ initialization [2], or cluster closures for faster encoding [31]), so we look into the literature for methods that may be adapted to improve MCQ.

5.1 Stochastic relaxations

A stochastic relaxation (SR), as formalized by Zeger et al. [33] in the early 1990s, is a method that defines an approximation to simulated annealing, with the idea of improving the quality of an approximation at reasonable computational costs. The idea was originally proposed to improve k-means clustering, and here we revisit and adapt it for MCQ.

Broadly defined, simulated annealing (SA) is a classical stochastic local search (SLS) technique that iteratively works in 3 major steps: (1) define an optimization state s, (2) create a new state \(s'\) by randomly perturbing the current state: \(s' = \pi (s)\), and (3) decide whether to reject or accept the new state as the basis for the next perturbation (for a broad review of the subject, see [15]). The acceptance probability in Step 3 is controlled by a parameter traditionally called temperature, which is typically slowly decreased over many iterations of Steps 2 and 3. (Various temperature schedules have been proposed and used in the many applications of simulated annealing). A stochastic relaxation modifies some of the typical SA steps in order to make them more computationally efficient. We now define these three steps for our method.

Defining a SA state: A functional view of MCQ. As a first step, we formally define an optimization state in MCQ. Expression 1 is defined over two latent variables, C and B. We assume that the optimization state is fully determined given a single variable, either C or B, which fully specifies the other via a pre-defined function. Thus, we define

  • an encoder function \(\mathcal {C}(X,B) \rightarrow C\), and

  • a decoder function \(\mathcal {D}(X, C) \rightarrow B\).

In our case, \(\mathcal {C}\) amounts to the codebook-update step, for which we adopt the method described in Sect. 4. Similarly, \(\mathcal {D}\) amounts to updating the codes B; in this case, we simply adopt the encoding method of LSQ. We have defined the optimization state of MCQ in two ways, which will give rise to two SR methods. The first method, called SR-C, uses the encoder function \(\mathcal {C}\), and the second method, called SR-D, uses the decoder function \(\mathcal {D}\).

Perturbing the SA state. The next step is to define a way to perturb the SA state at time-step i. We define two perturbation methods, one for SR-C and one for SR-D. Since we have defined the state as fully-determined given either variable via a proxy function, we can perturb the state by simply perturbing the corresponding function used in SR-C or SR-D. We define the functions

  • \(\,\mathcal {C}^* := \mathcal {C}(\pi _{\mathcal {C}}(X,i), B) \rightarrow C\) for SR-C, and

  • \(\mathcal {D}^* := \mathcal {D}(X, \pi _{\mathcal {D}}(C,i)) \rightarrow B\) for SR-D.

\(\pi _{\mathcal {C}}(X,i) \rightarrow X + T(i) \cdot \epsilon \) amounts to adding noise \(\epsilon \) to X, according to a predefined temperature schedule T(i). We choose to sample the noise from a zero-mean Gaussian with a diagonal covariance proportional to X; in other words, \(\epsilon \sim N(\mathbf {0}, \varSigma )\), where \(\varSigma = diag(cov(X))\).

A major difference between k-means and MCQ is that, in MCQ, we use multiple codebooks. This difference is particularly important in SR-D, where the noise affects C, which represents m different codebooks. Since the centroids are obtained by summing one entry from each codebook, perturbing C amounts to perturbing the centroids m times. We thus define the perturbation function for SR-D slightly differently: \(\pi _{\mathcal {D}}(C,i) \rightarrow C + (T(i)/m) \cdot \epsilon \). In other words, we multiply the noise by factor of 1 / m in SR-D.

Temperature schedule. In simulated annealing, it is common to gradually reduce the temperature, which controls the probability of accepting a new state (the so-called Metropolis-Hastings criterion). In SR, following Zeger et al. [22], we instead use the temperature to control the amount of noise added in each time-step. We use the schedule

$$\begin{aligned} T(i)&\rightarrow \left( 1 - (i/I) \right) ^p, \end{aligned}$$
(8)

where I is the total number of iterations, i represents the current iteration, and \(p \in (0, 1]\) is a tunable hyper-parameter. We have found that a value of \(p=0.5\) produces good results, and we use this parameter in all our experiments.

figure a

Acceptance criterion. The final building block of SA is an acceptance criterion, which decides whether the new (perturbed) state will be accepted or rejected. Following Zeger et al., we always accept the new state. As we will show, this simple criterion gives excellent results in practice.

Recap. To summarize, we have introduced two algorithms that define crude approximations to simulated annealing: SR-C and SR-D. These approximations are extremely simple to implement. To highlight this simplicity, we summarize the EM-like approach to MCQ in Algorithm 1; notice that

  • SR-C follows Algorithm 1 exactly, except that line 5 is replaced by \(C \leftarrow {{\mathrm{\arg \!\min }}}_C ||\pi _{\mathcal {C}}(X, i) - CB ||_F^2\), and

  • SR-D follows Algorithm 1 exactly, except that line 6 is replaced by \(B \leftarrow {{\mathrm{\arg \!\min }}}_B ||X - \pi _{\mathcal {D}}(C, i)B ||_F^2\).

In other words, SR-C and SR-D amount to adding noise in different parts of the EM-like MCQ optimization pipeline, but the workhorse functions that perform the codebook-update, as well as the encoding encoding step, remain unchanged. This has multiple advantages. On one hand, this means that we can fully maintain the parallelism of LSQ. On the other hand, if in the future better codebook-update or encoding functions are found, they can be seamlessly integrated into our pipelines. Finally, we note that our methods involve only minimal computational overhead, as they only require the computation of the covariance of either X (which can be computed once and re-used many times in SR-C), or C, which is a compact variable independent of n. In practice, this overhead is negligible: \({<}0.1\) s for SR-C, and \({<}0.01\) s for SR-D. We refer to the combination of SR and fast codebook update as LSQ++.

6 Experimental Evaluation

We quantify the impact of our codebook update method by measuring the time it saves per LSQ iteration (i.e., between lines 4 and 8 in Algorithm 1), and with a head-to-head large-scale evaluation against conjugate gradient (CG) methods. We also measure the impact of SR-C and SR-D by reporting recall@N.

Datasets. We evaluate our contributions on five datasets. The first two datasets are LabelMe22K [28] and MNIST. These datasets were originally created for classification, and have only two partitions (training/test). We learn both B and C on the training set, and use the test set as queries. LabelMe22K has \(d=512\) dimensions, 20 019 training vectors, and 2 000 queries. MNIST has \(d=784\) dimensions, 60 000 training vectors, and 10 000 queries.

The other three datasets are SIFT1M [17], Deep1M and VGG (called “Convnet1M” in [21]). SIFT1M is a classical retrieval dataset of SIFT [20] features. We have put together the Deep1M dataset, by sampling from the 10 million example set provided with the recently introduced Deep1B dataset [5]. These vectors come from the last convolutional layer of a GoogLeNet v3 [30] network, and have been PCA-projected to 96 dimensions. The VGG dataset consists of vectors from the CNN-M-128 network of Chatfield et al. [8] evaluated on Imagenet [10] images. These datasets have three partitions: train, query and base. We follow the standard protocol, which uses the train set to learn the codebooks C, and then uses those codebooks to encode the base set (i.e., obtain B); we then use the query set to find approximate nearest neighbours in the compressed base set [1, 3, 9, 12, 17, 21, 22, 26, 27]. SIFT1M and VGG have \(d=128\) dimensions, and Deep1M has \(d=96\) dimensions. The three datasets have 100 000 training vectors, 1 M base vectors, and 10 000 queries.

Table 3. Total time per LSQ/LSQ++ iteration, depending on how we update C (CG or Cholesky), and how we update B (using a C++ or a CUDA implementation).

6.1 Fast Codebook Update

We show the time savings obtained due to our codebook update method on Table 3. Our method saves anywhere from 2.3 (Deep1M, 64 bits) to 16 s (SIFT1M, 128 bits) of training time per iteration. This has a bigger impact when encoding is GPU-accelerated, as it results in 2.2–5.6\(\times \) speedups in practice.

Large-scale experiments. On Fig. 1, we show a “stress-test” comparison between our method for fast codebook update and CG, using dataset sizes of \(n=\{10^4, 10^5, 10^6, 10^7\}\). We take the first n training vectors from the SIFT1B [17] and Deep1B [5] datasets, and generate a random B. This is a specially easy case for CG, and it takes only 2–3 iterations to converge. Even in this case, our method is orders of magnitude faster than previous work, and stays under 10 s in all cases, while CG takes up to 700 s for \(n=10^7\). Our method is only slower on small training sets due to the complexity of matrix inversion, which is independent of n.

Fig. 1.
figure 1

Time for codebook update as a function of dataset size with up to \(10^7\) vectors.

6.2 Stochastic Relaxations

To evaluate our second contribution, we report recall@1, which represents the empirical probability, computed over the query set, that the actual nearest neighbour of the query is returned as the first retrieved entry. We run every method ten times on each dataset and report the average result to account for the randomness in recall.

We compare our contributions against the classical orthogonal MCQ methods PQ [17] and OPQ [12, 26], as well as the more recent RVQ [9], ERVQ [1, 22], CQ [34], and LSQ [21]. All methods use the same memory budget (64 or 128 bits per vector), the same codebook size of \(h=256\), and require the same number of table lookups to approximate a distance, so their query times are comparable as well. We run all the methods for 25 iterations.

Fig. 2.
figure 2

Recall@1 as a function of time in the MNIST and LabelMe datasets.

6.2.1 Recall@1.

Figures 2 and 3 show the recall@1 obtained by different methods as a function of time. We observe that SR-D obtains higher recall than LSQ in all datasets, and for both 64 and 128 bits, except for SIFT1M at 128 bits. Our fast codebook update method makes optimization faster than LSQ in all cases.

SR-C shows a more interesting behaviour. When using 64 bits, the method either gives a small boost to LSQ, or has a small detrimental effect (LabelMe22K). However, when using 128 bits, the method underperforms LSQ in all datasets, except for Deep1M and VGG. We find this result rather interesting, as it suggests that SR-C is better suited for deep features, which currently dominate a number of machine learning and computer vision applications. However, its performance on more classical benchmarks is somewhat disappointing.

We also note that, once we account for query time by dedicating one codebook to store the database norms, RVQ [9] and ERVQ/SQ [1, 22] tend to perform worse than PQ and OPQ – the only exception being the Deep1M and VGG datasets again. Previous work controlled only for memory use (with increased query time), so this detail was not obvious from previous benchmarks.

Fig. 3.
figure 3

Recall@1 as a function of time in the SIFT1M, Deep1M and VGG datasets.

Finally, we also observe that CQ fails to generalize when trained on the learn set, as is the standard protocol for SIFT1M. The method, however, performs well on LabelMe and MNIST, which do not have a separate learning set. This is in line with our preliminary analysis, and suggests that CQ needs more training data (which implies more training time) to generalize well.

Software. For our experiments, we wrote Rayuela.jl, a library that implements PQ, OPQ, OTQ, RVQ, ERVQ, CompQ, LSQ and LSQ++ in Julia, with C++ and CUDA bindings for OTQ and LSQ/LSQ++ encoding – we do not include CQ, because we want to release our library under an MIT licence, and the CQ code, released under GPLv2, does not allow for stricter sublicensing. We believe that Rayuela.jl is the most comprehensive library of MCQ methods to date. Rayuela.jl is available at https://github.com/una-dinosauria/Rayuela.jl.

Comparison to CompQ. In Table 4, we update the benchmark against CompQ. Out of the box, LSQ++ (with SR-D) manages to reduce the gap to CompQ by half, from 0.012 to 0.006, and is also faster due to the faster codebook update.

We iteratively double the computational budget of LSQ++ (trading off computation for accuracy), and bring the difference in recall to 0.001 with 100 training iterations and 128 Base encoding ILS iterations. Doubling the budget of this final step puts our method above CompQ by 0.001 in R@1. Even under these circumstances, LSQ++ is still \(200\times \) faster than CompQ.

Table 4. Comparison between CompQ, LSQ and LSQ++ on SIFT1M using 64 bits.

7 Conclusions

We have benchmarked recent non-orthogonal MCQ algorithms and have found that (1) LSQ [21] is considerably faster than its competitors, (2) LSQ lags in accuracy behind CompQ, and (3), when using a GPU, the computational bottleneck of LSQ is, somewhat counterintuitively, the codebook update step.

Based on these observations, we have introduced two stochastic relaxation methods for MCQ that provide inexpensive approximations to simulated annealing, a technique widely used for hard combinatorial problems. One of these methods (SR-D) consistently improves recall in LSQ at negligible computational cost. We have also introduced a method for fast codebook updates that results in faster training. Both of our contributions can be used as out-of-the-box improvements on top of LSQ and are simple to implement. Furthermore, these two contributions increase the gap in running time between LSQ and its competitors, and account for the difference in accuracy between LSQ and CompQ [27].