# Adaptive Euclidean maps for histograms: generalized Aitchison embeddings

- 418 Downloads

## Abstract

Learning distances that are specifically designed to compare histograms in the probability simplex has recently attracted the attention of the machine learning community. Learning such distances is important because most machine learning problems involve bags of features rather than simple vectors. Ample empirical evidence suggests that the Euclidean distance in general and Mahalanobis metric learning in particular may not be suitable to quantify distances between points in the simplex. We propose in this paper a new contribution to address this problem by generalizing a family of embeddings proposed by Aitchison (J R Stat Soc 44:139–177, 1982) to map the probability simplex onto a suitable Euclidean space. We provide algorithms to estimate the parameters of such maps by building on previous work on metric learning approaches. The criterion we study is not convex, and we consider alternating optimization schemes as well as accelerated gradient descent approaches. These algorithms lead to representations that outperform alternative approaches to compare histograms in a variety of contexts.

## Keywords

Metric learning for histograms Aitchison geometry Probability simplex Embeddings## 1 Introduction

Defining a distance to compare objects of interest is an important problem in machine learning. Many metric learning algorithms were proposed to tackle this problem by considering labeled datasets, most of which exploit the simple and intuitive framework of Mahalanobis distances (Xing et al. 2002; Schultz and Joachims 2003; Kwok 2003; Goldberger et al. 2004; Shalev-Shwartz et al. 2004; Globerson and Roweis 2005). Within these contributions, two algorithms are particularly popular in applications: the Large Margin Nearest Neighbor (LMNN) approach described by Weinberger et al. (2006), Weinberger and Saul (2008, 2009), and the Information-Theoretic Metric Learning (ITML) approach proposed by Davis et al. (2007).

Among such objects of interest, histograms—the normalized representation for bags of features—play a fundamental role in many applications, from computer vision (Julesz 1981; Sivic and Zisserman 2003; Perronnin et al. 2010; Vedaldi and Zisserman 2012), natural language processing (Salton and McGill 1983; Salton 1989; Baeza-Yates and Ribeiro-Neto 1999; Joachims 2002; Blei et al. 2003; Blei and Lafferty 2006, 2009), speech processing (Doddington 2001; Campbell et al. 2003; Campbell and Richardson 2007) to bioinformatics (Erhan et al. 1980; Burge et al. 1992; Leslie et al. 2002). Mahalanobis distances can be used as such on histograms or bags-of-features, but fail however to incorporate the geometrical constraints of the probability simplex (non-negativity, normalization) in their definition. Given this issue, Cuturi and Avis (2011) and Kedem et al. (2012) have very recently proposed to learn the parameters of distances specifically designed for histograms, namely the transportation distance and a generalized variant of the \(\chi ^{2}\) distance respectively.

We propose in this work a new approach to compare histograms that builds upon older work by Aitchison (1982). In a series of influential papers and monographs, Aitchison and Shen (1980), Aitchison and Lauder (1985), Aitchison (1982, 1986, 2003) proposed to study different maps from the probability simplex onto a Euclidean space of suitable dimension. These maps are constructed such that they preserve the geometric characteristics of the probability simplex, yet make subsequent analysis easier by relying only upon Euclidean tools, such as Euclidean distances, quadratic forms and ellipses. Our goal in this paper is to follow this line of work and propose suitable maps from the probability simplex to a Euclidean space of suitable dimension. However, rather than relying on a few mappings defined a priori such as those proposed in Aitchison (1982), we propose to *learn* such maps directly in a supervised fashion using Mahalanobis metric learning.

We build upon our earlier contribution (Le and Cuturi 2013) and provide new insights on the empirical behaviour of our method, notably in terms of convergence speed and parameter sensitivity. We also consider the adaptive restart heuristic (O’Donoghue and Candès 2013) and show that it can prove beneficial. Source code for our tools can be obtained in http://github.com/lttam/GenAitEmb.

This paper is organized as follows: after providing some background on Aitchison embeddings in Sect. 2, we propose a generalization of Aitchison embeddings in Sect. 3. In Sect. 4, we propose algorithms to learn the parameters of such embeddings using training data. We also review related work in Sect. 5, before providing experimental evidence in Sect. 6 that our approach improves upon other adaptive metrics on the probability simplex. Finally, we provide some observations on the empirical behavior of our algorithms in Sect. 7 before concluding this paper in Sect. 8.

## 2 Aitchison embeddings

*the relative values of their coordinates*rather than on their absolute value. Therefore, Aitchison makes the point that comparing histograms directly with Euclidean distances is not appropriate, since the Euclidean distance can only measure the arithmetic difference between coordinates. Given two points \(\mathbf {x}\) and \(\mathbf {z}\) in the simplex, Aitchison proposes to focus explicitly on the log-ratio of \(\mathbf {x}_i\) and \(\mathbf {z}_i\) for each coordinate \(i\), which can be expressed as the arithmetic difference of the logarithms of \(\mathbf {x}_i\) and \(\mathbf {z}_i\),

### 2.1 Additive log-ratio embedding

**alr**) which maps a vector \(\mathbf {x}\) from the probability simplex \({\mathbb {S}}_{d}\) onto \({\mathbb {R}}^{d-1}\),

### 2.2 Centered log-ratio embedding

### 2.3 Isometric log-ratio embedding

*isometric log-ratio embedding*(\({\mathbf{ilr}}\)). The \({\mathbf{ilr}}\) map is defined as follows:

Aitchison’s original definitions do not consider explicitly the regularization coefficient \(\varepsilon \) (1982, 1986, 2003). In that literature, the histograms are either assumed to have strictly positive values or the problem is dismissed by stating that all values can be regularized by a very small constant (Aitchison and Lauder 1985, p. 132; 1986, §11.5). We consider explicitly this constant \(\varepsilon \) here because it forms the basis of the embeddings we propose in the next section.

## 3 Generalized Aitchison embeddings

Rather than settling for a particular weight matrix—such as those defined in Eqs. (1), (2) or (3)—and defining a regularization constant \(\varepsilon \) arbitrarily, we introduce in the definition below a family of mappings that leverage instead these parameters to define a flexible generalization of Aitchison’s maps. In the following, \({\mathcal {S}}_d^+\) is the cone of symmetric positive semidefinite matrices of size \(d\times d\).

### **Definition 1**

*learn*\(\mathbf {P}\) and \(\mathbf {b}\) such that histograms mapped following \(\mathfrak {a}\) can be efficiently discriminated using the Euclidean distance. The Euclidean distance between the images of two histograms \(\mathbf {x}\) and \(\mathbf {z}\) under the embedding \(\mathfrak {a}\) is:

## 4 Learning generalized Aitchison embeddings

### 4.1 Criterion

Let \(D=\{\left( \mathbf {x_i}, y_{i}\right) _{1 \le i \le N}\}\) be a dataset of labeled points in the simplex, where each \(\mathbf {x_i}\in {\mathbb {S}}_{d}\) and each \(y_i\in \{1,\cdots ,L\}\) is a label. We follow Weinberger’s approach to define a criterion to optimize the parameters \((\mathbf {Q}, \mathbf {b})\) (2006, 2009). Weinberger et al. propose a large margin approach to nearest neighbor classification: given a training set \(D\), their criterion considers for a single reference point \(\mathbf {x_{i}}\) the cumulated distance of its closest neighbors that belong to the same class, corrected by a coefficient which takes into account whether points from a different class are in the immediate neighborhood of \(\mathbf {x_{i}}\). Taken together over the entire dataset, these two factors promote metric parameters which ensure that each point’s immediate neighborhood is mostly composed of points that share its label.

These ideas can be formulated using the following notations. Let \(\kappa \) be an integer. Given a pair of parameters \((\mathbf {Q},\mathbf {b})\), consider the geometry induced by \(d_{\mathfrak {a}}\). For each point \(\mathbf {x_{i}}\) in the dataset, there exists \(\kappa \) neighbors of \(\mathbf {x_{i}}\) which share its label. We single out these indices by introducing the binary relationship \(j \leadsto i\) for two indices \(1 \le i \ne j\le N\). The notation \(j \leadsto i\) means that the \(j\)-th point is among those close neighbors with the same class (namely \(y_i=y_j\)). The set of indices \(j\) such that \(j \leadsto i\) is called the set of *target neighbors* of the \(i\)-th point. Note that \(j \leadsto i\) does not imply \(i \leadsto j\).

### 4.2 Alternating optimization

Unlike the original LMNN formulation, optimization problem (6) is not convex because of the introduction of a pseudo count vector \(\mathbf {b}\). Although the objective is still convex with respect to \(\mathbf {Q}\), it is non-convex with respect to \(\mathbf {b}\). We consider first a naive approach which updates alternatively \(\mathbf {Q}\) and \(\mathbf {b}\). This approach is summarized in Algorithm 1 and detailed below.

When \(\mathbf {b}\) is fixed, optimization problem (6) is equivalent to the Mahalanobis metric learning problem: indeed, once each training vector \(\mathbf {x}\) is mapped to \(\log \left( \mathbf {x} + \mathbf {b} \right) \), problem (6) can be solved with a LMNN solver.

### 4.3 Projected subgradient descent with Nesterov acceleration

We propose in this section a more straightforward approach to the problem of minimizing Problem (6) which bypasses the cost associated with running many iterations of the LMNN solver. We consider a projected subgradient descent using Nesterov acceleration scheme (Nesterov 1983, 2004) to optimize the parameters \((\mathbf {Q}, \mathbf {b})\) in Problem (6) directly. Our experiments show that this approach is considerably faster and equally efficient in terms of classification accuracy.

### 4.4 Low-rank approaches

### 4.5 Adaptive restart

The projected subgradient descent with Nesterov acceleration presented in Sect. 4.3 does not guarantee a monotone decrease of the objective value. Indeed, it has been observed that Nesterov acceleration scheme may create ripples in the objective value curve when plotted against iteration count. This phenomenon happens when the momentum built from Nesterov acceleration scheme becomes higher than a critical value (the optimal momentum value described by Nesterov (1983, 2004)), and thus damage convergence speed. To overcome this, we adopt the heuristic of O’Donoghue and Candès (2013), which sets the momentum back to zero whenever an increase in the objective is detected. Whenever \({\mathcal {F}}_{t} > {\mathcal {F}}_{t-1}\) at some point in time \(t\), the idea of this heuristic is to erase the memory of previous iterations, reset the algorithm counter to \(0\) and use the current iteration as a warm start.

## 5 Related work

*burstiness*of feature counts (Salton 1989; Baeza-Yates and Ribeiro-Neto 1999; Rennie et al. 2003; Lewis et al. 2004; Madsen et al. 2005), using the mapping

In addition to the logarithm, Hellinger’s embedding, which considers the element-wise square-root vector of a histogram (\(\mathbf {x} \mapsto \sqrt{\mathbf {x}}\)) is particularly popular in computer vision (Perronnin et al. 2010; Vedaldi and Zisserman 2012). This embedding was also considered as an adequate representation to learn Mahanlanobis metrics in the probability simplex as argued by Cuturi and Avis, §6.2.1. Some other explicit feature maps such as \(\chi ^{2}\), intersection and Jensen-Shannon are also benchmarked in Vedaldi and Zisserman (2012).

## 6 Experiments

### 6.1 Experimental setting and implementation notes

*Datasets*We evaluate our algorithms on 12 benchmark datasets of various sizes. Table 1 displays their properties and relevant parameters. These datasets include problems such as scene classification, image classification with a single label or multi labels, handwritten digit and text classification. We follow recommended configurations for these datasets. If they are not provided, we randomly generate fivefolds to evaluate in each run. Additionally, we also repeat the experiments at least 3 times to obtain averaged results, except for PASCAL VOC 2007 and MirFlickr datasets where we use a predefined train and test set.

Properties of datasets and their corresponding experimental parameters

Dataset | #Train | #Test | #Class | Feature | Rep | #Dim | #Run |
---|---|---|---|---|---|---|---|

MIT Scene | 800 | 800 | 8 | SIFT | BoF | 800 | 5 |

UIUC Scene | 1,500 | 1,500 | 15 | SIFT | BoF | 800 | 5 |

DSLR | 409 | 89 | 31 | SURF | BoF | 800 | 5 |

WEBCAM | 646 | 149 | 31 | SURF | BoF | 800 | 5 |

AMAZON | 2,262 | 551 | 31 | SURF | BoF | 800 | 5 |

OXFORD Flower | 680 | 680 | 17 | SIFT | BoF | 400 | 5 |

CALTECH-101 | 3,060 | 2,995 | 102 | SIFT | BoF | 400 | 3 |

Pascal Voc 2007 | 5,011 | 4,952 | 20 | Dense Hue | BoF | 100 | 1 |

MirFlickr | 12,500 | 12,500 | 38 | Dense Hue | BoF | 100 | 1 |

MNIST | 5,000 | 5,000 | 10 | Normalized intensity | 784 | 5 | |

20 News Group | 600 | 19,397 | 20 | BoW | LDA | 200 | 5 |

Reuters | 500 | 9,926 | 10 | BoW | LDA | 200 | 5 |

*Parameters of the proposed algorithms* We set the target neighborhood size \(\kappa = 3\) as a default parameter setting of the LMNN solver.^{1} We note that the number of target neighbor \(\kappa \) is not necessary to be equal to parameter \(k\) in \(k\)-nearest neighbor classification. In our experiments, \(\kappa \) is a fixed number while \(k\) varies. We also set the regularization \(\mu = 1\) as in LMNN (Weinberger and Saul 2009) while the regularization \(\lambda \) is set to \(\kappa N\) (recall that \(N\) is the size of the training set), guided by preliminary experiments. For the step size \(t_0\) in the subgradient descent update, we choose from the set \(\frac{1}{\kappa N}\{0.01, 0.05, 0.1, 0.5\}\) via cross validation. For the alternating optimization (Algorithm 1), we set \(t_{\max } = 20\) iterations (in our experiments, we observe that this number is generous, since usually 6–10 iterations suffice for most datasets, as shown in Figs. 6, 7). For the projected subgradient descent with Nesterov acceleration (PSGD-NES), the algorithm takes less than 500 iterations for converge (usually about 300 iterations, illustrated in Figs. 10, 11). So, we set \(t_{\max } = 500\) for the PSGD-NES algorithm.

*Dense SIFT features for images* Dense SIFT features are computed by operating a SIFT descriptor of \(16 \times 16\) patches computed over each pixel of an image as in (Le et al. 2011) instead of key points (Lowe 2004) or a grid of points (Lazebnik et al. 2006). Additionally, before computing the dense SIFT, we convert images into gray scale ones to improve robustness. We obtained dense SIFT features by using the LabelMe toolbox^{2} (Russell et al. 2008).

### 6.2 Metrics and metric learning methods

We consider LMNN metric learning for histograms using: their original representation; the \({\mathbf{ilr}}\) representation (Sect. 2, Eq. (3)); their Hellinger map. We also include the simple Euclidean distance in our benchmarks. To illustrate the fact that learning the pseudo-count vector \(\mathbf {b}\) results in significant performance improvements, we also conduct experiments with an algorithm that learns \(\mathbf {Q}\) through LMNN but only considers a uniform pseudo-count vector of \(\alpha \) chosen in \(\{0.0001, 0.001, 0.01, 0.1, 1\}\) by cross validation on the training fold. We call this approach Log-LMNN.

### 6.3 Scene classification

We conduct experiments on the MIT Scene^{3} and UIUC Scene^{4} datasets. In these datasets, we select randomly 100 train and 100 test points from each class. Histograms are obtained by using dense SIFT features with bag-of-feature representation (BoF) where the number of visual words is set to 800. We repeat experiments 5 times on each dataset and split randomly onto train and test sets.

### 6.4 Handwritten digits classification

We also perform experiments for handwritten digits classification on the MNIST^{5} dataset. A feature vector for each point is constructed from a normalized intensity level of each pixel. We randomly choose 500 points disjointly from each class for train and test sets, repeat 5 times for averaged results. The middle graph in Fig. 2 illustrates that the generalized Aitchison embedding also outperforms other alternative embeddings.

### 6.5 Text classification

We also carry out experiments for text classification on 20 News Groups^{6} and Reuters^{7} (the 10 largest classes) datasets. In these datasets, we calculate bag of words (BoW) for each document, and then we use topic modelling (LDA) to reduce the dimension of histograms using the *gensim* toolbox.^{8} We obtain a histogram of topics for each document (Blei et al. 2003; Blei and Lafferty 2009). We randomly choose 30 points and 50 points from each class in 20 News Groups and Reuters datasets for training, and use the remaining points for testing respectively. We randomly generate 5 different train and test sets for each dataset and average results.

Averaged percentage of zero-elements in a histogram (sparseness) of single-label datasets

Dataset | Sparseness (%) |
---|---|

MIT Scene | 20.04 |

UIUC Scene | 20.33 |

DSLR | 39.58 |

WEBCAM | 64.44 |

AMAZON | 83.20 |

OXFORD Flower | 1.12 |

CALTECH-101 | 13.15 |

MNIST | 80.68 |

20 News Group | 98.01 |

Reuters | 98.00 |

### 6.6 Single-label object classification

*DSLR, AMAZON and WEBCAM* These datasets^{9} are split into fivefolds. Each point is a histogram of visual words obtained by BoF representation on SURF features (Bay et al. 2006) where the code-book size is set to \(800\). We repeat experiments 5 times on each dataset with different random splits and average results.

*OXFORD FLOWER* ^{10} We randomly choose 40 flower images of each class for training and use the rest for testing. We construct histograms using a BoF representation with \(400\) visual words on a dense SIFT feature and repeat experiments 5 times on different random splits to obtain averaged results. The fourth graph in Fig. 3 shows that the proposed embedding outperforms that of histograms more than 30 %, and also improves about 15 % comparing to the \({\mathbf{ilr}}\) embedding as well as the Hellinger representation with LMNN. As showed in Table 2, this dataset is highly dense since there are only about 1 % zero-elements in a histogram. This suggests that our approaches might work better with dense datasets.

*CALTECH-101* We randomly choose 30 images for training and up to 50 other images for testing. We use BoF representation with 400 visual words on a dense SIFT feature to construct histograms for each image. The rightmost graph in Fig. 3 shows averaged results on 3 different random splits of the CALTECH-101^{11} dataset, illustrating again the interest of our approach.

### 6.7 Multi-label object classification

^{12}and MirFlickr

^{13}datasets. We follow the guidelines to define the train and test sets. Histograms for each image are built in these datasets based on BoF representation with 100 visual words on a dense hue feature. Then, we employ a one-versus-all strategy for \(k\)-NN classification and calculate averaged precisions for each dataset. Figure 4 illustrates that the proposed embedding outperforms original, \({\mathbf{ilr}}\), and Hellinger representation with LMNN again. Additionally, the performance of Hellinger distance is better than that of LMNN and comparative with that of Log-LMNN in these datasets.

### 6.8 Low-rank embeddings

## 7 Experimental behavior of the algorithm

### 7.1 Convergence speed

The naive alternating optimization has computational cost that is about one order of magnitude larger than that of a direct application of LMNN. This factor appears because we run the LMNN solver multiple times. The burden of optimizing the pseudo-count vector is small due to the fact that the gradient has a closed-form solution for each pair in the objective function. We only need to run a few iterations of the LMNN algorithm using a warm start when alternating. Our experiments show that we only need to run 6–10 alternating iterations for these datasets, but each iteration is costly. These results show the interest of using Nesterov acceleration scheme here, and even suggest adopting the adaptive restart heuristic of O’Donoghue and Candès (2013).

### 7.2 Sensitivity to parameters

*Target neighbors*Figures 8 and 9 illustrate the effect of the number of target neighbors \(\kappa \) on the results of our algorithms. We evaluate for \(\kappa = \{1, 3, 5, 7, 9\}\) for single-label datasets, except \(\kappa = \{7, 9\}\) for DSLR and \(\kappa = 9\) for WEBCAM due to the size of the smallest class in these datasets. These results suggest that the number of target neighbors has a large impact and should remain low, both from a computational viewpoint and performances of the algorithm. Figures 8 and 9 also show that 3-target-neighbor setup is an appropriate choice for those evaluated datasets.

*Average test accuracy over iteration count*Figures 10 and 11 show the average test accuracy over iteration count. The curves of average test accuracy value seem to increase monotonically with the iteration count, therefore suggesting that our algorithms do not overfit training data in these evaluations.

## 8 Conclusion

We have shown that a generalized family of embeddings for histograms coupled with different procedures to estimate its parameters can be effective to represent histograms in Euclidean spaces. Our variations outperform other common approaches such as the Hellinger map or Aitchison’s original embeddings. Rather than using an alternative optimization scheme and use LMNN solvers, our results indicated that a simple accelerated subgradient method provides the best results both in performance and computational time. Other variations, such as learning a low-rank embedding or using adaptive restart heuristic for PSGD-NES, can also prove beneficial, depending on the datasets.

## Footnotes

## Notes

### Acknowledgments

TL acknowledges the support of the MEXT scholarship 123353. MC acknowledges the support of the Japanese Society for the Promotion of Science Grant 25540100.

## References

- Aitchison, J. (1982). The statistical analysis of compositional data.
*Journal of the Royal Statistical Society*,*44*, 139–177.zbMATHMathSciNetGoogle Scholar - Aitchison, J. (1986).
*The statistical analysis of compositional data*. London: Chapman and Hall Ltd.zbMATHCrossRefGoogle Scholar - Aitchison, J. (2003). A concise guide to compositional data analysis. In
*CDA workshop*.Google Scholar - Aitchison, J., & Lauder, I. J. (1985). Kernel density estimation for compositional data.
*Applied statistics*,*34*, 129–137.zbMATHCrossRefGoogle Scholar - Aitchison, J., & Shen, S. M. (1980). Logistic-normal distributions: Some properties and uses.
*Biometrika*,*67*, 261–272.zbMATHMathSciNetCrossRefGoogle Scholar - Aizawa, A. (2003). An information-theoretic perspective of tf-idf measures.
*Information Processing and Management*,*39*(1), 45–65.zbMATHMathSciNetCrossRefGoogle Scholar - Baeza-Yates, R., & Ribeiro-Neto, B. (1999).
*Modern information retrieval*(Vol. 463). New York: ACM press.Google Scholar - Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: Speeded up robust features. In
*European conference on computer vision*(pp. 404–417).Google Scholar - Blei, D., & Lafferty, J. (2006). Correlated topic models. In B. Schökopf, J. C. Platt & T. Hoffman (Eds.),
*Advances in Neural Information Processing Systems*(pp. 147–154). Vancouver, Canada: MIT Press.Google Scholar - Blei, D., & Lafferty, J. (2009). Topic models. In A. Srivastava & M. Sahami (Eds.),
*Text mining: Classification, clustering, and applications*. Boca Raton, FL: Chapman & Hall, CRC Press.Google Scholar - Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation.
*Journal of Machine Learning Research*,*3*, 993–1022.zbMATHGoogle Scholar - Burge, C., Campbell, A. M., & Karlin, S. (1992). Over-and under-representation of short oligonucleotides in DNA sequences.
*National Academy of Sciences*,*89*(4), 1358–1362.CrossRefGoogle Scholar - Campbell, W. M., & Richardson, F. S. (2007). Discriminative keyword selection using support vector machines. In J. C. Platt, D. Koller, Y. Singer & S. T. Roweis (Eds.),
*Advances in Neural Information Processing Systems*. Vancouver, Canada: Curran Associates, Inc.Google Scholar - Campbell, W. M., Campbell, J. P., Reynolds, D. A., Jones, D. A., & Leek,T. R. (2003). Phonetic speaker recognition with support vector machines. In S. Thrun, L. K. Saul & B. Schökopf (Eds.),
*Advances in Neural Information Processing Systems*. Vancouver, Canada: MIT Press.Google Scholar - Cuturi, M., & Avis, D. (2014). Ground metric learning.
*Journal of Machine Learning Research*,*15*(1), 533–564.Google Scholar - Cuturi, M., & Avis, D. (2011). Ground metric learning. arXiv preprint arXiv:1110.2306.
- Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I. S. (2007). Information-theoretic metric learning. In
*International conference on machine learning*, pp. 209–216.Google Scholar - Doddington, G. (2001). Speaker recognition based on idiolectal differences between speakers. In
*P. Dalsgaard*, B. Lindberg, H. Benner, & Z.-H. Tan (Eds.),*Eurospeech*(pp. 2521–2524). Aalborg, Denmark: Center for Personkommunikation, Aalborg University.Google Scholar - Egozcue, J. J., Pawlowsky-Glahn, V., Mateu-Figueras, G., & Barcel-Vidal, C. (2003). Isometric logratio transformations for compositional data analysis.
*Mathematical Geology*,*35*(3), 279–300.zbMATHMathSciNetCrossRefGoogle Scholar - Erhan, S., Marzolf, T., & Cohen, L. (1980). Amino-acid neighborhood relationships in proteins. Breakdown of amino-acid sequences into overlapping doublets, triplets and quadruplets.
*International Journal of Bio-Medical Computing*,*11*(1), 67–75.CrossRefGoogle Scholar - Globerson, A., & Roweis, S. T. (2005). Metric learning by collapsing classes. In Y. Weiss, B.
*Schökopf*& J. C. Platt (Eds.),*Advances in Neural Information Processing Systems*(pp. 451–458). Vancouver, Canada: MIT Press.Google Scholar - Goldberger, J., Roweis, S. T., Hinton, G. E., & Salakhutdinov, R. (2004). Neighbourhood components analysis. In L. K. Saul, Y. Weiss & L. Bottou (Eds.),
*Advances in Neural Information Processing Systems*. Vancouver, Canada: MIT Press.Google Scholar - Joachims, T. (2002).
*Learning to classify text using support vector machines: Methods: Theory and algorithms*. Berlin: Springer.CrossRefGoogle Scholar - Julesz, B. (1981). Textons, the elements of texture perception, and their interactions.
*Nature*,*290*(5802), 91–97.Google Scholar - Kedem, D., Tyree, S., Weinberger, K. Q., Sha, F., & Lanckriet, G. (2012). Nonlinear metric learning. In F. Pereira, C . J. C. Burges, L. Bottou & K. Q. Weinberger (Eds.),
*Advances in Neural Information Processing Systems*(pp. 2582–2590). Nevada: Curran Associates, Inc.Google Scholar - Kwok, J. T., & Tsang, I. W. (2003). Learning with idealized kernels. In
*International conference on machine learning*(pp. 400–407).Google Scholar - Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories.
*Computer Vision and Pattern Recognition*,*2*, 2169–2178.Google Scholar - Le, T., & Cuturi, M. (2013). Generalized aitchison embeddings for histograms. In
*Asian conference on machine learning*(pp. 293–308).Google Scholar - Le, T., Kang, Y., Sugimoto, A., Tran, S., & Nguyen, T. (2011). Hierarchical spatial matching kernel for image categorization. In
*International conference on image analysis and recognition*(pp. 141–151).Google Scholar - Leslie, C. S., Eskin, E., & Noble, W. S. (2002). The spectrum kernel: A string kernel for SVM protein classification. In
*Pacific symposium on biocomputing*(Vol. 7, pp. 566–575).Google Scholar - Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research.
*Journal of Machine Learning Research*,*5*, 361–397.Google Scholar - Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.
*International Journal of Computer Vision*,*60*(2), 91–110.CrossRefGoogle Scholar - Madsen, R. E., Kauchak, D., & Elkan, C. (2005). Modeling word burstiness using the dirichlet distribution. In
*International conference on machine learning*(pp. 545–552).Google Scholar - Nesterov, Y. (1983). A method of solving a convex programming problem with convergence rate o (1/k2). In
*Soviet Mathematics Doklady*(Vol. 27, pp. 372–376).Google Scholar - Nesterov, Y. (2004).
*Introductory lectures on convex optimization: A basic course*. Berlin: Springer.CrossRefGoogle Scholar - O’Donoghue, B., & Candès, E. (2013). Adaptive restart for accelerated gradient schemes.
*Foundations of Computational Mathematics*,*13*, 1–18.Google Scholar - Perronnin, F., Sánchez, J., & Liu, Y. (2010). Large-scale image categorization with explicit data embedding. In
*Computer vision and pattern recogition*(pp. 2297–2304). San Francisco, CA: Curran Associates, Inc.Google Scholar - Rennie, J. D., Shih, L., Teevan, J., & Karger, D. (2003). Tackling the poor assumptions of naive bayes text classifiers. In
*International conference on machine learning*(Vol. 3, pp. 616–623).Google Scholar - Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation.
*International Journal of Computer Vision*,*77*(1–3), 157–173.CrossRefGoogle Scholar - Salton, G. (1989).
*Automatic text processing: The transformation, analysis, and retrieval of*. Reading: Addison-Wesley.Google Scholar - Salton, G., & McGill, M. J. (1983).
*Introduction to modern information retrieval*. New York: McGraw-Hill.zbMATHGoogle Scholar - Schultz, M., & Joachims, T. (2003). Learning a distance metric from relative comparisons. In S. Thrun , L. K. Saul & B.
*Schökopf*(Eds.),*Advances in Neural Information Processing Systems*(Vol. 16, p. 41). Vancouver, Canada: MIT Press.Google Scholar - Shalev-Shwartz, S., Singer, Y., & Ng, A. Y. (2004). Online and batch learning of pseudo-metrics. In
*International conference on machine learning*(p. 94).Google Scholar - Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In
*International conference on computer vision*.Google Scholar - Torresani, L., & Lee, K. (2006). Large margin component analysis. In
*Advances in Neural Information Processing Systems*(pp. 1385–1392).Google Scholar - Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps.
*IEEE Pattern Analysis and Machine Intelligence*,*34*(3), 480–492.CrossRefGoogle Scholar - Weinberger, K. Q., & Saul, L. K. (2008). Fast solvers and efficient implementations for distance metric learning. In
*International conference on machine learning*(pp. 1160–1167).Google Scholar - Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification.
*Journal of Machine Learning Research*,*10*, 207–244.zbMATHGoogle Scholar - Weinberger, K. Q., Blitzer, J., & Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In
*Advances in Neural Information Processing Systems*(pp. 1473–1480).Google Scholar - Xing, E. P., Ng, A. Y., Jordan, M. I., & Russell, S. J. (2002). Distance metric learning with application to clustering with side-information. In
*Advances in Neural Information Processing Systems*(pp. 1473–1480).Google Scholar