Output Fisher embedding regression
 308 Downloads
Abstract
We investigate the use of Fisher vector representations in the output space in the context of structured and multiple output prediction. A novel, general and versatile method called output Fisher embedding regression is introduced. Based on a probabilistic modeling of training output data and the minimization of a Fisher loss, it requires to solve a preimage problem in the prediction phase. For Gaussian Mixture Models and StateSpace Models, we show that the preimage problem enjoys a closedform solution with an appropriate choice of the embedding. Numerical experiments on a wide variety of tasks (time series prediction, multioutput regression and multiclass classification) highlight the relevance of the approach for learning under limited supervision like learning with a handful of data per label and weakly supervised learning.
Keywords
Fisher vector Structured output prediction Output kernel regression Small data regime Weak supervision1 Introduction
Recent years have witnessed an explosion of interest for predicting outputs with some structure, whether explicit (Bakir et al. 2007; Nowozin and Lampert 2011) or implicit (Vinyals et al. 2015; Lebret et al. 2015), in various areas such as computer vision, natural language processing and bioinformatics. It is generally acknowledged that learning to predict complex outputs raises important difficulties: (i) the output space does not enjoy a vector space structure, (ii) the output structure has to be taken into account in the model itself and in an appropriate loss function, (iii) prediction and learning phases are usually very expensive, (iv) the supervision is a demanding job and data are expensive to label, (v) as a consequence, training data might be partially or weakly labeled.
Regarding the two first issues, significant progress has been made through the abundant literature of structured output prediction. The socalled energybased learning methods (Lafferty et al. 2001; Tsochantaridis et al. 2005; LeCun and Huang 2005; Chen et al. 2015) build a score function on pairs of input/output and make a prediction by extracting the output that obtains the best score. Interestingly, some of these approaches have been extended to also take into account the hidden structure that links inputs to the outputs (see for instance Yu and Joachims 2009; Zhu et al. 2010). Regarding point (iv), an extension of structured output learning with latent output variable has also been developed in Vezhnevets et al. (2012) to address weakly supervised learning in the context of semantic segmentation of images.
However, although very elegant, these methods are very expensive both at the prediction and the learning stage.
Another line of research, mainly represented by the family of Output Kernel Regression (OKR) tools (Cortes et al. 2005; Geurts et al. 2006; Brouard et al. 2011; Kadri et al. 2013; Brouard et al. 2016) but also close approaches based on an output distance (Kocev et al. 2013; Pugelj and Džeroski 2011), has been explored to avoid the cost of maximizing a score for each prediction. In the case of OKR methods, the structured output prediction problem is converted into a regression problem with an output feature map associated to a kernel. Thus, learning does not require to solve the prediction problem in the original output space and therefore is less demanding in terms of computational load. Yet, as in energybased methods, the prediction stage is expensive: a solution has to be found in the original output space (Kadri et al. 2013) and this is generally addressed by solving a preimage problem (Honeine and Richard 2011). A close approach emphasizing the role of the output feature map has also been very recently studied under the angle of regularization in Ciliberto et al. (2016).
Other works have been developed around the prediction of unstructured or semistructured outputs such as phrases and sentences. They mainly concern automatic captioning in images or videos for which the most relevant methods combine discriminant approaches and generative models in a single architecture (Vinyals et al. 2015) or hierarchically (Lebret et al. 2015).
Now regarding the two last issues, (iv) and (v), there exists many fields of application where labeling requires experts and would benefit from methods that are able to cope with less training data. Video Labeling (PonceLópez et al. 2016) and Drug Activity Prediction (Su et al. 2010) are examples of such involved tasks. In this work, we take inspiration from the field of Structured Output Prediction to define a class of methods able to take into account explicit as well as implicit structure of output data in supervised tasks. We expect that accounting for hidden structure of the output does help when learning with a handful of labeled data, a small data regime often called oneshot learning (FeiFei et al. 2006) in case of classification tasks. We propose to address the problem of complex output prediction by learning the relationship between an input and an output, the latter seen as a realization of a generative probabilistic model. For this purpose, we introduce a novel, general and versatile method called output Fisher embedding regression (OFER) based on the minimization of a surrogate loss based on Fisher Embedding and a closedform solution for the corresponding preimage problem. OFER relies on three main steps. First, a generative parametric probabilistic model of the training outputs is estimated and each training output is embedded into a vector space using the wellknown Fisher vector, i.e. the gradient in the parameters of the generative model introduced in the seminal work of Amari (1998). Specifically, we use a well chosen linear transformation of the Fisher vector as the new target and we call it, Output Fisher Embedding. The training dataset is therefore transformed and in the second step, a vectorvalued function is learned to predict the Output Fisher Embedding from the input. Eventually, to make a prediction in the original output space, one has to solve a preimage problem at the third step. By electing an appropriate linear transformation of the Fisher vector, we show that the preimage problem admits a closedform solution in the case of Gaussian Mixture models while the Fisher Gaussian StateSpace models enjoy a recursive solution based on Viterbi algorithms. Therefore, for outputs that can be seen as realizations of one of these two models, prediction is not expensive and point (iii) is no more an issue. Moreover, when using a Fisher embedding of a mixture model, a single training instance brings information about the prediction of outputs belonging to the same mixture component. This makes the approach well adapted to learning from an handful of training data, a variant of oneshot learning (FeiFei et al. 2006). Knowing how to learn under this ”small data regime”, is especially useful in application fields characterized by point (iv), when supervision is expensive. Eventually, the interpretability of each term of the Fisher embedding in the case of Gaussian mixture model also opens the door to a variant of weakly supervised learning supervision. To our knowledge, Output Fisher Embedding is the first use of Fisher vectors in output embeddings whilst Fisher kernels (inner product of Fisher vectors) (Jaakkola and Haussler 1998; Hofmann 2000; Siolas and d’AlchéBuc 2002; Perronnin et al. 2010) have been widely exploited to handle input data.
The paper is structured as follows. In Sect. 2, we introduce the OFER approach in the case of full supervision. Section 3 introduces Output Fisher Embeddings for Gaussian Mixture models and StateSpace Models while Sect. 4 presents the solutions of the corresponding preimage problems. Section 5 discusses the supervised learning problem and introduces a scenario for weakly supervised learning. Section 6 presents an extended experimental study on synthetic datasets and on a large variety of tasks with 5 different real datasets. Conclusion and perspectives are drawn in Sect. 7. The Appendix contain additional experimental results and an empirical study of computation times for all tested methods in all tasks.
2 Output Fisher embedding regression
2.1 Brief reminder about Fisher scores and kernels
2.2 Regression with output Fisher embeddings
 1.Defining an output feature map, called here, Output Fisher embedding, \({\phi _{Fisher}^{M}:\mathcal {Y}{\rightarrow } \mathbb {R}^p},\) indexed by a matrix M of size \((m \times p)\) defined as: \(\phi _{Fisher}^{M}(y)= M \phi _{Fisher}(y) = M\nabla _{\hat{\theta }} \log p_{\hat{\theta }}(y)\), where:

\(p_{\theta }(y)\) is the probability density of a parametric probabilistic model of parameter \(\theta \in \mathbb {R}^m\) and \(\hat{\theta }\) is an estimate of \(\theta \) obtained from the training output set \(\mathcal {Y}_{\ell }=\{y_1, \ldots , y_{\ell }\}\). Note that in some cases, a larger dataset including \(\mathcal {Y}_{\ell }\) may be used instead of \(\mathcal {Y}_{\ell }\).

An appropriate choice of the linear transformation M allows one to simplify the preimage problem.

 2.Training a learning algorithm on a class \(\mathcal {H}\) of functions \(h:\mathcal {X}\rightarrow \mathbb {R}^p\) and dataset \(\mathcal {S}_{\ell }\) to approximate the relationship between \({x \in \mathcal {X}}\) and \(\phi _{Fisher}^{M}(y)\) by minimizingwhere \(L_{Fisher}^M: \mathcal {Y}\times \mathbb {R}^p \rightarrow \mathbb {R}^+\) is a surrogate loss defined as:$$\begin{aligned} \lambda _0 {\varOmega }(h) + \frac{1}{2\ell }\sum _{i=1}^{\ell } L_{Fisher}^M(y_i,h(x_i)), \end{aligned}$$
\({L_{Fisher}^M(y_i,h(x_i)) = \Vert \phi _{Fisher}^{M}(y_i)  h(x_i) \Vert ^2}\).
 3.
Given x, making a prediction in the original output space \(\mathcal {Y}\) by solving a preimage problem: \(~y^{*}\in \arg \min _{y \in \mathcal {Y}} L_{Fisher}^M(y,h(x)) \).
In contrast, OFER directly exploits an explicit and finite dimensional feature map for the output kernel, the Output Fisher embedding. The vectorvalued function h in Step 2 can be computed using any learning method devoted to vectorvalued functions or by any learning method devoted to scalarvalued functions using parallel coordinatewise computations as opposed to using the kernel trick in the output space.
Now, the size of matrix M (in particular the value p), involved in the surrogate Fisher loss, directly influences the time complexity of the learning phase. Strictly speaking, the Fisher vector is a very sparse representation that is unnecessary long in the context of prediction. In this paper, we show that the case \(M=I\) yielding a classic preimage problem with no analytical solution can be replaced by a proper choice of \(M \ne I\) in order to obtain a closedform solution, easier to compute (see Sect. 4) and with better empirical performance.
3 Fisher embeddings
In this paper, we illustrate the principle of output Fisher embedding regression with two generative models: the Gaussian Mixture Model, to take into account the hidden components of random vectors and the Gaussian State Space Model, to represent hidden processes in time series. In the following, we assume that the generic parameter \(\theta \) is known. A building block of these models is the Gaussian vector for which we briefly present the Fisher feature map.
3.1 Gaussian vector
Let us assume that \(Y \sim \mathcal {N}(\mu ,{\varSigma })\) and \(\mu \in \mathbb {R}^d\) and \({\varSigma }\) is a \(d\times d\) positive definite matrix, with \(d\in \mathbb {N}^*\). Denote \(p_{\theta }(y) = \frac{1}{\sqrt{2\pi }{\varSigma }^{d/2}}e^{\frac{1}{2}(y\mu )^T {\varSigma }^{1}(y\mu )}\) the probability density. Then, the Fisher feature map takes the following form:
3.2 Gaussian mixture model
3.3 Gaussian statespace model
4 Prediction by solving the preimage problem
4.1 Preimage for the Gaussian Fisher embedding
If the prediction h(x) was perfect, we would have \(h(x) = \phi _{Fisher}(y^{true})\), then the minimum \(L_{Fisher}^{I}(\hat{y},h(x)) = 0\) would be attained by the closedform solution \(\hat{y}= \sigma h_1(x) + \mu \) and \(\hat{y}\) would be equal to the exact solution \(y^{true}\). However in practice, the prediction h(x) is not perfect and the previous equation is not satisfied. Instead we have: \(\exists (\epsilon _1, \epsilon _2) \in \mathbb {R}^d \times \mathbb {R}^{d^2}\), a pair of nonnull vectors, such that: \(h_1(x) = \varphi _{\mu }(y^{true}) + \epsilon _1\) and \(h_2(x) = \varphi _{{\varSigma }}(y^{true}) + \epsilon _2\). Therefore, in general, the minimum of the loss function is greater than 0 and moreover, the loss is non nonconvex. The best we can do to minimize i,t is to apply a gradient descent method to find a local minimum. The final prediction \(\hat{y}\) is thus affected by two sources of error here: the \(\epsilon =(\epsilon _1,\epsilon _2)\) error due to the lack of accuracy of h and the error driven by the approximate solution of this preimage problem.
4.2 Preimage for the Gaussian mixture model Fisher embedding
\(\Vert \varphi _{\pi }(y^*)  h_{1}(x)\Vert ^2 + \sum _{j=1}^{C} \Vert \varphi _{\mu ,j}(y^*)  h_{2,j}(x)\Vert ^2 =0\),
implies that: \({\left\{ \begin{array}{ll} \varphi _{\pi }(y^*) = h_{1}(x)\\ \forall j = 1, \ldots , C, \varphi _{\mu ,j}(y^*) = h_{2,j}(x) \end{array}\right. }\)
4.3 Preimage for the Gaussian state space model Fisher embedding
5 Learning to predict output Fisher embeddings
5.1 Supervised learning
5.2 Learning in the small data regime with OFERGMM
We would like to emphasize the relevance of our approach for learning with a handful of labeled data, a regime we call small data regime learning. As stated in Sect. 2.1, Output Fisher Embedding provides a richer information than the sole output. A training labeled example \((x_i,y_i)\), converted into \((x_i, \phi _{Fisher}^{M}(y_i))\), conveys additional information about its structure. A simple example is the Fisher vector associated to a mixture model for encoding a vector y. Let us focus on \(\varphi _{\pi }(y)\). The mixture component k to which y belongs with the highest probability corresponds to a high absolute value of the kth coordinate of the vector of \(\varphi _{\pi }(y)\). Then, if two outputs \(y_i\) and \(y_j\) belong to the same component, learning to predict \(\varphi _{\pi }(y_i)\) will also bring information about predicting \(\varphi _{\pi }(y_j)\) and viceversa. Therefore, our framework has appealing features for learning using only a handful of labeled examples per class if the target problem is classification or a small training data if the task is regression. This is empirically tested in the experimental section.
5.3 Weakly supervised learning in the case of OFERGMM

A subsample \(\{x_1, \ldots , x_{\ell }\}\) of the training input data set has been accurately labeled by an expert supervisor, providing the corresponding training sample \(\mathcal {Y}_{\ell }=\{y_1, \ldots , y_{\ell } \}\).

The training output sample \(\mathcal {Y}_{\ell }\) is used to estimate the parameters of a (Gaussian) mixture model with C components.

A matrix M of the form described in Sect. 4.2 is defined, yielding an interpretable Output Fisher embedding where the first \(C\) coefficients encode the membership of the given y to each of the \(C\) components of the mixture.
 The remaining training input data are presented to nonexpert supervisors who are invited to indicate to which output component the presented input data can be associated, and therefore, only fulfill the first \(C\) coefficients of the Fisher vector instead of providing the full corresponding output object. From this weak supervision, one can define an additional training set with weak labels. To describe this set we follow the notations of Sect. 4.2 and introduce the matrix \(A=\left[ \begin{array}{cc}I_{C} &{} \mathbf 0 \\ \hline \mathbf 0 &{} \mathbf 0 _{pC} \end{array} \right] \), and \(A^{c}= I_{p}A\), its complement. Then the additional weak training set \(\mathcal {W}_{w}\) is defined as:where each \(y_i, i=\ell +1, \ldots , \ell + w\) is supposed to be the true unknown output associated with each \(x_i, i=\ell +1, \ldots , \ell +w\).$$\begin{aligned} \mathcal {W}_{w}= \left\{ \left( x_i, A\phi _{Fisher}^{M}(y_i), i=\ell +1, \ldots , \ell +w\right\} \right. , \end{aligned}$$
6 Numerical results
6.1 Context of the empirical study
In this section, we first explore the relevance of output fisher embedding regression on synthetic datasets and then, we investigate its behaviour on real datasets in three contexts: fully supervised learning, small data regime learning and weakly supervised learning.
To show the versatility of our framework, we study and compare OFER in a wide variety of real tasks. A structured output prediction task where the goal is to predict a short time series from an input text related to Kogan et al. (2009) is first explored. Then, OFER is tested on supervised learning tasks without explicit structure. Numerical results are provided on two multiple output regression tasks for which labels are usually difficult to obtain: a Video Score Prediction task extracted from PonceLópez et al. (2016) and a Drug Activity Prediction task presented in Su et al. (2010). Eventually we turn an image classification problem into a Word Prediction task in an appropriate semantic space and present numerical results on two datasets, namely the Caltech101^{1} and the AWA.^{2} In all numerical results unless otherwise noted, train/test splits are generated 10 times and average performance with the empirical standard deviation are presented. We used bold font to point out the best performance in a row when it is statistically significant based on Student’s ttests with an \(\alpha \)risk equal to \(10^{3}\). For sake of space, an empirical comparison of computation time in learning and test phases is presented and discussed in the Appendix.
6.1.1 Generative model estimation
Gaussian models are chosen isotropic in all experiments. The parameter \(\theta \) is estimated using a procedure based on maximum likelihood criterion. More precisely, an ExpectationMaximization (EM) algorithm is applied to estimate the Gaussian mixture model while Kalman filtering/EM is used to estimate the StateSpace Model.
6.1.2 OFER learning
In all experiments, the function h in Eq. 13 is learned using OFER implemented through Kernel Ridge Regression, called appropriately for multiple outputs in the case of fully and weakly supervised learning. The choice of kernel k in \(K(x,x')=Ik(x,x')\) is linear by default if not precised. For purpose of comparison in regression and time series prediction, we used three baselines : multioutput Kernel Ridge Regression (mKRR) and multioutput Random Forest for regression (mRF) and variants of Input Output Kernel Regression (IOKR). As for the classification task, we used multiclass SVM, Multilayer perceptron (MLP) as well as IOKR. For each problem at hand, we precise which output kernel was used. IOKR is associated to a preimage problem which is solved for each test data using a gradient descent implemented in the Open Source library scipy: https://www.scipy.org/. Apart from the synthetic dataset experiments, OFER is always used with the reduced Fisher Embedding explained in Sect. 4.
6.1.3 Implementation
All the codes are written in Python and call the appropriate functions of scikitlearn library^{3} except for the Kalman filtering/EM procedure which relies on PyKalman.^{4}
6.1.4 Time complexity
An empirical comparison of computation time in learning and test phases has been conducted for all real tasks and all the tested methods. For sake of space, it is presented and discussed in the Appendix.
6.2 Numerical results on synthetic datasets
 1.
Does OFER with reduced Fisher embedding perform better than OFER with full Fisher embedding in the Gaussian context?
 2.
Does OFER with reduced Fisher embedding perform better than OFER with full Fisher embedding in the GMM context?
 3.
How does OFERGMM performance evolve with a growing number of training data and a growing number of clusters
Average relative root mean squared errors (RRMSE) on test set obtained by OFER with the full Gaussian Fisher preimage solution and the reduced Gaussian Fisher preimage solution over 10 train/test splits of dataset \(\mathcal {T}_1\)
Average RRMSE on test set (mean ± std)  

Embedding/# Train  10  50  100 
\(\phi _{Fisher}\)  1.15 ± 0.02  1.01 ± 0.01  1.01 ± 0.01 
\(\phi _{Fisher}^{M}\)  \( \mathbf{1.09\pm 0.02}\)  \(\mathbf{0.89 \pm 0.004}\)  \(\mathbf{0.83 \pm 0.001}\) 
Average training and test computation times for OFERGMM with full and reduced Fisher embeddings over 10 train/test splits of dataset \(\mathcal {D}_1\)
Computation time (s) of Table 1  

Training time  Test time  
\(\phi _{Fisher}\)  10  \(3.35\times 10^{4} \pm 7.01 \times 10^{5}\)  \(57.1 \pm 14.7\) 
50  \(8.20\times 10^{4} \pm 2.20 \times 10^{4}\)  \(80.4 \pm 16.0\)  
100  \(2.45\times 10^{3} \pm 5.75 \times 10^{4}\)  \(67.1 \pm 21.7\)  
\(\phi _{Fisher}^{M}\)  10  \(\mathbf{3.50 \times 10^{4} \pm 8.74 \times 10^{5}}\)  \(\mathbf{6.19 \times 10^{7} \pm 5.74 \times 10^{7}}\) 
50  \(\mathbf{7.60 \times 10^{4} \pm 1.96 \times 10^{4}}\)  \(\mathbf{1.19 \times 10^{6} \pm 1.14 \times 10^{6}}\)  
100  \(\mathbf{1.64 \times 10^{3} \pm 4.06 \times 10^{4}}\)  \(\mathbf{5.72 \times 10^{7} \pm 5.22 \times 10^{7}}\) 
Relative root mean squared errors (RRMSE) obtained by OFERGMM with the full Fisher preimage solution and the reduced Gaussian Fisher preimage solution on dataset \(\mathcal {D}_1\) for 3 values of \(C\)=2,4,10
RRMSE on test sets ± std  Test time (s)  

# modes  # Train  \(\phi _{Fisher}\)  \(\phi _{Fisher}^{M}\)  \(\phi _{Fisher}\)  \(\phi _{Fisher}^{M}\) 
2GMM  10  1.17 ± 0.09  \(\mathbf{1.13 \pm 0.11 }\)  124.01 ± 31.27  \(\mathbf{0.014^*}\) 
50  1.14 ± 0.06  \(\mathbf{1.03 \pm 0.002}\)  80.54 ± 22.68  \(\mathbf{0.015^*}\)  
100  1.18 ± 0.08  \(\mathbf{1.03 \pm 0.002}\)  83.30 ± 31.66  \(\mathbf{0.016^*}\)  
4GMM  10  \(1.50 \pm 0.19\)  \({ 1.66 \pm 0.20}\)  276.54 ± 84.17  \(\mathbf{0.023^*}\) 
50  1.44 ± 0.08  \(\mathbf{1.19 \pm 0.02}\)  289.91 ± 90.78  \(\mathbf{0.028^*}\)  
100  1.33 ± 0.10  \(\mathbf{1.19 \pm 0.01}\)  290.70 ± 27.07  \(\mathbf{0.025^*}\)  
10GMM  10  1.84 ± 1.04  \({ 1.34 \pm 0.02}\)  1857.38 ± 232.84  \(\mathbf{0.12^*}\) 
50  1.79 ± 1.23  \({ 1.59 \pm 0.19}\)  2193.76 ± 1336.08  \(\mathbf{0.13^*}\)  
100  1.65 ± 0.61  \(\mathbf{1.27 \pm 0.01}\)  2494.62 ± 860.46  \(\mathbf{0.13^*}\) 
OFERGMM relative root mean square error on test set according the number of training examples and the number of components in GMMs for dataset \(\mathcal {D}_2\)
# training ex  12  18  30  60  120  180  300 

RRMSE on test set: \(\hbox {mean}\pm \) std  
mKRR  \(0.94 \pm 0.46\)  \(0.76 \pm 0.61\)  \(1.05 \pm 0.36\)  \(0.82 \pm 0.26\)  \(0.53\pm 0.09\)  \(0.40 \pm 0.15\)  \(0.35 \pm 0.07\) 
OFER 1GMM  \(0.85 \pm 0.61\)  \(0.32 \pm 0.61\)  \(0.94 \pm 0.33\)  \(0.71 \pm 0.24\)  \(0.53 \pm 0.09\)  \(0.39 \pm 0.14\)  \(0.35 \pm 0.08\) 
OFER 2GMM  \(0.93 \pm 0.61\)  \(0.80 \pm 0.63\)  \(1.07 \pm 0.40\)  \(0.66 \pm 0.25\)  \(0.64\pm 0.11\)  \(0.41 \pm 0.18\)  \(0.35 \pm 0.12\) 
OFER 3GMM  \(1.12 \pm 1.09\)  \(0.97 \pm 1.11\)  \(1.13 \pm 0.36\)  \(0.93 \pm 0.31\)  \(0.63 \pm 0.13\)  \(0.42 \pm 0.18\)  \(0.34 \pm 0.11\) 
Table 1 reports experiments on dataset \(\mathcal {D}_1\) and shows that OFER with reduced Gaussian Fisher embedding leads to better performance than OFER with the full Gaussian embedding. We also see in Table 2 that the computation times in training and test phases differ. This is mostly due to the resolution of the preimage problem. The preimage with the reduced Fisher vector is both faster and more accurate. Similar conclusions can be drawn from Table 3 where OFERGMM with reduced Fisher embedding exhibits better performance than OFERGMM with full Fisher embedding.
For the last question we generated a dataset \(\mathcal {D}_2\) similarly to \(\mathcal {D}_1\) with output data generated in \(\mathbb {R}^{5}\) from 10 clusters. Table 4 presents the Means Square Error obtained by OFERGMM with \(C=1,2,3\) and using an increasing number of training samples taken from a total training set containing \(50\%\) of the dataset \(\mathcal {D}_2\) and with a fixed test set containing the \(50\%\) remaining samples. Multiple KRR is used here as a baseline. The results are averaged over 10 different splits. First, if we consider Table 4 row by row, we see that OFERGMM behaves consistently with a test error decreasing with the number of training data. Second, if we observe the table column by column, we see that the impact of a larger number of components in the mixture depends on the number of available training data. In regression, we need a sufficient number of training data if one wants to take benefit from a more complex output model. It is also the case that low complexity GMM Fisher embedding brings a significant improvement when the training dataset are really small (Table 5).
Comparison of IOKR2,OFERGSSM, mKRR, and mRF on the time series dataset of different sizes
Train size  Relative root mean squared error, \(\pm std \le 10^{3}\)  

IOKR2  OFERGSSM  mKRR  mRF  
100  2.42  2.45  2.45  \(\mathbf{2.37}\) 
200  2.44  2.46  2.41  \(\mathbf{2.37}\) 
500  2.36  \(\mathbf{2.26}\)  2.32  2.34 
1000  2.36  \(\mathbf{2.24 }\)  2.27  2.31 
Comparison of OFER and competitors on the texttotimeseries task in terms of 5CV relative root mean squared error
Year/coordinate  2001  2002  2003  2004  2005  Global 

5CV relative root mean squared error, ± std \(\le 10^{3}\)  
mKRR  1.27  \(\mathbf{1.33}\)  1.47  1.30  1.71  1.42 
mRF  1.19  1.37  1.41  1.32  1.20  1.30 
IOKR  1.16  1.34  1.32  1.30  1.24  1.27 
IOKRDTW  1.20  1.38  1.33  1.32  1.15  1.28 
IOKRGAK  1.20  1.38  1.32  1.29  \(\mathbf{1.13}\)  1.26 
OFERGSSM  \(\mathbf{1.11}\)  1.37  \(\mathbf{1.31}\)  \(\mathbf{1.28}\)  1.17  \(\mathbf{1.25}\) 
6.3 From text to time series
6.3.1 Task description
In finance, institutional actors provide legal reports such as the socalled ”10K forms” to evaluate the inner risk that a given company bears in the near future (usually 6 months ahead). Predicting the market risk indicator for a company from legal report is of high interest since it provides a mean to take into account the dependency of the market to unquantifiable objects through a numerical indicator. Let us recall the definition of the shortreturn logvolatility at time t for a given company j: \(y_{t,j} = log\left( \sqrt{\sum _{i=1}^{\tau }(r_{ti,j}\bar{r}_j)}\right) ,\) where \(r_{i,j} = \frac{p_{i,j}p_{i1,j}}{p_{i1,j}}\) is the return at time i for an asset of a company j , and \(p_{i,j}\) is the price of the asset. The output is the logvolatility over the time period from \(t\tau \) to t.
To test our approach, we use the dataset described in Kogan et al. (2009) for which we modify both inputs and outputs to get a set of reports as input and a time series as output. The dataset \(\mathcal {S}=\{(x_i,y_i)\}\) contains 500 companies, for each one we have the 10K form reports from 2001 to 2005 and the Shortreturn logvolatility following each period with no overlap. We design a new learning task that consists of predicting a shortreturn logvolatility time series of length 5 from the 5 10K form text reports corresponding to the 5 years, from 2001 to 2005. Given a company i, each yearly text report is represented using the classic TFIDF using a dictionary of size \(70\times 10^{3}\), yielding to input vector \(x_i\) of size \(35\times 10^{4}\) while \(y_i\) is a vector of size 5.
6.3.2 Fully supervised learning
To apply OFER, we model the short output training time series with a linear statespace model. We use \(L_{Fisher}^M\) and the diagonal block matrix M writes as \(M = diag(M_0,M_1,M_2,\ldots )\) defined in Eqs. 11 and 12. OFER based on a Gaussian StateSpace Model is called OFERGSSM. As announced we compare OFERGSSM with mKRR, mRF and IOKR. In mKRR, IOKR and OFERGSSM, kernel k was chosen linear, giving rise to \(K(x,x') = Id\times k(x,x') = Id \times (x^Tx')\) with our notations of Eq. 13. As for the output kernel of IOKR, we have made three choices : a Gaussian kernel (IOKR) to simply take into account nonlinear relationships between outputs, and two kernels devoted to timeseries similarity: FastDTW introduced in Salvador and Chan (2004) (IOKRDTW) and the Global Alignement Kernel defined in Cuturi et al. (2007) (IOKRGAK). The two last kernels take into account the time series structure and provide a fair comparison to our method while neither mKRR, mRF nor IOKR do take into account the hidden structure of the output timeseries. Note that in the singleoutput regression problem solved in Kogan et al. (2009), KRR exhibits the same performance as Support Vector Regression (SVR). On Table 6, 5fold CrossValidation Relative Root Mean Squared Errors (5CV RRMSE) are reported with standard deviations. These last information are omitted whenever they are smaller than \(10^{3}\). We observe that OFERGSSM globally outperforms the other methods. We also give the details of the 5CV Relative Root MSE (RRMSE), coordinate by coordinate.
Moreover, note that using Table 16 in the Appendix that reports computation time of all methods for training and test phase, we observe that OFERGSSM exhibits a test time nearly as short as the faster method (mRF) for a training phase that costs half of the training time for mRF. OFERGSSM provides here the best tradeoff between accuracy and computational time.
6.4 Multioutput regression tasks
6.4.1 Scores prediction in personality videos
Task description
The second dataset is the development set of the First Impression Chalearn Challenge^{5} (PonceLópez et al. 2016). It contains 6000 videos of 15 seconds interviews annotated with the Big 5 Traits that correspond to 5 continuous scores widely used in the personality analysis literature (John and Srivastava 1999). These annotations were obtained by fitting a BradleyTerryLuce model on pairwise preferences given by annotators from the Amazon Mechanical Turk. For each video, the task is then to predict 5 personality scores belonging to the interval [0, 1] that have been inferred from a set of pairwise preferences among experts.
In this study, we only focus on videos for which at least one of the personality scores is far from the mean. We expect that these scores correspond to a strong agreement during the annotation step and could have been given by an expert. Relying on this assumption, we create two datasets of different sizes by rejecting the samples for which no personality score belongs to the interval: \([0,t]\cup [1t,1]\) for different values of the agreement threshold t. In the following experiments, we choose the thresholds \(t_1 = 0.9\) and \(t_2 = 0.75\). The two choices lead to the construction of two training datasets \(S_1\) and \(S_2\), respectively of size 100 and 1000, with \(S_1 \subset S_2\). The remaining set of 5000 examples is split in two. One will be used as a common test set and the other as a set from which we draw data with weak labels.
Following the baseline proposed during the challenge, we encode the videos by taking the first frame of each video and encoding it taking the fc8 feature map representation of the BVLC CaffeNet model.^{6} We use (Gaussian) Kernel Ridge Regression (mKRR) in the baseline approach as well as in OFERGMM. We also use as baseline the Input and Output Kernel Regression (IOKR) and the multioutput Random Forest (mRF). For IOKR we took as input and output kernel the linear kernel which show the best choice ofr this task.All the hyperparameters are selected using a 5CV procedure.
Comparison of (Gaussian) mKRR, mRF, IOKR and OFERGMM on the two videotoscore prediction tasks
Task  mKRR  IOKR  mRF  OFERGMM 

RRMSE on test set: mean ± std  
Task 1  0.44 \(\pm ~ 4\times 10^{2}\)  0.44 \(\pm ~ 6\times 10^{2}\)  0.43 \(\pm ~ 2\times 10^{2}\)  0.44 \(\pm ~ 4\times 10^{2}\) 
Task 2  0.42 \(\pm ~ 4\times 10^{2}\)  0.42 \(\pm ~ 4\times 10^{2}\)  0.43 \(\pm ~ 2\times 10^{2}\)  0.42 \(\pm ~ 4\times 10^{2}\) 
We see evidence that the four approaches perform equally well, meaning that in this fully supervised context, OFERGMM does not bring any advantage. The relative easiness of the task may explain this result.
Impact of the number of weakly supervised examples on weak OFERGMM measured by RRMSE on test sets. GMM is omitted for sake of clarity
Task  OFERGMM  wOFER +644  wOFER+1288  wOFER+2500 

RRMSE on test set: mean ± std  
Task 1  0.44 \(\pm ~ 4\times 10^{2}\)  0.36 \(\pm ~ 2\times 10^{3}\)  0.31 \(\pm ~ 1\times 10^{4}\)  \(\mathbf{0.29} \pm \mathbf{1}\times \mathbf{10}^{\mathbf{4}}\) 
Task 2  0.42 \(\pm ~ 4\times 10^{2}\)  0.35 \(\pm ~ 2\times 10^{3}\)  0.28 \(\pm ~ 1\times 10^{3}\)  \(\mathbf{0.127} \pm \mathbf{1}\times \mathbf{10}^{\mathbf{4}}\) 
Table 8 presents results obtained with a Gaussian mixture model whose number of components and all other hyperparameters have been selected using a grid search strategy with a crossvalidation procedure on the training set. Adding weakly supervised examples that only provide the membership to one of the two components of the Gaussian Mixture model allow to drastically improve the performance of OFER. Information conveyed by these examples consolidates the learned function.
6.4.2 Drug activity prediction
Comparison of (Gaussian) mKRR, RF, IOKR and OFERGMM on drug activity prediction
Train size  mKRR  IOKR  mRF  OFERGMM 

RRMSE on test set: mean ± std (%)  
10  0.24 ± 0.006  0.24 ± 0.006  0.23 ± 0.033  \(\mathbf{0.22} \pm \mathbf{0.009}\) 
20  0.22 ± 0.004  0.22 ± 0.004  0.22 ± 0.018  \(\mathbf{0.21} \pm \mathbf{0.008}\) 
100  0.22 ± 0.003  0.22 ± 0.003  0.21 ± 0.007  \(\mathbf{0.20} \pm \mathbf{0.004}\) 
300  0.20 ± 0.002  0.20 ± 0.002  0.19 ± 0.004  \(\mathbf{0.19} \pm \mathbf{0.002}\) 
500  0.19 ± 0.002  0.19 ± 0.002  0.19 ± 0.003  0.19 ± 0.001 
Impact of the number of weakly supervised examples on weak OFERGMM measured by RRMSE on test sets. GMM is omitted for sake of clarity
Train size  OFERGMM  wOFER+100  wOFER+500  wOFER+800 

RRMSE on test set: mean ± std  
10  0.22 ± 0.009  \(0.20^*\)  \(0.20^*\)  \(\mathbf{0.19}^{*}\) 
20  0.21 ± 0.008  \(0.20^*\)  \(0.19^*\)  \(0.19^{*}\) 
100  0.20 ± 0.004  \(0.20^*\)  \(0.19^*\)  \(\mathbf{0.18}^{*}\) 
300  0.19 ± 0.002  \(0.18^*\)  \(0.18^*\)  \(\mathbf{0.17}^{*}\) 
500  0.19 ± 0.001  \(0.18^*\)  \(0.17^*\)  \(\mathbf{0.16}^{*}\) 
For this problem, OFER enables us to get slightly better performances in terms of RRMSE and also to get smaller standard deviations. A notable fact is that the selected number of components in the mixture model, selected by crossvalidation, is 6 which corresponds to 6 clear levels of scores when observing the data.
6.5 Multiclass classification
Task description
We now propose to address an image multiclass classification problem cast into a word prediction task. The idea is that object classes have semantics and by replacing classes by names (words), we allow the learning algorithm to take advantage of this semantics. We considered two real datasets: the first one is the wellknown Caltech101^{7} image dataset consisting of images of 101 classes and the second one is Animals With Attributes dataset (AWA,^{8} Akata et al. 2016) consisting of images of 50 animal classes. In both case, the final goal is to automatically predict the name of the animal/object present in a given image. Note that for AWA dataset, our purpose is not to work on information transfer but only have a dataset where the names of the objects might be meaningful to cluster. So we do not use at all the attributes information about animals as studied in Akata et al. (2016). We also emphasize that the binary nature of the Attribute Label Embedding and Hierarchical Label Embedding proposed in Akata et al. (2016) do not lead to a dense output space where our method applies.

First, we built the “Wiki corpus” by scraping the wiki pages, containing the textual descriptions of the 101 (resp. 50) classes. We have kept only the introductory paragraph.

Second, for each word of the ”Wiki corpus” we retrieve the GLOVE (Pennington et al. 2014) semantic representations. That is to say that we get a 50dimensional vector for each word. We convert the corpus into a TFIDF matrix.

Third, we represent each class as a weighted combination of its GLOVE represented word, with the weight of the TFIDF.
Learning in small data regime
Our framework (OFERGMM) is compared to two simple multiclass classifiers: SVM (mSVM) with a oneversusall strategy and Multiclass Random Forest (mRF) and a multiple output Kernel Ridge Regression with linear kernel (SemKRR) working in the semantical output space \(\mathcal {Y}\). Let us first report the performance of multiclass Random Forest: on Caltech101, from 1 to 10 examples per class the classification accuracy of mRF did not exceed 1.6% with a std of 0.45 . On AWA, the mean accuracy varies from 10 to 18% which is again very low and far from the performance of the other methods.
For both datasets, multiclass Random Forest has been tested but it suffers from the limited number of training examples per label. We also made comparisons with a method devoted to structured outputs: Input Output Kernel Regression (IOKR). IOKR is implemented here using an identitydecomposable operatorvalued kernel defined on the input space and a scalarvalued output kernel, here taken as Gaussian one on the semantic space \(\mathcal {Y}\). As the number of target classes is limited, the minimization problem involved in preimage problem is exactly solved.
We also tried to compare our method with methods from the deep learning community (Vinyals et al. 2016; Sohn et al. 2015). In Sohn et al. (2015) the authors developed a deep conditional generative model for structured output variables using Gaussian latent variables (CVAE). However their approach estimates distributions from very large datasets to capture global regularities. On the other hand, we designed OFER as a framework tailored for small to medium sized dataset that also allows the user to take advantage of weak annotations. Consequently when it comes to the empirical approach, we cannot really compare with CVAE since we do not work on the same type of problems. When trying to apply CVAE in the small data regime, the algorithm does not converge to a solution and gives an accuracy close to 0%.
Nonetheless, in order to have a deep learning baseline, we attempted to finetune the pretrained VGG19 network but obtained our best results with all the convolutional layers frozen. We refer to this solution as MLP.
Comparison between OFERGMM, SemKRR, SemIOKR, Multilayer Perceptron (MLP) and SVM on Caltech101 with a growing number of labeled examples per class
#ex/class  Classification accuracy on test set: mean ± std (%)  Caltech101  

mSVM  MLP  SemIOKR  SemKRR  OFERGMM  
1  9.61 ± 3.98  10.80 ± 3.78  13.40 ± 2.22  14.83 ± 4.02  \(\mathbf{38.22} \pm \mathbf{2.87}\) 
2  23.24 ± 2.03  12.73 ± 4.46  18.80 ± 2.10  19.96 ± 2.68  \(\mathbf{43.89} \pm \mathbf{2.24}\) 
3  33.89 ± 1.79  17.25 ± 3.18  22.51 ± 1.81  22.71 ± 2.33  \(\mathbf{46.33} \pm \mathbf{2.44}\) 
4  42.23 ± 1.8  25.14 ± 1.89  22.83 ± 1.41  24.52 ± 1.93  \(\mathbf{48.41} \pm \mathbf{2.25}\) 
5  47.63 ± 2.87  27.04 ± 2.45  24.90 ± 1.27  25.91 ± 1.28  \(\mathbf{49.40} \pm \mathbf{2.09}\) 
7  \(\mathbf{55.19 \pm 2.43}\)  35.84 ± 3.48  26.84 ± 0.92  27.42 ± 1.59  50.39 ± 2.04 
10  \(\mathbf{58.55 \pm 1.84}\)  44.93 ± 3.80  31.27 ± 1.84  29.49 ± 1.39  50.49 ± 1.07 
Comparison between OFERGMM, SemKRR ,SemIOKR, MLP, mSVM and mRF on AWA dataset with a growing number of labeled example per class
#ex/class  Classification accuracy on test set: mean ± std (%)  AWA  

mSVM  MLP  SemIOKR  SemKRR  OFERGMM  
1  \(1.91 \pm 0.13\)  \(22.18 \pm 3.98\)  \(0.26 \pm 0.17\)  \(21.85 \pm 3.04\)  \(\mathbf{23.00} \pm \mathbf{3.11}\) 
2  \(2.12 \pm 0.15\)  \(30.62 \pm 3.59\)  \(6.43 \pm 1.15\)  \(29.77 \pm 2.33\)  \(\mathbf{31.05} \pm \mathbf{2.17}\) 
3  \(6.33 \pm 1.12\)  \(34.06 \pm 3.65\)  \(14.93 \pm 1.56\)  \(33.21 \pm 2.02\)  \(\mathbf{34.79} \pm \mathbf{2.11}\) 
4  \(12.87 \pm 0.70\)  \(38.02 \pm 2.31\)  \(20.81 \pm 1.94\)  \(37.01 \pm 1.97\)  \(\mathbf{37.91} \pm \mathbf{2.07}\) 
5  \(18.53 \pm 1.69\)  \(39.36 \pm 2.04\)  \(26.26 \pm 1.74\)  \(39.97 \pm 1.75\)  \(\mathbf{40.74} \pm \mathbf{1.70}\) 
7  \(26.85 \pm 2.24\)  \(41.36 \pm 1.86\)  \(33.82 \pm 1.37\)  \(42.89 \pm 0.93\)  \(\mathbf{43.24} \pm \mathbf{0.58}\) 
10  \(37.36 \pm 1.14\)  \(43.88 \pm 1.80\)  \(42.61 \pm 1.3\)  \(44.21 \pm 1.00\)  \(\mathbf{45.23} \pm \mathbf{1.00}\) 
11  \(40.55 \pm 0.56\)  \(44.50 \pm 2.38\)  \(44.91 \pm 1.4\)  \(45.67 \pm 0.94\)  \(\mathbf{46.26} \pm \mathbf{1.08}\) 
14  \(47.00 \pm 1.00\)  \(46.25 \pm 1.57\)  \(\mathbf{49.7 \pm 0.89}\)  47.08 ± 1.24  47.38 ± 1.29 
The conclusion of those results is that our framework OFER outperforms the other baselines.
Weakly Supervised learning
Impact of weak supervision (wOFER+ data weak size) on OFER for the classification task using growing sizes of weakly supervised dataset (700, 1500, 2500)
#ex/class  Classification accuracy on test set: mean ± std (%)  Caltech101  

mSVM  OFERGMM  wOFER+700  wOFER +1500  wOFER+2500  
1  \(9.61 \pm 3.98\)  \(38.22 \pm 2.87\)  \(39.19 \pm 0.20\)  \(39.90 \pm 0.20\)  \(\mathbf{40.15} \pm \mathbf{0.20}\) 
2  \(23.24 \pm 2.03\)  \(43.89 \pm 2.24\)  \(44.35 \pm 0.01\)  \(44.82 \pm 0.02\)  \(\mathbf{45.16} \pm \mathbf{0.01}\) 
3  \(33.89 \pm 1.79\)  \(46.33 \pm 2.44\)  \(47.01 \pm 0.01\)  \(47.56 \pm 0.01\)  \(\mathbf{47.83} \pm \mathbf{0.01}\) 
4  \(42.23 \pm 1.8\)  \(48.41 \pm 2.25\)  \(48.83 \pm 0.01\)  \(49.32 \pm 0.01\)  \(\mathbf{50.24} \pm \mathbf{0.01}\) 
5  \(47.63 \pm 2.87\)  \(49.40 \pm 2.09\)  \(50.32 \pm 0.01\)  \(51.13 \pm 0.01\)  \(\mathbf{51.65} \pm \mathbf{0.01}\) 
7  \(\mathbf{55.19} \pm \mathbf{2.43}\)  50.39 ± 2.04  51.04 ± 0.01  51.63 ± 0.01  52.35 ± 0.01 
10  \(\mathbf{58.55} \pm \mathbf{1.84}\)  50.49 ± 1.07  51.14 ± 0.01  52.13 ± 0.01  52.83 ± 0.01 
Impact of weak supervision (wOFER+ data weak size) on OFER for AWA using growing sizes of weakly supervised dataset (700,1500, 2500)
#ex/ class  Classification accuracy on test set: mean ± std (%)  AWA  

MLP  OFER  wOFER+700  wOFER+1500  wOFER+2500  
1  22.18 ± 3.98  23.00 ± 3.11  24.19 ± 0.10  24.90 ± 0.09  \(\mathbf{25.67} \pm \mathbf{0.10}\) 
2  30.62 ± 3.59  31.05 ± 2.17  32.71 ± 0.01  33.36 ± 0.01  \(\mathbf{34.04} \pm \mathbf{0.01}\) 
3  34.06 ± 3.65  34.79 ± 2.11  36.22 ± 0.01  37.03 ± 0.01  \(\mathbf{37.63} \pm \mathbf{0.01}\) 
4  38.02 ± 2.31  37.91 ± 2.07  38.53 ± 0.01  39.32 ± 0.01  \(\mathbf{40.14} \pm \mathbf{0.01}\) 
5  39.36 ± 2.04  40.74 ± 1.70  41.35 ± 0.01  42.22 ± 0.02  \(\mathbf{43.16} \pm \mathbf{0.01}\) 
7  41.36 ± 1.86  43.24 ± 0.58  44.00 ± 0.01  44.96 ± 0.01  \(\mathbf{45.83} \pm \mathbf{0.01}\) 
10  43.88 ± 1.80  45.23 ± 1.00  46.13 ± 0.01  47.18 ± 0.01  \(\mathbf{47.84} \pm \mathbf{0.01}\) 
11  44.50 ± 2.38  46.26 ± 1.08  47.37 ± 0.01  48.02 ± 0.01  \(\mathbf{48.94} \pm \mathbf{0.01}\) 
14  46.25 ± 1.57  47.38 ± 1.29  49.04 ± 0.01  49.63 ± 0.01  \(\mathbf{50.35} \pm \mathbf{0.01}\) 
7 Conclusion
Output Fisher embedding regression is a general framework able to account for explicit and implicit structure in output data by exploiting Fisher vectors in the output space. We have shown that the corresponding preimage problems admit closedform solutions in the case of Gaussian Mixture Models and Gaussian StateSpace models, allowing for a fast prediction phase compared to Structured Output Prediction method such as IOKR.
OFER has been applied on a wide variety of tasks, showing its versatility. Beyond Structured Output Regression like time series prediction, OFER is also of interest for regression and multiclassification tasks when the training dataset is small. OFERGMM, based on the Fisher embedding of Gaussian mixture model, especially exhibits a very interesting behaviour in multiclassification tasks when learning from a handful of data per label. The Fisher embedding seems to remedy to the lack of examples per label, overcoming the performance of the other approaches, including the ones that take advantage of a semantic encoding. It is also important to notice the flexibility of OFERGMM whose number of components selected by crossvalidation is adapted to the observed outputs.
Additionally, the interpretability of OFERGMM outputs opens the door to learning under weak supervision, avoiding the cost of expert labeling. This is observed in multioutput regression as well as the multiclassification tasks.
We identify at least two attractive use cases for OFER. The first one is in Structured Output Prediction when one wants to avoid an expensive prediction phase while accounting for the structure. The second one is in small data regime for any problem that can be cast into multioutput regression where the training outputs can be clustered.
First working perspectives concern the improvement of the whole approach by learning jointly the parametric probabilistic model and the predictive function. Second, this paper has focused on vectorial outputs and time series but the framework is more general. An attractive direction is to apply this approach on more complex outputs in applications where the training dataset is often very limited such as bioinformatics and chemoinformatics.
Footnotes
Notes
Acknowledgements
The authors are very grateful to Slim Essid, Chloé Clavel (LTCI, Télécom Paristech) and Zoltán Szabó (CMAP, Ecole Polytechnique) for fruitful discussions about this work. Moussab Djerrab is supported by the Télécom ParisTech Machine Learning for Big Data Chair.
References
 Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2016). Labelembedding for image classification. IEEE Transactions on Pattern Analysis Machine Intelligence, 38(7), 1425–1438.CrossRefGoogle Scholar
 Álvarez, M. A., Rosasco, L., & Lawrence, N. D. (2012). Kernels for vectorvalued functions: A review. Foundations and Trends in Machine Learning, 4(3), 195–266.CrossRefzbMATHGoogle Scholar
 Amari, S.I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.CrossRefGoogle Scholar
 Bakir, G., Hofmann, T., Schölkopf, B., Smola, A., Taskar, B., & Vishwanathan, S. (2007). Predicting structured data. Cambridge: MIT Press.Google Scholar
 Brouard, C., d’AlchéBuc, F., & Szafranski, M. (2011). Semisupervised penalized output kernel regression for link prediction. In International conference on machine learning (ICML) (pp. 593–600).Google Scholar
 Brouard, C., d’Alché Buc, F., & Szafranski, M. (2016). Input output kernel regression. Journal of Machine Learning Research, 17(176), 1–48.zbMATHGoogle Scholar
 Chen, L., Schwing, A. G., Yuille, A. L., & Urtasun, R. (2015). Learning deep structured models. In Proceedings of the 32nd international conference on machine learning, ICML 2015 (pp. 1785–1794).Google Scholar
 Ciliberto, C., Rosasco, L., & Rudi, A. (2016). A consistent regularization approach for structured prediction. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems 29 (pp. 4412–4420). New York: Curran Associates Inc.Google Scholar
 Cortes, C., Mohri, M., & Weston. J. (2005). A general regression technique for learning transductions. In International conference on machine learning (ICML) (pp. 153–160).Google Scholar
 Cuturi, M., Vert, J., Birkenes, Ø., & Matsui, T. (2007). A kernel for time series based on global alignments. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2007.Google Scholar
 FeiFei, L., Fergus, R., & Perona, P. (2006). Oneshot learning of object categories. IEEE Transactions on Pattern Analysis Machine Intelligence, 28, 594–611.CrossRefGoogle Scholar
 Geurts, P., Wehenkel, L., & d’AlchéBuc, F. (2006). Kernelizing the output of treebased methods. In International conference on machine learning (ICML) (pp. 345–352).Google Scholar
 Geurts, P., Wehenkel, L., & d’AlchéBuc, F. (2007). Gradient boosting for kernelized output spaces. In Machine learning, proceedings of the twentyfourth international conference (ICML 2007), Corvallis, Oregon, USA, June 20–24, 2007 (pp. 289–296).Google Scholar
 Hofmann, T. (2000). Learning the similarity of documents: An informationgeometric approach to document retrieval and categorization. In S. A. Solla, T. K. Leen, & K. Müller (Eds.), Advances in neural information processing systems 12 (pp. 914–920). MIT Press.Google Scholar
 Honeine, P., & Richard, C. (2011). Preimage problem in kernelbased machine learning. IEEE Signal Processing Magazine, 28(2), 77–88.CrossRefGoogle Scholar
 Hou, Y., Hsu, W., Lee, M. L., & Bystroff, C. (2003). Efficient remote homology detection using local structure. Bioinformatics, 19(17), 2294.CrossRefGoogle Scholar
 Jaakkola, T., & Haussler, D. (1998). Exploiting generative models in discriminative classifiers. In M. J. Kearns, S. A. Solla, & D. A. Cohn (Eds.), In advances in neural information processing systems 11 (pp. 487–493). Cambridge: MIT Press.Google Scholar
 John, O. P., & Srivastava, S. (1999). The big five trait taxonomy: History, measurement, and theoretical perspectives. Handbook of Personality: Theory and Research, 2(1999), 102–138.Google Scholar
 Kadri, H., Ghavamzadeh, M., & Preux, P. (2013). A generalized kernel approach to structured output learning. In International conference on machine learning (ICML) (pp. 471–479).Google Scholar
 Kocev, D., Vens, C., Struyf, J., & Dzeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 46(3), 817–833.CrossRefGoogle Scholar
 Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., & Smith, N. A. (2009). Predicting risk from financial reports with regression. In Proceedings of the human language technologies: NAACL’09.Google Scholar
 Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the international conference on machine learning (ICML).Google Scholar
 Lebret, R., Pinheiro, P. H. O., & Collobert, R. (2015). Phrasebased image captioning. In International conference on machine learning (ICML) (pp. 2085–2094).Google Scholar
 LeCun, Y., & Huang, F. (2005). Loss functions for discriminative training of energybased models. In Proceedings of the 10th international workshop on artificial intelligence and statistics (AIStats’05).Google Scholar
 Micchelli, C. A., & Pontil, M. A. (2005). On learning vectorvalued functions. Neural Computation, 17, 177–204.MathSciNetCrossRefzbMATHGoogle Scholar
 Nowozin, S., & Lampert, C. H. (2011). Structured learning and prediction in computer vision. Foundations and Trends Computer Graphics and Vision, 6(3:8211;4), 185–365.zbMATHGoogle Scholar
 Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with Fisher vectors on a compact feature set. In Proceedings of the IEEE international conference on computer vision (pp. 1817–1824).Google Scholar
 Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Empirical methods in natural language processing (EMNLP) (pp. 1532–1543).Google Scholar
 Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the Fisher kernel for largescale image classification. In Proceedings of the 11th european conference on computer vision: Part IV, ECCV’10 (pp. 143–156). Berlin, Heidelberg: Springer.Google Scholar
 PonceLópez, V., Chen, B., Oliu, M., Corneanu, C., Clapés, A., Guyon, I., Baró, X., Escalante, H. J., & Escalera, S. (2016). Chalearn lap 2016: First round challenge on first impressionsdataset and results. In Computer vision–ECCV 2016 workshops (pp. 400–418). Berlin: Springer.Google Scholar
 Pugelj, M., & Džeroski, S. (2011). Predicting structured outputs knearest neighbours method. In Proceedings of the 14th international conference on discovery science, DS’11 (pp. 262–276). Berlin, Heidelberg: Springer.Google Scholar
 Salvador, S., & Chan, P. (2004). Fastdtw: Toward accurate dynamic time warping in linear time and space. In KDD workshop on mining temporal and sequential data. Citeseer.Google Scholar
 Siolas, G., & d’AlchéBuc, F. (2002). Mixtures of probabilistic pcas and Fisher kernels for word and document modeling. In ICANN 2002 (pp. 769–776).Google Scholar
 Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett, (Eds.), Advances in neural information processing systems 28 (pp. 3483–3491). Curran Associates, Inc.Google Scholar
 Su, H., Heinonen, M., & Rousu, J. (2010). Structured output prediction of anticancer drug activity. In International conference on pattern recognition in bioinformatics (PRIB) (pp. 38–49). Berlin: Springer.Google Scholar
 Sydorov, V., Sakurada, M., & Lampert, C. H. (2014). Deep fisher kernels—End to end learning of the Fisher kernel GMM parameters. In 2014 IEEE conference on computer vision and pattern recognition, CVPR (pp. 1402–1409).Google Scholar
 Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetzbMATHGoogle Scholar
 Vezhnevets, A., Ferrari, V., & Buhmann, J. M. (2012). Weakly supervised structured output learning for semantic segmentation. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 845–852). IEEE.Google Scholar
 Vinyals, O., Blundell, C., Lillicrap, T. P., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. CoRR. arXiv:1606.04080.
 Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In The IEEE conference on computer vision and pattern recognition (CVPR), June 2015.Google Scholar
 Yu, C. J., & Joachims, T. (2009). Learning structural svms with latent variables. In A. P. Danyluk, L. Bottou, and M. L. Littman (Eds.) Proceedings of the 26th annual international conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009 (Vol. 382, pp. 1169–1176). ACM.Google Scholar
 Zhu, L., Chen, Y., Yuille, A. L., & Freeman, W. T. (2010). Latent hierarchical structural learning for object detection. In The twentythird IEEE conference on computer vision and pattern recognition, CVPR 2010 (pp. 1062–1069).Google Scholar