Output Fisher embedding regression

Djerrab, Moussab; Garcia, Alexandre; Sangnier, Maxime; d’Alché-Buc, Florence

doi:10.1007/s10994-018-5698-0

Output Fisher embedding regression

Published: 17 May 2018

Volume 107, pages 1229–1256, (2018)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Output Fisher embedding regression

Download PDF

Moussab Djerrab¹,
Alexandre Garcia¹,
Maxime Sangnier² &
…
Florence d’Alché-Buc¹

1811 Accesses
2 Citations
Explore all metrics

Abstract

We investigate the use of Fisher vector representations in the output space in the context of structured and multiple output prediction. A novel, general and versatile method called output Fisher embedding regression is introduced. Based on a probabilistic modeling of training output data and the minimization of a Fisher loss, it requires to solve a pre-image problem in the prediction phase. For Gaussian Mixture Models and State-Space Models, we show that the pre-image problem enjoys a closed-form solution with an appropriate choice of the embedding. Numerical experiments on a wide variety of tasks (time series prediction, multi-output regression and multi-class classification) highlight the relevance of the approach for learning under limited supervision like learning with a handful of data per label and weakly supervised learning.

Forecasting and Granger Modelling with Non-linear Dynamical Dependencies

Manifold Learning Regression with Non-stationary Kernels

Deep Fisher Discriminant Analysis

1 Introduction

Recent years have witnessed an explosion of interest for predicting outputs with some structure, whether explicit (Bakir et al. 2007; Nowozin and Lampert 2011) or implicit (Vinyals et al. 2015; Lebret et al. 2015), in various areas such as computer vision, natural language processing and bioinformatics. It is generally acknowledged that learning to predict complex outputs raises important difficulties: (i) the output space does not enjoy a vector space structure, (ii) the output structure has to be taken into account in the model itself and in an appropriate loss function, (iii) prediction and learning phases are usually very expensive, (iv) the supervision is a demanding job and data are expensive to label, (v) as a consequence, training data might be partially or weakly labeled.

Regarding the two first issues, significant progress has been made through the abundant literature of structured output prediction. The so-called energy-based learning methods (Lafferty et al. 2001; Tsochantaridis et al. 2005; LeCun and Huang 2005; Chen et al. 2015) build a score function on pairs of input/output and make a prediction by extracting the output that obtains the best score. Interestingly, some of these approaches have been extended to also take into account the hidden structure that links inputs to the outputs (see for instance Yu and Joachims 2009; Zhu et al. 2010). Regarding point (iv), an extension of structured output learning with latent output variable has also been developed in Vezhnevets et al. (2012) to address weakly supervised learning in the context of semantic segmentation of images.

However, although very elegant, these methods are very expensive both at the prediction and the learning stage.

Another line of research, mainly represented by the family of Output Kernel Regression (OKR) tools (Cortes et al. 2005; Geurts et al. 2006; Brouard et al. 2011; Kadri et al. 2013; Brouard et al. 2016) but also close approaches based on an output distance (Kocev et al. 2013; Pugelj and Džeroski 2011), has been explored to avoid the cost of maximizing a score for each prediction. In the case of OKR methods, the structured output prediction problem is converted into a regression problem with an output feature map associated to a kernel. Thus, learning does not require to solve the prediction problem in the original output space and therefore is less demanding in terms of computational load. Yet, as in energy-based methods, the prediction stage is expensive: a solution has to be found in the original output space (Kadri et al. 2013) and this is generally addressed by solving a pre-image problem (Honeine and Richard 2011). A close approach emphasizing the role of the output feature map has also been very recently studied under the angle of regularization in Ciliberto et al. (2016).

Other works have been developed around the prediction of unstructured or semi-structured outputs such as phrases and sentences. They mainly concern automatic captioning in images or videos for which the most relevant methods combine discriminant approaches and generative models in a single architecture (Vinyals et al. 2015) or hierarchically (Lebret et al. 2015).

Now regarding the two last issues, (iv) and (v), there exists many fields of application where labeling requires experts and would benefit from methods that are able to cope with less training data. Video Labeling (Ponce-López et al. 2016) and Drug Activity Prediction (Su et al. 2010) are examples of such involved tasks. In this work, we take inspiration from the field of Structured Output Prediction to define a class of methods able to take into account explicit as well as implicit structure of output data in supervised tasks. We expect that accounting for hidden structure of the output does help when learning with a handful of labeled data, a small data regime often called one-shot learning (Fei-Fei et al. 2006) in case of classification tasks. We propose to address the problem of complex output prediction by learning the relationship between an input and an output, the latter seen as a realization of a generative probabilistic model. For this purpose, we introduce a novel, general and versatile method called output Fisher embedding regression (OFER) based on the minimization of a surrogate loss based on Fisher Embedding and a closed-form solution for the corresponding pre-image problem. OFER relies on three main steps. First, a generative parametric probabilistic model of the training outputs is estimated and each training output is embedded into a vector space using the well-known Fisher vector, i.e. the gradient in the parameters of the generative model introduced in the seminal work of Amari (1998). Specifically, we use a well chosen linear transformation of the Fisher vector as the new target and we call it, Output Fisher Embedding. The training dataset is therefore transformed and in the second step, a vector-valued function is learned to predict the Output Fisher Embedding from the input. Eventually, to make a prediction in the original output space, one has to solve a pre-image problem at the third step. By electing an appropriate linear transformation of the Fisher vector, we show that the pre-image problem admits a closed-form solution in the case of Gaussian Mixture models while the Fisher Gaussian State-Space models enjoy a recursive solution based on Viterbi algorithms. Therefore, for outputs that can be seen as realizations of one of these two models, prediction is not expensive and point (iii) is no more an issue. Moreover, when using a Fisher embedding of a mixture model, a single training instance brings information about the prediction of outputs belonging to the same mixture component. This makes the approach well adapted to learning from an handful of training data, a variant of one-shot learning (Fei-Fei et al. 2006). Knowing how to learn under this ”small data regime”, is especially useful in application fields characterized by point (iv), when supervision is expensive. Eventually, the interpretability of each term of the Fisher embedding in the case of Gaussian mixture model also opens the door to a variant of weakly supervised learning supervision. To our knowledge, Output Fisher Embedding is the first use of Fisher vectors in output embeddings whilst Fisher kernels (inner product of Fisher vectors) (Jaakkola and Haussler 1998; Hofmann 2000; Siolas and d’Alché-Buc 2002; Perronnin et al. 2010) have been widely exploited to handle input data.

The paper is structured as follows. In Sect. 2, we introduce the OFER approach in the case of full supervision. Section 3 introduces Output Fisher Embeddings for Gaussian Mixture models and State-Space Models while Sect. 4 presents the solutions of the corresponding pre-image problems. Section 5 discusses the supervised learning problem and introduces a scenario for weakly supervised learning. Section 6 presents an extended experimental study on synthetic datasets and on a large variety of tasks with 5 different real datasets. Conclusion and perspectives are drawn in Sect. 7. The Appendix contain additional experimental results and an empirical study of computation times for all tested methods in all tasks.

2 Output Fisher embedding regression

2.1 Brief reminder about Fisher scores and kernels

Fisher vectors allow one to encode any realization $z\in \mathbb {R}^d$ of a parametric probabilistic model as a feature vector. Given a model $p_{\theta }$ of parameter vector $\theta $, the Fisher vector is defined as:

$$\begin{aligned} \phi _{Fisher}(z) = \nabla _{\theta } \log p_{\theta }(z). \end{aligned}$$

A Fisher vector encodes in each of its coordinates the contribution of each parameter to the value of the probability density on z, $p_{\theta }(z)$. We call the function $\phi _{Fisher}:\mathcal {Y}\mapsto \mathbb {R}^m$, the Fisher feature map. In supervised learning, Fisher vectors have been associated to kernel machines with the so-called Fisher kernel to improve the encoding of the input characteristics:

$$\begin{aligned} k_{Fisher}(z,z')= \phi _{Fisher}(z)^TI^{-1}\phi _{Fisher}(z'), \end{aligned}$$

where $I= \mathbb {E}_{z\sim p_{\theta }}[\phi _{Fisher}(z)\phi _{Fisher}(z)^T]$ is the Fisher information matrix, i.e. the covariance matrix of the Fisher score. Referring to the work by Amari in differential geometry and the idea of natural gradient (Amari 1998), $k_{Fisher}$ is indeed the proper inner product to compare two samples taking into account the manifold of the parameter space of their probability distribution. In practice, $I^{-1}$ is either empirically estimated from the training data or even omitted to give rise to the naïve Fisher kernel: $ k_{Fisher}(z,z')= \phi _{Fisher}(z)^T\phi _{Fisher}(z')$. Support Vector Machines (SVM) based on (naïve) Fisher kernels have proved to provide accurate classifications in many fields such as remote protein homologies detection (Hou et al. 2003), document categorization (Hofmann 2000; Siolas and d’Alché-Buc 2002) or more recently image classification (Perronnin et al. 2010; Oneata et al. 2013). A recent work (Sydorov et al. 2014) has also bridged Fisher Kernel SVM and Convolutional Neural Networks, emphasizing the similarity between the two approaches for image classification. In this work, we propose a completely different use of the Fisher vectors in order to embed outputs with some structure in an appropriate feature space.

2.2 Regression with output Fisher embeddings

Denote $\mathcal {X}$ (resp. $\mathcal {Y}$) the input set (resp. output set) of a prediction problem. Assume that both $\mathcal {X}$ and $\mathcal {Y}$ are sample spaces. Let us call $\mathcal {S}_{\ell } =\{(x_i,y_i), i=1, \ldots , \ell \}$, the training set. Inspired by the framework of Output Kernel Regression (Cortes et al. 2005; Geurts et al. 2006), Input Output Kernel Regression (Brouard et al. 2016) and the recent work of Ciliberto et al. (2016), Output Fisher Embedding Regression (OFER) aims at encoding the output data into a vector space where the inner product is relevant, helping the learning task at hand. This transformation of the output variable yields a new surrogate loss, able to take into account the inherent structure of the output.

The method, depicted in Fig. 1, relies on three steps:

1.
Defining an output feature map, called here, Output Fisher embedding, ${\phi _{Fisher}^{M}:\mathcal {Y}{\rightarrow } \mathbb {R}^p},$ indexed by a matrix M of size $(m \times p)$ defined as: $\phi _{Fisher}^{M}(y)= M \phi _{Fisher}(y) = M\nabla _{\hat{\theta }} \log p_{\hat{\theta }}(y)$, where:
- $p_{\theta }(y)$ is the probability density of a parametric probabilistic model of parameter $\theta \in \mathbb {R}^m$ and $\hat{\theta }$ is an estimate of $\theta $ obtained from the training output set $\mathcal {Y}_{\ell }=\{y_1, \ldots , y_{\ell }\}$. Note that in some cases, a larger dataset including $\mathcal {Y}_{\ell }$ may be used instead of $\mathcal {Y}_{\ell }$.
- An appropriate choice of the linear transformation M allows one to simplify the pre-image problem.
2.
Training a learning algorithm on a class $\mathcal {H}$ of functions $h:\mathcal {X}\rightarrow \mathbb {R}^p$ and dataset $\mathcal {S}_{\ell }$ to approximate the relationship between ${x \in \mathcal {X}}$ and $\phi _{Fisher}^{M}(y)$ by minimizing
$$\begin{aligned} \lambda _0 {\varOmega }(h) + \frac{1}{2\ell }\sum _{i=1}^{\ell } L_{Fisher}^M(y_i,h(x_i)), \end{aligned}$$
where $L_{Fisher}^M: \mathcal {Y}\times \mathbb {R}^p \rightarrow \mathbb {R}^+$ is a surrogate loss defined as:

${L_{Fisher}^M(y_i,h(x_i)) = \Vert \phi _{Fisher}^{M}(y_i) - h(x_i) \Vert ^2}$.
3.
Given x, making a prediction in the original output space $\mathcal {Y}$ by solving a pre-image problem: $~y^{*}\in \arg \min _{y \in \mathcal {Y}} L_{Fisher}^M(y,h(x)) $.

OFER can be seen as an instance of the general framework of Output Kernel Regression with the output Fisher kernel $k_{Fisher}$. Let us recall that in Output Kernel Regression (OKR), one makes use of the kernel trick in the output space, allowing for predicting complex outputs with squared loss applied in output feature space. If h is the function to be learned, for a given pair (x, y), the local square loss $L_{OKR}(y,h(x))= \Vert \phi (y) - h(x)\Vert ^2_{\mathcal {F}_{Y}}$ can be computed using the kernel $k(y,y')=\langle \phi (y),\phi (y') \rangle _{\mathcal {F}_{Y}}$ as soon as h(x) writes in terms of $\phi (y_i),i \in \{1, \ldots , n\}$. This is the case for the following methods: (i) Kernel Dependency Estimation KDE (Cortes et al. 2005) that is the first example of how to use two scalar-valued kernels, one defined on the input space and the other on the output space, (ii) Input Output Kernel Regression (Brouard et al. 2011, 2016) that can be seen as a generalization of KDE where the input kernel is an operator-valued kernel and the output kernel a scalar-valued one, (iii) all Output Kernel Tree-based approaches such as OK3 (Geurts et al. 2006) and OKBoost (Geurts et al. 2007) that extend tree-based methods for regression to kernelized outputs. However all these methods require to solve a difficult pre-image problem.

In contrast, OFER directly exploits an explicit and finite dimensional feature map for the output kernel, the Output Fisher embedding. The vector-valued function h in Step 2 can be computed using any learning method devoted to vector-valued functions or by any learning method devoted to scalar-valued functions using parallel coordinate-wise computations as opposed to using the kernel trick in the output space.

Now, the size of matrix M (in particular the value p), involved in the surrogate Fisher loss, directly influences the time complexity of the learning phase. Strictly speaking, the Fisher vector is a very sparse representation that is unnecessary long in the context of prediction. In this paper, we show that the case $M=I$ yielding a classic pre-image problem with no analytical solution can be replaced by a proper choice of $M \ne I$ in order to obtain a closed-form solution, easier to compute (see Sect. 4) and with better empirical performance.

3 Fisher embeddings

In this paper, we illustrate the principle of output Fisher embedding regression with two generative models: the Gaussian Mixture Model, to take into account the hidden components of random vectors and the Gaussian State Space Model, to represent hidden processes in time series. In the following, we assume that the generic parameter $\theta $ is known. A building block of these models is the Gaussian vector for which we briefly present the Fisher feature map.

3.1 Gaussian vector

Let us assume that $Y \sim \mathcal {N}(\mu ,{\varSigma })$ and $\mu \in \mathbb {R}^d$ and ${\varSigma }$ is a $d\times d$ positive definite matrix, with $d\in \mathbb {N}^*$. Denote $p_{\theta }(y) = \frac{1}{\sqrt{2\pi }|{\varSigma }|^{d/2}}e^{-\frac{1}{2}(y-\mu )^T {\varSigma }^{-1}(y-\mu )}$ the probability density. Then, the Fisher feature map takes the following form:

$\forall y \in \mathcal {Y}$,

$$\begin{aligned} \begin{aligned} \phi _{Fisher}(y) = \begin{pmatrix} \phi _{\mu }(y)\\ \phi _{{\varSigma }}(y) \end{pmatrix} = \begin{pmatrix} {\varSigma }^{-1}(y-\mu )\\ vec\left( -{\varSigma }^{-1}+ \phi _{\mu }(y)\phi _{\mu }(y)^T\right) \end{pmatrix}. \\ \end{aligned} \end{aligned}$$

(1)

Let us notice that $\phi _{{\varSigma }}(y)$ can be directly derived from the value of $\phi _{\mu }(y)$. Figure 2 intends to provide a geometric intuition of the Fisher embedding.

3.2 Gaussian mixture model

Let us now consider a Gaussian Mixture Model in $\mathbb {R}^d$ with the following probability density $p_{\theta }$:

$$\begin{aligned} p_{\theta }(y) = \sum _{i=1}^{C} \pi _i p_{\theta _i}(y), \end{aligned}$$

where each component has a Gaussian density: $p_{\theta _i}(y) = \mathcal {N}(\mu _i,{\varSigma }_i) $, $\theta = ((\pi _j,\theta _j),j=1, \ldots , C)$, Hence, the Fisher vector writes as:

$$\begin{aligned} \begin{aligned} \phi _{Fisher}(y) =&\begin{pmatrix} \varphi _{\pi }(y)\\ \varphi _{\mu }(y)\\ \varphi _{{\varSigma }}(y) \end{pmatrix}, \end{aligned} \end{aligned}$$

where $\varphi _{\pi }(y) \in \mathbb {R}^C$, $\varphi _{\mu }(y) \in \mathbb {R}^{d}$ and $\varphi _{{\varSigma }}(y) \in \mathbb {R}^{d^2}$. We have:

$$\begin{aligned} \varphi _{\pi }(y)= & {} \nabla _{\pi }(\log p_{\theta }(y)) \\= & {} \left( \frac{p_{\theta _{1}}(y)}{p_{\theta }(y)}, \ldots , \frac{p_{\theta _{C}}(y)}{p_{\theta }(y)} \right) ^T, \end{aligned}$$

and, using the previous notation for the Gaussian Fisher embedding:

$$\begin{aligned} \varphi _{\mu ,j}(y)= & {} \nabla _{\mu _j}(\log p_{\theta }(y))= \pi _j\frac{p_{\theta _j}(y)}{p_{\theta }(y)} \phi _{\mu _j}(y),~\forall j=1, \ldots , C \\ \varphi _{{\varSigma },j}(y)= & {} vec\left( \pi _j\frac{p_{\theta _j}(y)}{p_{\theta }(y)} \phi _{{\varSigma }_j}(y)\right) ,~\forall j=1, \ldots , C. \end{aligned}$$

The jth coefficient of vector $\varphi _{\pi }(y)$ is non-negative and represents some contribution of y with respect to the jth component. In the case where y does not contribute to this component at all, the value of the derivative would be equal to zero. As for the other parts of the Fisher vector, only the derivatives related to the component that most likely have generated y have an absolute value far from 0. The Fisher vector is therefore generally very sparse for GMM.

3.3 Gaussian state-space model

Gaussian State-Space Models allow to represent partially observed stochastic continuous time processes. They are defined by the two following equations:

$$\begin{aligned} m_t= & {} Am_{t-1}+Q\eta _t ~~~~~\text{(State } \text{ equation) },\\ z_t= & {} Cm_t+Ru_t ~~~~~\text{(Observation } \text{ equation) },\\ \end{aligned}$$

where $z_t\in \mathbb {R}^{p}$ is the observation at time t and $m_t \in \mathbb {R}^{d}$ is the hidden state at time t. $A\in \mathbb {R}^{d\times d}$, $Q\in \mathbb {R}^{d\times d}$, $C\in \mathbb {R}^{p\times d}$ and $R\in \mathbb {R}^{p\times p}$ are matrices to be estimated. Noise $\eta _t$ (resp. $u_t$) is assumed to be independently and identically distributed according $\mathcal {N}(0,I_{d})$ (resp. $\mathcal {N}(0,I_{p})$). At each time step, the observations and the hidden states are thus distributed as follows:

$$\begin{aligned} m_{t}~\sim ~&\mathcal {N}(Am_{t-1},Q), \\ z_{t}~\sim ~&\mathcal {N}(Cm_t,R). \end{aligned}$$

From these assumptions, we derive the Fisher vector for $y^T=(z_1, \ldots , z_T, m_1, \ldots , m_T)$:

$$\begin{aligned} \phi _{Fisher}(y)^T = (\phi _{0,1}(y),\phi _{1,1}(y),\ldots ,\phi _{1,T}(y) \phi _{2,1}(y),\ldots ,\phi _{2,T}(y)), \end{aligned}$$

(2)

where $\phi _{0,1}(y)= \nabla _{\mu ,{\varSigma }} \log p(m_1)$ is the Fisher vector corresponding to the initial hidden state, $\phi _{1,t}(y) =\nabla _{(Am_{t-1},Q)}log(p(m_t|m_{t-1})),t=2,\ldots , T$ is the Fisher vector corresponding to the transition probability density between hidden states and $\phi _{2,t}(y)=\nabla _{(Cm_{t-1},R)}log(p(z_t|m_{t}))$, for $t=2,\ldots ,T$, is the Fisher vector for the emission probability density.

4 Prediction by solving the pre-image problem

Now let us assume that $\theta $ is known and function h has been learned. Let us consider $(x,y^{true})$ a pair of input/output. In order to predict the value of $y^{true}$, we want to find $y^*$ such that:

$$\begin{aligned} \text {Find}~y^* \in \arg \min _{y \in \mathcal {Y}} \Vert M \phi _{Fisher}(y) - h(x) \Vert ^2. \end{aligned}$$

(3)

In the following, we study the pre-image problem and show sufficient conditions on the input representation h(x) such that the minimization problem in Eq. 3 admits a unique closed-form solution in the case of Gaussian model and Gaussian Mixture models and can be obtained recursively in the case of a Gaussian State-Space Model. As a preliminary step, we study the simple case of a Gaussian model which is not interested itself but is a building block for the two other pre-image problems.

4.1 Pre-image for the Gaussian Fisher embedding

In the case of the Gaussian Fisher embedding, we have : $h(x)=\begin{pmatrix} h_1(x) \\ h_2(x)\end{pmatrix}$ with $h_1(x) \in \mathbb {R}^d$ and $h_2(x) \in \mathbb {R}^{d^2}$. From this representation h(x), we want to make a prediction $y^*$ of $y^{true}$ by solving the pre-image problem for the full Fisher embedding:

$$\begin{aligned} \begin{aligned} \hat{y}\in \underset{y\in \mathcal {Y}}{argmin~}L_{Fisher}^{I}(y,h(x)), \end{aligned} \end{aligned}$$

(4)

where $L_{Fisher}^{I}(y,h(x)) = \Vert {\varSigma }^{-1}(y-\mu )-h_1(x)\Vert ^2 + \cdots +\Vert vec({\varSigma }^{-1}(y-\mu )(y-\mu )^T{\varSigma }^{-1}-{\varSigma }^{-1})-h_2(x)\Vert ^2$.

If the prediction h(x) was perfect, we would have $h(x) = \phi _{Fisher}(y^{true})$, then the minimum $L_{Fisher}^{I}(\hat{y},h(x)) = 0$ would be attained by the closed-form solution $\hat{y}= \sigma h_1(x) + \mu $ and $\hat{y}$ would be equal to the exact solution $y^{true}$. However in practice, the prediction h(x) is not perfect and the previous equation is not satisfied. Instead we have: $\exists (\epsilon _1, \epsilon _2) \in \mathbb {R}^d \times \mathbb {R}^{d^2}$, a pair of non-null vectors, such that: $h_1(x) = \varphi _{\mu }(y^{true}) + \epsilon _1$ and $h_2(x) = \varphi _{{\varSigma }}(y^{true}) + \epsilon _2$. Therefore, in general, the minimum of the loss function is greater than 0 and moreover, the loss is non non-convex. The best we can do to minimize i,t is to apply a gradient descent method to find a local minimum. The final prediction $\hat{y}$ is thus affected by two sources of error here: the $\epsilon =(\epsilon _1,\epsilon _2)$ error due to the lack of accuracy of h and the error driven by the approximate solution of this pre-image problem.

Another approach consists in taking advantage of the re-parameterization of $\phi _{Fisher}$ we pointed out in Sect. 3.1. We forget about $\varphi _{{\varSigma }}(y)$ that we can reconstruct from a proxy of $\varphi _{\mu }(y)$ if needed and we now learn $g_1:\mathcal {X}\rightarrow \mathbb {R}^d$ by solving:

$$\begin{aligned} g_1 = \arg \min _{f} \lambda _0 {\varOmega }(f) + \Vert \phi _{Fisher}^{M}(y_i) - f(x_i)\Vert ^2 \end{aligned}$$

(5)

with

$$\begin{aligned} \phi _{Fisher}^{M}(y) = \varphi _{\mu }(y)= {\varSigma }^{-1}(y-\mu ), \end{aligned}$$

and M is the matrix that projects the complete $\phi _{Fisher}(y)$ on the d first coordinates. In this case, to solve the pre-image problem for a given x, we only have to find the solution of the following problem :

$$\begin{aligned} \begin{aligned} y^{*}\in \underset{y\in \mathcal {Y}}{argmin~} \Vert \phi _{Fisher}^{M}(y) - g_1(x)\Vert ^2 \end{aligned} \end{aligned}$$

(6)

that admits a closed-form solution:

$$\begin{aligned} y^{*}= {\varSigma }g_1(x) + \mu . \end{aligned}$$

(7)

If one assumes that $g_1(x) = \varphi _{\mu }(y^{true}) + \epsilon _1'$, the solution $y^{*}$ is thus only affected by the prediction error $\epsilon _1'$. More precisely, we have:

$$\begin{aligned} \Vert y^{true} - y^{*}\Vert = \Vert {\varSigma }\epsilon _1'\Vert \end{aligned}$$

(8)

We expect that the error of the reduced Fisher embedding, ${\varSigma }\epsilon _1'$, is easier to control than the two other errors appearing in the Full Fisher embedding pre-image problem and for which no analytical form is available. In Sect. 6.2, we show empirical comparisons in Table 1 for simulated data to illustrate this claim. The reduced variant of Fisher Embedding is chosen in the real experiments.

4.2 Pre-image for the Gaussian mixture model Fisher embedding

Assume a Gaussian Mixture model with $C$ components in $\mathbb {R}^d$. Again, for any input vector x we denote $h(x)= (h_1(x), h_2(x),h_3(x))$ where $h_1(x)$ is vector of sizea d-dimensional vector, $h_2(x)$ is a Cd block vector that can be decomposed into C vectors $h_{2,j}(x), j=1, \ldots , C$ of dimension d and $v_3$, another block vector that can be decomposed into C vectors $h_{3,j}(x), j=1, \ldots , C$ of dimension $d^2$. In the case $M= I$, the pre-image for the Gaussian Mixture Model Fisher embedding consists in minimizing the following expression with respect to $y\in \mathcal {Y}$:

$$\begin{aligned} \begin{aligned} L_{Fisher}^{I}(y,h(x))&= \Vert \varphi _{\pi }(y) - h_{1}(x)\Vert ^2 + \sum _{j=1}^{C} \Vert \varphi _{\mu ,j}(y ) - h_{2,j}(x)\Vert ^2 \\&+ \sum _{j=1}^{C}\Vert \varphi _{{\varSigma },j}(y)- h_{3,j}(x)\Vert ^2. \end{aligned} \end{aligned}$$

(9)

Similarly to the Gaussian Fisher Embedding, we notice that, when omitting the third term, we get a closed-form-solution problem. Indeed:

$\Vert \varphi _{\pi }(y^*) - h_{1}(x)\Vert ^2 + \sum _{j=1}^{C} \Vert \varphi _{\mu ,j}(y^*) - h_{2,j}(x)\Vert ^2 =0$,

implies that: ${\left\{ \begin{array}{ll} \varphi _{\pi }(y^*) = h_{1}(x)\\ \forall j = 1, \ldots , C, \varphi _{\mu ,j}(y^*) = h_{2,j}(x) \end{array}\right. }$

which leads to the following system of equations:

$$\begin{aligned} {\left\{ \begin{array}{ll} \varphi _{\pi }(y^*) = h_{1}(x)\\ \sum _{j=1}^C \varphi _{\mu ,j}(y^*) = \sum _{j=1}^C h_{2,j}(x) \end{array}\right. }. \end{aligned}$$

According the definition of $\varphi _{\pi }$, $\varphi _{\mu }$ and Eq. 1, we have:

$$\begin{aligned} {\left\{ \begin{array}{ll} \forall j = 1, \ldots , C, \frac{p_{\theta _{j}}(y^*)}{p_{\theta }(y^*)}= h_{1,j}(x) \\ \sum _{j=1}^C \pi _j\frac{p_{\theta _j}(y^*)}{p_{\theta }(y^*)} {\varSigma }_j^{-1}(y^*-\mu _j) = \sum _{j=1}^C h_{2,j}(x) \end{array}\right. }, \end{aligned}$$

with the following closed-form solution:

$$\begin{aligned} y^{*}= \left( \sum _{j=1}^{C}\pi _j h_{1,j}(x) {\varSigma }^{-1}_j \right) ^{-1} \left( \sum _{j=1}^{C} h_{2,j}(x) + \left( \sum _{j=1}^{C}\pi _j h_{1,j}(x) {\varSigma }^{-1}_j\mu _j \right) \right) . \end{aligned}$$

(10)

To exploit the hidden structure of outputs data, we therefore propose to use the following (reduced) Fisher embedding for the GMM:

$$\begin{aligned} \phi _{Fisher}^{M}(y) = (\varphi _{\pi }(y), \sum _{j=1}^C \varphi _{\mu ,j}(y))^T = M \phi _{Fisher}(y), \end{aligned}$$

with the block matrix $M= \begin{pmatrix} M_1 &{} 0 &{} 0\\ 0 &{} M_2 &{} 0\\ 0 &{} 0 &{} 0\\ \end{pmatrix}$, where $M_1=I_C$ and $M_2$, a block matrix of size $dC \times d$ is defined as: $M_2= (I_d I_d \ldots I_d)$. This embedding keeps on taking into account the mixture structure while drastically reducing the dimension of the outputs from $C(1+d+d^2)$ to ($C+d$) and allowing for a closed-form pre-image. As for the Gaussian Fisher Embedding, the prediction error $\Vert y^{true}- y^{*}\Vert $ associated to $y^{*}$ only comes from the learning phase while the (local) minimizer of the full Fisher pre-image loss in Eq. 9 will convey two sources of error: the one induced by the learning of h and the one resulting from the approximate resolution of the full pre-image problem itself. This remark is illustrated in Sect. 6.2 where Tables 3 and 4 present an empirical comparison of performance of OFER with the full Fisher embedding for GMM and the reduced Fisher embedding on toy datasets.

4.3 Pre-image for the Gaussian state space model Fisher embedding

The pre-image problem for the Gaussian State Space Model embedding benefits from the previous solution and is sequentially solved over time. The solution at step $t-1$ is leveraged to solve the problem at step t:

$$\begin{aligned} \begin{aligned} L_m(m_t)&= \Vert M_1\phi _{1,t}(m_t) - h_{1,t}(x)\Vert ^2 \\ L_z(z_t)&= \Vert M_2\phi _{2,t}(z_t) - h_{2,t}(x)\Vert ^2, \end{aligned} \end{aligned}$$

where the predicted vectors is $h(x) = (h_0(x),h_{1,1}(x),\ldots , h_{1,T}(x),h_{2,1}(x),\ldots $, $h_{2,T}(x))$. As a matter of simplification we chose a reduced form $\phi _{Fisher}^{M}$ with $M=diag(M_0,$ $M_1,\ldots ,M_1,M_2 \ldots , M_2)$. For each matrix $M_1$ and $M_2$, as done in the Gaussian case, we choose a projection on the first d , and respectively V, dimensions:

$$\begin{aligned} M_1\phi _{1,t}(m_t) =&\frac{\partial log(p(m_t|m_{t-1}))}{\partial Am_{t-1}} ~=~Q^{-1}(m_t - Am_{t-1}), \end{aligned}$$

(11)

$$\begin{aligned} M_2\phi _{2,t}(z_t) =&\frac{\partial log(p(z_t|m_t)}{\partial Cm_t} ~=~ R^{-1}(z_t - Cm_t), \end{aligned}$$

(12)

Hence, at each step, the pre-image problem becomes :

$$\begin{aligned} \begin{aligned} L_m(m_t)&= \Vert Q^{-1}(m_t-A\hat{m}_{t-1}) - h_{1,t}(x)\Vert ^2 \\ L_z(z_t)&= \Vert R^{-1}(z_t-C\hat{z}_{t-1}) - h_{2,t}(x)\Vert ^2, \end{aligned} \end{aligned}$$

where $\hat{m}_t$, $\hat{z}_t$ represent the pre-image solutions at time t. For $\hat{m}_0$, we solve the first pre-image step depending on the probability distribution of the initial stats (Gaussian or GMM). Then, Algorithm 1 is applied to solve sequentially the following problem: $ \underset{(m,z)}{argmin}~\sum _{t=1}^{T}(\mathcal {L}_m(m_t)+ L_z(z_t))$.

5 Learning to predict output Fisher embeddings

5.1 Supervised learning

Supervised learning of Output Fisher Embeddings (Step 2 of OFER) can be achieved by solving a ridge regression problem within a class of vector-valued functions, $\mathcal {H}$. Given $\lambda _0$, a positive scalar, we search:

$$\begin{aligned} \underset{h\in \mathcal {H}}{{\text {arg min}}}~\lambda _0 \Vert h\Vert ^2_{\mathcal {H}} + \frac{1}{2\ell } \sum _{i=1}^{\ell } L_{Fisher}^M(y_i,h(x_i)). \end{aligned}$$

(13)

In this work, $\mathcal {H}$ was chosen as the Reproducing Kernel Hilbert Space associated to the matrix-valued kernel $K: \mathcal {X}\times \mathcal {X}\mapsto \mathcal {L}(\mathbb {R}^p)$. Matrix-valued kernels provide a generalization of the scalar-valued kernels (Micchelli and Pontil 2005; Álvarez et al. 2012) to build vector-valued functions. In this work, we chose $K(x,x')= I_{p} \times k(x,x')$ where k is the classic scalar-valued Gaussian kernel. This choice simply yields to p independent Kernel Ridge Regression models.

5.2 Learning in the small data regime with OFER-GMM

We would like to emphasize the relevance of our approach for learning with a handful of labeled data, a regime we call small data regime learning. As stated in Sect. 2.1, Output Fisher Embedding provides a richer information than the sole output. A training labeled example $(x_i,y_i)$, converted into $(x_i, \phi _{Fisher}^{M}(y_i))$, conveys additional information about its structure. A simple example is the Fisher vector associated to a mixture model for encoding a vector y. Let us focus on $\varphi _{\pi }(y)$. The mixture component k to which y belongs with the highest probability corresponds to a high absolute value of the kth coordinate of the vector of $\varphi _{\pi }(y)$. Then, if two outputs $y_i$ and $y_j$ belong to the same component, learning to predict $\varphi _{\pi }(y_i)$ will also bring information about predicting $\varphi _{\pi }(y_j)$ and vice-versa. Therefore, our framework has appealing features for learning using only a handful of labeled examples per class if the target problem is classification or a small training data if the task is regression. This is empirically tested in the experimental section.

5.3 Weakly supervised learning in the case of OFER-GMM

We notice that, for a given y, each coordinate of the Fisher vector is interpretable as it reflects some structure of the output data. Taking the simple example of a mixture model, Output Fisher Embedding encodes the cluster-structure of the data since the first coordinates of the vector $\varphi _{\pi }(y)$ indicate the membership to each cluster. This remarks has inspired us a new scenario for weakly supervised learning. We imagine that we have access to two kinds of supervisors: expert supervisors who are able to label data in a precise way with their exact class and non-expert supervisors who are just able to associate one cluster to a given data. Note that the cluster has not even to be named, the content of clusters (for instance, images) can be presented to the non-expert supervisor. To recap, the scenario for weakly supervised learning writes as follows:

A subsample $\{x_1, \ldots , x_{\ell }\}$ of the training input data set has been accurately labeled by an expert supervisor, providing the corresponding training sample $\mathcal {Y}_{\ell }=\{y_1, \ldots , y_{\ell } \}$.
The training output sample $\mathcal {Y}_{\ell }$ is used to estimate the parameters of a (Gaussian) mixture model with C components.
A matrix M of the form described in Sect. 4.2 is defined, yielding an interpretable Output Fisher embedding where the first $C$ coefficients encode the membership of the given y to each of the $C$ components of the mixture.
The remaining training input data are presented to non-expert supervisors who are invited to indicate to which output component the presented input data can be associated, and therefore, only fulfill the first $C$ coefficients of the Fisher vector instead of providing the full corresponding output object. From this weak supervision, one can define an additional training set with weak labels. To describe this set we follow the notations of Sect. 4.2 and introduce the matrix $A=\left[ \begin{array}{c|c}I_{C} &{} \mathbf 0 \\ \hline \mathbf 0 &{} \mathbf 0 _{p-C} \end{array} \right] $, and $A^{c}= I_{p}-A$, its complement. Then the additional weak training set $\mathcal {W}_{w}$ is defined as:
$$\begin{aligned} \mathcal {W}_{w}= \left\{ \left( x_i, A\phi _{Fisher}^{M}(y_i), i=\ell +1, \ldots , \ell +w\right\} \right. , \end{aligned}$$
where each $y_i, i=\ell +1, \ldots , \ell + w$ is supposed to be the true unknown output associated with each $x_i, i=\ell +1, \ldots , \ell +w$.

The learning task is now enriched in such ways that both kinds of examples, fully labeled and weakly labeled examples can be used during the training phase. Interestingly for the implementation, the following equality holds:

$$\begin{aligned} \begin{aligned} L_{Fisher}^M(y,x)&=\Vert \phi _{Fisher}^{M}(y) - h(x)\Vert ^2 =\Vert (A+A^c)\phi _{Fisher}^{M}(y) - (A+A^c)h(x)\Vert ^2 \\&= \Vert A\phi _{Fisher}^{M}(y) - Ah(x)\Vert ^2\ +\Vert A^c\phi _{Fisher}^{M}(y) - A^c h(x)\Vert ^2 \\&=L_{Fisher,A}^M(y,h(x)) + L_{Fisher,A^c}^M(y,h(x)). \end{aligned} \end{aligned}$$

Thus, the new learning task expresses as:

$$\begin{aligned} \begin{aligned} \underset{h\in \mathcal {H}}{{\text {arg min}}}~\frac{1}{2\ell } \sum _{i=1}^{\ell } L_{Fisher,A}^M(y_i,h(x_i)) + \lambda _2 \sum _{i=\ell +1}^{\ell +w} L_{Fisher,A}^M(y_i,h(x_i)) \\ +\, \frac{1}{2\ell } \sum _{i=1}^{\ell } L_{Fisher,A^c}^M(y_i,h(x_i)) + \lambda _0 \Vert h\Vert ^2_{\mathcal {H}}. \end{aligned} \end{aligned}$$

(14)

6 Numerical results

6.1 Context of the empirical study

In this section, we first explore the relevance of output fisher embedding regression on synthetic datasets and then, we investigate its behaviour on real datasets in three contexts: fully supervised learning, small data regime learning and weakly supervised learning.

To show the versatility of our framework, we study and compare OFER in a wide variety of real tasks. A structured output prediction task where the goal is to predict a short time series from an input text related to Kogan et al. (2009) is first explored. Then, OFER is tested on supervised learning tasks without explicit structure. Numerical results are provided on two multiple output regression tasks for which labels are usually difficult to obtain: a Video Score Prediction task extracted from Ponce-López et al. (2016) and a Drug Activity Prediction task presented in Su et al. (2010). Eventually we turn an image classification problem into a Word Prediction task in an appropriate semantic space and present numerical results on two datasets, namely the Caltech101^{Footnote 1} and the AWA.^{Footnote 2} In all numerical results unless otherwise noted, train/test splits are generated 10 times and average performance with the empirical standard deviation are presented. We used bold font to point out the best performance in a row when it is statistically significant based on Student’s t-tests with an $\alpha $-risk equal to $10^{-3}$. For sake of space, an empirical comparison of computation time in learning and test phases is presented and discussed in the Appendix.

6.1.1 Generative model estimation

Gaussian models are chosen isotropic in all experiments. The parameter $\theta $ is estimated using a procedure based on maximum likelihood criterion. More precisely, an Expectation-Maximization (EM) algorithm is applied to estimate the Gaussian mixture model while Kalman filtering/EM is used to estimate the State-Space Model.

6.1.2 OFER learning

In all experiments, the function h in Eq. 13 is learned using OFER implemented through Kernel Ridge Regression, called appropriately for multiple outputs in the case of fully and weakly supervised learning. The choice of kernel k in $K(x,x')=Ik(x,x')$ is linear by default if not precised. For purpose of comparison in regression and time series prediction, we used three baselines : multi-output Kernel Ridge Regression (m-KRR) and multi-output Random Forest for regression (m-RF) and variants of Input Output Kernel Regression (IOKR). As for the classification task, we used multiclass SVM, Multilayer perceptron (MLP) as well as IOKR. For each problem at hand, we precise which output kernel was used. IOKR is associated to a pre-image problem which is solved for each test data using a gradient descent implemented in the Open Source library scipy: https://www.scipy.org/. Apart from the synthetic dataset experiments, OFER is always used with the reduced Fisher Embedding explained in Sect. 4.

6.1.3 Implementation

All the codes are written in Python and call the appropriate functions of scikit-learn library^{Footnote 3} except for the Kalman filtering/EM procedure which relies on PyKalman.^{Footnote 4}

6.1.4 Time complexity

An empirical comparison of computation time in learning and test phases has been conducted for all real tasks and all the tested methods. For sake of space, it is presented and discussed in the Appendix.

6.2 Numerical results on synthetic datasets

We report numerical results on synthetic datasets in order to answer three questions:

1.
Does OFER with reduced Fisher embedding perform better than OFER with full Fisher embedding in the Gaussian context?
2.
Does OFER with reduced Fisher embedding perform better than OFER with full Fisher embedding in the GMM context?
3.
How does OFER-GMM performance evolve with a growing number of training data and a growing number of clusters

We generated a synthetic dataset $\mathcal {D}_1$ by using a Gaussian mixture model (4 clusters) in $\mathbb {R}^{10}$ to get the outputs and by computing the projection of outputs by kernel PCA to obtain the corresponding input representation in a 5-dimensional space. When OFER is based on the full Fisher embedding, each pre-image problem is solved by applying a stochastic gradient descent algorithm.

Table 1 Average relative root mean squared errors (RRMSE) on test set obtained by OFER with the full Gaussian Fisher pre-image solution and the reduced Gaussian Fisher pre-image solution over 10 train/test splits of dataset $\mathcal {T}_1$

Output Fisher embedding regression

Abstract

Similar content being viewed by others

Forecasting and Granger Modelling with Non-linear Dynamical Dependencies

Manifold Learning Regression with Non-stationary Kernels

Deep Fisher Discriminant Analysis

1 Introduction

2 Output Fisher embedding regression

2.1 Brief reminder about Fisher scores and kernels

2.2 Regression with output Fisher embeddings

3 Fisher embeddings

3.1 Gaussian vector

3.2 Gaussian mixture model

3.3 Gaussian state-space model

4 Prediction by solving the pre-image problem

4.1 Pre-image for the Gaussian Fisher embedding

4.2 Pre-image for the Gaussian mixture model Fisher embedding

4.3 Pre-image for the Gaussian state space model Fisher embedding

5 Learning to predict output Fisher embeddings

5.1 Supervised learning

5.2 Learning in the small data regime with OFER-GMM

5.3 Weakly supervised learning in the case of OFER-GMM

6 Numerical results

6.1 Context of the empirical study

6.1.1 Generative model estimation

6.1.2 OFER learning

6.1.3 Implementation

6.1.4 Time complexity

6.2 Numerical results on synthetic datasets

6.3 From text to time series

6.3.1 Task description

6.3.2 Fully supervised learning

6.4 Multi-output regression tasks

6.4.1 Scores prediction in personality videos

6.4.2 Drug activity prediction

6.5 Multi-class classification

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

A: Additional insights on multiclass classification as a word prediction task

B: Empirical study of time complexity

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation