Keywords

1 Introduction

Estimating object pose is an important building block in systems aiming to understand complex scenes and has a long history in computer vision [1, 2]. Whereas early systems achieved low accuracy, recent advances in deep learning and the collection of extensive data sets have led to high performing systems that can be deployed in useful applications [3,4,5].

However, the reliability of object pose regression depends on the quality of the image provided to the system. Key challenges are low-resolution due to distance of an object to the camera, blur due to motion of the camera or the object, and sensor noise in case of poorly lit scenes (see Fig. 1).

We would like to predict object pose in a way that captures uncertainty. Probability is the right way to capture the uncertainty [6] and in this paper we therefore propose a novel model for object pose regression whose predictions are fully probabilistic. Figure 1 depicts an output of the proposed system. Moreover, instead of assuming a fixed form for the predictive density we allow for flexible multimodal distributions, specified by a deep neural network.

The value of quantified uncertainty in the form of probabilistic predictions is two-fold: first, a high prediction uncertainty is a robust way to diagnose poor inputs to the system; second, given accurate probabilities we can summarize them to improved point estimates using Bayesian decision theory.

More generally, accurate representation of uncertainty is especially important in case a computer vision system becomes part of a larger system, such as when providing an input signal for an autonomous control system. If uncertainty is not well-calibrated, or—even worse—is not taken into account at all, then the consequences of decisions made by the system cannot be accurately assessed, resulting in poor decisions at best, and dangerous actions at worst.

Fig. 1.
figure 1

Our model predicts complex multimodal distributions on the circle (truncated by the outer circle for better viewing). For difficult and ambiguous images our model report high uncertainty (bottom row). Pose estimation predictions (pan angle) on images from IDIAP, TownCentre and PASCAL3D+ datasets.

In the following we present our method and make the following contributions:

  • We demonstrate the importance of probabilistic regression on the application of object pose estimation;

  • We propose a novel efficient probabilistic deep learning model for the task of circular regression;

  • We show on a number of challenging pose estimation datasets (including PASCAL 3D+ benchmark [7]) that the proposed probabilistic method outperforms purely discriminative approaches in terms of predictive likelihood and show competitive performance in terms of angular deviation losses classically used for the tasks.

2 Related Work

Estimation of object orientation arises in different applications and in this paper we focus on the two most prominent tasks: head pose estimation and object class orientation estimation. Although those tasks are closely related, they have been studied mostly in separation, with methods applied to exclusively one of them. We will therefore discuss them separately, despite the fact that our model applies to both tasks.

Head pose estimation has been a subject of extensive research in computer vision for a long time [2, 8] and the existing systems vary greatly in terms of feature representation and proposed classifiers. The input to pose estimation systems typically consists of 2D head images [9,10,11], and often one has to cope with low resolution images [8, 12,13,14]. Additional modalities such as depth [15] and motion [14, 16] information has been exploited and provides useful cues. However, these are not always available. Also, information about the full body image could be used for joint head and body pose prediction [17,18,19]. Notably the work of [18] also promotes a probabilistic view and fuse body and head orientation within a tracking framework. Finally, the output of facial landmarks can be used as an intermediate step [20, 21].

Existing head pose estimation models are diverse and include manifold learning approaches [22,23,24,25], energy-based models [19], linear regression based on HOG features [26], regression trees [15, 27] and convolutional neural networks [5]. A number of probabilistic methods for head pose analysis exist in the literature [18, 28, 29], but none of them combine probabilistic framework with learnable hierarchical feature representations from deep CNN architectures. At the same time, deep probabilistic models have shown an advantage over purely discriminative models in other computer vision tasks, e.g., depth estimation [30]. To the best of our knowledge, our work is the first to utilize deep probabilistic approach to angular orientation regression task.

An early dataset for estimating the object rotation for general object classes was proposed in [31] along with an early benchmark set. Over the years the complexity of data increased, from object rotation [31] and images of cars in different orientations [32] to Pascal3D [33]. The work of [33] then assigned a separate Deformable Part Model (DPM) component to a discrete set of viewpoints. The work of [34, 35] then proposed different 3D DPM extensions which allowed viewpoint estimation as integral part of the model. However, both [34] and [35] and do not predict a continuous angular estimate but only a discrete number of bins.

More recent versions make use of CNN models but still do not take a probabilistic approach [3, 4]. The work of [36] investigates the use of a synthetic rendering pipeline to overcome the scarcity of detailed training data. The addition of synthetic and real examples allows them to outperform previous results. The model in [36] predicts angles, and constructs a loss function that penalizes geodesic and \(\ell _1\) distance. Closest to our approach, [37] also utilizes the von Mises distribution to build the regression objective. However, similarly to [5], the shape of the predicted distribution remains fixed with only mean value of single von Mises density being predicted. In contrary, in this work we advocate the use of complete likelihood estimation as a principled probabilistic training objective.

The recent work of [38] draws a connection between viewpoints and object keypoints. The viewpoint estimation is however again framed as a classification problem in terms of Euler angles to obtain a rotation matrix from a canonical viewpoint. Another substitution of angular regression problem was proposed in a series of work [39,40,41], where CNN is trained to predict the 2D image locations of virtual 3D control points and the actual 3D pose is then computed by solving a perspective-n-point (PnP) problem that recovers rotations from 2D–3D correspondences. Additionally, many works phrase angular prediction as a classification problem [3, 36, 38] which always limits the granularity of the prediction and also requires the design of a loss function and a means to select the number of discrete labels. A benefit of a classification model is that components like softmax loss can be re-used and also interpreted as an uncertainty estimate. In contrast, our model mitigate this problem: the likelihood principle suggests a direct way to train parameters, moreover ours is the only model in this class that conveys an uncertainty estimate.

3 Review of Biternion Networks

We build on the Biternion networks method for pose estimation from [5] and briefly review the basic ideas here. Biternion networks regress angular data and currently define the state-of-the-art model for a number of challenging head pose estimation datasets.

A key problem is to regress angular orientations which is periodic and prevents a straight-forward application of standard regression methods, including CNN models with common loss functions. Consider a ground truth value of \(0^{\circ }\), then both predictions \(1^{\circ }\) and \(359^{\circ }\) should result in the same absolute loss. Applying the mod operator is no simple fix to this problem, since it results in a discontinuous loss function that complicates the optimization. A loss function needs to be defined to cope with this discontinuity of the target value. Biternion networks overcome this difficulty by using a different parameterization of angles and the cosine loss function between angles.

3.1 Biternion Representation

Beyer et al. [5] propose an alternative representation of an angle \(\phi \) using the two-dimensional sine and cosine components \(\varvec{y} = (\cos {\phi }, \sin {\phi })\).

This biternion representation is inspired by quaternions, which are popular in computer graphics systems. It is easy to predict a \((\cos , \sin )\) pair with a fully-connected layer followed by a normalization layer, that is,

$$\begin{aligned} f_{BT}(\varvec{x}; \varvec{W}, \varvec{b})= \frac{\varvec{W}\varvec{x}+\varvec{b}}{|| \varvec{W}\varvec{x}+\varvec{b} ||} = (\cos {\phi }, \sin {\phi }) = \varvec{y}_{pred}, \end{aligned}$$
(1)

where \(x \in \mathbb {R}^n\) is an input, \(W \in \mathbb {R}^{2 \times n}\), \(b \in \mathbb {R}^{2}\). A Biternion network is then a convolutional neural network with a layer (1) as the final operation, outputting a two-dimensional vector \(\varvec{y}_{pred}\). We use VGG-style network [42] and InceptionResNet [43] networks in our experiments and provide a detailed description of the network architecture in Sect. 6.1. Given recent developments in network architectures it is likely that different network topologies may perform better than selected backbones. We leave this for future work, our contributions are orthogonal to the choice of the basis model.

3.2 Cosine Loss Function

The cosine distance is chosen in [5] as a natural candidate to measure the difference between the predicted and ground truth Biternion vectors. It reads

$$\begin{aligned} L_{cos}(\varvec{y}_{pred} , \varvec{y}_{true})= & {} 1 - \frac{\varvec{y}_{pred} \cdot \varvec{y}_{true}}{ ||\varvec{y}_{pred}|| \cdot ||\varvec{y}_{true}||} = 1 - \varvec{y}_{pred} \cdot \varvec{y}_{true}, \end{aligned}$$
(2)

where the last equality is due to \(|| \varvec{y} || = \cos ^2{\phi } + \sin ^2{\phi } = 1\).

The combination of a Biternion angle representation and a cosine loss solves the problems of regressing angular values, allowing for a flexible deep network with angular output. We take this state-of-the-art model and generalize it into a family of probabilistic models of gradually more flexibility.

4 Probabilistic Models of Circular Data

We utilize the von Mises (vM) distribution as the basic building block of our probabilistic framework, which is a canonical choice for a distribution on the unit circle [44]. Compared to standard Gaussian, the benefit is that it have as a support any interval of length \(2\pi \), which allow it to truthfully models the domain of the data, that is angles on a circle.

Fig. 2.
figure 2

Left: examples of the von Mises probability density function for different concentration parameters \(\kappa \). Center, right: predicted \({{\mathrm{\mathcal {VM}}}}\) distributions for two images from the CAVIAR dataset. We plot the predicted density on the viewing circle. For comparison we also include the 2D plot (better visible in zoomed pdf version). The distribution on the center image is very certain, the one on the right more uncertain about the viewing angle.

We continue with a brief formal definition and in Sect. 4.1 describe a simple way to convert the output of Biternion networks into a \({{\mathrm{\mathcal {VM}}}}\) density, that does not require any network architecture change or re-training as it requires only selection of the model variance. We will then use this approach as a baseline for more advanced probabilistic models. Section 4.2 slightly extends the original Biternion network by introducing an additional network output unit that models uncertainty of our angle estimation and allows optimization for the log-likelihood of the \({{\mathrm{\mathcal {VM}}}}\) distribution.

The von Mises distribution \({{\mathrm{\mathcal {VM}}}}(\mu , \kappa )\) is a close approximation of a normal distribution on the unit circle. Its probability density function is

$$\begin{aligned} p(\phi ; \mu , \kappa ) = \frac{\exp {(\kappa \cos {(\phi - \mu )})}}{2\pi I_0(\kappa )}, \end{aligned}$$
(3)

where \(\mu \in [0,2\pi )\) is the mean value, \(\kappa \in \mathbb {R}_+\) is a measure of concentration (a reciprocal measure of dispersion, so 1/\(\kappa \) is analogous to \(\sigma ^2\) in a normal distribution), and \(I_0(\kappa )\) is the modified Bessel function of order 0. We show examples of \({{\mathrm{\mathcal {VM}}}}\)-distributions with \(\mu =\pi \) and varying \(\kappa \) values in Fig. 2 (left).

4.1 Von Mises Biternion Networks

A conceptually simple way to turn the Biternion networks from Sect. 3 into a probabilistic model is to take its predicted value as the center value of the \({{\mathrm{\mathcal {VM}}}}\) distribution,

$$\begin{aligned} p_{\theta }(\phi | \varvec{x}; \kappa ) = \frac{\exp {(\kappa \cos {(\phi - \mu _{\theta }(\varvec{x}))})}}{2\pi I_0(\kappa )}, \end{aligned}$$
(4)

where \(\varvec{x}\) is an input image, \(\theta \) are parameters of the network, and \(\mu _{\theta }(\varvec{x})\) is the network output. To arrive at a probability distribution, we may regard \(\kappa > 0\) as a hyper-parameter. For fixed network parameters \(\theta \) we can select \(\kappa \) by maximizing the log-likelihood of the observed data,

$$\begin{aligned} \kappa ^* = \mathop {{{\mathrm{argmax}}}}\limits _{\kappa } \sum _{i=1}^N{\log p_{\theta }(\phi ^{(i)}| \varvec{x}^{(i)}; \kappa )}, \end{aligned}$$
(5)

where N is the number of training samples. The model (4) with \(\kappa ^*\) will serve as the simplest probabilistic baseline in our comparisons, referred as fixed \(\kappa \) model in the experiments.

4.2 Maximizing the von Mises Log-Likelihood

Using a single scalar \(\kappa \) for every possible input in the model (4) is clearly a restrictive assumption: model certainty should depend on factors such as image quality, light conditions, etc. For example, Fig. 2 (center, right) depicts two low resolution images from a surveillance camera that are part of the CAVIAR dataset [13]. In the left image facial features like eyes and ears are distinguishable which allows a model to be more certain when compared to the more blurry image on the right (Fig. 3).

We therefore extend the simple model by replacing the single constant \(\kappa \) with a function \(\kappa _{\theta }(\varvec{x})\), predicted by the Biternion network,

$$\begin{aligned} p_{\theta }(\phi | \varvec{x}) = \frac{\exp {(\kappa _{\theta }(\varvec{x}) \cos {(\phi - \mu _{\theta }(\varvec{x}))})}}{2\pi I_0(\kappa _{\theta }(\varvec{x}))}. \end{aligned}$$
(6)

We train (6) by maximizing the log-likelihood of the data,

$$\begin{aligned} \log {\mathcal {L}(\theta | \varvec{X}, \varPhi )} = \sum _{i=1}^N{\kappa _{\theta }(\varvec{x}^{(i)})\cos {(\phi ^{(i)} - \mu _{\theta }(\varvec{x}^{(i)}))}} - \sum _{i=1}^N{\log 2 \pi I_0(\kappa _{\theta }(\varvec{x}^{(i)}))}. \end{aligned}$$
(7)

Note that when \(\kappa \) is held constant in (7), the second sum in \(\log {\mathcal {L}(\theta | \varvec{X}, \varPhi )}\) is constant and therefore we recover the Biternion cosine objective (2) up to constants \(C_1\), \(C_2\),

$$\begin{aligned} \log {\mathcal {L}(\theta | \varvec{\varvec{X}}, \varPhi , \kappa )} = C_1 \sum _{i=1}^N{\cos {\big (\phi ^{(i)} - \mu _{\theta }(\varvec{x}^{(i)})\big )}} + C_2. \end{aligned}$$

The sum has the equivalent form,

$$\begin{aligned} \sum _{i=1}^N{\cos {\big (\phi ^{(i)} - \mu _{\theta }(\varvec{x}^{(i)})\big )}}= & {} \sum _{i=1}^N{\big [\cos {\phi ^{(i)}} \cos {\mu _{\theta }(\varvec{x}^{(i)})} + \sin {\phi ^{(i)}}\sin {\mu _{\theta }(\varvec{x}^{(i)})} \big ]} \end{aligned}$$
(8)
$$\begin{aligned}= & {} \sum _{i=1}^N{\varvec{y}_{\phi ^{(i)}} \cdot \varvec{y}_{\mu _{\theta }(\varvec{x}^{(i)})} }, \end{aligned}$$
(9)

where \(\varvec{y}_{\phi } = (\cos {\phi }, \sin {\phi })\) is a Biternion representation of an angle. Note, that the above derivation shows that the loss function in [5] corresponds to optimizing the von Mises log-likelihood for the fixed value of \(\kappa =1\). This offers an interpretation of Biternion networks as a probabilistic model.

Fig. 3.
figure 3

The single mode von Mises model (VGG backbone variation). A BiternionVGG network regresses both mean and concentration parameter of a single vM distribution.

The additional degree of freedom to learn \(\kappa _{\theta }(\varvec{x})\) as a function of \(\varvec{x}\) allows us to capture the desired image-dependent uncertainty as can be seen in Fig. 2.

However, like the Gaussian distribution the von Mises distribution makes a specific assumption regarding the shape of the density. We now show how to overcome this limitation by using a mixture of von Mises distributions.

5 Mixture of von Mises Distributions

The model described in Sect. 4.2 is only unimodal and can not capture ambiguities in the image. However, in case of blurry images like the ones in Fig. 2 we could be interested in distributing the mass around a few potential high probability hypotheses, for example, the model could predict that a person is looking sideways, but could not determine the direction, left or right, with certainty. In this section we present two models that are able to capture multimodal beliefs while retaining a calibrated uncertainty measure.

5.1 Finite Mixture of von Mises Distributions

One common way to generate complex distributions is to sum multiple distributions into a mixture distribution. We introduce K different component distributions and a K-dimensional probability vector representing the mixture weights. Each component is a simple von Mises distribution. We can then represent our density function as

$$\begin{aligned} p_{\theta }(\phi | \varvec{x}) = \sum _{j=1}^{K}{\pi _j(\varvec{x}, \theta ) \, p_j(\phi | \varvec{x}, \theta )}, \end{aligned}$$
(10)

where \(p_j(\phi | \varvec{x}, \theta ) = {{\mathrm{\mathcal {VM}}}}(\phi | \mu _j, \kappa _j)\) for \(j=1,\dots ,K\) are the K component distributions and the mixture weights are \(\pi _j(\varvec{x}, \theta )\) so that \(\sum _j \pi _j(\varvec{x},\theta ) = 1\). We denote all parameters with the vector \(\theta \), it contains component-specific parameters as well as parameters shared across all components.

Fig. 4.
figure 4

The finite \({{\mathrm{\mathcal {VM}}}}\) mixture model. A VGG network predicts K mean and concentration values and the mixture coefficients \(\pi \). This allows to capture multimodality in the output.

To predict the mixture in a neural network framework, we need \(K \times 3\) output units for modeling all von Mises component parameters (two for modeling the Biternion representation of the mean, \(\mu _j(\varvec{x},\theta )\) and one for the \(\kappa _j(\varvec{x},\theta )\) value), as well as K units for the probability vector \(\pi _j(\varvec{x},\theta )\), defined by taking the softmax operation to get a positive mixture weights.

The finite von Mises density model then takes form

$$\begin{aligned} p_{\theta }(\phi | \varvec{x}) = \sum _{j=1}^{K}{\pi _j(\varvec{x}, \theta ) \, \frac{\exp {\Big (\kappa _{j}(\varvec{x}, \theta ) \cos {\big (\phi - \mu _{j}(\varvec{x}, \theta )\big )}\Big )}}{2\pi I_0\big (\kappa _{j}(\varvec{x}, \theta )\big )}}. \end{aligned}$$
(11)

Similarly to the single von Mises model, we can train by directly maximizing the log-likelihood of the observed data, \(\sum _{i=1}^N \log p_{\theta }(\phi ^{(i)} | \varvec{x}^{(i)})\). No specific training schemes or architectural tweaks were done to avoid redundancy in mixture components. However, empirically we observe that model learns to set mixture weights \(\pi _j\) of the redundant components close to zero, as well as to learn the ordering of the components (e.g. it learns that some output component j should correspond to the component with high mixture weight).

We show an overview of the model in Fig. 4.

5.2 Infinite Mixture (CVAE)

To extend the model from a finite to an infinite mixture model, we follow the variational autoencoder (VAE) approach [45, 46], and introduce a vector-valued latent variable \(\varvec{z}\). The resulting model is depicted in Fig. 5. The continuous latent variable becomes the input to a decoder network \(p(\phi |\varvec{x},\varvec{z})\) which predicts the parameters—mean and concentration—of a single von Mises component. We define our density function as the infinite sum (integral) over all latent variable choices, weighted by a learned distribution \(p(\varvec{z}|\varvec{x})\),

$$\begin{aligned} p_{\theta }(\phi | \varvec{x}) = \int {p(\phi |\varvec{x},\varvec{z}) \, p(\varvec{z}|\varvec{x}) d\varvec{z}}, \end{aligned}$$
(12)

where \( p_{\theta }(\phi | \varvec{x}, \varvec{z}) = {{\mathrm{\mathcal {VM}}}}(\mu (\varvec{x}, \theta ), \kappa (\varvec{x}, \theta ))\), and \(p_{\theta }(\varvec{z}| \varvec{x}) = \mathcal {N}(\mu _1(\varvec{x}, \theta ), \sigma _1^2(\varvec{x}, \theta ))\). The log-likelihood \(\log p_{\theta }(\phi | \varvec{x})\) for this model is not longer tractable, preventing simple maximum likelihood training. Instead we use the variational autoencoder framework of [45, 46] in the form of the conditional VAE (CVAE) [47]. The CVAE formulation uses an auxiliary variational density \(q_{\theta }(\varvec{z}|\varvec{x}, \phi ) = \mathcal {N}(\mu _2(\varvec{x}, \phi , \theta ), \sigma _2^2(\varvec{x}, \phi , \theta ))\) and instead of the log-likelihood optimizes a variational lower bound,

$$\begin{aligned} \log { p_{\theta }(\phi | \varvec{\varvec{x}})}= & {} \log {\int { p_{\theta }(\phi |\varvec{x},\varvec{z}) \, p_{\theta }(\varvec{z}|\varvec{x}) d\varvec{z}}} \end{aligned}$$
(13)
$$\begin{aligned}\ge & {} \mathbb {E}_{z \sim q_{\theta }(\varvec{z}|\varvec{x},\phi )}\left[ \log \frac{p_{\theta }(\phi | \varvec{x},\varvec{z}) \, p_{\theta }(\varvec{z}|\varvec{x})}{ q_{\theta }(\varvec{z}|\varvec{x},\phi )} \right] =: \mathcal {L}_{\text {ELBO}}(\theta | \varvec{x},\phi ). \end{aligned}$$
(14)

We refer to [45,46,47,48] for more details on VAEs.

Fig. 5.
figure 5

The infinite mixture model (CVAE). An encoder network predicts a distribution \(q(\varvec{z}|\varvec{x})\) over latent variables \(\varvec{z}\), and a decoder network \(p(\phi |\varvec{x},\varvec{z})\) defines individual mixture components. Integrating over \(\varvec{z}\) yields an infinite mixture of von Mises distributions. In practice we approximate this integration using a finite number of Monte Carlo samples \(\varvec{z}^{(j)} \sim q(\varvec{z}|\varvec{x})\).

The CVAE model is composed of multiple deep neural networks: an encoder network \(q_{\theta }(\varvec{z}|\varvec{x}, \phi )\), a conditional prior network \( p_{\theta }(\varvec{z}| \varvec{x})\), and a decoder network \( p_{\theta }(\phi | \varvec{x}, \varvec{z})\). Like before, we use \(\theta \) to denote the entirety of trainable parameters of all three model components. We show an overview of the model in Fig. 5. The model is trained by maximizing the variational lower bound (14) over the training set \((\varvec{X},\varPhi )\), where \(\varvec{X}=(\varvec{x}^{(1)},\dots ,\varvec{x}^{(N)})\) are the images and \(\varPhi = (\phi ^{(1)},\dots ,\phi ^{(N)})\) are the ground truth angles. We maximize

$$\begin{aligned} \hat{\mathcal {L}}_{\text {CVAE}}(\theta | \varvec{X}, \varPhi )= & {} \frac{1}{N} \sum _{i=1}^N \hat{\mathcal {L}}_{\text {ELBO}}(\theta | \varvec{x}^{(i)}, \phi ^{(i)}), \end{aligned}$$
(15)

where we use \(\hat{\mathcal {L}}_{\text {ELBO}}\) to denote the Monte Carlo approximation to (14) using S samples. We can optimize (15) efficiently using stochastic gradient descent.

To evaluate the log-likelihood during testing, we use the importance-weighted sampling technique proposed in [49] to derive a stronger bound on the marginal likelihood,

$$\begin{aligned} \log p_{\theta }(\phi | \varvec{x})\ge & {} \log \frac{1}{S} \sum _{j=1}^{S} \frac{ p_{\theta }(\phi | \varvec{x}, \varvec{z}^{(j)}) \, p_{\theta }(\varvec{z}^{(j)} | \varvec{x}) }{q_{\theta }(\varvec{z}^{(j)} | \varvec{x}, \phi )},\end{aligned}$$
(16)
$$\begin{aligned} \varvec{z}^{(j)}\sim & {} q_{\theta }(\varvec{z}^{(j)} | \varvec{x}, \phi ) \qquad j=1,\dots ,S. \end{aligned}$$
(17)

Simplified CVAE. In our experiments we also investigate a variant of the aforementioned model where \( p_{\theta }(\varvec{z} | \varvec{x}) = q_{\theta }(\varvec{z} | \varvec{x}, \phi ) = p(z) = \mathcal {N}(0, I)\). Compared to the full CVAE framework, this model, which we refer to as simplified CVAE (sCVAE) in the experiments, sacrifices the adaptive input-dependent density of the hidden variable \(\varvec{z}\) for faster training and test inference as well as optimization stability. In that case the KL-divergence \(KL\big (q_{\theta } \parallel p_{\theta } \big )\) term in \(\hat{\mathcal {L}}_{\text {ELBO}}\) becomes zero, and we train for a Monte Carlo estimated log-likelihood of the data:

$$\begin{aligned} \hat{\mathcal {L}}_{\text {sCVAE}}(\theta | \varvec{X}, \varPhi )= & {} \frac{1}{N} \sum _{i=1}^N \log {\Big ( \frac{1}{S} \sum _{j=1}^S p_{\theta }(\phi ^{(i)} | \varvec{x^{(i)}}, \varvec{z}^{(j)}) \Big )}, \end{aligned}$$
(18)
$$\begin{aligned}&\quad \!\!\varvec{z}^{(j)} \sim p(\varvec{z}) = \mathcal {N}(0, I), j=1,\dots ,S. \end{aligned}$$
(19)

In some applications it is necessary to make a single best guess about the pose, that is, to summarize the posterior \(p(\phi |\varvec{x})\) to a single point prediction \(\hat{\phi }\). We now discuss an efficient way to do that.

5.3 Point Prediction

To obtain an optimal single point prediction we utilize Bayesian decision theory [6, 50, 51] and minimize the expected loss,

$$\begin{aligned} \hat{\phi }_{\varDelta } = \mathop {{{\mathrm{argmin}}}}\limits _{\phi \in [0,2\pi )} \, \mathbb {E}_{\phi ' \sim p(\phi |\varvec{x})}\left[ \varDelta (\phi ,\phi ')\right] , \end{aligned}$$
(20)

where \(\varDelta : [0,2\pi ) \times [0,2\pi ) \rightarrow \mathbb {R}_+\) is a loss function. We will use the \(\varDelta _{\text {AAD}}(\phi ,\phi ')\) loss which measures the absolute angular deviation (AAD). To approximate (20) we use the empirical approximation of [50] and draw S samples \(\{\phi _j\}\) from \(p_{\theta }(\phi | \varvec{x})\). We then use the empirical approximation

$$\begin{aligned} \hat{\phi }_{\varDelta } = \mathop {{{\mathrm{argmin}}}}\limits _{j=1,\ldots ,S} \, \frac{1}{S} \sum _{k=1}^S \varDelta (\phi _j, \phi _{k}). \end{aligned}$$
(21)

We now evaluate our models both in terms of uncertainty as well as in terms of point prediction quality.

6 Experiments

This section presents the experimental results on several challenging head and object pose regression tasks. Section 6.1 introduces the experimental setup including used datasets, network architecture and training setup. In Sect. 6.2 we present and discuss qualitative and quantitative results on the datasets of interest.

6.1 Experimental Setup

Network Architecture and Training. We use two types of network architectures [42, 43] during our experiments and Adam optimizer [52], performing random search [53] for the best values of hyper-parameters. We refer to supplementary and corresponding project repository for more detailsFootnote 1.

Head Pose Datasets. We evaluate all methods together with the non-probabilistic BiternionVGG baseline on three diverse (in terms of image quality and precision of provided ground truth information) headpose datasets: IDIAP head pose [9], TownCentre [54] and CAVIAR [13] coarse gaze estimation. The IDIAP head pose dataset contains 66295 head images stemmed from a video recording of a few people in a meeting room. Each image has a complete annotation of a head pose orientation in form of pan, tilt and roll angles. We take 42304, 11995 and 11996 images for training, validation, and testing, respectively. The TownCentre and CAVIAR datasets present a challenging task of a coarse gaze estimation of pedestrians based on low resolution images from surveillance camera videos. In case of the CAVIAR dataset, we focus on the part of the dataset containing occluded head instances (hence referred to as CAVIAR-o in the literature).

PASCAL3D+ Object Pose Dataset. The Pascal 3D+ dataset [33] consists of images from the Pascal [55] and ImageNet [56] datasets that have been labeled with both detection and continuous pose annotations for the 12 rigid object categories that appear in Pascal VOC12 [55] train and validation set. With nearly 3000 object instances per category, this dataset provide a rich testbed to study general object pose estimation. In our experiments on this dataset we follow the same protocol as in [36, 38] for viewpoint estimation: we use ground truth detections for both training and testing, and use Pascal validation set to evaluate and compare the quality of our predictions.

Table 1. Quantitative results on the IDIAP head pose estimation dataset [9] for the three head rotations pan, roll and tilt. In the situation of fixed camera pose, lightning conditions and image quality, all methods show similar performance (methods are considered to perform on par when the difference in performance is less than standard error of the mean).
Table 2. Quantitative results on the CAVIAR-o [13] and TownCentre [54] coarse gaze estimation datasets. We see clear improvement in terms of quality of probabilistic predictions for both datasets when switching to mixture models that allow to output multiple hypotheses for gaze direction.
Table 3. Results on PASCAL3D+ viewpoint estimation with ground truth bounding boxes. First two evaluation metrics are defined in [38], where \(Acc_{\frac{\pi }{6}}\) measures accuracy (the higher the better) and MedErr measures error (the lower the better). Additionally, we report the log-likelihood estimation \(\log \mathcal {L} \) of the predicted angles (the higher the better). We can see clear improvement on all metrics when switching to probabilistic setting compared to training for a purely discriminative loss (fixed \(\kappa \) case).

6.2 Results and Discussion

Quantitative Results. We evaluate our methods using both discriminative and probabilistic metrics. We use discriminative metrics that are standard for the dataset of interest to be able to compare our methods with previous work. For headpose tasks we use the mean absolute angular deviation (MAAD), a widely used metric for angular regression tasks. For PASCAL3D+ we use the metrics advocated in [38]. Probabilistic predictions are measured in terms of log-likelihood [57, 58], a widely accepted scoring rule for assessing the quality of probabilistic predictions. We summarize the results in Tables 12 and 3. It can be seen from results on IDIAP dataset presented in Table 1 that when camera pose, lightning conditions and image quality are fixed, all methods perform similarly. In contrast, for the coarse gaze estimation task on CAVIAR we can see a clear improvement in terms of quality of probabilistic predictions for both datasets when switching to mixture models that allow to output multiple hypotheses for gaze direction. Here low resolution, pure light conditions and presence of occlusions create large diversity in the level of head pose expressions. Finally, on a challenging PASCAL3D+ dataset we can see clear improvement on all metrics and classes when switching to a probabilistic setting compared to training for a purely discriminative loss (fixed \(\kappa \) case). Our methods also show competitive or superior performance compared to state-of-the-art methods on disriminative metrics advocated in [38]. Method of [36] uses large amounts of synthesized images in addition to the standard training set that was used by our method. Using this data augmentation technique can also lead to an improved performance of our method and we consider this future work.

Fig. 6.
figure 6

Qualitative results of our simpified CVAE model on the PASCAL3D+ dataset. Our model correctly quantifies the uncertainty of pose predictions and is able to model ambiguous cases by predicting complex multimodal densities. Lower right images are failure cases (confusing head and tail of the object with high confidence).

Qualitative Results. Examples of probabilistic predictions for PASCAL3D+ dataset are shown in Fig. 6. Upper left images highlight the effect we set out to achieve: to correctly quantify the level of uncertainty of the estimated pose. For easier examples we observe sharp peaks and a highly confident detection, and more spread-out densities otherwise. Other examples highlight the advantage of mixture models, which allow to model complex densities with multiple peaks corresponding to more than one potential pose angle. Failure scenarios are highlighted in the lower right: high confidence predictions in case if the model confuses head and tail.

7 Conclusion

We demonstrated a new probabilistic model for object pose estimation that is robust to variations in input image quality and accurately quantifies its uncertainty. More generally our results confirm that our approach is flexible enough to accommodate different output domains such as angular data and enables rich and efficient probabilistic deep learning models. We train all models by maximum likelihood but still find it to be competitive with other works from the literature that explicitly optimize for point estimates even under point estimate loss functions. In the future, to improve our predictive performance and robustness, we would also like to handle uncertainty of model parameters [30] and to use the Fisher-von Mises distribution to jointly predict a distribution of azimuth-elevation-tilt [44].

We hope that as intelligent systems increasingly rely on perception abilities, future models in computer vision will be robust and probabilistic.