1 Introduction

Articulated human motion tracking from video image sequences is one of the most challenging computer vision problems for the past two decades. The basic idea behind this issue is to recover a motion of a complete human body basing on the image evidence from a single or many cameras. Moreover, it is assumed that the motion tracking is performed without any additional devices, e.g., color or electromagnetic markers. The human motion tracking system can be applied in many everyday life areas, see [9, 15, 18]. Giving some examples, it may be used in control devices for human–computer interaction, surveillance systems detecting unusual behaviors, dancing or martial arts training assistants, support systems for medical diagnosis.

During last years, a lot of effort has been put in solving the human motion tracking issue. However, excluding some minor cases, the problem remains open. There are several reasons worth mentioning that make the issue challenging. First, there is a huge variety of different images corresponding to the same pose that may be obtained. This is caused by diversity of human wear and appearance, changes in lighting conditions, camera noise, etc. Second, an image lacks depth information which makes impossible to obtain three-dimensional pose from two-dimensional images. Moreover, one has to handle different types of occlusions including self-occlusions and occlusions caused by external environment. Finally, efficient exploration of the space of all possible human poses is troublesome because of high-dimensionality of the space and its non-trivial constraints.

To date, however, several conceptually different approaches have been proposed to address the human motion tracking problem. They can be roughly divided into two groups.

In the first group, discriminative methods are used to model directly the probability distribution over poses conditioned on the image evidence. This approach is usually composed of two parts where feature extraction is followed by prediction using multivariate regression model. To obtain informative features simple techniques like binary silhouettes [1, 14] as well as more sophisticated descriptors like histogram of oriented gradients or a HMAX model [3] were adapted. As a regression model a whole spectrum of different techniques were used, e.g., ridge regression and support vector machines [1], mixture of experts [10], gaussian processes [3], kernel information embedding [14].

On the contrary, in the second group a generative approach is used to model separately the prior distribution over poses and the likelihood of how well a given pose fits to the current image. Pure generative modeling assumes that one tries to model the true pose space and uses Bayesian inference to combine this prior knowledge with the image evidence to estimate the current pose. Within this group of methods the two important branches have evolved.

In the first one, kinematic-tree models are used to represent the body pose. Since it is straightforward to render a 3D body model using this representation, the likelihood is usually computed by comparing the difference between the given images and body model projections. The main effort in this branch of methods is to design an appropriate pose prior. Many strategies were applied here, from simple limits on joint movements [20] to more advanced models trying to capture the manifolds of human poses embedded in high-dimensional pose space, e.g., gaussian process latent variable model (GPLVM) [25, 27], gaussian process dynamical model [26], mixtures of factor analyzers [13], hierarchical hidden Markov model [17], restricted Boltzmann machines [24]. Eventually, we can predict the body pose by finding maximum a posteriori (MAP) estimator or use a fully Bayesian approach by computing the complete posterior distribution over poses. The latter approaches usually take advantage of particle filters to approximate the posterior.

In the second branch of methods part-based models are used for pose recovery [2, 4, 22, 23, 29]. Here, we assume that all body parts are modeled individually. More flexible priors are used to cover many possible relative positions between parts. The main effort is put in constructing rich likelihood models that are required to detect individual body parts. In general, inference is based on searching MAP estimate using for example dynamic programming [29] or Branch and Bound methods [22, 23]; however, there are some individual cases where a fully Bayesian inference is used, see [5]. Finally, part-based models are mainly applied to 2D pose estimation.

In this paper, we present a novel fashion of involving information about low-dimensional embedding in the pose space into the tracking process that leads to a generic filtering procedure. We use the generative approach based on a kinematic-tree model and Bayesian inference. We propose a particle filter-based algorithm for the filtering problem, which we will refer to as manifold regularized particle filter. Finally, we present a dynamics model based on gaussian process latent variable model with back constraints.

The contribution of the paper is fourfold. First, a new class of particle filter algorithms is proposed where a low-dimensional information is involved into the inference process. This allows to utilize full Bayesian reasoning. Second, we present a specific instance of the proposed approach that utilizes the gaussian process latent variable model with back constraints combined with gaussian diffusion. Third, the outlined approach is applied to the articulated human-motion tracking. Fourth, we show empirically that the presented sampling scheme outperforms sampling-importance resampling and annealed particle filter procedures on benchmark dataset HumanEva.

The paper is organized as follows. In Sect. 2 the problem of the human motion tracking is outlined and the proposed filtering procedure is presented. In Sect. 3 the manifold regularized particle filter is proposed. The model of dynamics with low-dimensional manifold is presented in Sect. 5. At the end, the empirical study is carried out in Sect. 6 and conclusions are drawn in Sect. 7.

2 Human motion tracking

In this paper, we assume that a human body is represented by a set of articulately connected rigid parts (see Fig. 1a). Each connection between two neighboring elements characterizes a joint and can be described by up to three degrees of freedom, depending on movability of the joint. All connected parts form a kinematic tree with the root typically associated with the pelvis. A common representation of the state of the kth joint uses Euler angles that describe relative rotation between neighboring parts in the kinematic tree (see Fig. 1b). However, we prefer to use quaternions because they can be compared using the Euclidean distance metric. Moreover, if the relative rotations between connected parts in the kinematic tree are small, i.e., in the range between 0 and \(\pi \) then we can reduce quaternion representation to 3-tuple instead of 4-tuple, see [21] or [5] for details.

Fig. 1
figure 1

a Human body model represented by articulately connected rigid parts. b Kinematic tree representing connections between neighboring rigid parts. Red, blue, and green vertices correspond to joints with one, two and three degrees of freedom, respectively. Yellow vertex is the root of the tree

The set of quaternions for all relative rotations in the kinematic tree together with the global position and orientation in 3D constitutes the minimal set of variables that are used to describe the current pose of the human body, which is denoted by \(\mathbf {x}\). It is worth mentioning that \(\mathbf {x}\) is usually around 40–50 dimensions, which is one of the fundamental reasons that makes the human motion tracking a difficult problem due to intractability in searching over high-dimensional spaces.

We assume that there are several synchronized cameras that provide video images of a human body from different perspectives. The cameras should be located so that to contribute as much information about the body as possible, i.e., they should register different parts of the scene. We will denote a set of all available images from all cameras by \(\mathcal {I}\).

In typical pose estimation problem, we want to estimate the human body configuration \(\mathbf {x}\) basing on \(\mathcal {I}\). Thus, the key issue is to properly model the conditional distribution \(p(\mathbf {x}|\mathcal {I})\). Since it is a multivariate regression model, we can estimate the pose by computing the expected value \(\hat{\mathbf {x}}=\mathbb {E}[\mathbf {x}|\mathcal {I}]\). In the generative approach we follow the Bayes rule to inverse the conditional probability:

$$\begin{aligned} p(\mathbf {x}|\mathcal {I})\propto p(\mathcal {I}|\mathbf {x})p(\mathbf {x}), \end{aligned}$$
(1)

and then model the prior \(p(\mathbf {x})\) and the likelihood \(p(\mathcal {I}|\mathbf {x})\) separately.

We can extend the individual pose estimation problem to tracking of the whole trajectory \(\mathbf {x}_{1:T} = \{ \mathbf {x}_{1}, \ldots , \mathbf {x}_{T}\}\) in the pose space. Then, let \(\mathcal {I}_{1:T}=\{\mathcal {I}_{1},\ldots \mathcal {I}_{T}\}\) denote the corresponding sequence of available images.

Before giving the formal problem statement, notice that the high-dimensional pose space consists of human body configurations where most of them are unrealistic. Additionally, during specific motions (e.g., walking or running) all degrees of freedom exhibit strong correlations that depend on the current pose. These two remarks yield a corollary that the true trajectories form a low-dimensional manifolds. Therefore, we assume that any pose \(\mathbf {x}\) corresponds to a point \(\mathbf {z}\) in the coordinate system on the low-dimensional manifold.

Formally, in the generative approach we need to model the joint probability distribution \(p(\mathbf {x}_{1:T},\mathbf {z}_{1:T},\mathcal {I}_{1:T})\). This task is rather difficult unless we assume some conditional independence between variables. Typically, it is assumed that current pose depends only on the current low-dimensional representation, as well as the current observation depends only on the current pose. Temporal dependencies are assumed only between low-dimensional representations [25, 26, 28]. This is a reasonable assumption if the variance of the conditional distribution \(p(\mathbf {x}_{t}|\mathbf {z}_{t})\) is low. However, we have found out that for typical video frame rates (e.g., 60 Hz) the variance of the distribution \(p(\mathbf {x}_{t}|\mathbf {z}_{t})\) is usually much higher than the variance of the distribution \(p(\mathbf {x}_{t}|\mathbf {x}_{t-1})\). Hence, this leads to the corollary that human motion formulates continuous trajectories that locally oscillate around the low-dimensional manifold.

Therefore, we indicate that it is more important to model temporal coherence between high-dimensional poses, and low-dimensional representations should be used just to keep the trajectory close to the manifold. The manner how the joint probability distribution is factorized is presented by the probabilistic graphical model in Fig. 2. Notice that the current state \(\mathbf {x}_{t}\) influences future state and future point on the manifold \(\mathbf {z}_{t+1}\) which in turn impacts \(\mathbf {x}_{t+1}\).

Fig. 2
figure 2

Probabilistic graphical model presenting how the joint probability distribution \(p(\mathbf {x}_{1:T},\mathbf {z}_{1:T},\mathcal {I}_{1:T})\) factorizes

In the Bayesian inference we are interested in calculating the posterior probability distribution for \(\mathbf {x}_{t}\) given images \(\mathcal {I}_{1:t}\) by marginalizing out all previous poses \(\mathbf {x}_{1:t-1}\) and hidden variables \(\mathbf {z}_{1:t}\) from \(p(\mathbf {x}_{1:t},\mathbf {z}_{1:t}|\mathcal {I}_{1:t})\) which yields:

$$\begin{aligned} p(\mathbf {x}_{t}|\mathcal {I}_{1:t}) =&\frac{p(\mathcal {I}_{t}|\mathbf {x}_{t})}{p(\mathcal {I}_{t}|\mathcal {I}_{1:t-1})} \iint p(\mathbf {z}_{t}|\mathbf {x}_{t-1}) \nonumber \\ {}&\times p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t}) p(\mathbf {x}_{t-1}|\mathcal {I}_{1:t-1})\mathrm {d}\mathbf {x}_{t-1}\mathrm {d}\mathbf {z}_{t}, \end{aligned}$$
(2)

where \(p(\mathcal {I}_{t}|\mathcal {I}_{1:t-1})\) is a normalization constant given by:

$$\begin{aligned} p(\mathcal {I}_{t}|\mathcal {I}_{1:t-1}) =&\iint p(\mathcal {I}_{t}|\mathbf {x}_{t})p(\mathbf {z}_{t}|\mathbf {x}_{t-1})\nonumber \\ {}&\times p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t}) p(\mathbf {x}_{t-1}|\mathcal {I}_{1:t-1})\mathrm {d}\mathbf {x}_{t-1:t}\mathbf {z}_{t}. \end{aligned}$$
(3)

We have obtained a filtering procedure that includes information about the low-dimensional manifold and is independent of actual forms of each component. Further, we show that if the filtering procedure is performed in this manner, then choosing relatively simple component models \(p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t})\) and \(p(\mathbf {z}_{t}|\mathbf {x}_{t-1})\) leads to very promising results.

3 Manifold regularized particle filter

In the context of the human motion tracking the filtering procedure is intractable usually since we are unable to compute analytically the integral in (2) and the normalization constant (3) except the case where all distributions are gaussian. Hence, an approximation of the posterior should be applied. Typically, sampling methods like particle filter-based techniques are used, in the context of human motion tracking the most popular method is known as Condensation algorithm [8]. However, the main disadvantage of this technique is that it requires to generate a huge amount of particles in order to cover a high-dimensional state space. Otherwise, it fails to approximate the true distribution. In order to cover the highly probable areas in the pose space only, an extension called annealed particle filter (APF) has been proposed [6]. However, in this method particles tend to be trapped in one or a few dominating local maxima in the posterior distribution. Therefore, the method is non-robust to cases where substantial number of local maxima occurs and thus fails to track the proper trajectory. This usually happens when the image evidence is inconsiderable, e.g., we use noisy likelihood model or small number of cameras.

In this paper, we propose a different approach that modifies the Condensation algorithm by introducing a regularization in a form of the low-dimensional manifold. This filtering procedure operates in the neighborhood of the low-dimensional space where the true poses are concentrated, and thus it guarantees that highly probable regions are covered and the particles are distributed around different local extrema.

In fact the proposed particle-based filtering procedure provides a proxy to the posterior \(p(\mathbf {x}_{t}|\mathcal {I}_{1:t})\) obtained in (2), because we can approximate the distribution as follows:

$$\begin{aligned} p(\mathbf {x}_{t}|\mathcal {I}_{1:t}) \approx \sum _{n=1}^{N}\pi (\mathbf {x}_{t}^{(n)})\delta (\mathbf {x}_{t}-\mathbf {x}_{t}^{(n)}) , \end{aligned}$$
(4)

where \(\mathbf {x}_{t}^{(1)},\ldots ,\mathbf {x}_{t}^{(N)}\sim p(\mathbf {x}_{t}|\mathcal {I}_{1:t-1})\) denote samples from the current prior, \(\delta (\cdot )\) is the Dirac delta function, and \(\pi (\mathbf {x}_{t}^{(n)})\) is a normalized form of a single score calculated using the likelihood model \(\tilde{\pi }(\mathbf {x}_{t}^{(n)}) = p(\mathcal {I}_{t}|\mathbf {x}_{t}^{(n)})\), so:

$$\begin{aligned} \pi (\mathbf {x}_{t}^{(n)})=\frac{\tilde{\pi }(\mathbf {x}_{t}^{(n)})}{\sum _{j=1}^{N} \tilde{\pi }(\mathbf {x}_{t}^{(j)})}. \end{aligned}$$
(5)

However, in most cases it is troublesome to generate a sample from the prior \(p(\mathbf {x}_{t}|\mathcal {I}_{t-1})\) using standard sampling techniques for directed graphical models since generating \(\mathbf {x}_{t}\) from \(p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t})\) is usually intractable. Thus, we introduce an auxiliary distribution \(q(\mathbf {x}_{t}|\mathbf {x}_{t-1})\) from which we can sample effectively. Then, taking advantage of conditional independencies defined by the probabilistic graphical model in Fig. 2, we get:

$$\begin{aligned} p(\mathbf {x}_{t},\mathbf {x}_{t-1},\mathbf {z}_{t}|\mathcal {I}_{1:t-1})&= \frac{1}{Z}\tilde{\omega }(\mathbf {x}_{t},\mathbf {x}_{t-1},\mathbf {z}_{t}) \nonumber \\&\quad \times q(\mathbf {x}_{t},\mathbf {x}_{t-1},\mathbf {z}_{t}|\mathcal {I}_{1:t-1}), \end{aligned}$$
(6)

where \(\tilde{\omega }\) are weight coefficients defined as follows:

$$\begin{aligned} \tilde{\omega }(\mathbf {x}_{t},\mathbf {x}_{t-1},\mathbf {z}_{t})=\frac{p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t})}{q(\mathbf {x}_{t}|\mathbf {x}_{t-1})}, \end{aligned}$$
(7)

and \(q(\mathbf {x}_{t},\mathbf {x}_{t-1},\mathbf {z}_{t}|\mathcal {I}_{1:t-1})\) is a joint auxiliary distribution:

$$\begin{aligned} q(\mathbf {x}_{t},\mathbf {x}_{t-1},\mathbf {z}_{t}|\mathcal {I}_{1:t-1})&= q(\mathbf {x}_{t}|\mathbf {x}_{t-1})p(\mathbf {z}_{t}|\mathbf {x}_{t-1}) \nonumber \\&\quad \times p(\mathbf {x}_{t-1}|\mathcal {I}_{1:t-1}), \end{aligned}$$
(8)

and Z is a normalization constant.

Let us assume that we can sample from \(q(\mathbf {x}_{t}|\mathbf {x}_{t-1})\) and \(p(\mathbf {z}_{t}|\mathbf {x}_{t-1})\).Footnote 1 Thus, once we have a sample from the previous posterior \(p(\mathbf {x}_{t-1}|\mathcal {I}_{1:t-1})\), eventually we can easily generate a sample from the auxiliary joint distribution (8) and further we can use it to approximate the distribution (6). Since integrating out variables in discrete approximations is trivial, we obtain the following expression for the prior:

$$\begin{aligned} p(\mathbf {x}_{t}|\mathcal {I}_{1:t-1})\approx \sum _{n=1}^{N} \omega (\mathbf {x}_{t}^{(n)},\mathbf {x}_{t-1}^{(n)},\mathbf {z}_{t}^{(n)}) \delta (\mathbf {x}_{t}-\mathbf {x}_{t}^{(n)}), \end{aligned}$$
(9)

where coefficients \(\omega (\mathbf {x}_{t}^{(n)},\mathbf {x}_{t-1}^{(n)},\mathbf {z}_{t}^{(n)})\) are obtained by normalizing (7) in analogical manner as in (5).

Eventually, by combining the prior (9) with the image likelihoods we can approximate the posterior distribution (2) using the following formula:

$$\begin{aligned} p(\mathbf {x}_{t}|\mathcal {I}_{1:t}) \approx \sum _{n=1}^{N}\xi (\mathbf {x}^{(n)}_{t},\mathbf {x}^{(n)}_{t-1},\mathbf {z}^{(n)}_{t})\delta (\mathbf {x}_{t}-\mathbf {x}_{t}^{(n)}), \end{aligned}$$
(10)

where the weights are defined as follows:

$$\begin{aligned} \xi (\mathbf {x}^{(n)}_{t},\mathbf {x}^{(n)}_{t-1},\mathbf {z}^{(n)}_{t}) = \frac{\tilde{\pi }(\mathbf {x}_{t}^{(n)})\tilde{\omega }(\mathbf {x}^{(n)}_{t},\mathbf {x}^{(n)}_{t-1},\mathbf {z}^{(n)}_{t})}{\sum _{j=1}^{N} \tilde{\pi }(\mathbf {x}_{t}^{(j)})\tilde{\omega }(\mathbf {x}^{(j)}_{t},\mathbf {x}^{(j)}_{t-1},\mathbf {z}^{(j)}_{t})}. \end{aligned}$$
(11)

Notice that we introduce the low-dimensional manifold in the manner that the particles are weighted both by the image evidence and coefficients \(\tilde{\omega }\).

Fig. 3
figure 3

Schematic illustration of the manifold regularized particle filter

As a result we obtain a new generic particle-based filtering procedure which takes advantage of the prior given by the low-dimensional manifold. The proposed approach is referred to as manifold regularized particle filter (MRPF). The schematic representation of the MRPF is presented in Fig. 3 and the detailed procedure is presented in Algorithm 1. Notice that to mitigate particle degeneration phenomenon, which is a typical problem in particle filter-based algorithms, we apply resampling procedure. For more detailed considerations on the importance of resampling see for example [7]. To avoid confusions we use \(\mathcal {X}_{t}\) and \(\overline{\mathcal {X}}_{t}\) to denote samples before and after resampling, respectively.

figure c

4 Likelihood function

The likelihood function \(p(\mathcal {I}_{t}|\mathbf {x}_{t})\) aims at evaluating the given human body configuration \(\mathbf {x}_{t}\) corresponds to the set of images \(\mathcal {I}_{t}\). We compare images which contain a human body model (see Fig. 1a) projected onto camera views with binary silhouettes obtained from background subtraction procedure, by calculating the difference between them. This model is called bidirectional silhouette likelihood [20] and can be defined as follows:

$$\begin{aligned} -\ln p(\mathcal {I}|\mathbf {x})&= \frac{1}{|\mathcal {I}|}\sum _{\mathrm {I}\in \mathcal {I}}\Bigg \{ \frac{1}{|\{\mathrm {S}^{\mathrm {I}}_{ij}(\mathbf {x})=1\}|}\sum _{(i,j)\in \{\mathrm {S}^{\mathrm {I}}_{ij}(\mathbf {x})=1\}} \big ( 1 - \mathrm {S}^{\mathrm {I}}_{ij} \big ) \nonumber \\&\quad + \frac{1}{|\{\mathrm {S}^{\mathrm {I}}_{ij}=1\}|}\sum _{(i,j)\in \{\mathrm {S}^{\mathrm {I}}_{ij}=1\}} \big ( 1 - \mathrm {S}^{\mathrm {I}}_{ij}(\mathbf {x}) \big ) \Bigg \} + \mathrm {const}, \end{aligned}$$
(12)

where \(\mathrm {S}^{\mathrm {I}}\) denotes binary silhouette obtained by subtracting background from the input image \(\mathrm {I}\), and \(\mathrm {S}^{\mathrm {I}}(\mathbf {x})\) denotes binary silhouette obtained by projecting a body model generated from a given pose \(\mathbf {x}\) onto the image \(\mathrm {I}\). Additional constant value in (12) corresponds to the normalizing coefficient in the probability distribution that is independent of the image. We use a simple body model composed of articulately connected cylindrical elements. Analogous model was used in [20]. The idea of calculating the likelihood is presented in Fig. 4.

Fig. 4
figure 4

a An example input image \(\mathrm {I}\). b Binary silhouette \(\mathrm {S}^{\mathrm {I}}\) obtained in background subtraction procedure. c Comparison between \(\mathrm {S}^{\mathrm {I}}\) and binary silhouette \(\mathrm {S}^{\mathrm {I}}(\mathbf {x})\) obtained by projecting body model onto image \(\mathrm {I}\)

5 Dynamics model using low-dimensional manifold

We propose to model the dynamics using low-dimensional manifold and a nonlinear dependency. First, we need to learn the low-dimensional manifold. Second, we need to construct a model of dynamics on the low-dimensional manifold, \(p(\mathbf {z}_{t}|\mathbf {x}_{t-1})\), and the model of dynamics in the pose space with the low-dimensional manifold, \(p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t})\).

5.1 Learning the low-dimensional manifold

For learning the low-dimensional manifold, we apply the Gaussian process latent variable model (GPLVM) [11]. The GPLVM model constitutes a nonlinear dependency between the pose and the low-dimensional manifold as follows:

$$\begin{aligned} \mathbf {x} = \mathbf {f}(\mathbf {z}) + \varvec{\varepsilon }, \end{aligned}$$
(13)

where ith function is a realization of the gaussian process [19], \(f_{i}\sim \mathcal {GP}(f|0,k(\mathbf {z},\mathbf {z}'))\), where k is a kernel function, and \(\varvec{\varepsilon }\sim \mathcal {N}(\varvec{\varepsilon }|0,\sigma _{z}^{2}\mathbf {I}_{D\times D})\) denotes univariate gaussian noise, where \(\sigma _{z}^{2}\) is variance, and \(\mathbf {I}_{D\times D}\) denotes the D-dimensional identity matrix. In this paper, we use the RBF kernel,

$$\begin{aligned} k(\mathbf {z},\mathbf {z}')=\beta \exp \left( -\frac{\gamma _{z}}{2}\Vert \mathbf {z}-\mathbf {z}'\Vert ^{2}\right) + \beta _{0}, \end{aligned}$$
(14)

where \(\beta \), \(\beta _{0}\), and \(\gamma _{z}\) are kernel parameters.

We are interested in finding a matrix of low-dimensional variables corresponding to observed poses, i.e., a matrix \(\mathbf {Z}\) for observed poses \(\mathbf {X}\). Additionally, we want to determine the mapping between the manifold and the high-dimensional space by learning parameters \(\beta \), \(\beta _{0}\) and \(\gamma _{z}\), and \(\sigma _{z}^{2}\). The training corresponds to finding the parameters and points on the manifold that maximize the logarithm of the likelihood function in the following form:

$$\begin{aligned} \ln p(\mathbf {X}|\mathbf {Z})&= \ln \prod _{i=1}^{D} \mathcal {N}(\mathbf {X}_{:,i}|0,\mathbf {K}+\sigma _{z}^{2}\mathbf {I}_{T\times T}) \nonumber \\&= -\frac{DT}{2}\ln (2\pi ) -\frac{D}{2}\ln |\overline{\mathbf {K}}|+ \nonumber \\&\quad \ - \frac{1}{2}\mathrm {tr}(\mathbf {X}^{\mathrm {T}}\overline{\mathbf {K}}^{-1}\mathbf {X}), \end{aligned}$$
(15)

where \(\mathbf {X}_{:,i}\) denotes ith column of the matrix \(\mathbf {X}\), \(|\cdot |\) and \(\mathrm {tr}(\cdot )\) are matrix determinant and trace, respectively, \(\overline{\mathbf {K}}=\mathbf {K}+\sigma _{z}^{2}\mathbf {I}_{T\times T}\), and \(\mathbf {K}=[k_{nm}]\) is the kernel matrix with elements \(k_{nm}=k(\mathbf {z}_{n},\mathbf {z}_{m})\).

Let us notice that solutions of the maximization \(\mathbf {z}_{t}\) and \(\gamma _{z}\) can be arbitrarily re-scaled, thus, there are many equivalent solutions. In order to avoid this issue we introduce a regularizer \(\frac{1}{2}\Vert \mathbf {Z}\Vert ^{2}_{F}\), where \(\Vert \cdot \Vert _{F}\) is the Frobenius norm, and the final objective function takes the form:

$$\begin{aligned} L(\mathbf {Z})=\ln p(\mathbf {X}|\mathbf {Z}) - \frac{1}{2}\Vert \mathbf {Z}\Vert ^{2}_{F}. \end{aligned}$$
(16)

The objective function can be optimized using standard gradient-based optimization algorithms, e.g., scaled conjugate gradient method. Additionally, the objective function is not concave and hence it has multiple local maxima. Therefore, it is important to carefully initialize the numerical algorithm, e.g., by using principal component analysis.

Optimization algorithms need information about the gradient of the objective function. In order to calculate the gradient of (16) w.r.t. \(\overline{\mathbf {K}}\) we use the properties of derivatives for matrices [16], which yields:

$$\begin{aligned} \frac{\partial L(\mathbf {Z})}{\partial \overline{\mathbf {K}}} = -\frac{D}{2}\overline{\mathbf {K}}^{-1}+\frac{1}{2}\overline{\mathbf {K}}^{-1}\mathbf {X}\mathbf {X}^{\mathrm {T}}\overline{\mathbf {K}}^{-1}. \end{aligned}$$
(17)

Next, the derivative of (16) w.r.t. \(z_{t}^{i}\) is as follows:

$$\begin{aligned} \frac{\partial L}{\partial z_{t}^{i}} = \mathrm {tr} \left( \left( \frac{\partial L(\mathbf {Z})}{\partial \overline{\mathbf {K}}} \right) ^{\mathrm {T}} \frac{\partial \overline{\mathbf {K}}}{\partial z_{t}^{i}}\right) - \frac{1}{2}\frac{\partial \Vert \mathbf {Z}\Vert ^{2}_{F}}{\partial z_{t}^{i}}, \end{aligned}$$
(18)

where \(\frac{\partial L(\mathbf {Z})}{\partial \overline{\mathbf {K}}}\) is given by (17), and derivatives \(\frac{\partial \overline{\mathbf {K}}}{\partial z_{t}^{i}}\) are given in the following form:

(19)

and eventually

$$\begin{aligned} \frac{\partial \Vert \mathbf {Z}\Vert ^{2}_{F}}{\partial z_{t}^{i}} = 2z_{t}^{i}. \end{aligned}$$
(20)

We can calculate derivatives w.r.t. \(\beta \), \(\beta _{0}\), \(\gamma _{z}\) and \(\sigma _{z}^{2}\) analogically.

Notice that the kernel used to determine the covariance function takes high values for points \(\mathbf {z}_{n}\) and \(\mathbf {z}_{m}\) that are close to each other, i.e., they are similar. Moreover, because the points on the manifold are similar, the original poses \(\mathbf {x}_{n}\) and \(\mathbf {x}_{m}\) are similar as well. However, the situation does not hold in the opposite direction. This issue is undesirable in the proposed filtering procedure (2) since the distribution \(p(\mathbf {z}_{t}|\mathbf {x}_{t-1})\) is multi-modal and thus hard to determine. However, this effect can be reduced by introducing back constraints that leads to back-constrained GPLVM (BC-GPLVM) [12].

The idea behind BC-GPLVM is to define \(\mathbf {z}\) as a smooth mapping of \(\mathbf {x}\), \(\mathbf {z} = \mathbf {g}(\mathbf {x})\). For example, this mapping can be given in the linear form, i.e.,

$$\begin{aligned} g_{i}(\mathbf {x}) = \sum _{t=1}^{T} a_{ti} k_{x}(\mathbf {x},\mathbf {x}_{t}) + b_{i}, \end{aligned}$$
(21)

where \(g_{i}\) denotes ith component of \(\mathbf {z}\), \(a_{ti}\), \(b_{i}\) are parameters, and

$$\begin{aligned} k_{x}(\mathbf {x},\mathbf {x}')=\exp \left( -\frac{\gamma _{x}}{2}\Vert \mathbf {x}-\mathbf {x}'\Vert ^{2}\right) \end{aligned}$$
(22)

is the kernel function in the high-dimensional space of poses. We can incorporate the mapping into the objective function, i.e., \(z_{n}^{i} = g_{i}(\mathbf {x}_{n})\), and then optimize w.r.t. \(a_{ti}\) and \(b_{i}\) instead of \(z_{n}^{i}\). The application of the back constraints entails closeness of low-dimensional points \(\mathbf {z}_{t}\) if high-dimensional points \(\mathbf {x}_{t}\) are similar.

The big advantage of gaussian processes is tractability of calculating the predictive distribution for new pose \(\mathbf {x}_{p}\) and its low-dimensional representation \(\mathbf {z}_{p}\). The corresponding kernel matrix is as follows:

$$\begin{aligned} \begin{bmatrix} \overline{\mathbf {K}}&\overline{\mathbf {k}}\\ \overline{\mathbf {k}}^{\mathrm {T}}&\overline{k}_{z}(\mathbf {z}_p,\mathbf {z}_p) \end{bmatrix}, \end{aligned}$$
(23)

and finally the predictive distribution [19]:

$$\begin{aligned} p(\mathbf {x}_p|\mathbf {z}_p,\mathbf {X},\mathbf {Z}) = \mathcal {N}(\mathbf {x}_p|\varvec{\mu }_{p},\sigma ^{2}_{p}\mathbf {I}_{D\times D}), \end{aligned}$$
(24)

where:

$$\begin{aligned} \varvec{\mu }_{p}&= \mathbf {X}^{\mathrm {T}}\overline{\mathbf {K}}^{-1}\overline{\mathbf {k}}, \end{aligned}$$
(25)
$$\begin{aligned} \sigma _{p}^{2}&=\overline{k}_{z}(\mathbf {z}_p,\mathbf {z}_p) - \overline{\mathbf {k}}^{\mathrm {T}}\overline{\mathbf {K}}^{-1}\overline{\mathbf {k}}. \end{aligned}$$
(26)

5.2 Models of dynamics

Idea of the model \(p(\mathbf {z}_{t}|\mathbf {x}_{t-1})\) is to predict new position on the manifold basing on the previous pose. Therefore, we need a mapping that allows to transform a high-dimensional representation to a low-dimensional one. For this purpose, we apply the back constraints. By adding gaussian noise with the covariance matrix \(\mathrm {diag}(\varvec{\sigma }^{2}_{x\rightarrow z})\) to the back constraints, we obtain the following model of the dynamics on the manifold:

$$\begin{aligned} p(\mathbf {z}_{t}|\mathbf {x}_{t-1})=\mathcal {N}(\mathbf {z}_{t}|\mathbf {g}(\mathbf {x}_{t-1}),\quad \mathrm {diag}(\varvec{\sigma }^{2}_{x\rightarrow z})). \end{aligned}$$
(27)

On the other hand, the model \(p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t})\) determines the probability of the current pose basing on the previous pose and the current point on the low-dimensional manifold. A reasonable assumption is that the model factorizes into two components, namely, one concerning only previous pose, and second—the low-dimensional manifold. This factorization follows from the fact that these two quantities belong to two different spaces and thus are hard to compare quantitatively. Then, the model of the dynamics takes the following form:

$$\begin{aligned} p(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathbf {z}_{t}) \propto p(\mathbf {x}_{t}|\mathbf {x}_{t-1})p(\mathbf {x}_{t}|\mathbf {z}_{t}). \end{aligned}$$
(28)

The first component is expressed as a normal distribution with the diagonal covariance matrix \(\mathrm {diag}(\varvec{\sigma }_{x\rightarrow x}^{2})\):

$$\begin{aligned} p(\mathbf {x}_{t}|\mathbf {x}_{t-1}) = \mathcal {N}(\mathbf {x}_{t}|\mathbf {x}_{t-1},\quad \mathrm {diag}(\varvec{\sigma }_{x\rightarrow x}^{2})). \end{aligned}$$
(29)

The second component is constructed using the mean of the predictive distribution (25) and is disturbed by a gaussian noise with the diagonal covariance matrix \(\mathrm {diag}(\varvec{\sigma }_{z\rightarrow x}^{2})\) which leads to the following model:

$$\begin{aligned} p(\mathbf {x}_{t}|\mathbf {z}_{t})=\mathcal {N}(\mathbf {x}_{t}|\mathbf {X}^{\mathrm {T}}\overline{\mathbf {K}}^{-1}\overline{\mathbf {k}}, \mathrm {diag}(\varvec{\sigma }^{2}_{z \rightarrow x})). \end{aligned}$$
(30)

It is important to highlight that the training of the parameters \(\mathrm {diag}(\varvec{\sigma }^{2}_{z \rightarrow x})\) has to be performed using a separate validation set which contains data. Otherwise, using the same training set as for determining \(\mathbf {Z}\) leads to underestimation of the parameters.

Eventually, let us consider the application of the MRPF (see Algorithm 1) in the context of the outlined models of dynamics. We need to propose the auxiliary distribution \(q(\mathbf {x}_{t}|\mathbf {x}_{t-1})\). In our case it is given in the form (29), i.e., \(q(\mathbf {x}_{t}|\mathbf {x}_{t-1}) = \mathcal {N}(\mathbf {x}_{t}|\mathbf {x}_{t-1},\mathrm {diag}(\varvec{\sigma }_{x\rightarrow x}^{2}))\). Then, the weights \(\tilde{\omega }\) are given in the form (30), i.e., \(\tilde{\omega }(\mathbf {x}_{t},\mathbf {x}_{t-1},\mathbf {z}_{t}) = \mathcal {N}(\mathbf {x}_{t}|\mathbf {X}^{\mathrm {T}}\overline{\mathbf {K}}^{-1}\overline{\mathbf {k}}, \mathrm {diag}(\varvec{\sigma }^{2}_{z \rightarrow x}))\).

Fig. 5
figure 5

Low-dimensional pose representations learned using training data for each sequence

6 Empirical study

6.1 Setup

Dataset The performance of the proposed approach is evaluated using real-life benchmark dataset HumanEva [20]. The dataset contains multiple subjects performing a set of predefined actions. Originally, for each subject and action the dataset is divided into training, validation and testing sequences. However, all testing sequences are not publicly available. Therefore, very often a different data division is utilized to perform an evaluation, e.g., see [5].

In the experiment we focused on two motion types, namely, walking and jogging, performed by three different persons, i.e., S1, S2, S3, which results in six various sequences. In each sequence we used 350 and 300 frames from different training trials for training and validation sets, respectively. Only the sequence S1-Jog contained 200 and 200 frames in training and validation sets, respectively. For testing we utilized first 200 frames from each validation trial.

Evaluation methodology The aim of the experiment is to evaluate the proposed approach using MRPF. The presented method was tested against two well-known approaches using the ordinary sampling importance resampling (SIR) and the annealed particle filter (APF). These two methods are usually used as baselines for comparison on HumanEva, e.g., [5, 13]. The code of these approaches is provided together with HumanEva. In both methods gaussian diffusion as the dynamics model was applied.

Each motion sequence is synchronized with measurements from the MOCAP system and thus it is possible to evaluate the difference between the true values of a pose configuration with the estimated ones using the following equation (\(\mathbf {w}(\cdot )\in \mathcal {W}\) denotes M points on a body for given state variables):

$$\begin{aligned} \mathrm {err}(\hat{\mathbf {x}}_{1:T})=\frac{1}{TM}\sum _{t=1}^{T}\sum _{\mathbf {w}\in \mathcal {W}}\Vert \mathbf {w}(\mathbf {x}_{t})-\mathbf {w}(\hat{\mathbf {x}}_{t})\Vert . \end{aligned}$$
(31)

The obtained value of the error \(\mathrm {err}(\hat{\mathbf {x}}_{1:T})\) is expressed in millimeters.

In the empirical study we used the following number of particles: (i) MRPF with 500 particles, (ii) SIR with 500 particles, and (iii) APF with 5 annealing layers with 100 particles each. The low-dimensional manifold had 2 dimensions. In Fig. 5 the learnt low-dimensional embeddings are presented. All parameters (except \(\gamma _{\mathbf {x}} = 10^{-4}\)) were set according to the optimization process. The methods were run 5 times.

Fig. 6
figure 6

Tracking error rate over the test sequences

6.2 Results and discussion

The averaged results obtained within the experiment are gathered in Table 1. The results show that the proposed approach with MRPF gave the best results except the sequence S1-Jog for which SIR was slightly better. It is probably caused by the low-quality of this sequence which resulted in shorter training and validation sets. Because of this fact the manifold was possibly not fully discovered.

Table 1 The tracking errors \(\mathrm {err}(\hat{\mathbf {x}}_{1:T})\) for all methods are expressed as an average and a standard deviation (in brackets)
Fig. 7
figure 7

Example frames from S3-Jog test sequence

Table 2 The tracking errors \(\mathrm {err}(\hat{\mathbf {x}}_{1:T})\) for different body parts are expressed as an average and a standard deviation (in brackets)

The worst performance was obtained by the APF. The explanation for such result can be as follows. First, the likelihood model used in the experiment is highly noised by the low quality of the silhouettes achieved in the background subtraction process. The noise in the likelihood model leads to displacement of extrema and thus wrong tracking. Second, the number of particles can be insufficient. However, considering larger number of particles would cause prohibitively long execution time.

In Fig. 6, the tracking error rates are presented. We can see there that the MPRF method behaves more stable than SIR and APF, i.e., the error is not accumulating over consecutive frames. This is the effect of keeping the trajectory close to the manifold. It is especially important if the image evidence is poor, which leads to huge ambiguity. For more detailed consideration on this problem see Table 2, where individual tracking errors for different body parts are presented. Notice that MRPF always achieves better results for arms, where the ambiguity is the biggest since in most of the tracking time arms stay cluttered by the torso. This effect can be also seen in Fig. 7, where some example frames from the last test sequence are shown.

7 Conclusions

In this paper, a fully Bayesian approach to the articulated human motion tracking was proposed. The modification of the standard Condensation algorithm is based on introducing low-dimensional manifold as a regularizer that incorporates prior knowledge about the specificity of human motion. The application of the low-dimensional manifold allows to restrict the space of possible pose configurations. The idea is based on the application of GPLVM with back constraints. At the end of the paper, the experiment was carried out using the real-life benchmark dataset HumanEva. The proposed approach was compared with two particle filters, namely, SIR and APF, and the obtained results showed that it outperformed both of them.