1 Introduction

Motion visual tracking is an important and essential component of perception that has been an active research area in computer vision for past two decades. The developments of 2D and 3D visual tracking algorithms have shown rapid progress thanks to the explosive growth of video data which in turn creates high demand for accuracy and speed of tracking methods. Current approaches are motivated to design faster and better methods in spite of the challenges that exist in this topic, especially robustness to large occlusions, drastic scale change, accurate localization, multi-object tracking, and recovery from failure [25, 27]. Despite the success in addressing numerous challenges under a wide range of scenarios, a number of core problems still remain unsolved. A major challenge in real scenarios is handling missing entries of the data, due to ad hoc data collection, presence of outliers, sensor failure, or partial knowledge of relationships in a dataset. For instance, to recover object motions and deformations from video, the tracking algorithm may lose the track of features in some image frames due to lack of visibility or mismatches. In a similar manner, for 3D tracking, multi-camera systems (such as motion capture systems) [26, 46] are applied to obtain the time-varying evolution of a scenario. While these systems are now capable of recovering most of the observations, they can fail on real-world scenarios, such as those formed by multiple objects while performing different activities, deforming, moving, and even interacting between them. In these cases, missing tracks continually appear, as either self-occlusions or occlusions between objects. It is worth mentioning that this is especially relevant in outdoors scenarios, where current algorithms to estimate motion tracks often produce partial solutions with a wide amount of missing entries.

In many fields, an underlying tenet is that the data may contain certain type of structure that enables intelligent processing and representation, and they can be characterized by using parametric models. Assuming that, the visual tracking completion problem can be addressed as a matrix completion one. To this end, one can use the well-known linear subspaces since they are easy to estimate, and often effective in many real-world applications. For instance, these models have been successfully used to characterize several types of visual data, such as motion [39, 49], shape [32], and texture [35]. Maybe, the most common choice it is to use the principal component analysis method that is based on the hypothesis that the data are approximately drawn from a low-rank subspace. Unfortunately, real data from complex scenarios can rarely be well described by a single low-rank subspace. For these cases, a more reasonable model is to assume data are lying near several subspaces, i.e., the data are considered as samples approximately drawn from a union of several low-rank subspaces. The generality and importance of subspaces naturally lead to a challenging problem of subspace clustering, whose goal is to group data into clusters with every cluster corresponding to a different subspace. Solving clustering and finding low-dimensional representations of data are important unsupervised learning problems in machine learning with numerous applications, including image segmentation, system identification, data visualization, and collaborative filtering to name just a few. The problem becomes even more complex if the data are partially observed, due to either sensor failure or visual occlusions [1, 2, 4, 24].

In this paper, we propose, to the best of our knowledge, the first attempt to approximate high-dimensional data using a dual union of low-dimensional subspaces, accounting for two distinct criteria. Additionally, input data are assumed to be corrupted by partial observations and noise. We apply our approach to the specific case in which the input data encode 2D or 3D point trajectories of multiple dynamic objects with large percentages of missing entries, and we aim at hallucinating these missing tracks while simultaneously approximating the data using spatial and temporal subspaces, as well as filtering the noisy measurements.

We will formulate the problem as a matrix completion one. Input data will be arranged into a matrix with the missing entries set to zero. To encode data similarities, we introduce two affinity matrices to be learned. We will then devise an optimization scheme based on augmented Lagrangian multipliers (ALM) to simultaneously and efficiently estimate the missing entries, and the bases and dimensionality for of each low-rank subspace. The proposed algorithm is unsupervised, does not depend on the initialization nor relies on training data at all, and can be solved in polynomial time. An important corollary of our approach is that applying off-the-shelf state-of-the-art spectral clustering on the estimated affinity matrices results in consistent temporal and spatial segmentations of the input data.

We evaluate the proposed algorithm on 2D and 3D incomplete motion capture and real sequences of several objects performing complex actions and interacting with each other. We will show that the accuracy of the completed tracks we obtain improves that of state-of-the-art methods by a considerable margin, while we additionally provide a spatiotemporal clustering of the data, which in most cases has a direct physical interpretation (either the object identity or the type of motion it is performing).

2 Related work

The most standard approach to perform matrix completion is to assume the underlying data lies in a single low-dimensional subspace. Early works [30, 44] enforced this constraint based on expectation maximization strategies to optimize non-convex functions of the model parameters and the missing entries. Other attempts constrain the solution space using trajectory [24], or spatiotemporal models [5]. Nevertheless, all these methods require a good initialization, and most importantly, they need to set the rank of the subspace a priori, performing poorly when the dimension of the subspace increases. Additionally, trajectory-based methods normally use a predefined basis, making them very problem specific. To address these limitations, another family of low-rank matrix completion techniques has been recently proposed [7, 9,10,11, 14, 15]. These methods estimate missing entries by optimizing the convex surrogate of the rank, i.e., by they enforce the nuclear norm of the complete matrix. These ideas were also applied in problems where the matrix directly includes visual tracking information, imposing smooth [42, 43, 54] and sparse [47] representations. When the underlying subspace is not consistent with standard basis components and missing track locations are spread uniformly at random, these approaches are guaranteed to recover missing entries.

Table 1 Qualitative comparison of our approach against competing techniques

Unfortunately, matrix completion techniques based on a single low-rank subspace cannot handle the challenging and more general scenario in which input data lie in a union of low-rank subspaces (e.g., when dealing with simultaneous and incomplete tracks of multiple objects performing complex motions). Data segmentation from full annotations was proposed by assuming a union of subspaces by means of a subspace clustering based on sparse representation [18] or seeking the lowest rank one [34]. Going back to the completion problem, the objective would extend to recovering the missing entries together with the clustering of the data according to the subspaces. Mixture of factor analyzers [22], mixture of probabilistic principal component analysis [45], and incremental matrix completion algorithms with K-subspaces [6] are some early methods used to address grouping and completion of multi-subspace data. Again, the performance of these methods highly depends on the initialization and degrades for large subspace ranks. A polynomial number of data points in the ambient space dimension are required in [19] which often cannot be met in high-dimensional datasets. Ma et al. [36] proposed an algebraic approach to model data drawn from a union of subspaces based on generalized principal component analysis. Yet, due to the difficulty of estimating the polynomials from data, the method is sensitive to noise and is computationally very demanding. This strategy was extended in [40], yielding more robust solutions but only for low-dimensional input data and a reduced number of subspaces. A Lipschitz monotonic function was assumed to model the low-rank matrix in [21], even though this cannot cover the case of multiple subspaces. Another family of solutions proposed solving completion and clustering as a two-stage problem [50], by first obtaining a similarity graph for clustering and then applying low-rank matrix completion to each cluster. While this is an interesting direction, the solution proposed in [50] is prone to fail when subspaces intersect or when the initial grouping is incorrect. To solve this limitation, Elhamifar [17] has proposed self-expressive models for simultaneous clustering and completion of incomplete data. Along the same line, Fan and Chow [20] have recently presented a sparse representation to solve the problem. However, these approaches can only cluster the data based on one single criterion. In parallel, some works have relied on neural networks to learn temporal clustering [52] and infer missing entries [37, 55], but solving just a single problem. In all cases, these approaches propose to exploit a loss function as we do in this paper, but they require a large amount of training data to learn the model and demand a specific hardware to complete the training step. Unfortunately, this cannot be assumed for generic scenarios, where an unknown number of unknown object typologies can deform, move, and even interact between them, doing the process of simultaneously obtaining training data for track completion, spatial groups, and temporal ones very hard and expensive in practice. In contrast, our formulation can solve the problem in just few seconds in a commodity computer, without requiring sophisticated hardware, nor prior knowledge about the scenario to be solved. Moreover, none of them simultaneously solve multiple clustering and completion as we propose in this paper.

2.1 Our contributions

We go beyond previous works by proposing an efficient and robust method that does not require initialization, and it can jointly perform two types of clustering (spatial and temporal), while recovering missing entries and filtering the rest. To the best of our knowledge, no previous approach has jointly addressed the three problems in a unified and unsupervised framework. To this end, we assume the input data to lie in a dual union of low-rank subspaces, where no a priori knowledge about the dimensionality of the subspaces or which data points belong to which subspace is required. It is worth noting that our approach does not require any training data at all. Additionally, the proposed solution can handle situations with complex motion patterns, affected by large degrees of overlapping and percentage of missing entries, in a completely unsupervised manner.

Table 1 summarizes a qualitative comparison of our approach and the aforementioned techniques to jointly solve completion and clustering.

3 Preliminaries and problem statement

3.1 Notation

Matrices are represented with boldface uppercase letters, e.g., \(\mathbf {X}\). In particular, \(\mathbf {I}_{A}\) is used to denote the identity matrix of size \(A \times A\), and \(\mathbf {1}_{A}\) a column vector of ones of size \(A \times 1\). The entries of matrices are denoted by means of subscripts \([\cdot ]\). For instance, \(\mathbf {X}_{[:j]}\) corresponds to the jth column of the matrix \(\mathbf {X}\), \(\mathbf {X}_{[i:]}\) is the ith row of the matrix \(\mathbf {X}\), and \(\mathbf {X}_{[ij]}\) indicates its (ij)th entry. We also define two types of products: \(\mathbf {X}\otimes \mathbf {Z}\) to denote the Kronecker product, and \(\mathbf {X}\odot \mathbf {Z}\) to denote the Hadamard (or element-wise) one. The negative of a binary matrix \(\mathbf {X}\) is denoted as \(\bar{\mathbf {X}}\). We also define several norms on matrices: The \(l_{\infty }\)-norm is defined as \(\Vert \mathbf {X}\Vert _{\infty }=\max _{(i,j)}|\mathbf {X}_{[ij]}|\), and the \(l_{2,1}\)-norm as \(\Vert \mathbf {X}\Vert _{2,1}=\varSigma _{j}\Vert \mathbf {X}_{[:j]}\Vert _{2}\), where the \(l_{2}\)-norm of a vector is denoted by \(\Vert \mathbf {X}_{[:j]}\Vert _{2}\). The Frobenius and nuclear norms are represented as \(\Vert \mathbf {X}\Vert _{F}\) and \(\Vert \mathbf {X}\Vert _{*}=\varSigma _{i} \sigma _{i}(\mathbf {X})\), respectively, with \(\sigma _i(\mathbf {X})\) being the ith singular value of the matrix \(\mathbf {X}\). Finally, the Euclidean inner product between two matrices is denoted as \(\langle \mathbf {X},\mathbf {Z}\rangle =\text {tr}(\mathbf {X}^{\top }\mathbf {Z})\), where \(\text {tr}(\cdot )\) represents the trace of a matrix.

3.2 Problem Formulation

Let us consider F temporal subspaces \(\{S_f\}_{f=1}^{F}\) of dimension \(\{d_f>0\}_{f=1}^{F}\) in a C-dimensional space, and G spatial subspaces \(\{S_g\}_{g=1}^{G}\) of dimension \(\{d_g>0\}_{g=1}^{G}\) in a H-dimensional space. Let \(\mathbf {Y}\in \mathbb {R}^{C\times T}\) be a matrix of T data points lying on the union of the temporal subspaces, and \(\hat{\mathbf {Y}}\in \mathbb {R}^{H\times N}\) a matrix of N data points lying on the union of spatial subspaces. If we assume that both dimensions can be factorized by a factor D (i.e., \(C=DN\) and \(H=DT\)), the two matrices \(\mathbf {Y}\) and \(\hat{\mathbf {Y}}\) can then contain exactly the same number of values but in a different arrangement. Additionally, we will assume that only some entries of these matrices are observed, i.e., some locations can include null values. To denote this, we include the matrix \(\tilde{\mathbf {Y}}\), a sparse version of \(\mathbf {Y}\), in which non-observed entries are set to zero.

Our problem consists in, given an incomplete and noisy matrix \(\tilde{\mathbf {Y}}\) of data points from motion tracking, retrieving the full matrix, \(\mathbf {Y}\) or \(\hat{\mathbf {Y}}\), and clustering the data into the underlying temporal and spatial subspaces. To this end, we will encode the spatial and temporal subspaces using affinity matrices. It is worth noting that both the bases (\(\{S_f\}_{f=1}^{F}\) and \(\{S_g\}_{g=1}^{G}\)) and dimensions of each subspace (\(d_f\) and \(d_g\), respectively) are not known a priori, nor to which cluster each data point belongs to. The incomplete and noisy input matrix can be provided by any tracking algorithm, by considering, for instance, optimization [25, 27] or deep learning approaches (Joo et al. [28]). We next describe our unsupervised and unified approach that can solve the problem without requiring any training data at all.

4 Spatiotemporal subspace clustering

Drawing inspiration on the ideas of [3] for reconstructing non-rigid shapes, we next generalize a spatiotemporal constraint for joint motion track matrix completion and clustering. Note that this constraint was not used previously in the literature for completing missing entries as we present here. We first introduce the two types of interpretations of the tracking matrices we shall use. After that, and considering the previous interpretations, we will introduce the temporal and spatial constraints, extending our formulation to handle missing tracks.

4.1 Motion tracking matrix interpretations

Let us consider a dynamic set of N D-dimensional points tracked along T time instances. For the particular case of \(D=3\), i.e., a tridimensional space, we shall denote by \(\mathbf {x}_i^t=[x_i^t , y_i^t, z_i^t]^{\top }\) the spatial coordinates of the ith point at time instant t. All acquired point coordinates can be collected into the matrix \(\mathbf {Y}\in \mathbb {R}^{DN\times T}\) in an unordered manner in terms of any type of grouping that stores the x, y, and z coordinates in a block matrix form as:

$$\begin{aligned} \mathbf {Y}=\begin{bmatrix} x_1^1&{}\ldots &{}x_N^1&{}y_1^1&{}\ldots &{}y_N^1&{}z_1^1&{}\ldots &{}z_N^1\\ \vdots &{}\ddots &{}\vdots &{}\vdots &{}\ddots &{}\vdots &{}\vdots &{}\ddots &{}\vdots \\ x_1^T&{}\ldots &{}x_N^T&{}y_1^T&{}\ldots &{}y_N^T&{}z_1^T&{}\ldots &{}z_N^T\\ \end{bmatrix}^{\top } . \end{aligned}$$

We could assume the previous motion tracking matrix admits a low-rank decomposition of rank K (\(K=1\) for rigid objects), where K represents the number of bases in a single subspace. We know from the structure from motion theory this matrix is of low rank (Dai et al. [16], Xiao et al. [48]), but since no information about the motion is assumed, only a low-rank constraint can be considered. However, as discussed above, the single low-rank assumption may not have sufficient expressiveness power to model complex motion patterns of multi-object tracks. It is worth mentioning that if we know some kind of clustering or grouping of the T data points, we might handle this situation by enforcing the low-rank assumption to every particular cluster. In this work, however, the number and type of clusters is not known a priori, making the problem more challenging and generic. Consequently, we need to jointly solve for completion and clustering, without assuming any information about the dimensionality of the subspaces.

Since each column of the matrix \(\mathbf {Y}\) encodes all points at a time instant, this matrix cannot be directly used to retrieve spatial similarities. To address this limitation, we consider a new \(DT\times N\) matrix \(\hat{\mathbf {Y}}\), for which each column stores the point tracks. Following the previous case of \(D=3\), this matrix can be written asFootnote 1:

$$\begin{aligned} \hat{\mathbf {Y}}=\begin{bmatrix} x_1^1&{}x_2^1&{}\ldots &{}x_N^1\\ y_1^1&{}y_2^1&{}\ldots &{}y_N^1\\ z_1^1&{}z_2^1&{}\ldots &{}z_N^1\\ \vdots &{}\vdots &{}\ddots &{}\vdots \\ x_1^T&{}x_2^T&{}\ldots &{}x_N^T\\ y_1^T&{}y_2^T&{}\ldots &{}y_N^T\\ z_1^T&{}z_2^T&{}\ldots &{}z_N^T\\ \end{bmatrix} \end{aligned}$$

that is also low rank as \({\mathbf {Y}}\) but differing in value.

Both matrices use two different matrix arrangements of the data points, but they include exactly the same information. We can map from \(\mathbf {Y}\) matrix to \(\hat{\mathbf {Y}}\) using the relation:

$$\begin{aligned} \hat{\mathbf {Y}}=(\mathbf {I}_D \otimes \mathbf {Y}^{\top }) \mathbf {A}, \end{aligned}$$

where \(\mathbf {A}\) is a \((D^D N)\times N\) binary matrix. The inverse mapping can be written as:

$$\begin{aligned} \mathbf {Y}=({\hat{\mathbf {Y}}}^{\top } \otimes \mathbf {I}_D ) \mathbf {B}, \end{aligned}$$

where \(\mathbf {B}\) is also a \((D^D T)\times T\) binary matrix. Both \(\mathbf {A}\) and \(\mathbf {B}\) matrices are known a priori, and they can be easily obtained by considering the data structure in data matrices \(\hat{\mathbf {Y}}\) and \(\mathbf {Y}\).

4.2 Dual union of spatiotemporal subspaces

The arrangement of the point tracks through the matrices \(\mathbf {Y}\) or \(\hat{\mathbf {Y}}\) gives two different interpretations, and each of it can be associated with a distinct subspace clustering process. For instance, when analyzing the temporal domain using \(\mathbf {Y}\), we can define an affinity matrix to capture the temporal similarities between instances at different time steps. This relation can be written as:

$$\begin{aligned} \mathbf {Y}= \mathbf {Y}\mathbf {T}+\mathbf {E}_t , \end{aligned}$$

where \(\mathbf {T}\) encodes a temporal affinity \(T\times T\) matrix and \(\mathbf {E}_t\) is a \(DN\times T\) residual noise. In this context, the temporal affinity matrix \(\mathbf {T}\) measures the similarities between D-dimensional poses along time. Using this relation, we enforce \(\mathbf {Y}\) to lie in a union of \(S_f\) temporal subspaces, each of them with rank \(d_f\). We could say that the matrix \(\mathbf {T}\) is the lowest rank representation of the data \(\mathbf {Y}\) with respect to itself. It is worth noting that \(\mathbf {T}\) will be block-diagonal when the data samples have been grouped together in \(\mathbf {Y}\) according to the subspace memberships. This block pattern is lost for random entries, obtaining null entries when no affinities are provided. This type of self-expressive model was previously used by [18, 34] in the context of subspace clustering.

Similarly, we can analyze the spatial domain through the matrix \(\hat{\mathbf {Y}}\), by introducing an affinity matrix associated with a union of spatial subspaces in the presence of noise. In this case, we can write:

$$\begin{aligned} \hat{\mathbf {Y}}= \hat{\mathbf {Y}}\mathbf {S}+\mathbf {E}_s , \end{aligned}$$

where \(\mathbf {S}\) encodes a spatial affinity \(N\times N\) matrix and \(\mathbf {E}_s\) is a \(DT \times N\) residual noise. In this case, we are enforcing \(\hat{\mathbf {Y}}\) to lie in a union of \(S_g\) spatial subspaces of rank \(d_g\), respectively, measuring the similarities between D-dimensional points in a same time instant. Basically, \(\mathbf {S}\) and \(\mathbf {T}\) are made of low-rank coefficients that define the union of subspaces in every domain, respectively. Once these affinity matrices are learned from data, off-the-shelf spectral clustering algorithms like [13] can be applied on each of them to discover the grouping in every domain. The temporal clustering splits the data into motion primitives and the spatial one into different object instances.

Nevertheless, the previous formulation requires full measurements on the tracking matrices \(\mathbf {Y}\) or \(\hat{\mathbf {Y}}\), which is not often the case in real applications. Previous subspace clustering algorithms assume the observation matrices to be complete [3, 18, 29, 34, 56]. As mentioned above, other approaches [17, 20] proposed an algorithm to jointly estimate missing entries and build a similarity graph for clustering, when considering a single union of temporal subspaces. The algorithm we present in the following section goes beyond these approaches and allows solving the matrix completion problem when considering the data to be spanned by two different union of subspaces. Our approach can handle high levels of missing entries and noisy measurements, and solve the problem by means of a one-stage optimization algorithm. This means our approach can produce more accurate solutions than competing techniques, while being more general.

5 Motion tracking completion and spatiotemporal subspace clustering

We next present our algorithm to simultaneously recover missing entries and estimate two similarity matrices for computing the spatial and temporal grouping. Note that neither prior information nor training data is used at all. The input to our algorithm are incomplete motion tracks of N D-dimensional points observed along T time instances that are arranged into the matrix \(\tilde{\mathbf {Y}}\). In addition, we also introduce an observation matrix \(\mathbf {O}\in \mathbb {R}^{N\times T}\) with binary entries that indicate whether the coordinates of a point at a specific time instant are observed or not.

5.1 Proposed formulation

Let us denote by \(\varvec{\varTheta }\equiv \{\mathbf {Y},\hat{\mathbf {Y}},\mathbf {T},\mathbf {S},\mathbf {E}_t,\mathbf {E}_s\}\) the set of model parameters we have to learn from the input data \(\varvec{\varGamma }\equiv \{\tilde{\mathbf {Y}},\mathbf {O}\}\). We introduce an optimization framework ruled by a cost function that accounts for the spatiotemporal clustering constraints of Eqs. (3) and (4), and enforces the similarity matrices \(\mathbf {T}\) and \(\mathbf {S}\) to be spanned by low-rank subspaces. Consequently, the combination of both constraints enforces the data in order to lie in a dual union of subspaces. Indeed, the single union of subspaces model can be seen as a degenerate case of our model (see Remark 1).

Since rank minimization is a non-convex NP-hard problem [41], the nuclear norm is approximated by its convex relaxation [12, 14]. Additionally, in order to be able to deal with data corrupted by noise and outliers, we use \(l_{2,1}\)-norm regularization, as the convex relaxation of the \(l_{2,0}\)-norm [33]. The objective function can therefore be written as:

$$\begin{aligned} \begin{array}{rll} \text {subject to} &{}\quad \mathbf {Y}= \mathbf {Y}\mathbf {T}+\mathbf {E}_t \\ &{}\quad \hat{\mathbf {Y}}= \hat{\mathbf {Y}}\mathbf {S}+\mathbf {E}_s \\ &{}\quad (\mathbf {I}_D \otimes \mathbf {Y}^{\top }) \mathbf {A}= \hat{\mathbf {Y}} \end{array} \end{aligned}$$

where \(\{\phi ,\gamma ,\lambda _t,\lambda _s\}\) are predefined penalty term parameters.

Remark 1

When the data points are not connected in the spatial domain, it means that the affinity matrix \(\mathbf {S}\) becomes the identity \(\mathbf {I}_N\) (we assume the data points are clean in this domain, i.e., \(\mathbf {E}_s=\mathbf {0}\)), and hence, our formulation degenerates to a union of temporal subspaces. On the other hand, when this occurs in the temporal domain (\(\mathbf {T}=\mathbf {I}_T\) and \(\mathbf {E}_t=\mathbf {0}\)), our formulation degenerates to a union of spatial subspaces.

5.2 Efficient augmented Lagrangian multiplier optimization

The optimization problem in Eq. (5) can be efficiently solved in a unified manner via an ALM method [8, 31]. Without loss of generality, we set \(\lambda \equiv \lambda _t\equiv \lambda _s\). In order to reduce the number of parameters and the complexity of the problem while improving convergence, we choose to bring the clustering constraints into the energy function using several Lagrange multipliers with a unique penalty weight \(\beta >0\). In addition, we introduce three support matrices \(\mathbf {Y}\equiv \mathbf {M}\), \(\mathbf {T}\equiv \mathbf {J}\), and \(\mathbf {S}\equiv \mathbf {K}\), to obtain the corresponding augmented Lagrangian function that can be written as:


where \(\varvec{\varTheta }_{L}\equiv \{\mathbf {M},\mathbf {Y},\mathbf {J},\mathbf {T},\mathbf {K},\mathbf {S},\hat{\mathbf {Y}},\mathbf {E}_s,\mathbf {E}_t\}\) includes the tracking completion, spatiotemporal similarity parameters, and residual noises. The Lagrange multipliers are defined as \(\{\mathbf {L}_1,\mathbf {L}_4\}\in \mathbb {R}^{DN\times T}\), \(\{\mathbf {L}_2,\mathbf {L}_3\}\in \mathbb {R}^{DT\times N}\), \(\mathbf {L}_5\in \mathbb {R}^{T\times T}\), and \(\mathbf {L}_6\in \mathbb {R}^{N\times N}\). Recall that we do not need to know the dimensions nor the bases of the temporal and spatial subspaces a priori, since Eq. (5) automatically selects the appropriate number of data points from every spatiotemporal subspace.

We propose to solve the problem in Eq. (6) by minimizing each variable individually and in closed form, while keeping fixed the rest of model parameters. Algorithm 1 explains the details. The expressions for estimating \(\mathbf {Y}\), \(\mathbf {T}\), \(\mathbf {S}\), and \(\hat{\mathbf {Y}}\) (steps 4, 6, 8, and 9) are obtained by computing the derivatives of Eq. (6) in \(\mathbf {Y}\), \(\mathbf {T}\), \(\mathbf {S}\), and \(\hat{\mathbf {Y}}\), respectively, and equating to zero. The subproblems to recover \(\mathbf {M}\), \(\mathbf {J}\), \(\mathbf {K}\), \(\mathbf {E}_t\), and \(\mathbf {E}_s\) are convex and have closed-form solutions. Particularly, for steps 2, 5, and 7, we apply a singular value thresholding minimization [10] with a ‘shrinkage operator’ \(S\frac{*}{\beta }(x)=\max (0,x-\frac{*}{\beta })\) where \(*=\{\phi ,\gamma \}\). In order to optimize the noise terms \(\mathbf {E}_t\) and \(\mathbf {E}_s\) (steps 10 and 11, respectively), we apply the Lemma 4.1 in [51]. After each iteration, the Lagrange multipliers are updated according to standard rules as shown in lines 12-13. Additionally, we also update the penalty weight \(\beta \) (step 14) to guarantee the convergence of our algorithm, following the upper bounded requirement of the alternating direction methods. Particularly, we apply a factor of 1.1 to increase \(\beta \) every iteration.

figure d

The theoretical convergence of our algorithms is not easy to proof, as the method is based on nine different blocks. However, we have empirically observed that for all experiments reported in the following section, the algorithm always converged in about \(190-220\) iterations. Additionally, we observe the optimality gap obtained in every iteration to monotonically decrease. An example of this analysis is displayed in Fig. 1, where both constraints and full errors in Eq. (6) are represented for a specific case. As it can be seen, after around 50 iterations all constraints are almost perfectly satisfied and the overall energy converges.

Fig. 1
figure 1

Convergence analysis: energy reduction as a function of the number of iterations. Evolution of the error for the six constraints (denoted as \(C_c\), with \(c=\{1,\ldots ,6\}\)) and the full energy in Eq. (6) as a function of the number of iterations until convergence (corresponding to the Jump scenario described in “Experimental results” section). Note that two different scales are used to represent the errors of the constraints (left axis) and the full error (right axis). For visualization purposes, we plot the full energy scaled by a factor of 0.1

Fig. 2
figure 2

Computation time as a function of the number of frames, points and iterations. Computation time versus number of iterations until convergence on the mocap sequences described in “Experimental results” section, for two (red dots) and four (blue dots) people. Next to each dot are indicated the number of images of the sequence. In all cases, the number of iterations until convergence always remains within reasonable bounds. The corresponding computation time depends on the number of frames and points

5.3 Complexity analysis

The most computationally demanding parts of Algorithm 1 are the steps 2, 5 and 7, which require computing several SVD operations over matrices of size \(DN\times T\), \(T\times T\) and \(N\times N\), respectively. Hence, our problem can be solved in polynomial time with a computational complexity of at most of \(\mathcal {O}(N^2 T+T^3+N^3)\) [23]. Note that this complexity could be easily reduced by orthogonalizing the columns of the matrices \(\mathbf {Y}\) and \(\hat{\mathbf {Y}}\). The computation times (in unoptimized MATLAB code) on a commodity laptop with an Intel Core i7 processor at 2.4GHz for motion capture sequences for two and four people are displayed in Fig. 2. On average, the median computation time in experiments with sequences between 277 and 652 frames, and two people (\(N=82\) points) was of 51 seconds. Processing between 214 and 432 frames, and four people (\(N=164\) points) required a median time of 44 seconds. In any case, to handle larger datasets we could use current results Yao et al. [53] on the use of SVD operations on large datasets to address large-scale low-rank problems. This could really help to reduce the reported complexity. Moreover, our formulation could be extended to be employed in a sequential manner, being this a part of our future work.

6 Experimental results

In this section we report the performance of our algorithm to solve motion tracking completion, as well as temporal and spatial clustering on several challenging datasets. For all cases, we denote by \(\rho \) the fraction of missing entries in the input data. In all experiments, we set \(\phi =1.0\), \(\gamma =2.0\) and \(\lambda =0.03\). It is worth pointing out that we do not need fine-tuning these parameters, as the results were stable for wide range of values for \(\phi \in [0.1,10]\) and \(\gamma \in [0.2,20]\). Regarding the competing approaches, we will compare our algorithm, denoted as spatiotemporal track completion (ST2C), with the low-rank matrix completion (LRMC) [10], and the bilinear factorization matrix completion (BFMC) [9], two approaches where the rank is automatically estimated. We do not include [6, 22] as these methods require knowing the rank of every subspace a priori. Unfortunately, neither can we report the results of [17, 20] as its source code is not publicly available. Recall, however, that both approaches did only consider a single union of subspaces.

To establish a quantitative evaluation, we will compute three types of errors: the temporal \(e_{TC}\) and spatial \(e_{SC}\) clustering error as well as the motion tracking completion \(e_{MTC}\) (this error is equivalent to a matrix completion evaluation) that are defined as:

$$\begin{aligned} e_{TC}&=\frac{\# \text {Misclassified frames}}{\# \text {All frames}}, \end{aligned}$$
$$\begin{aligned} e_{SC}&=\frac{\# \text {Misclassified points}}{\# \text {All points}}, \end{aligned}$$
$$\begin{aligned} e_{MTC}&=\frac{\Vert \mathbf {Y}-\mathbf {Y}_{GT}\Vert ^{2}_{F}}{\Vert \mathbf {Y}_{GT}\Vert ^{2}_{F}}. \end{aligned}$$

where \(\mathbf {Y}_{GT}\) and \(\mathbf {Y}\) denote the ground truth and the recovered matrices, respectively. For the temporal clustering error, we have obtained the ground truth segmentation over noise-free and complete measurement matrices by applying [34] to compute the similarity matrices and [13] to obtain the clusters. Spatial ground truth were annotated by hand. This means the evaluation we propose for temporal clustering is actually an implicit comparison with respect to the competing approach [34] by assuming clear measurements.

Fig. 3
figure 3

Patterns of missing entries. \(\mathbf {V}\) patterns used to simulate missing entries in the Jump sequence. White and black cells denote non-visible and visible points, respectively. Top: \(\rho =0.4\) of random missing entries. Bottom: \(\rho =0.4\) of structured missing entries

Fig. 4
figure 4

Motion completion errors of different algorithms as a function of the missing entries rate \(\rho \) on motion capture sequences with two subjects. Each algorithm is evaluated under noise-free (\(\tau =0\)) and noisy (\(\tau =\{1,2\}\)) data. For visualization purposes, the error of LRMC has been divided by a factor 3.5 in all graphs. Top: Random missing entries. Bottom: Structured missing entries

Fig. 5
figure 5

Motion completion errors of different algorithms as a function of the missing entries rate \(\rho \) on motion capture sequences with four people. Again, every algorithm is evaluated under noise-free (\(\tau =0\)) and noisy (\(\tau =\{1,2\}\)) data. The error of LRMC has been divided by a factor 3.5 in all graphs. Top: Random missing entries. Bottom: Structured missing entries

Fig. 6
figure 6

3D motion track completion on multi-body scenarios, assuming missing entries (\(\rho =0.7\)) and noisy measurements (\(\tau =1\)). The sequences in order of appearance (from top to bottom) are: Shelters, Nursery, Greet4 and Zombie4, respectively. For everyone, several instant frames are represented from two orthogonal viewpoints (z-x and y-z). 3D ground truth is represented by circles and squares, where the color denoted if a point is visible (black circles) or not (blue squares). We represent our motion track completion by means of red dots. Observe that even for high levels of missing entries, our algorithm produces an accurate and clean completion. Although it is not represented in this figure, it is worth pointing out that our algorithm also recovers the spatiotemporal segmentation, even for large degrees of overlapping between the bodies, as it can be seen in the displayed scenarios. Best viewed in color

6.1 Real experiments on motion capture data

We evaluate the proposed approach on the CMU MoCap dataset. We consider several scenarios with either two or four people interacting and performing complex motions in 3D. On average, the sequences we consider are 433 frames long, and the number of points per frame is either 82 (two people) or 164 (four people). Specifically, we select eight sequences with two people: 23_16 (Jump): subjects alternating synchronized jumping jacks; 19_05 (Pull): a subject pulls the other by the elbow; 22_20 (Violence): a subject picks up high stool and threatens to strike the other; 20_06 (Soldiers): subjects follow a soldiers march; 23_19 (Stares Down): a subject stares down the other and leans with hands on high stool; 22_12 (Stumbles): a person stumbles into the other; 20_09 (Nursery): people follow a nursery rhyme; and 22_10 (Shelters): a person shelters the other from harm. A total of four sequences with four people are considered, synthetically generated by combining pairs of sequences with two people.

All sequences are corrupted in three different ways: 1) randomly removing a fraction \(\rho =\{0.1, \ldots , 0.8\}\) of entries of the measurement matrix \(\mathbf {Y}\); 2) removing a structured fraction \(\rho =\{0.1, \ldots , 0.4\}\) of entries of the measurement matrix \(\mathbf {Y}\) where we emulate temporal self-occlusions or lack of visibility, by including patterns with 50% of structured missing entries per frame; and 3) adding noise to the observed points, according to a Gaussian distribution with standard deviation \(\sigma _{noise}=\frac{\tau }{100} \psi \), where \(\tau \) controls the amount of noise, and \(\psi \) represents the maximum distance of a point to the centroid of all the points. An example of these artifacts is shown in Fig. 3, for both random and structured missing entries.

Figures 4 and 5 summarize the results for two and four people, respectively. Each graph depicts the results of all 3 methods for one specific sequence, at increasing levels or missing data for the two types of cases we propose. Solid and dashed lines represent results for noise-free and noisy measurements, respectively. Our approach and BFMC [9] show similar error patterns, even though ours being always consistently better. A breaking point is achieved earlier by BFMC [9], showing our superiority in terms of robustness against this type of artifacts. As it can be seen, our solution by assuming noise can even provide better solutions than the competing approaches for clean annotations. The performance of LRMC [10] is far below the other two algorithms. We hypothesize this is due to the pseudo-block structure of the missing data, as each missing point does indeed represent three—recall that for this experiment, D\(=\)3—adjacent null elements in \(\mathbf {Y}\). This is especially relevant when the missing entries are structured, as it can be seen in the bottom part of Figs. 4 and 5. Note also that BFMC [9] and LRMC [10] are specifically designed for matrix completion. These algorithms do not provide any kind of affinity measure, that allows subsequent clustering. Some instances for several scenarios when the missing entries are random are displayed in Fig. 6. Moreover, our algorithm is faster than the competing approaches, producing an speed up of \(2.7\times \) when BFMC [9] is considered.

As we have commented above, our approach also estimates spatial and temporal clustering. Tables 2 and 3 summarize the mean error for each sequence and all levels of missing data for the random and structured cases, respectively, for noiseless (\(\tau =0\)) and noisy (\(\tau =\{1,2\}\)) measurements. As it can be seen, our approach produces very good results for most of the sequences, especially in terms of spatial clustering where we obtain an almost negligible clustering error. In fact, our algorithm produces better spatial clustering solutions with artifacts that the provided by LRR [34] even assuming full observations (remember that this method needs full data, i.e., \(\rho =0\)), as it is observed in Table 2. For temporal clustering, our solution is implicitly compared with respect to LRR Liu et al. [34], showing consistent solutions as a function of the level of noisy. For both types of artifacts, the worst results are obtained for the sequences Shelters and Nursery, since the type of motion does not include many deformation cycles. In any case, even for these complex motions, our algorithm provides a good trade-off between accuracy and computational cost.

Fig. 7
figure 7

2D Tracking completion for the ASL dataset. Results for three frames of the sequence. For every image, visible 2D tracking data are shown as red dots. To complete the non-visible tracks, we use our algorithm (blue crosses) and the low-trajectory-rank approach of [24] (green circles). Qualitative results show that our approach provides more accurate track completion for this challenging experiment. Best viewed in color

Fig. 8
figure 8

2D Tracking completion for the multi-fish sequence. Results for four frames of the sequence. For every image, visible 2D tracking data and hallucinated non-visible tracks by our algorithm are displayed as red dots and blue crosses, respectively. As it can be seen, our algorithm produces physically aware estimations on this experiment. Best viewed in color

6.2 Real experiment on ASL tracking completion

We now consider the completion of time series trajectories from a real monocular video. We use two American Sign Language (ASL) sequences [24] of 114 image frames, where 77 feature points per sequence are tracked. For the purpose of evaluating the spatial clustering ability of our algorithm, we merge the frames of the two sequences to render a unique video with two faces (with N=154 feature points). The face tracks are corrupted by missing entries (corresponding to \(\rho =0.1445\)) due to partial occlusions produced by one or two hands (self-occlusion), or by the face self-rotation causing lack of visibility.

Results are shown in Fig. 7. We compare against [24], a completion algorithm that estimates missing tracks enforcing low-rank trajectory models. Note that this approach requires fine-tuning the rank of the subspace a priori, producing very different solutions when this is done. We use the rank value provided by the authors. For the non-visible points, there is no ground truth, but from a qualitative inspection we observe our approach to be remarkably more accurate (see, for instance, the rightmost frame of Fig. 7). We may nevertheless measure the accuracy of the estimated position for the visible points (red dots). For these, our method provides a solution 2.35 times more accurate than that obtained by [24] without assuming any rank knowledge.

6.3 Real experiment on multi-fish data

Finally, we consider a very challenging multi-fish real sequence taken from the DAVIS dataset Perazzi et al. [38]. Particularly, this is a sequence of 51 frames where 33 points per image are tracked. The incomplete tracks are provided by hand, obtaining a level of missing entries of \(\rho =0.129\) (as a combination of random and structured missing tracks), due mainly to multiple partial occlusions produced by the dynamic motion of the animals. A qualitative evaluation of our algorithm is displayed in Fig. 8. As it can be seen, our algorithm can accurately recover the missing tracks without assuming any extra information about the type of observed scene, such as the number of objects, the type of deformations, or the rank of every subspace.

Table 2 Completion and spatiotemporal clustering results on the CMU dataset as a function of noisy measurements for random missing entries
Table 3 Completion and spatiotemporal clustering results on the CMU dataset as a function of noisy measurements for structured missing entries

7 Conclusion

We have proposed an algorithm for simultaneous motion track completion and clustering based on two different criteria. For this purpose, we have devised a model that allows to jointly enforce the entries of the matrix to lie in a dual union of subspaces. This goes beyond state-of-the-art solutions, which were restricted to single union of subspaces. Using the machinery of the augmented Lagrange multipliers, we have obtained an efficient solution to the problem, and applied it to the case of input data obtained from motion capture systems of multiple human motion, and to challenging real videos. Extensive evaluation demonstrates the ability of our approach to recover missing tracks and segment input data into each of the objects being captured, and automatically discovering their motion primitives. Further theoretical analysis of the algorithm and convergence proofs will be investigated in the future. Moreover, we pretend to extend our formulation for sequential estimation as the data arrive.