Dynamic Behavior Analysis via Structured Rank Minimization

Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear time-invariant system when behavioral cues act as the input (e.g., continuous rather than discrete annotations of dimensional affect). Here we study the learning of such dynamical system under real-world conditions, namely in the presence of noisy behavioral cues descriptors and possibly unreliable annotations by employing structured rank minimization. To this end, a novel structured rank minimization method and its scalable variant are proposed. The generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods for these tasks and, hence, evidence the robustness and effectiveness of the proposed approach.


Introduction
Analysis of human behavior concerns detection, tracking, recognition, and prediction of complex human behaviors including affect and social behaviors such as agreement and conflict escalation/resolution from audio-visual data captured in naturalistic, real-world conditions. Modeling human behavior for automatic analysis in such conditions is the prerequisite for next-generation human-centered computing and novel applications such as personalized natural interfaces (e.g., in autonomous cars), software tools for social skills enhancement including conflict management and negotiation, and assistive technologies (e.g., for independent living), to mention but a few.
Traditionally, research in behavior and affect analysis has focused on recognizing behavioral cues such as smiles, head nods, and laughter (Déniz et al. 2008;Kawato and Ohya 2000;Lockerd and Mueller 2002), pre-defined posed human actions (e.g., walking, running, and hand-clapping) (Dollár et al. 2005;Niebles et al. 2008;Georgakis et al. 2012) or discrete, basic emotional states (e.g., happiness, sadness) (Pantic and Rothkrantz 2000;Cohen et al. 2003;Littlewort et al. 2006) mainly from posed data acquired in laboratory settings. However, these models are deemed unrealistic as they are unable to capture the temporal evolution of non-basic, possibly atypical, behaviors and subtle affective states exhibited by humans in naturalistic settings. In order to accommodate such behaviors and subtle expressions, continuous-time and dimensional descriptions of human behavior and affect have been recently employed (Gunes and Pantic 2010;Gunes et al. 2011;Pantic et al. 2011;Pantic and Vinciarelli 2014;Vrigkas et al. 2015). For instance, the temporal evolution of level of interest (Nicolaou et al. 2014;Panagakis et al. 2016) and agreement (Bousmalis et al. 2011;Rakicevic etal. 2016), or the intensity of pain (Kaltwang et al. 2012(Kaltwang et al. , 2015 and conflict (Kim et al. 2012a, b;Panagakis et al. 2016) is precisely described as continuous-valued function of time. In analogy, dimensional and continuous description of human emotion consists of characterizing emotional states in terms of a number of latent dimensions over time . Two dimensions are deemed sufficient for capturing most of the affective variability: valence and arousal (V-A), signifying respectively, how positive/negative and active/inactive an emotional state is Lane and Nadel (2002).
Representative machine learning models employed for automatic, continuous behavior and emotion analysis include Hidden Markov Models (HMMs) (Cohen et al. 2003) for facial expression recognition, Dynamic Bayesian Networks (DBN) for human motion classification and tracking (Pavlović et al. 1999), Conditional Random Fields (CRFs) for prediction of visual backchannel cues (i.e., head nods) (Morency et al. 2010), Long-Short Term Memory (LSTM) Neural Networks for continuous prediction of dimensional affect , and regressionbased approaches for continuous emotion and depression recognition or pain estimation (Nicolaou et al. 2012;Valstar et al. 2013;Kaltwang et al. 2012). Despite their merits, these methods rely on large sets of training data, involve learning of a large number of parameters, they do not model dynamics of human behavior and affect in an explicit way, and, more importantly, they are fragile in the presence of gross non-Gaussian noise and incomplete data, which is abundant in real-world (visual) data.
Contributions In this work, we model and tackle the problem of dynamic behavior analysis in the presence of gross, but sparse noise and incomplete visual data under a different perspective, making the following contributions: 1. The modeling assumption here is that for smoothlyvarying dynamic behavior phenomena, such as conflict escalation and resolution, temporal evolution of human affect described in terms of valence and arousal, or motion of human crowds, among others, the observed data can be postulated to be trajectories (inputs and outputs) of a linear time-invariant (LTI) system. Recent advances in system theory (Van Overschee and De Moor 2012; Fazel et al. 2013) indicate that such dynamics can be discovered by learning a low-complexity (i.e., low-order) LTI system based on its inputs and outputs via rank minimization of a Hankel matrix constructed from the observed data. Here, continuous-time annotations characterizing the temporal evolution of relevant behavior or affect are considered as system outputs, while (visual) features describing behavioral cues are deemed system inputs. In practice, visual data are often contaminated by gross, non-Gaussian noise mainly due to pixel corruptions, partial image texture occlusions or feature extraction failure (e.g., incorrect object localization, tracking errors), and human assessments of behavior or affect may be unreliable mainly due to annotator subjectivity or adversarial annotators. The existing structured rank minimization-based methods perform sub-optimally in the presence of gross corruptions. Therefore, to robustly learn a LTI system from grossly corrupted data, we formulate a novel q -norm regularized (Hankel) structured Schatten-p norm minimization problem in Sect. 3. The Schatten p-and the sparsity promoting q -norm act either as convex surrogates, when p = q = 1, or as non-convex approximations, when p, q ∈ (0, 1), of the rank function and the 0 -(quasi) norm, respectively. 2. To tackle the proposed optimization problem, an algorithm based on the Alternating-Directions Method of Multipliers (ADMM) (Bertsekas 2014) is developed in Sect. 4. Furthermore, in the same section a scalable version the algorithm is elaborated. 3. The proposed model is the heart of a general and novel framework for dynamic behavior modeling and analysis, which is detailed in Sect. 5. A common practice in behavioral and affective computing is to train machine learning algorithms by employing large sets of training data that comprehensively cover different subjects, contexts, interaction scenarios and recording conditions. The proposed approach allows us to depart from this practice. Specifically, we demonstrate for the first time that complex human behavior and affect, manifested by a single person or group of interactants, can be learned and predicted based on a small amount of person(s)-specific observations, amounting to a duration of just a few seconds. 4. The effectiveness and the generalizability of the proposed model is corroborated by means of experiments on synthetic and real-world data in Sect. 6. In particular, the generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods on both synthetic and real-world data and, hence, evidence the robustness and effectiveness of the proposed approach. The proposed framework is graphically illustrated in Fig. 1.  Fig. 1 Illustration of the proposed dynamic behavior analysis framework, as applied on the task of conflict intensity prediction for a sequence from CONFER dataset. A portion of the sequence frames is used for LTI system learning through the proposed structured rank minimization method (training), while the remaining frames are used for prediction (test)

Background and Related Work
In this section, notation conventions and mathematical formalism related to Hankel matrix structure are first introduced. Next, in order to make the paper self-contained, we describe how learning of dynamical systems and, in particular, of a LTI system can be cast as a (Hankel) structured rank minimization problem. Related works on structured rank minimization and their applications in visual information processing are also described.

Preliminaries
Notations Matrices (vectors) are denoted by uppercase (lowercase) boldface letters, e.g., X, (x). I denotes the identity matrix of compatible dimensions. The ith element of vector x is denoted as x i , the ith column of matrix X is denoted as x i , while the entry of X at position (i, j) is denoted by x i j . For the set of real numbers, the symbol R is used. For two matrices A and B in R m×n , A • B denotes the Hadamard (entry-wise) product of A and B, while A, B denotes the inner product tr(A T B), where tr(·) is the trace of a square matrix. For a symmetric positive semi-definite matrix A, we write A 0. Regarding vector norms, x := i x 2 i denotes the Euclidean norm. The sign function is denoted by sgn(·), while | · | denotes the absolute value operator. Regarding matrix norms, the 0 -(quasi-) norm, which equals the number of non-zero entries, is denoted by · 0 . X q := i j |X i j | q 1/q is the matrix q -norm, of which the Frobenius norm X F := i j X 2 i j = tr(X T X) is a special case when q = 2. X denotes the spectral norm, which equals the largest singular value. If σ i (X) is the ith singular value of X, X S p := i σ i (X) p 1/ p is the Schatten p-norm of X, of which the nuclear norm X * := i σ i (X) is a special case when p = 1. Linear maps are denoted by scripted letters. For a linear map A : R m×n → R p , A * denotes the adjoint map of A, while σ max (A) denotes the maximum singular value of A. I denotes the identity map.

The Hankel Matrix Structure Let
and ∈ R nk×q with σ max ( ) ≤ 1 (Fazel et al. 2013 It is proved in Fazel et al. (2013) that H * m,n, j,k (B) 2 F ≤ L B 2 F , where L := min{ j, k}. This finding, combined with σ max ( ) ≤ 1, entails that the spectral norm of the adjoint map H * is less than or equal to √ L. Herein, the space of Hankel matrices is denoted by S H .

LTI System Learning via Structured Rank Minimization
Dynamical systems, such as LTI systems, are able to compactly model the temporal evolution of time-varying data. While the dynamic model can be considered as known in some applications (e.g., Brownian dynamics in motion models), it is in general unknown and, hence, should be learned from the available data. Consider a sequence of observed outputs y t ∈ R m and inputs u t ∈ R d , respectively, for t = 0, . . . , T − 1. The goal is to find from the observed data, a state-space model, corresponding to a LTI system, given by such that the system is of low-order, i.e., it is associated with a low-dimensional state vector x t ∈ R n at time t, where n is the unknown true system order. The order of the system (i.e., the dimension of the state vector) captures the memory of the system and it is a measure of its complexity. In (3), both the state and the measurement equations are linear and the parameters of the system, i.e., the matrices A, B, C, D are constant over time but their dimensions are unknown. Therefore, to determine the model, we need to find the model order n, the matrices A, B, C, D, and the initial state x 0 . To this end, the model order should be estimated first. Next, the estimation of the system order using Hankel matrices is summarized. Let us assume that the unknown state vectors has dimension r > n and let be the matrices containing in their columns the unknown state vectors, the observed outputs, and the observed inputs of the system, respectively, for t = 0, 1, . . . , T − 1. Let also H m,1,r +1,T −r (Y) and H d,1,r +1,T −r (U) be the Hankel matrices constructed from the observed system outputs and inputs, respectively, according to (1) and U ⊥ ∈ R (T −r )×q be the matrix whose columns form an orthogonal basis for the nullspace of H d,1,r +1,T −r (U). Then, the LTI in (3) can be expressed by employing the above mentioned Hankel matrices as follows.
By right-multiplying both sides of (4) with U ⊥ and by setting If the inputs are persistently exciting (i.e., XU ⊥ has full rank) and the outputs are exact, then by (6) it is clear that the system order, which is measured by the rank of G (Van Overschee and De Moor 2012), is equal to rank (H(Y)) (Van Overschee and De Moor 2012) and from it a system realization (i.e., estimation of the unknown system parameters) is easily computed by solving a series of systems of linear equations following, for example, Van Overschee and De Moor (2012).
However, real-world data are not exact and thus H(Y) is full-rank. Therefore, to find the minimum order realization of the system, we seek a matrixŶ which is as close as possible, in the least square sense, to the observed data and the rank of H(Ŷ) is minimal. Formally, we seek to solve the following Hankel structured rank minimization problem where λ > 0. Assuming thatŶ is a solution of (7), then rank(H(Ŷ)) acts as the estimated system order 1 andŶ is used next to estimate the system parametersÂ,B,Ĉ,D and the initial state vectorx 0 by solving a series of systems of linear equations (Van Overschee and De Moor 2012).

Hankel Rank Minimization Models and Applications
Problem (7) is combinatorial due to the discrete nature of the rank function and thus difficult to be solved (Fazel et al. 2001). To tackle this problem, several approximations have been proposed. In particular, by employing the nuclear norm, which is the convex surrogate of the rank function (Fazel et al. 2001), a convex approximation of (7) has been proposed in Fazel et al. (2013). By adopting the variational norm of the nuclear norm (i.e., Ŷ * = minŶ =UV U 2 F + V 2 F ), nonlinear approximations to (7) have been developed (Signoretto et al. 2013;Yu et al. 2014). Furthermore, to estimate the rank of an incomplete Hankel matrix (i.e., in the presence of missing data), the models in Markovsky (2014), Dicle et al. (2013) and Ayazoglu et al. (2012) have been proposed. Representative structured rank minimization models along with the optimization problems that they solve are listed in Table 1.
The aforementioned models have been mainly applied in the fields of system analysis and control theory for system identification and realization and in finance for time-series analysis and forecasting. More recently, learning dynamical models via Hankel rank minimization has been exploited to address computer vision problems such as activity recognition Bhattacharya et al. 2014), tracklet matching (Ding et al. 2007a(Ding et al. , 2008Dicle et al. 2013), multi-camera tracking , video inpainting (Ding et al. 2007b), causality detection (Ayazoglu et al. 2013), and anomaly detection (Surana et al. 2013). However, none of these methods has been exploited to learn behavior dynamics based on continuous annotations of behavior or affect and visual features. This will be investigated shortly in Sect. 6.
Remark Despite their merits, the aforementioned models exhibit the following limitations. By adopting the least squares error, the majority of the models in Table 1 assume Gaussian distributions with small variance (Huber 2011). Such an assumption rarely holds in real-world data that are often corrupted by sparse, non-Gaussian noise (cf. Sect. 1). This drawback is partially alleviated in SRPCA (Ayazoglu et al. 2012), where a sparsity promoting norm is incorporated into the nuclear norm minimization problem in order to account for sparse noise of large magnitude. Furthermore, the convex relaxation of the rank function with the nuclear norm in Fazel et al. (2013) and Ayazoglu et al. (2012) may introduce a relaxation gap. Therefore, due to the above reasons, the estimated rank of the Hankel matrix obtained by the models in Fazel et al. (2013) and Ayazoglu et al. (2012) may be arbitrarily away from the true one . On the other hand, since the models in Signoretto et al. (2013), Yu et al. (2014) and Markovsky (2014) rely on factorizations of the Hankel matrix, they implicitly assume that the rank of the Hankel matrix is known in advance; obviously this is not the case in practice. To alleviate the aforementioned limitations and robustly estimate the rank of the Hankel matrix in the presence of gross noise and missing data, a novel structured rank minimization model is detailed next.

Table 1
List of structured rank minimization methods (including the proposed method) and the corresponding optimization problems Structured Robust PCA (SRPCA) (Ayazoglu et al. 2012)

Related methods
Iterative Hankel Total Least Squares (IHTLS) (Dicle et al. 2013) GivenH Structured Low-Rank Approximation (SLRA) (Markovsky 2014) min For all methods, the observed data matrix, its Hankel version, and the estimated (Hankel) structured low-rank approximate are denoted by M respectively, unless otherwise stated

Problem Formulation
Let M = [m 0 m 1 . . . m T−1 ] ∈ R D×T be a matrix containing in its columns contaminated by gross but sparse noise, time varying data. The goal is to robustly learn the dynamics underlying the data, in the presence of sparse, non-Gaussian noise and missing data.
To this end, we seek to decompose M as a superposition of two matrices: M = L+E, where L ∈ R D×T and E ∈ R D×T , such that the Hankel matrix of L (i.e., H(L) ∈ R M×N ) be of minimum rank and E be sparse. The minimum rank of H(L) correspond to the minimum-order LTI system that describes the data, while by imposing E to be sparse, we account for sparse, non-Gaussian noise.
A natural estimator accounting for the low-rank of the Hankel matrix H(L) and the sparsity of E is to minimize the rank of H(L) and the number of non-zero entries of E, measured by the 0 (quasi)-norm. This is equivalent to solving the following non-convex optimization problem.
where λ is a positive parameter. Clearly, (8) is a robust version of the Hankel structured rank minimization problem (7). Problem (8) is intractable, as both rank and 0 -norm minimization are NP-hard (Vandenberghe and Boyd 1996;Natarajan 1995). In order to tackle this NP-hard problem, both convex and non-convex relaxations of the rank function and the 0 -norm are considered. To this end, we choose to approximate the rank function and the 0 -norm by the Schatten p-and the q -norm, respectively, and solve which is a convex optimization problem for p = q = 1 (i.e., the Schatten 1-norm is by definition the nuclear norm) and non-convex for 0 < p, q < 1. Convex approximations of the rank function and the 0 -(quasi)-norm by means of the nuclear norm (i.e., Schatten 1-norm) (Fazel et al. 2001) and the 1 -norm (Donoho 2006) have been widely applied in several rank and sparsity minimization problems (e.g., Candès et al. 2011). The main advantage of this approach is that the global optimum of the convex problems can be found relatively easily by using off-the-shelf optimization methods such as the ADMM. However, the convexification of rank minimization problems may suffer from the following two drawbacks. First, the recoverability of the low-rank solutions via nuclear norm minimization is only guaranteed under incoherence assumptions (e.g., Candès et al. 2011). Such assumptions regarding incoherence may not be guaranteed in practical scenarios . For example in the proposed model, the resulting global optimal solution of the convex instance of (9) (p, q ≥ 1) may be arbitrarily away from the actual solution of (8). Second, it is known that the 1 -norm is a biased estimator (e.g., Zhang 2010). Since the nuclear norm (or equivalently the Schatten-1 norm) is essentially the application of the 1 norm on the singular values, it may only find a biased solution. To alleviate the aforementioned issues of the convex instance of (9), we further consider the non-convex approximation of (8) by employing the Schattenp norm and q -norm with p, q ∈ (0, 1). Such non-convex functions have been shown to provide better estimation accuracy and variable selection consistency (Wang et al. 2014b) in related approximations of 0 -norm regularized rank minimization problems (Nie et al. 2012(Nie et al. , 2013Papamakarios et al. 2014).
To disentangle the Schatten p-and q -norm minimization sub-problems in (9) from the matrix structure and data-fitting requirements, respectively, (9) is equivalently written as To account also for (partially) missing observations in M, we introduce the matrix W ∈ R D×T which is given by where ⊂ [1, D] × [1, T ] is the set containing the indices of the observed (available) entries in M. By incorporating W inside the q -norm term in (10) as a multiplicative weight matrix for E, we arrive at the following problem.
Remark Note that the choice of the Hankel map H(·) depends on the application (see Sects. 2.2, 5). In any case, the is computed according to (1); the number of blocks along the row and column dimension j and k, respectively, are set to j = r + 1 and T −r , where T is the number of observations and r > n, with n denoting the system order.

Alternating-Direction Method-Based Algorithm
The ADMM is employed to solve (12). To this end, the augmented Lagrangian function for (12) is defined as follows.
where μ is a positive parameter and V : are the sets containing all the unknown variables and the Lagrange multipliers for the equality constraints in (12), respectively. Specifically, at each iteration of the proposed ADMM-based solver, (13) is minimized with respect to each variable in V in an alternating fashion and, subsequently, the Lagrange multipliers in Y and the parameter μ are updated. The iteration index is denoted herein by i. The notation is used to denote the solution stage in which all other variables but N are kept fixed, and similarly for the other unknown variables.
The solutions of minimization of (13) with respect to E and N are based on the operators and Lemmas that are introduced next. Minimizing (13) with respect to L does not admit a closed form solution due to the presence of the quadratic terms. Similarly to Fazel et al. (2013), to 'cancel out' these terms we add a proximal term to the respective partial augmented Lagrangian. The additive term is based on the (semi-) norm · Q 0 induced by the (semi-) inner product P T Q 0 P, with Q 0 being the positive (semi-) definite matrix given by where L := min{ j, k}. As shown in Sect. 2.1, √ L is the upper bound of the spectral norm of the Hankel adjoint map H * .
Thus, given the variables V[i], the Lagrange multipliers Y[i] and the parameter μ[i] at iteration i, the updates of the proposed solver, summarized in Algorithm 1, are as follows.

Update the Primal Variables
Update the Lagrange Multipliers Equation (15), which offers the update for E, is solved based on the generalized soft thresholding operator proposed in Nie et al. (2013) and briefly described next. Consider the following problem.
with B ∈ R m×n and α a positive parameter. Problem (20) is separable with respect to the elements of B and is thereby decomposed into m × n sub-problems of the form Let 1 2−q and c 2 = c 1 + αq|c 1 | q−1 . Equation (21) admits an analytical solution for q ∈ (0, 1] given by where ρ 1 and ρ 2 are the roots of The roots can easily be found by applying the iterative Newton-Raphson root-finding method initialized at z i j . Similarly to Papamakarios et al. (2014), we henceforth call the element-wise solver (22) generalized q-shrinkage operator and denote it by S q α {·}. Note that when q = 1 the aforementioned operator reduces to the element-wise application of the well-known shrinkage operator (Candès et al. 2011), defined by We shall denote by S q (α,W) {·} the operator for whichᾱ = αw i j , with W ∈ R m×n known, is used instead of α for the solution of each respective b i j in (22).
The solution of (16), that is, the minimization of (13) with respect to N, is based on the following Lemma.
. 7: Update the Lagrange multipliers by (18) Lemma 1 (Nie et al. 2013) The solution of the optimization problem We shall denote by D p α {·} the operator -henceforth called generalized singular value p-shrinkage operator -that solves (24).
The proposed ADMM-based solver is summarized in Algorithm 1. The latter is terminated when the following conditions are met where 1 and 2 are small positive parameters, or a maximum of 1000 iterations are reached.

Computational Complexity and Convergence
The cost of each iteration in Algorithm 1 is dominated by the calculation of the generalized singular value p-shrinkage operator in Step 5, which involves a complexity equal to that of SVD, i.e., O max{M 2 N , M N 2 } . The generalized q-shrinkage operator, utilized in Step 4, entails linear complexity O(DT ).
Regarding the convergence of Algorithm 1, there is no established convergence proof of the ADMM for problems in the form of (12). Indeed, the ADMM is only known to converge for convex separable problems with up to two blocks of variables (e.g., Bertsekas 2014;Candès et al. 2011). However, this is not the case even in the convex instance of (12) (i.e., when p = q = 1), since the optimization problem involves more than two blocks of variables. For the multiblock separable convex problems, with three or more blocks of variables, it is known that the original ADMM is not necessarily convergent (Chen et al. 2016). On the other hand, theoretical convergence analysis of the ADMM for nonconvex problems is rather limited, making either assumptions on the iterates of the algorithm (Xu et al. 2012;Magnusson et al. 2016) or dealing with special non-convex models (Li and Pong 2015;Wang et al. 2014aWang et al. , 2015, none of which is applicable for the proposed optimization problem (12). However, it is worth noting that the ADMM exhibits good numerical performance in non-convex problems such as nonnegative matrix factorization (Sun and Févotte 2014), tensor decomposition (Liavas and Sidiropoulos 2015), matrix separation (Shen et al. 2014;Papamakarios et al. 2014), matrix completion (Xu et al. 2012), motion segmentation , to mention but a few.
To the best of our knowledge, the only work which focuses on the convergence analysis of the ADMM when applied for the optimization of piecewise linear functions such as the Schatten p-norm and the q -norm (when 0 < p, q ≤ 1) is the recent preprint of Wang et al. (2016). However, since a systematic convergence analysis is out of the scope of this paper, we plan to adapt the analysis in Wang et al. (2016) in order to analyze the convergence of the proposed algorithm in the future.
Even though we cannot theoretically guarantee the convergence of the proposed solver, the experimental results on synthetic data in Sect. 6.1 show that its numerical performance is good in practice. Specifically, the empirical convergence of the proposed solver is evidenced, where both the primal residual and the primal objective are nonincreasing after the very few iterations (see Fig. 2). Similar convergence behavior characterizes also the experiments on real-world data presented in Sect. 6, where we have observed that even the non-convex variant with p = q = 0.1 of the proposed method (12) needs no more than 180 iterations to converge in most cases.

Scalable Version of the Algorithm
To improve the scalability and reduce the computational complexity of the ADMM-based Algorithm 1, we develop here a scalable version. Depending on the application, and more specifically, the number of inputs and/or outputs employed and the number of observations, the dimension of the Han- kel matrix H(L) ∈ R M×N can rise largely, which makes the calculation of SVD prohibitive. To alleviate the aforementioned computational complexity issue, we further impose that H(L) ∈ R M×N is factorized into an orthonormal matrix and a low-rank matrix as H(L) = QR, with Q ∈ R M×K , R ∈ R K ×N and K M, N . In this factorization, Q ∈ R M×K is a column-orthogonal matrix satisfying Q T Q = I and R ∈ R K ×N is a low-rank matrix representing the embedding of H(L) onto the K -dimensional subspace spanned by the columns of Q.
Due to the unitary invariance of the Schatten p-norm, the following equality holds QR S p = R S p . Thus, by incorporating the factorization H(L) = QR and adding the orthonormality constraint for Q, (12) is written as Since M K + K N M N , the number of variables has been significantly reduced. Clearly, this modification reduces the overall complexity of the method, since the SVD is now applied on M × K and K × N matrices as opposed to a M × N matrix.
The ADMM is employed to solve (26).
defined as the sets containing all the unknown variables and the Lagrange multipliers for the first two equality constraints in (26), respectively, the (partial) augmented Lagrangian function is defined as where μ is a positive parameter. Therefore, at each iteration of the ADMM-based solver for (26), we solve with respect to each variable in V in an alternating fashion and, subsequently, the Lagrange multipliers in Y and the parameter μ are updated. The proposed solver for (26) is summarized in Algorithm 2. The updates for R, L, E are similar to those employed to solve (12). The solution of (28) with respect to Q is based on the Procrustes operator, which is defined as P[L] = AB T for a matrix L with SVD L = A B T and solves the problem in the following Lemma.
Lemma 2 (Zou et al. 2006) The constrained minimization problem: has a closed-form solution given by P = P [A].

Computational Complexity and Convergence
The cost of each iteration in Algorithm 2 is dominated by the calculation of the generalized singular value p-shrinkage operator and the Procrustes operator in Step 5 and 6, respectively, which both rely on SVD, thus involving respective complex- It is worth stressing again that choosing K M, N , which implies M K + K N M N , leads to a significantly reduced overall complexity for Algorithm 2 compared to that of Algorithm 1, which is instead dominated by a SVD on a M × N matrix, hence O max{M 2 N , M N 2 } . Again, the generalized q-shrinkage operator, utilized in Step 4, entails linear complexity O(DT ).
Regarding the convergence of Algorithm 2 which solves the scalable version of the proposed model (26), there is no yet established convergence proof of the ADMM for problems in the form of (26). The discussion provided above on the convergence of Algorithm 1 applies to a large extent for Algorithm 2 as well. As a matter of fact, theoretical analysis for the convergence of Algorithm 2 becomes more challenging, compared to the case of Algorithm 1, considering that the factorization QR = H(L) and the nonlinear orthonormality constraint Q Q = I are introduced in the scalable version of the proposed model (26). It is also worth noting that problem (26) is always non-convex due to these two equality constraints, and thus the solutions yielded by the optimization problems (12) and (26) cannot be related. However, it has been shown in Liu and Yan (2012) that the ADMM converges to a local minimum for a problem similar to problem (26) with convex objective function, i.e., p, q ≥ 1. To the best of our knowledge, for the case 0 < p, q < 1, i.e., when the Schatten p-norm and the q -norm act as non-convex approximations of the rank function and the 0 -(quasi) norm, respectively, there has been no theoretical evidence for the convergence of the ADMM for the problem (26) and further investigation is needed.
Nevertheless, the ADMM has been shown to achieve good numerical performance in non-convex subspace learning problems employing a similar matrix factorization approach with one of the factors being orthonormal (Sagonas et al. 2014;Papamakarios et al. 2014). Also, experimental results on synthetic data evidence the empirical convergence of Algorithm 2, which has been found to be similar to that shown for Algorithm 1 ( p = q = 0.5) in Fig. 2. Good numerical performance is also achieved by the scalable solver in the experiments presented in Sect. 6.

Dynamic Behavior Analysis Frameworks based on Hankel Structured Rank Minimization
In this section, we develop two frameworks for dynamic behavior analysis.

Dynamic Behavior Prediction
Consider the case where continuous-time, real-valued annotations characterizing dynamic behavior or affect (e.g., conflict, valence, arousal), manifested in a video sequence of T frames, are available for a number of consecutive frames t = 0, 1, . . . , T train − 1 (training set). The goal herein is to first learn a low-order LTI system that generates the annotations as outputs Y = [y 0 , y 1 , . . . , y T train −1 ] ∈ R m×T train when visual features act as inputs U = [u 0 , u 1 , . . . , u T train −1 ] ∈ R d×T train , and next use it to predict behavior measurementsŷ t for the remaining frames of the sequence t = T train , . . . , T − 1 (test set), based on the respective features u t . To this end, the following framework is proposed. First, the proposed structured minimization problem (10) is solved, with M = Y and the Hankel map H(·) defined as in Sect. 2.2, to estimate the system order. Second, the low-rank solution H(L) is used to estimate the system matri-cesÂ,B,Ĉ,D and the initial state vectorx 0 by solving a system of linear equations, following, for example, Van Overschee and De Moor (2012). Finally, test set predictionsŷ (t = T train , . . . , T − 1) for dynamic behavior are obtained by applying the equations of the learned state-space model (3) for t = 0, 1, . . . , T train − 1, with the visual features used as inputs u t .
Applications The aforementioned framework can be used for continuous prediction of any number or type of realvalued behavioral attributes manifested in a video sequence, by employing a portion of consecutive frames (even a few seconds) to learn a LTI system as described above (see Sect. 6).

Dynamic Behavior Prediction with Partially Missing Outputs
Consider now the scenario in which the goal is to predict missing (or unreliable) and not necessarily consecutive realvalued measurements of dynamic behavior or affect, viewed as missing outputsȳ t of a low-order LTI system, directly by employing the observed visual features as inputs u t and the available annotations as outputs y t , without explicitly learn-ing the system. Herein, we approach this task as a (Hankel) structured low-rank matrix completion problem and address it by means of the following predictive framework that is based on the proposed model (12) Thus, a low-rank approximation of H(M) should be obtained to estimate the true order of the system n.
To this end, the proposed model (12) is solved, with M defined as above and W computed according to (11). Note that this process simultaneously 'completes' the missing observations of M, by forcing the approximation of H(M) to be low-rank, or in other words, the 'completed' trajectory L to follow the same linear dynamics underlying the observed trajectory M. Finally, the missing outputs are recovered from the respective entries of the low-rank approximation H(L). Notably, this framework has the advantage that the missing observations are obtained directly by solving (12), thus avoiding the computational load associated with learning a minimum order realization of the system.

Applications
The aforementioned framework can achieve prediction of missing (past or future) observations pertaining to dynamic human behavior or affect, with the latter used as outputs of a low-order LTI system. For instance, a computer vision problem that can be addressed by means of the proposed framework is the problem of tracklet matching (Ding et al. 2007a(Ding et al. , 2008Dicle et al. 2013), which consists of stitching trajectories of detections belonging to the same target. For this task, one needs to assess whether the joint trajectory of detections M = Y startȲinter Y end , where Y start and Y end are the observed trajectories and Y inter is a zero-valued matrix corresponding to the 'missing' intermediate trajectory, is the output of the same autonomous (output-only) LTI system that generated Y start and Y end . This is achieved by solving (12) forȲ inter , with M defined as above, and subsequently comparing rank(H(L)) with rank(H(Y start )) and rank(H(Y end )) (see Sect. 6.4).

Experiments
The efficiency of the proposed structured rank minimization methods is evaluated on synthetic data corrupted by sparse, non-Gaussian noise (Sect. 6.1), as well as on real-world data with applications to: (i) conflict intensity prediction (Sect. 6.2), (ii) valence-arousal prediction (Sect. 6.3), and (iii) tracklet matching (Sect. 6.4). For the case of dynamic behavior analysis experiments on real-world data, for the first two tasks, the framework described in Sect. 5.1 is employed, while for the last we utilize the framework described in Sect. 5.2.
Aside from the proposed methods, five structured minimization methods are also examined, namely HRM 2 (Fazel et al. 2013 Table 1). For all experiments presented in our paper, a grid search is employed to tune the parameter λ of the proposed methods or any other parameters of the compared methods that need tuning. Tuning is performed by following an outof-sample evaluation, that is, the last portion of the training frames is withheld for validation and the best-performing model is used for testing. Specifically, the last 2r training observations, with r defined in Sect. 3, are kept out for validation in all our experiments.

Experiment on Synthetic Data
In the experiments presented in this section, the efficiency of the proposed method (12) is evaluated on synthetic data corrupted with sparse, non-Gaussian noise. In order to generate Hankel matrices of given rank n, we follow the methodology proposed in Park et al. (1999), that is, T outputs y(t) of an autonomous stable LTI system of order n are generated by applying the following formula where z k appear in pairs of conjugate numbers so that the observations y(t) are real numbers. It follows naturally that a M × N Hankel matrix Y = H(y) = H 1,1,M,N (y) with y derived according to (31) has rank equal to n (Park et al. 1999). Subsequently, sparse, non-Gaussian noise η ∈ R 1×T is added to the original signal y, with the non-zero entries following the Bernoulli model with probability ρ = 0.2, as in Candès et al. (2011). The final corrupted signal is formed asỹ = y + η, with the corresponding noisy Hankel matrix Y = H(ỹ) being full-rank.
In what follows, the efficiency of various structured rank minimization methods in reconstructing the noiseless system outputs y(t), t = 1, 2, . . . , T , by finding a low-rank approximationŶ = H(ŷ) given the noisy Hankel matrixỸ, is experimentally assessed in various scenarios.
The reconstruction error, for both the noiseless observations y and the noise η, is measured in terms of relative reconstruction error as follows.
with s denoting the original signal andŝ denoting the estimated signal by the algorithm.

Experiment with Varying System Orders
Herein, experiments are conducted for various orders of the LTI system generating the 'clean' data, as described above. Specifically, the system order n is varied in {6, 12, 18}. For each value of n the experiment is repeated 10 times, that is, for 10 different output trajectories y ∈ R 1×T computed by randomly selecting the complex coefficients in (31). For the proposed model, Algorithm 1 is used and the following combinations are examined for the p and q values corresponding to the Schatten p-and q -norm, respectively: ( p, q) ∈ {(1, 1), (0.9, 0.9), (0.5, 0.5), (0.1, 0.1)}. The methods HRM, SVD-free, SRPCA, IHTLS, and SLRA (listed in Table 1) are also evaluated for comparison. For each method, results are reported in terms of minimum reconstruction error err(y,ŷ) computed according to (32). Performance is also evaluated in terms of reconstruction error for the noise signal err(η,η) and the Pearson Correlation Coefficient (COR) measured between the noiseless observations y and the reconstructed outputsŷ. Table 2a-c contain the results obtained by the various methods for system order n = 6, n = 12 and n = 18, respectively. Specifically, mean and standard deviation values of the reconstruction errors err(y,ŷ) and err(η,η) and the COR values computed over the 10 trials of each experiment are reported. The mean values of the estimated system order (rank ofŶ = H(ŷ)), number of iterations and execution time are also reported.
Firstly, we observe that the non-convex instances of the proposed method, i.e., when p, q < 1, consistently account for the most accurate reconstruction of both the clean signal, in terms of both reconstruction error and correlation, as well as the recovery of the sparse noise. In most cases, the performance is improved when smaller values for p and q are chosen for the proposed model. Secondly, all the compared methods (HRM, SVD-free, SRPCA, IHTLS and SLRA) achieve much lower performance in terms of all the three metrics employed. Furthermore, it is worth noting that, in the scenarios corresponding to orders n = 12 and n = 18, SRPCA recovers the noise more accurately than the HRM, SVD-free, IHTLS and SLRA. This is expected since the former is the only method amongst the compared ones that is robust to sparse, non-Gaussian noise. It is also worth mentioning that the system order pertaining to the recovered observations varies significantly amongst different methods. Amongst the different instances of the proposed method, this variation is much smaller, with the only exception being the result obtained by our method with ( p, q) = (0.1, 0.1) for the case n = 12. Regarding the number of iterations, which varies largely across methods, we observe that the non-convex instances of the proposed method require a larger amount of iterations to converge, as compared to the convex instance ( p = q = 1). However, even in the scenario of order n = 18, the best-performing instance of the proposed method ( p = q = 0.1) needs 223 iterations in average to converge. Finally, the execution times corresponding to the best-performing, non-convex instances of the proposed method in all three experiments are comparable to those accounted for by even convex compared methods, such as SRPCA.
Empirical Convergence Analysis In this experiment, the convergence of the proposed method is assessed by employing various types of initialization. To this end, we employ synthetic data corrupted with sparse, non-Gaussian noise, generated similarly to the previous experiment. We clarify here that the only variable that needs to be initialized in Algorithm 1, except for the Lagrange multipliers, is the matrix L. All other variables are calculated in the 1st iteration of the ADMM loop according to the respective updates.
The proposed solver is executed using the following three types of initialization, namely, 'original signal'  Fig. 2. Here M =ỹ denotes the given noisy data and L =ŷ the reconstruction. These plots enable us to demonstrate the convergence of the proposed solver. Note that for the last initialization scenario, the experiment is repeated 10 times. and the average convergence curve is plotted.
By inspecting both graphs, it is evident that all three initializations lead to similar convergence behavior in the sense that both the primal objective and the primal residual are non-increasing after the first few iterations. However, by initializing the algorithm using the scaled version of the original signal (L[0] = 1.1ỹ) the primal objective attains smaller val-

Conflict Intensity Prediction
In this section, we address the problem of continuous conflict intensity prediction based on the visual modality only. Conflict is usually defined as disagreement of high intensity or escalation, in which at least one of the involved interlocutors feels emotionally offended (Bousmalis et al. 2009). Hence, various challenges are posed to machine analysis of conflict in real-world competitive conversations, since simultaneous processing of the data streams from all inter-actants is required. Furthermore, when the visual modality is also considered, feature extraction has to cope with various types of visual noise, such as extreme head pose values and abrupt body movements, which renders computer vision pre-processing (e.g., tracking, alignment) rather difficult. Automated approaches to conflict analysis include just a few works, which are based on audio features only (Kim et al. 2012a, b). However, visual features can help discover facial behavioral cues that are intrinsically correlated with conflict, such as smiling, blinking, head nodding, flouncing and frowning. The only audio-visual approach to conflict detection that we are aware of is Panagakis et al. (2016), where robust, multi-modal fusion of audio-visual cues is utilized. However, all works mentioned above address conflict Fig. 3 Three sample snapshots from the CONFER dataset, corresponding to dyadic conversations of two guests in conflict or conflict escalation detection within a classification framework predicting binary (conflict/non-conflict) or discretized conflict intensity labels.
To the best of our knowledge, the presented experiments constitute the first work that (i) addresses continuous conflict intensity prediction through a dynamic modeling framework (as opposed to frame-by-frame classification or regression), and (ii) uses visual features only.
Data In view of the absence of benchmark datasets for conflict detection, video excerpts from live political debates, aired on Greek television 3 in between 2011 and 2012, are utilized. It is worth stressing that these debates, despite being moderated by the TV host, include unscripted dyadic interactions which are highly likely to lead to real conflict due to the participants acting under incompatible motives and interests. From the entire dataset, 160 audio-visual non-overlapping recordings with total duration amounting to 170 mins, have been manually extracted. These videos have been annotated by 10 experts, all of them being native Greek speakers, in terms of continuous conflict intensity. The temporal resolution of the video stream is 25 frames per second. Only the episodes involving exactly two interlocutors (97 out of 160 samples) are considered herein. For each sequence, the mean over the 10 available ratings, normalized to [0, 1], is used as ground truth for conflict intensity. Three sample snapshots from the dataset, henceforth called Conflict Escalation Resolution Database (CONFER), are depicted in Fig. 3.

Features and Experimental Protocol
For visual feature extraction, we use the Gauss-Newton Deformable Part Model in Tzimiropoulos and Pantic (2013) for facial landmark detection, which when combined with a person-specific face detector produces very accurate results (Chrysos et al. 2015), to detect 49 fiducial facial points in each frame of an input video for each of the two interactants. The points are subsequently globally registered, using a 2-D non-reflective similarity transformation with respect to 4 reference points (centers of the eyes, center of the nose and top of the nose), to remove the effects of head translation, scale and in-plane rotation. This way, yaw and pitch pose angles, which are expected to be informative in terms of conflict, are retained in the shape configuration. Finally, Principal Component Analysis (PCA) is used at each frame to reduce dimensionality for the points of each speaker to 7, based on the components collectively accounting for 98% of the total variance.
The dynamic behavior prediction framework described in Sect. 5.1 is applied separately for each sequence used in the experiments of this section. During training, the stacked feature vectors corresponding to the two interlocutors are used as inputs u t at each time frame t of the training set (t ∈ [0, T train − 1]), while the ground truth is used as output y t of a LTI system. The goal is to predict the outputŷ t (conflict intensity) for each frame of the sequence (t ∈ [0, T − 1]), based on the learned system parameters and the respective inputs (features).
We establish an experimental scenario involving complete input-output data. To this end, 43 non-overlapping segments have been extracted from the 97 available episodes, based on the following conditions: (i) they are at least 400 frames long, so that the predictive capability of the proposed framework can be evaluated on long temporal segments portraying frequent conflict intensity fluctuations and conflict escalation/resolution, and (ii) the face detection for each frame is successful and, hence, the facial landmark detection results for each frame are accurate (see Chrysos et al. 2015, for further explanation).
The resulting subset of clips has a mean and standard deviation of duration of 804 frames and 561 frames, respectively, and corresponds to 22 subjects. For each of the 43 video sequences, the first P = 60% of the frames are used for training, while the remaining frames are used for testing. This choice establishes a subjects-dependent experimental setting. It is worth mentioning that the experimental setting is challenging given that the proposed framework learns temporal behavioral patterns related to conflict escalation/resolution from a single dyadic interaction with average duration of about 19 seconds. This is in contrast to relying on a large set of training instances containing multiple interactants exhibiting conflicting behavior in various contexts. For the proposed model (12), the following combinations are examined for the p and q values corresponding to the Schatten p-and q -norm, respectively: ( p, q) ∈ {(1, 2), (1, 1), (0.9, 0.9), (0.5, 0.5), (0.1, 0.1)}. The scalable Algorithm 2 is also used for this experiment, with the dimension of the column space of Q in (26) set to K = 10. The convergence parameters 1 and 2 are set to 10 −4 and 10 −7 , respectively. For each sequence, 150 values, logarithmically spaced in the interval [10 −3 , 1] are examined for the tuning of parameter λ in Algorithms 1 and 2. Similarly, a suitable grid search is conducted to tune the parameters of the compared methods. For details on methods to which we compare, see Table 1.
For evaluation, the Pearson Correlation Coefficient (COR) is used, measured between the ground truth y t (mean over the 10 annotations) and the predicted outputŷ t on the test set frames (t ∈ [T train , T − 1]) of each sequence. Motivated by recent works on predictive analysis of human behavior (Mavadati et al. 2013;Kaltwang et al. 2016), we choose to also report the Intra-Class Correlation Coefficient (ICC), which was first proposed in Shrout and Fleiss (1979) as a metric to assess rater reliability in behavioral measurements. Specifically, the coefficient ICC(3,1) is employed herein, that corresponds to the case "Each target is rated by each of the same k judges, who are the only judges of interest" (Shrout and Fleiss 1979). For each sequence and method, the ICC(3,1) (henceforth denoted by ICC) is calculated by considering the 'method' and the 'mean annotator' as the only 'judges' of interest and the conflict intensity values for the test set frames as 'targets' in the definition above. To obtain a 'human' baseline ICC result, i.e., a measure of 'level of consistency amongst 10 humans in measuring conflict intensity', we also compute the ICC amongst the 10 annotations for the test frames of each sequence. The average value of the inter-annotator ICC, denoted by ICC h , over all 43 sequences, was found ICC h = 0.740. Finally, note that each method is separately optimized in terms of each metric.

Results and Discussion
Results in terms of mean value of COR and ICC over all 43 sequences are reported in Table 3 for all methods examined. For details on methods to which we compare, see Table 1. The values of the resulting LTI system order and execution time (time: secs per frame × 100) for the respective best-performing structured rank minimization solution are also reported, again averaged over all sequences. 4 As can be seen, the proposed methods outperform all methods that are compared to, in terms of both COR and ICC. The second-best-performing method in terms of both metrics is IHTLS, with all remaining methods yielding lower scores. Results obtained by the scalable Algorithm 2  (  The bold values indicate the best performances in terms of each evaluation metric Averaged values for the resulting system order and execution time (time: secs per frame × 100) are also shown for each (COR-optimized) structured rank minimization method. For details on methods to which we compare, see Table 1 (denoted by ours sc ) are on par with those yielded by Algorithm 1. As a matter of fact, the best overall performance in terms of both metrics is achieved by the scalable algorithm with p = q = 0.1. Furthermore, the proposed methods (12) and (26) yield superior performance when the objective function is non-convex ( p, q < 1), as compared to that obtained by convex instances of (12) and instances of (26) with convex objective function ( p, q = 1 and p = 1, q = 2). These results indicate that the dynamic model learned with the non-convex instances explain better the observed data thus providing a better estimate for the system order than that learned with the convex instances. This may be attributed to the relaxation gap entailed by replacing the rank and 0 -norm with the Schatten p-and q -norm, respectively, is tighter than that entailed by convex approximations. Also, it is interesting to observe that the choice q = 2, which corresponds to a Frobenius-norm based fitting measure, consistently results in the lowest performance amongst the values examined for the q -norm. Presumably, this is due to the susceptibility of the corresponding fitting measures to gross, sparse noise (Huber 2011).
Regarding run time efficiency, it is worth noting that the execution time accounted for by the best-performing variant of the proposed methods (ours sc with p = q = 0.1) is close to a degree of magnitude smaller than that of the best-performing out of the compared methods (IHTLS). As expected, execution time increases as p and q values move closer to zero. Moreover, the high COR and ICC scores achieved by the proposed methods are accompanied by low values for the resulting system orders (e.g., n ∈ [4, 6] for ours sc ). This property is crucial for both the generalizability and execution time efficiency of the overall predictive framework.
Notably, IHTLS, HRM and the proposed methods lead to an average ICC which is higher than the mean interannotator ICC h of 0.740. This means that these methods, which were trained using the 'mean annotator' annotations, have learned the trend of the 'mean annotator' exceptionally well and were able to reproduce the trend accurately. This clearly demonstrates the suitability of these methods for modeling the human behavior analysis task at hand (i.e., conflict intensity prediction).

Effect of the Training Set Size on Prediction Accuracy
The results reported in Table 3 correspond to using the first P = 60% of each sequence's frames for training (structured rank minimization and LTI system learning) and the remaining frames for predicting the respective conflict intensity values. To investigate how the choice of the portion of frames used for training affects the predictive capability of the structured rank minimization-based framework, we vary the training set percentage P in {30%, 40%, 50%, 60%, 70%} of the sequence length. The test set percentages vary also according to 100-P. The resulting training (test) set sizes, averaged over all 43 sequences, are 240, 322, 402, 483, 563 (559, 482, 401, 321, 241) frames, respectively. For this experiment, the proposed method with p = q = 0.1 is examined along with the same five compared methods, while performance is evaluated in terms of the COR metric only. For details on methods to which we compare, see Table 1.
A graph that shows the COR values (averaged over all sequences) obtained for each percentage P by the various methods 5 is illustrated in Fig. 4. The proposed method consistently outperforms the compared methods in all five scenarios. The second-best-performing method is SLRA and IHTLS for P in {30%, 40%, 50%} and P in {60%, 70%}, respectively. The superiority of the proposed method over the compared methods for this experiment is more evident in the cases where 30 or 40% of the frames are used for training; the discrepancy in performance achieved by the proposed method and SLRA reaches 0.117 and 0.126 in absolute COR terms, respectively. Overall, in most of the cases, a higher COR value is achieved by all methods when more data are used for training. For our method, the obtained COR values increase strictly monotonically with P, reaching COR = 0.834 at P = 70%. Fig. 4 Average correlation (COR) values plotted as a function of the training set percentage, for the conflict intensity prediction experiment on the CONFER dataset with varying training size. For details on methods to which we compare, see Table 1. Results for the proposed method (12) were obtained by using Algorithm 1 with p = q = 0.1 In Fig. 5, conflict intensity predictions, as obtained by the proposed method (( p, q) ∈ {(1, 1), (0.1, 0.1)}), HRM, IHTLS and SRPCA for a sequence of the CONFER dataset, are illustrated along with the ground truth annotations as line plots for the various training set percentages examined.
The COR values obtained are also shown in the respective sub-captions. As can be seen, the sequence in question establishes a challenging scenario, since it involves instances of both conflict escalation and resolution, either short-or long-term. One can easily notice that for all scenarios the trends of conflict intensity along the test frames are accurately predicted by the non-convex instance of the proposed method ( p = q = 0.1), while the convex model instance ( p = q = 1) yields smaller COR values in all five cases examined. The former achieves a COR value as high as 0.914 (Fig. 5f) for a total of 604 test frames when trained on just the first 30% of the sequence (260 frames). In the same scenario, IHTLS performs similarly, while other methods such as HRM and SRPCA yield COR values that lie just above or below zero, respectively. The various compared methods exhibit different patterns in performance as the amount of video frames used for training increases. For instance, IHTLS outperforms the other methods when less training data are used (30 and 40%), while SRPCA and HRM show a dramatic increase in performance at the point where 50 and 60% of the video frames are employed for training, respectively. The effectiveness of IHTLS in the scenarios involving less training data for the sequence in question is as expected. IHTLS is more likely to find a local approximation for the 'low-complexity' temporal dynamics of the first portion of the sequence that be low-rank and hence a simpler, more generalizable system than the convex, nuclear-norm based methods SRPCA and HRM, since the former searches for the desired rank iteratively starting from rank 1 (Dicle et al. 2013). Finally, as expected, the highest COR values obtained overall correspond to the highest training percentage of 70% and are similar across all methods.

Valence and Arousal Prediction
In this section, the efficiency of the proposed dynamic behavior analysis framework is validated on the problem of continuous prediction of valence and arousal based on visual features only. Motivated by advances in psychology and cognitive neuroscience (Russell 1980;Lane and Nadel 2002), focus of affective computing research has recently shifted towards continuous-time analysis of affect phenomena, represented in the dimensional space (e.g., valence, arousal, power, anticipation) rather than in terms of universal basic emotions (e.g., happiness, surprise) (Gunes and Schuller 2013;Gunes et al. 2011). Valence (how positive or negative the affect is) and Arousal (how excited or apathetic the affect is) are latent dimensions used to measure emotional experience, and are considered to encapsulate most of the affect variance (Lane and Nadel 2002).
Most of the existing automated approaches to Valence-Arousal (V-A) analysis have been limited to the use of audio cues only or have compromised to solving a two-class or fourclass classification problem, i.e., binary classification with respect to each dimension or classification into the quadrants of the 2D V-A space (Gunes and Schuller 2013). Although the relation of affective dimensions (mostly arousal) to certain acoustic features has been better documented as compared to visual cues, yet there has been evidence that also visual signals (e.g., facial expressions, head shakes, nods) are informative of the V-A dimensions (Cowie et al. 2010;Pantic and Bartlett 2007). Such findings have motivated the exploitation of visual features, such as facial expression cues and shoulder movements, in either isolation or combination with audio features, for dimensional affect analysis. Representative examples of this line of research are the works of Gunes et al. (2011), Nicolaou et al. (2012 and Kaltwang et al. (2016).
In this paper, we address continuous prediction of valence and arousal using visual features only. Motivated by evidence suggesting that valence and arousal exhibit high correlation (Pantic and Bartlett 2007), we treat them in a joint framework, that is, as outputs generated by the same LTI system.

Data
The SEMAINE database (McKeown et al. 2012), which contains audio-visual recordings of emotionally colored conversations between a human and an operator, is employed. The operator plays the role of an avatar and, depending on the choice of the latter, acts assuming one of 4 distinct personalities (happy, gloomy, angry or pragmatic). Since the goal of the operator is to elicit emotional reactions by the user, naturalistic dyadic conversations are developed, which are suitable for spontaneous affect analysis. Each video has been recorded at 50 frames per second, and has been annotated frame by frame by six raters in terms of real-valued valence and arousal ranging from −1 to 1. A subset of SEMAINE, containing 40 sequences that are at least 3000 frames (∼1 min) long from a total of 10 subjects, is used. For each sequence, the mean values of valence and arousal annotations over the six ratings are utilized as ground truth. Three sample video frames corresponding to three different users from the SEMAINE database are depicted in Fig. 6.

Features and Experimental Protocol
The Active Appearance Model-based tracker (Orozco et al. 2013), which performs simultaneous tracking of 3D head pose, lips, eyebrows, eyelids and irises in videos, is employed to extract facial features. For each frame, 113 2D characteristic facial landmarks are obtained. To ensure that only expression-related information is retained in the feature representation, we use the tracker's estimates of 3D head pose values to remove pose angles. Scale and translation effects are subsequently removed from the 226 coordinates of the pose-normalized points, according to the procedure described for the experiment in Sect. 6.2. Finally, dimensionality reduction is performed by means of PCA. Again, 98% of the total energy is retained resulting to a 12-dimensional feature vector.
For each of the 40 sequences, the framework described in Sect. 5.1 is employed for continuous valence and arousal prediction. Only the first 3000 frames are considered for each sequence. The experimental protocol is similar to that established for the conflict intensity prediction experiment. The first 2000 frames of each sequence are used for training, while the remaining 1000 frames (∼20 s) are used for V-A prediction. For this experiment, the visual feature vectors are used The bold values indicate the best performances in terms of each evaluation metric Averaged values for the resulting system order (Val. and Ar.), and execution time (time: secs per frame ×100) (Val.) are also shown for each (COR-optimized) structured rank minimization method. For details on methods to which we compare, see Table 1 as inputs and the V-A values are used as outputs. Predictive performance for both valence and arousal is assessed again by means of both COR and ICC. To facilitate the evaluation and discussion with respect to each of the affect dimensions, we choose to optimize each method separately for each dimension and performance metric. For the proposed method, only Algorithm 1 is examined in this experiment. For details on methods to which we compare, see Table 1.
The mean value over all 40 sequences of the interannotator ICCh, calculated amongst the six available ratings, was found to be ICC V h = 0.778 for valence and ICC A h = 0.893 for arousal, respectively. The higher inter-annotator reliability for arousal is expected in the case of the SEMAINE data due to the three interlinked facts: (i) the majority of SEMAINE annotated data relate to high aroused emotions, (ii) the annotators were presented with audio-visual recordings to be annotated, and (iii) the arousal is better recognized when audio modality is available Bänziger and Scherer 2010).

Results and Discussion
Valence and arousal prediction results, in terms of mean value of COR and ICC over all 40 SEMAINE sequences, are reported in Table 4 for all methods examined. For details on methods to which we compare, see Table 1. Mean values for the resulting system order and execution time (time: secs per frame ×100) are also reported. 6 As can be seen, the best performance, in terms of both metrics, is obtained by the proposed method, for both valence and arousal prediction. The second-best-performing method in terms of COR (ICC) is HRM (SLRA) for both affect dimensions. Overall, valence and arousal are predicted with similar accuracies by almost all the methods. Again, the non-convex instances of the proposed method ( p, q < 1) account for significant performance boost over convex model instances ( p, q = 1 and p = 1, q = 2), yet accompanied by an increase in model complexity and execution time. Still, in most of the cases the proposed method results in systems of lower-complexity, as compared to those accounted for by the remaining methods. Regarding execution time, the various methods achieve comparable performances, with the exception of IHTLS that is much slower for this experiment, probably due to the increased dimensions of the data Hankel matrices.
Finally, it is worth noting that the inter-annotator ICC V h for valence is exceeded by HRM, SLRA and our method, whereas no method furnishes an ICC value greater than ICC A h for arousal. This result is exactly as expected. Namely, as explained above, in the case of the utilized SEMAINE data, human annotators were presented with audio-visual (rather than visual-only) recordings when they were conducting the annotation. The presence of audio data does not affect the human performance in recognition of valence, but it does affect the recognition of arousal -arousal is better recognized when audio cues are available to humans to rely on . Hence, while automated methods like HRM and our methods are highly suitable for modeling human behavior analysis tasks at hand (i.e., valence intensity prediction), they could not learn the trends of the 'mean annotator' well enough for the case of arousal intensity prediction, because these were relying on audio data unavailable to the tested automated methods.

Tracklet Matching
In this section, the efficiency of the proposed method is evaluated on the task of tracklet matching. The goal is to identify targets in the visual stream across occlusions from a set of given detections.
Data Experiments are conducted on the recently published Similar MultiObject Tracking (SMOT) dataset (Dicle et al. 2013), which consists of 8 videos 7 showing multiple targets with identical or very similar appearance. For each video, the provided hand-labeled detections for the targets appearing in each frame are employed. Overall, the task is challenging due to the presence of multiple targets, long trajectories, object occlusions and crossings, missing data and camera motion.

Features and Experimental Protocol
We follow the tracklet matching framework proposed in Dicle et al. (2013), which is based on a Generalized Linear Assignment (GLA) Problem. Thus, given N tracklets (trajectories of system outputs) where K is an adjacency matrix, with k i j = 1 denoting that Y (i) is the predecessor of Y ( j) , and P is a similarity matrix given by and Y ( j) conflict rank(H(Y (i) ))+rank(H(Y ( j) )) minȲ j i rank(H(Y (i j) )) − 1 otherwise, with Y (i j) = [Y (i)ȳ j i Y ( j) ] being the joint tracklet of detections, padded with zeros at the entries of the trackletȲ j i of missing data. Hence, the critical point of the aforementioned algorithm is the solution of the low-rank Hankel matrix completion problem minȳ j i rank(H(Y (i j) )) in (34). This is solved according to the framework described in Sect. 5.2, in which the underlying LTI system is assumed to be autonomous and the data Hankel matrices are composed of the respective outputs (2D tracking point coordinates).
Two experimental scenarios are considered, similarly to (Dicle et al. 2013). In the first experiment, false positives are increased by injecting uniformly distributed false detections with percentage varying as [0%, 10%, . . . , 50%]. In the second scenario, false negatives are increased by removing, again uniformly, true detections with percentage varying as [0%, 6%, . . . , 30%]. For each scenario, the experiment is repeated 10 times for each noise level, and the average performance over the 60 runs is reported. The same five methods used for comparison in the previous experiments are examined. For details on methods to which we compare, see Table 1. For the proposed method, Algorithm 1 is used, with the weight matrix W in (12) formed by setting its entries corresponding to the 'missing' trackletẎ j i to zeros and all remaining entries to ones. Various values are examined for the parameters, that is, ( p, q) ∈ {(1, 2), (0.5, 2), (0.1, 2)} and λ ∈ {10 −6 , 5 · 10 −6 , 10 −5 , . . . , 10 3 }, for each video and noise level. The convergence parameters 1 and 2 in Algorithm 1 are set to 10 −7 . For all methods examined, a Frobenius-norm based fitting measure is adopted (q = 2 for the proposed method). This experimental choice was motivated by preliminary experiments, in which it was observed that the use of sparsity promoting norms for approximation error resulted in trivial solutions when a large amount of missing data was involved.
For evaluation, the MOTA measure (Bernardin and Stiefelhagen 2008) is used, which is given by where f n t , f p t , mm t and g t denote the false positives, false negatives, mismatches and ground truth detections for frame t, respectively.

Results and Discussion
Tracklet matching results in terms of the MOTA measure-averaged over all 8 videos, noise levels and experiment runs-are reported for each scenario in Table 5. For details on methods to which we compare, see Table 1. Run time performance (time: secs per frame) The bold values indicate the best performances in terms of each evaluation metric For each noise type, the results are averaged over 6 noise levels, with each of the latter examined 10 times. Average execution time (time: secs per frame) accounted for by each structured rank minimization method is also shown. For details on methods to which we compare, see Table 1  of each respective algorithm, averaged similarly, is also reported. Overall, performance varies less amongst different methods for the false positives case, as compared to the false negatives case. This can be partially ascribed to the former case corresponding to a less demanding task of tracklet matching, since it involves a smaller amount of missing data. The proposed method performs similarly to IHTLS in terms of MOTA for both experimental scenarios, with the difference in performance for all 8 videos calculated as not statistically significant according to a paired t-test at significance level α = 0.05. All remaining methods achieve lower scores. The computational efficiency of the proposed method ( p = 0.1, q = 2) is comparable to that accounted for by the best-performing amongst the compared methods, for both scenarios. Similarly to the previous experiments, the convex instance of our method ( p = 1, q = 2) corresponds to a smaller execution time than that of the non-convex instances, albeit to a poorer performance. Results in terms of MissMatch Ratio MMR = t (mm t ) have been merged accurately in this challenging scenario that involves a heavily occluded surveillance scene. It is also worth noting that trajectory 22 (shown in red box) has been accurately 'completed' for frames 127 and 140 ( Fig. 8b and 8c, resp.), despite the intense occlusion occurring at frame 127.

Conclusions
A framework for dynamic behavior analysis in real-world conditions was developed in this paper. Specifically, the presented framework essentially employs a novel structured rank minimization method to learn a low-complexity system from time-varying data, in the presence of gross sparse noise and possibly missing data. By resorting to the ADMM, an efficient algorithm for the proposed structured rank minimization model along with a scalable version have been developed. Regarding applications, focus was placed on vision-based conflict intensity prediction, valence and arousal prediction, and tracklet matching. Extensive experiments on real-world data drawn from these application domains demonstrate the robustness and the effectiveness of the proposed framework.