1 Introduction

Video activity recognition is a trending topic and yet a challenging problem in the field of computer vision. The capability of automatically recognizing human activities is a key functionality of ambient intelligence. It has a wide range of applications, from surveillance in public and restricted areas, traffic safety, and sports analysis to assisted living and healthcare and many other social aspects. Among them, assisted living and healthcare have drawn increasing attention due to population ageing and the noticeable trend of independent living.

In this paper, we mainly focus on two aspects. First, we consider human activities that are typical in the context of assisted living and healthcare, where only several essential daily activities are handled, instead of a large number of general activities. More specifically, activities of interest in our case include activities of daily living (ADL) such as eating and drinking and anomalies like falling down. The purpose of studying ADL is to learn daily routines of individuals and to generate dedicated recommendations for a healthy living. As for anomalies, the aim is to trigger alarms when emergency occurs.

Secondly, we focus on video data and propose a video-based method for activity classification. Before that, we briefly review some existing and representative work on video activity recognition in the past few years. For example, Lin et al. modeled the entire scene as an error-free network, where each node corresponds to a patch of the scene and each edge represents the activity correlation between the corresponding patches. Based on this network, people are modeled as packages and human activities are modeled as the process of package transmission [1]. Everts et al. recognized human actions in videos based on color spatio-temporal interest points (STIPs) [2] that are multichannel reformulations of STIP detectors and descriptors [3]. Zhang and Piccardi employed structural SVM for activity classification using spatio-temporal SIFT-based VLAD (vector of linearly aggregated descriptors) features [4]. Amer and Todorovic conducted activity recognition by representing activities using a sum-product network (SPN) [5]. Zhang and Parker introduced color-depth local spatio-temporal features for activity recognition based on orientation histograms in xyzt dimensions, where the histograms are built around interest points as local maxima of independent filters applied to different dimensions [6]. Recent years have also witnessed a significant advancement in various machine learning tasks using deep learning. Deep neural networks such as convolutional neural network (CNN) [7] and recurrent neural network (RNN) [8] have become common choices for image and video analysis, including the representation of video activities. For example, Baccouche et al. extended a CNN to 3D for learning spatio-temporal features and then trained an RNN to classify each sequence that contains human actions [9]. These methods mainly investigated spatio-temporal relations of human motions, where some promising results were reported. However, interacting objects as an important part of many activities with human-object interaction were paid less attention to. Further, activities as dynamic processes involving non-planar movement of human body lie on a nonlinear manifold, instead of vector space. This manifold nature was also under-explored.

In view of the issues mentioned above, we propose a novel method that jointly represents structural features for body pose and appearance features for interacting objects as a unified data point on a Riemannian manifold. By learning BoW features from this Riemannian manifold, we treat each video activity as a temporal sequence of manifold points (BoW features) on another Riemannian manifold. Then, we classify such time series with a kernel based on dynamic time warping (DTW) and geodesic distances.

More specifically, the main contributions of this paper include that (a) we use a unified covariance matrix to represent both structural and appearance features in each frame. These two different types of features correspond to body pose and human-object interaction, respectively. In this way, we obtain low-level features of each video as a temporal sequence of points on the Riemannian manifold of SPD matrices; (b) we build a BoW+T model on another Riemannian manifold, i.e., the unit n-sphere. The codebook of this model is learned by clustering per-frame covariance matrices from all videos in the training set. Considering the manifold structure of covariance matrices, geodesic distances and intrinsic means are used for the clustering; (c) we extract high-level features from the BoW+T model for each video as the final feature descriptor. It can be seen as a time series of points on the the unit n-sphere; (d) we formulate a positive definite kernel for activity classification using BoW+T features. This kernel is based on DTW and geodesic distances on the unit n-sphere.

The remainder of this paper is organized as follows: Section 2 briefly reviews the related work and theory. Section 3 gives an overview of the proposed method and then describes the major steps in detail. Section 4 shows experimental results on a video dataset containing activities from 8 classes. Finally, Section 5 concludes the paper.

2 Background information

In this paper, Riemannian manifolds are employed for feature representation of video activities. Therefore, in this section, we briefly review some theory and existing methods on Riemannian manifolds that are closely related to the proposed method, for the sake of mathematical and conceptual convenience in subsequent sections.

2.1 Riemannian geometry

Generally speaking, a manifold can be considered as a low-dimensional embedding in a high-dimensional space [10]. It represents the original data efficiently with lower dimensionality and still maintains key properties of the original data, such as topology and geometry. Manifolds are nonlinear structures that are not vector spaces; hence, Euclidean calculus does not apply. A Riemannian manifold is smooth and differentiable [10], where a set of metrics can be defined. In the tangent space of manifold points on a Riemannian manifold, linear operations may be performed.

2.1.1 The space of symmetric positive definite matrices

Mathematically, the space of d×d symmetric positive definite (SPD) matrices (\(Sym_{+}^{d}\)) is defined as

$$ Sym_{+}^{d} = \bigcap_{\mathbf{x} \in \mathbb{R}^{d}} \left\{\mathbf{P} \in Sym^{d}: \mathbf{x}^{T} \mathbf{P} \mathbf{x} > 0\right\}, $$
(1)

which is an open convex cone, whose strict interior is a Riemannian manifold [10]. Two different metrics are commonly used to compute the statistics on \(Sym_{+}^{d}\) (Fig. 1), namely, the affine-invariant metric [11] and the log-Euclidean metric [12]. The log-Euclidean metric is used in this paper, as it has a closed-form solution and is computationally more efficient than affine-invariant metric [12]. Hence, below we only show equations under the log-Euclidean metric.

Fig. 1
figure 1

Example of \(Sym_{+}^{d}\) (d = 2) embedded in a 3D space \(\mathbb {R}^{3}\). O is the origin. P and Q are the manifold points, i.e., \(\mathbf {P}, \mathbf {Q} \in Sym_{+}^{d}\). \(\mathcal {T}_{\mathbf {P}}\) is the tangent space at P. \(\mathbf {\Delta } \in \mathcal {T}_{\mathbf {P}}\) is the tangent vector whose projected point on the manifold is Q. The geodesic ρ is the shortest curve between P and Q on the manifold

Two mapping functions, the exponential map and the logarithm map, are usually defined to switch between the manifold and tangent space at a given point. Under the log-Euclidean metric, the exponential map (\(\exp _{\mathbf {P}}(\cdot): \mathcal {T}_{\mathbf {P}} \mapsto Sym_{+}^{d}\)) and the logarithmic map (\(\log _{\mathbf {P}}(\cdot): Sym_{+}^{d} \mapsto \mathcal {T}_{\mathbf {P}}\)) are defined as [13]:

$$\begin{array}{*{20}l} \exp_{\mathbf{P}}(\mathbf{\Delta}) & = \exp(\log(\mathbf{P}) + \mathbf{\Delta}) = \mathbf{Q}; \end{array} $$
(2)
$$\begin{array}{*{20}l} \log_{\mathbf{P}}(\mathbf{Q}) & = \log(\mathbf{Q}) - \log(\mathbf{P}) = \mathbf{\Delta}, \end{array} $$
(3)

where \(\mathcal {T}_{\mathbf {P}}\) is the tangent space at a manifold point P, \(\mathbf {\Delta } \in \mathcal {T}_{\mathbf {P}}\) is the tangent vector whose projected point on the manifold is Q, exp(·) is the matrix exponential, and log(·) is the principal logarithm of a matrix defined as the inverse of the matrix exponential [12].

The geodesic is the shortest curve between two points on a manifold [14]. The geodesic distance, the length of the geodesic, is used to measure the distance between two manifold points. Under the log-Euclidean metric, the geodesic distance between P and Q on the manifold \(Sym_{+}^{d}\) is computed by [13]

$$ \rho(\mathbf{P},\mathbf{Q}) = \|\log_{\mathbf{P}}(\mathbf{Q})\| = \|\log(\mathbf{Q}) - \log(\mathbf{P})\|, $$
(4)

where ∥·∥ is the Frobenius norm.

The Riemannian geometry of \(Sym_{+}^{d}\) can be exploited when the extracted feature descriptors are covariance matrices, e.g., region covariance [15], since the SPD cone is exactly the set of non-singular covariance matrices [16].

2.1.2 The unit n-sphere

The unit n-sphere, \(\mathcal {S}^{n}\), is an n-dimensional sphere with a unit radius, centered at the origin of (n + 1)-dimensional Euclidean space. An intuitive example would be a unit circle (n = 1) in 2-D space, or a 2-D unit sphere (n = 2) in 3-D space. Mathematically, it is defined by

$$ \mathcal{S}^{n} = \{\mathbf{p} \in \mathbb{R}^{n+1}: \|\mathbf{p}\| = 1\} $$
(5)

which can be considered as the simplest Riemannian manifold after the Euclidean space [17]. The geodesic distance between two manifold points p, q on \(\mathcal {S}^{n}\) is the great-circle distance (Fig. 2):

$$ \rho(\mathbf{p},\mathbf{q}) = \arccos(\mathbf{p}^{T} \mathbf{q}) $$
(6)
Fig. 2
figure 2

Example of an n-sphere \(\mathcal {S}^{n}\) (n = 2) embedded in an (n + 1)-D space \(\mathbb {R}^{n~+~1}\). p and q are manifold points, i.e., \(\mathbf {p}, \mathbf {q} \in \mathcal {S}^{n}\). The geodesic ρ is the shortest curve between p and q on the manifold

where arccos(·):[−1,1]→[0,π] is the inverse cosine function [18]. The great-circle distance between two manifold points is unique.

The Riemannian geometry of \(\mathcal {S}^{n}\) can be utilized when the extracted feature vectors are normalized by the 2 norm, e.g., SIFT [19], HOG [20], LBP [21].

2.2 Bag-of-words model

The bag-of-words (BoW) model is originally used in document classification, where each document is considered as a bag of words and is represented as a vector of occurrence counts of words (a histogram over the vocabulary). This model has also been applied to image classification [22], treating each image as a document (a bag of visual words). The BoW representation of an image is obtained by first clustering a set of selected local image descriptors such as SIFT (usually with k-means clustering) to generate a visual vocabulary (or, codebook), followed by extracting a histogram by assigning each descriptor to its closest visual word.

The learning and recognition based on the BoW model can be roughly divided into two categories, namely, generative and discriminative models. Generative models estimate the probability of BoW features given a class, including Naïve Bayes classifier, and hierarchical Bayesian models such as probabilistic latent semantic analysis (pLSA) and latent Dirichlet allocation (LDA). Discriminative models learn a decision rule (classifier) to assign BoW representation of images to different classes, including nearest-neighbor classifier, SVM, AdaBoost, and kernel methods such as pyramid match kernel.

Since the BoW model is an orderless representation that counts frequencies of visual words from a dictionary, efforts have been made to incorporate spatial information into the model. For example, one can compute BoW features from sub-windows of the entire image, or based on part-based models [23]. Also, spatial pyramid representation is an extension of BoW features that gives locally orderless representation at several levels of resolution [24]. Moreover, the BoW model has been extended to encode higher-order statistics of the difference between visual words and pooled local features, such as Fisher Vectors (FV) [25] or vector of locally aggregated descriptors (VLAD) [26]. In this paper, we use the very basic BoW model other than its extensions, since we mainly focus on (i) building the model on manifold and (ii) adding temporal information into this model. Hence, a baseline approach would suffice our purpose as a proof of concept.

2.3 Distances and kernels for time series

A time series is an ordered finite set (a sequence) of data points, typically consisting of measurements observed successively over a time interval. Mathematically, it is defined as

$$ Z = \{\mathbf{x}_{t}\}_{t=1}^{n} = \{\mathbf{x}_{1}, \mathbf{x}_{2}, \cdots, \mathbf{x}_{n}\} $$
(7)

where x t is the data point at time t and n is the total number of data points.

For time series classification, we first need to define a distance function d(Z i ,Z j ) that measures the difference between each pair of time series Z i and Z j . Then, a kernel function K(d(Z i ,Z j )) can be constructed, as a function of the distance function, to measure the similarity between each pair of time series Z i and Z j .

Some commonly used distance functions include dynamic time warping (DTW) [27], edit distance with real penalty (ERP) [28], and time warp edit distance (TWED) [29]. Kernel functions based on these distance measures have often been found to perform well in practice. However, they are not strictly positive definite, since DTW, ERP, and TWED in general are not positive definite [30].

Positive definiteness is a preferable property for kernel functions. It ensures that the optimization problem is convex and the solution is unique [31]. To this end, some positive definite kernels for time series classification have been suggested, e.g., global alignment (GA) kernels [32], recursive edit distance kernels (REDK) [30], which are shown to outperform indefinite kernels in general. In this paper, we propose to use a special type of REDK kernels in combination of a geodesic distance function, with the proof of its positive definiteness.

3 Proposed method for activity classification

This section first gives an overview of the proposed method, and then describes each important step of the method in details.

3.1 Overview of the proposed method

The proposed method can be summarized into three major steps, as depicted in Fig. 3. First of all, for each frame of a video activity, a unified covariance matrix is formed to jointly represent structural features of body pose and appearance features of interacting objects at hands. This covariance matrix can be viewed as a manifold point in the space of SPD matrices. Thus, a video activity is initially represented as a time sequence of covariance descriptors on the Riemannian manifold of SPD matrices. Then, a BoW model is learned by clustering the set of all covariance matrices from training video activities. For each video activity, time-dependent BoW features are extracted based on the learned BoW model. More specifically, a video activity is eventually characterized by a time series of BoW features. This time series can be seen as a trajectory on a unit n-sphere. Finally, a positive definite kernel is formulated based on DTW and geodesic distances on the unit n-sphere for activity classification, using these time-dependent BoW features.

Fig. 3
figure 3

Illustration of major steps in the proposed method. Notations and notes: ∙I t is the t-th frame of an input video, and L is the total number of frames ∙ “ ” are key points (head, hands, waist center, midpoint of feet), and the areas with dotted edges are local patches centered at hands ∙C is the frame-based covariance feature (as a point on the manifold of SPD matrices \(Sym_{+}^{d}\)) extracted from local patches and key points in I t ∙ The codebook for BoW+T model is generated by clustering covariance matrices on \(Sym_{+}^{d}\) ∙ The video is encoded by the BoW+T model as a time series of manifold points on a unit n-sphere \(\mathcal {S}^{n}\) and then classified by a kernel machine based on geodesic distance on that sphere

3.2 Covariance descriptor for combining local appearance and global pose features

We adopt a part-based approach for feature extraction of a target person in each image frame, where the positions of left/right hand, head, feet, and torso axes of the person are required. The basic idea is to extract both appearance and structural features from body parts. The former may give important cues for local human-object interaction, while the latter can provide information on the global body pose and motion. These body parts can be detected by a Kinect sensor with skeleton tracking [33], or by existing toolboxes for pose estimation [34, 35]. The reason for detecting hand points is that interacting objects, as useful cues for activity recognition, are likely to appear in the vicinity of the human hands. It may be argued that hand regions are less important for activities without human-object interaction (e.g., falling down, lying down, walking, sitting down). However, they still provide useful information on arm movement relative to other body parts and serve as a discriminative feature between activities with and without human-object interaction. It is also beneficial to detect the head, feet, and torso axes, as they may provide structural information about the body pose of the person.

For each image frame of a video activity containing a certain class of activity performed by a single person, a pair of 2-D hand points {p i } are detected as p i = (x i ,y i )T, where i = 1,2 is the hand index. For either hand point, a local image patch R i of size l × l centered at p i is obtained. For the j-th pixel in R i , a feature vector f i,j is formed by concatenating the following two component vectors:

a) Appearance feature vector [15, 16]

$${} \mathbf{f}_{i,j}^{\,a}= \left[r,g,b,|I_{x}|,|I_{y}|,|I_{xx}|,|I_{yy}|,\sqrt{I_{x}^{2}+I_{y}^{2}},\arctan\left(\frac{I_{y}}{I_{x}}\right)\right]^{T} $$
(8)

where r, g, and b are the RGB values of the pixel, |I x |, |I y |, |I xx |, and |I yy | are the magnitudes of the first and second derivatives along x, y directions, and \(\sqrt {I_{x}^{2}+I_{y}^{2}}\) and \(\arctan \left (\frac {I_{y}}{I_{x}}\right)\) are the gradient magnitude and orientation, respectively.

b) Structural feature vector

$$ \mathbf{f}_{i,j}^{\,s} = \left[x,y,\mathbf{d}_{1}^{T},\mathbf{d}_{2}^{T},\mathbf{d}_{3}^{T},d_{4}\right]^{T} $$
(9)

where (x,y)T is the pixel coordinate in R i , d 1, d 2, d 3, and d 4 are the distances from the pixel to the head point p a = (x a ,y a )T, the other hand point p k (ki,k = 1,2), the midpoint p b = (x b ,y b )T of two feet, and the torso axis, respectively. It is worth mentioning that (i) all these distances are normalized by the length of the torso axis L; (ii) d 1, d 2, and d 3 are the 2-D vectors that contains distances in x and y directions; and (iii) d 4 is a scalar, i.e., distance from a point (the pixel) to a line (the torso axis).

Thus, the feature vector f i,j for the j-th pixel in i-th local patch R i related to the left (or right) hand is defined as

$$ \mathbf{f}_{i,j} = \boldsymbol{\Omega}\left[\left(\mathbf{f}_{i,j}^{\,a}\right)^{T},~\left(\mathbf{f}_{i,j}^{\,s}\right)^{T}\right]^{T} $$
(10)

where \(\mathbf {f}_{i,j}^{a}\) and \(\mathbf {f}_{i,j}^{s}\) are feature vectors in (8) and (9) encoding local appearance of the interacting object and global pose of the target person, respectively, and Ω > 0 is an empirically determined diagonal matrix that adjusts the weight of features.

The local patch R i for the i-th hand is represented by an r × r covariance matrix as

$$ \mathbf{C}_{i} = \frac{1}{|R_{i}|-1} \sum\limits_{j=1}^{|R_{i}|} \tilde{\mathbf{f}}_{i,j} \tilde{\mathbf{f}}_{i,j}^{T} \quad \in Sym^{+}_{r} $$
(11)

where |R i | is the total number of pixels in patch region R i and \(\tilde {\mathbf {f}}_{i,j}\) is the mean-subtracted feature vector.

Finally, assuming image patches at two hands are statistically independent, a covariance matrix of d × d (d = 2r) is formed for each frame by using the local patch-based descriptors as follows:

$$ \mathbf{C} = \left[ \begin{array}{c c} \mathbf{C}_{i^{*}} & \mathbf{0} \\ \mathbf{0} & \mathbf{C}_{n} \end{array} \right] \quad \in Sym^{+}_{d} $$
(12)

where \(\mathbf {C}_{i^{*}}\phantom {\dot {i}\!}\) and C n are computed from (11), and their indices (subscripts) i = arg mini∈{1,2}∥p a p i ∥, and ni , n∈{1,2}. Since covariance matrix \(\mathbf {C} \in Sym_{+}^{d}\), it may be viewed as a point on a Riemannian manifold [16].

In this way, for each image frame of the video activity, the local appearance information of two hand regions which may potentially contain interacting objects and the global posture information as hand positions with respect to the head, feet, and torso axes are encoded into a unified covariance matrix, disregarding whether the person is left-handed or right-handed.

3.3 Temporal BoW model on the Riemannian manifold of SPD matrices

We employ the bag-of-words model for representing activities in videos. Since each image frame of a video activity is represented by a covariance descriptor (see Section 3.2), a most straightforward way would be to directly treat the video activity as a bag of covariance descriptors. However, temporal information as an important cue for activity recognition is neglected, which may lead to inferior results. Instead, in our case, each video activity is treated as a temporal sequence (time series) of bags of covariance descriptors. Further, comparing to representing the video activity as a time series of covariance matrices, the BoW model is more efficient and has been shown to be effective in many classification tasks. We refer to this temporal BoW model on Riemannian manifold as Riemannian BoW+T model.

The motivations for exploiting Riemannian manifolds in feature representation are threefolds: first, the nonlinear nature of manifolds enables effective description of dynamic processes of human activities involving non-planar movement, which lie on a nonlinear manifold other than a vector space; secondly, many video features of human activities may be effectively described by low-dimensional data points on the Riemannian manifold while still maintaining the important property of human activities such as topology and geometry; thirdly, the Riemannian geometry provides a way to measure the distances of different activities on the nonlinear manifold, hence is suitable tool for the classification.

Given a set of covariance descriptors (manifold points) \(\mathcal {X}~=~\{\mathbf {X}_{i}\}_{i=1}^{M}\), \(\mathbf {X}_{i} \in Sym^{+}_{d}\), extracted and collected from a training set of video activities, we aim to learn a codebook (or, a dictionary) for our BoW model. In the simplest case, one can ignore the Riemannian geometry of SPD matrices and learn a codebook straight from the vectorized form of these matrices. That is, Euclidean geometry is applied and arithmetic mean is used for computing the clusters. Despite the simplicity, this method often yields undesirable outcome due to the swelling effect 1 [12]. Hence, the underlying Riemannian geometry should be taken into account for creating the codebook without swelling effect 2. One common alternative is to first project the set of manifold points to a global tangent space at a particular point on the manifold and then apply Euclidean tools for clustering. However, mapping data to a tangent space only produces a first-order approximation of the data that can be distorted, especially in regions far from the origin of the tangent space. Therefore, we propose to use the intrinsic mean for obtaining the codebook, by extending k-means clustering to the case of Riemannian manifolds with the Karcher mean (also known as the Fráchet or Riemannian mean).

In this case, we aim to partition the set of M manifold points into k (kM) subsets (or, clusters) \(\mathcal {C} = \{\mathcal {C}_{1}, \mathcal {C}_{2}, \cdots, \mathcal {C}_{k}\}\) by minimizing the sum of squared geodesic distances of each manifold point in the cluster to its center. The objective is to seek:

$$ \arg\min_{\mathcal{C}} \sum_{j=1}^{k} \sum_{\mathbf{X}_{i} \in \mathcal{C}_{j}} \rho^{2}(\boldsymbol{\mu}_{j}, \mathbf{X}_{i}) $$
(13)

where ρ(·,·) is the geodesic distance defined in (4) and \(\boldsymbol {\mu }_{j} \in Sym^{+}_{d}\) is the Karcher mean of points in the j-th cluster, which is found by

$$ \arg\min_{\boldsymbol{\mu}_{j}} \sum_{\mathbf{X}_{i} \in \mathcal{C}_{j}} w_{i} \rho^{2}(\boldsymbol{\mu}_{j},\mathbf{X}_{i}) $$
(14)

where \(w_{i} \in \mathbb {R}\) is the weight for the i-th point that is inversely proportional to the distance from the point to its cluster center. The minimization problem in (14) can be solved by iteratively mapping from manifold to tangent spaces and vice versa until convergence [13]:

$$ \boldsymbol{\mu}_{j}^{m+1} = \exp_{\boldsymbol{\mu}_{j}^{m}} \left(\sum_{\mathbf{X}_{i} \in \mathcal{C}_{j}} w_{i} \log_{\boldsymbol{\mu}_{j}^{m}} (\mathbf{X}_{i}) / \sum_{\mathbf{X}_{i} \in \mathcal{C}_{j}} w_{i} \right) $$
(15)

where exp·(·) and log·(·) are the pair of exponential and logarithm mapping functions defined in (2) and (3) under log-Euclidean metric and m is the index for the current iteration. Although one may argue that the iterative approach in (15) is computationally expensive, it shows superior performance comparing to extrinsic methods in our experiment.

Given a codebook of covariance descriptors \(\{\boldsymbol {\mu }_{j}\}_{j=1}^{k}\) that is learned from (13), each video activity as a time series of covariance matrices \(\{\mathbf {C}_{t}\}_{t=1}^{L}\), where \(\mathbf {C}_{t} \in Sym^{+}_{d}\) and L is the length of the video activity (number of frames) can be encoded by the Riemannian BoW+T model as a time series of bags of covariance descriptors as follows.

  1. 1.

    Assign each covariance descriptor to its closest vocabulary word in the dictionary according to the geodesic distance in (4):

    $$ v_{t} = \arg\min_{j \in \{1,2,\cdots,k\}} \rho(\boldsymbol{\mu}_{j}, \mathbf{C}_{t}) $$
    (16)
  2. 2.

    Temporally divide the video activity into N (fixed) segments \(\{\mathcal {Z}_{i}\}_{i=1}^{N}\), each of length ⌊L/N⌋. If L <N, then the segment length is chosen as 1, and the number of segments becomes L.

  3. 3.

    For each segment, generate a histogram h i with k bins, and set the j-th bin to

    $$ c_{ij} = \sum_{t: \mathbf{C}_{t} \in \mathcal{Z}_{i}} \mathbb{I}\left[v_{t} = j\right] $$
    (17)

    where c ij denotes the count of covariance descriptors that belong to the i-th segment \(\mathcal {Z}_{i}\) and are assigned to the j-th codeword (cluster), and \(\mathbb {I}[A]\) is an indicator function which equals 1 if event A is true, and 0 otherwise.

  4. 4.

    Normalize each histogram h i by 2 norm.

In this way, each video activity is represented as a time series of histograms {h 1,h 2,⋯,h l } over the vocabulary learned from (13), where each histogram h i is a BoW feature vector based on covariance descriptors. Since all histograms are 2-normed, the Riemannian geometry of \(\mathcal {S}^{n}\) can be utilized where video activities can also be viewed as temporal sequences of manifold points on a unit n-sphere \(\mathcal {S}^{n}\), for some n.

Alternatively, one can temporally divide each video activity into segments with a fixed time interval and generate BoW feature vector from each segment in a same way. However, this may not be suitable for datasets containing video activities with significantly different length, especially for activities from the same class.

3.4 Time series classification with regularized DTW kernel based on geodesic distances on \(\mathcal {S}^{n}\)

For each pair of video activities, we need to measure the similarity between them. This is done by representing each video activity as a time series of manifold points on a unit n-sphere by the Riemannian BoW+T model (see Section 3.3) and comparing them with a distance measure. The essence for using DTW-based kernels and geodesic distance-based local kernels is to fit for two important aspects of the classification problem: (i) the sequential nature of our feature data points; (ii) the underlying nonlinear manifold structure of data sequence.

As aforementioned, dynamic time warping is a common way for the comparison between time series. However, DTW kernel is in general not positive definite, which may lead to inferior results. Instead, we employ a regularized version of DTW kernel that can be positive definite if certain conditions are satisfied. For detailed expression of this regularized DTW kernel, readers are referred to [30]. In fact, this regularized DTW kernel is a special type of REDK kernels [30], whose definiteness will be elaborated in a theorem in the Appendix of this paper.

Moreover, considering the underlying geometry of given time series, we propose to use a local kernel that is based on geodesic distances between manifold points on a unit n-sphere \(\mathcal {S}^{n}\) in (6). More specifically, the local kernel is defined as

$$\begin{array}{*{20}l} k(\mathbf{x},\mathbf{y}) & = \exp\left(-\gamma \rho(\mathbf{x},\mathbf{y})\right) \\ & = \exp\left(-\gamma \arccos\left(\mathbf{x}^{T} \mathbf{y}\right)\right), \end{array} $$
(18)

where γ is a stiffness parameter that weights the contribution of the local elementary costs. For detailed proof regarding the positive definiteness of the proposed kernel, please refer to the Appendix of this paper.

4 Experimental results

This section describes the experiments and shows the results on two video datasets containing activities from multiple classes using the proposed method.

4.1 Video datasets on activity classification

Dataset-A: This video dataset contains a total of 943 video activities from 8 activity classes, namely, (1) eating, (2) drinking, (3) using laptop, (4) reading, (5) falling down, (6) lying down, (7) walking, and (8) sitting down. The videos were recorded by ourselves at Chalmers University of Technology, Gothenburg, Sweden, using a Kinect sensor. There are 34 participants involved to increase the randomness in performing activities, without any pre-training. The frame rate is 30 frames per second. The frame resolution is 640 × 480. The average length of video is approximately 100∼600 frames (≈ 3 ∼ 20 s). Detailed information on this dataset is provided in Table 1. As shown in Table 1, activities from different classes take up comparable proportions. Figure 4 depicts some key frames of the videos from Dataset-A.

Fig. 4
figure 4

Key frames from Dataset-A containing activities from 8 classes. Upper row from left to right: eating, drinking, using laptop, and reading. Lower row from left to right: falling down, lying down, walking, and sitting down

Table 1 Specifications on Dataset-A

Dataset-B: This video dataset [36] contains a total of 224 RGB-D videos from 7 activity classes, namely, (1) drinking, (2) eating, (3) using laptop, (4) reading cellphone, (5) making phonecall, (6) reading book, and (7) using remote. The videos are captured by a Kinect sensor. There are 16 participants involved to perform each class of activity. The frame rate is 30 frames per second. The frame resolution is 640 × 480. The average length of video is approximately 260 ∼ 530 frames (≈ 8∼17 s). Detailed information on this dataset is provided in Table 2. As shown in Table 2, activities from different classes take up exactly the same proportions. Figure 5 depicts some key frames of the videos from Dataset-B.

Fig. 5
figure 5

Key frames from Dataset-B containing activities from 7 classes. Upper row from left to right: drinking, eating, using laptop, and reading cellphone. Lower row from left to right: making phonecall, reading book, and using remote

Table 2 Specifications on Dataset-B

4.2 Experimental setup

For adjusting the weight of features, the diagonal matrix in (10) is Ω = diag(1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,4). For the BoW model, the number of codewords (clusters) k = 150. For each video activity, the number of segments M = 7. These parameters are empirically determined, without much tuning or optimization.

To limit the impact of inaccurate skeleton/pose estimation, we use manually marked key points in our tests for Dataset-A. However, for Dataset-B, key points are taken from skeletal joints that are automatically estimated by Kinect, to ensure the fairness in comparison with other methods.

The libSVM [37] software was modified by using the proposed kernel to fit for our classifier, with the regularization coefficient and kernel parameters tuned by coarse-to-fine grid search and cross-validation. For both datasets, the classifiers were trained on approx. 50% videos from each class, and the remaining ones (approx. 50%) were used for testing.

4.3 Tests, evaluations, and comparisons

The proposed method is tested on both datasets, with evaluations as well as comparisons to other methods where applicable.

4.3.1 Results on Dataset-A

For Dataset-A, the confusion matrix for the proposed method on the test set is given in Table 3. The performance of the proposed method in terms of classification accuracy and false positive rate (FPR) on the test set are reported in Table 4. It can be observed from Tables 3 and 4 that the proposed method overall achieved high classification accuracy and low false positive rate. Confusions are found between eating/drinking/using laptop/reading and walking/sitting down which may appear to be unusual. This is probably due to the fact that in some walking or sitting down scenarios, the person is holding something (e.g., food, drink, book, laptop) while walking or sitting down.

Table 3 Confusion matrix for the test set of Dataset-A
Table 4 Performance of the proposed method on activity classification (8 classes) using Dataset-A: classification accuracy, and false positive rate (FPR) on the test set

4.3.2 Results on Dataset-B

For Dataset-B, the performance of the proposed method and several existing methods3 in terms of classification accuracy on the test set is reported in Table 5. Below, we briefly summarize these methods that are compared with:

  • Orderlet + Boosting/SVM [36] integrates three types of features to construct a spatio-temporal representation, including pairwise joint distances, spatial joint coordinates, and temporal variations of joint locations.

    Table 5 Performance of different methods on activity classification (7 classes) using Dataset-B: classification accuracy on the test set
  • Actionlet Ensemble [38] defines an actionlet as a particular conjunction of features for a subset of skeleton joints, indicating a structure of the features. Based on it, one human action can be interpreted as an actionlet ensemble that is a linear combination of the actionlets.

  • DSTIP + DCSF [39] extends STIP to depth video as DSTIP and extracts depth cuboid similarity feature (DCSF) to describe the local 3-D depth cuboid around DSTIPs for activity recognition.

  • EigenJoints [40] proposes a dimension-reduced skeleton feature, by using the spatial position differences between detected joints as well as the temporal differences between corresponding joints.

  • Moving Pose [41] proposes a moving pose descriptor for capturing dynamic postures, by using the configuration, speed, and acceleration of the skeleton joints.

It can be observed from Table 5 that the proposed method achieved the highest classification accuracy, providing further evidence for the effectiveness of the proposed method. Also, it is worth noting the performance drop on Dataset-B, comparing to Dataset-A. This is probably due to the fact that key points used for experiments on Dataset-B are automatically estimated by Kinect, which may be less accurate than manually marking.

4.4 Discussions

The proposed method is shown to have better performance than other methods on Dataset-B. This is probably due to the following major differences between the proposed method and the other ones: (i) instead of joint representation of features through concatenation, we compute the covariance matrix of these features and use it as the low-level feature descriptor. The covariance descriptor encodes information of the variances of the defined features, and their correlations with each other. Comparing to feature concatenation, covariance descriptor is a much more compact, efficient, and effective representation; (ii) in addition to spatio-temporal information that is exploited in other methods, we also consider local appearance information that encodes human-object interactions; (iii) other than the Euclidean metrics that is adopted in other methods, we take into account the underlying manifold geometry of the feature data points for classification.

For Dataset-A, to limit the impact of inaccurate skeleton/pose estimation on the proposed method, we used manually marked key points in our tests. Hence, when being replaced by automatically detected key points, some performance degradation is expected, if the key points on the skeleton are less accurate. This is also a possible reason of the performance drop on Dataset-B, comparing to Dataset-A. Although there are many toolboxes that can be exploited, such as [34, 35], the study of the impact of inaccurate skeleton/pose estimation on activity classification is beyond the scope of this paper.

5 Conclusion

In this paper, we proposed a method on human activity classification in video that is dedicated to assisted living and healthcare. The method treats each video activity as a temporal sequence of BoW features on a Riemannian manifold and classifies such time series with a kernel based on dynamic time warping (DTW) and geodesic distances. Experiments were conducted on two video datasets containing a total number of 943 videos from 8 classes and 224 videos from 7 classes, respectively. The proposed method achieved high classification accuracy and small false alarms overall, as well as for each individual class. Comparison with several existing methods provided further evidence for the effectiveness of the proposed method.

6 Endnotes

1 A consequence from the Euclidean averaging of SPD matrices: the determinant of the Euclidean mean can be strictly larger than the original determinants, which is physically unacceptable.

2 The determinant of a mean of SPD matrices remains bounded by the values of the determinants of the averaged matrices.

3 The results of all these methods have been originally reported in [36].

7 Appendix

To show the positive definiteness of the regularized DTW kernel (REDK) with the local kernel of our choice in (18), we start with the following theorem.

Theorem 1

(Definiteness of REDK [ 30 ]) Let \(\mathbb {U}\) be the set of finite sequences (time series) and Ω be the empty sequence (with null length). REDK is a positive definite kernel on \(\mathbb {U} \times \mathbb {U}\) if the local kernel k(x,y)=f(Γ(xy)) is a positive definite kernel on \(((\mathcal {S} \times \mathcal {T}) \cup \{\mathbf {\Omega }\})^{2}\), where \(\mathcal {S}\) embeds the multidimensional space variables and \(\mathcal {T} \subset \mathbb {R}\) embeds the time stamp variable, xy is an edit operation on a pair \((\mathbf {x},\mathbf {y}) \in ((\mathcal {S} \times \mathcal {T}) \cup \{\mathbf {\Omega }\})^{2}\), and Γ(xy) is the associated cost (or, distance) function.

From the above theorem, we know that it is the positive definiteness of our local kernel in (18) that matters, which leads us to the theorem below.

Theorem 2

(Schoenberg’s Theorem [ 42 , 43 ]) Let \(\mathcal {X}\) be a nonempty set and \(f: (\mathcal {X} \times \mathcal {X}) \to \mathbb {R}\) be a kernel. The kernel exp(−γ f(x,y)) is positive definite for all γ>0 if and only if f is conditionally negative definite.

Therefore, we need to show that the pairwise geodesic distance function ρ(x,y) (or, the inverse cosine function arccos(x T y)) in (18) itself as a kernel f is conditionally negative definite. First of all, the definition of conditionally negative definite kernels is given as follows.

Definition 1

(Conditionally Negative Definite Kernels [ 44 ]) A kernel \(f: (\mathcal {X} \times \mathcal {X}) \to \mathbb {R}\) is called (conditionally) negative definite if it it symmetric and \(\sum _{i,j=1}^{m} c_{i} c_{j} f(\mathbf {x}_{i},\mathbf {x}_{j}) \leq 0\) for all \(m \in \mathbb {N}\), \(\{\mathbf {x}_{1},\cdots,\mathbf {x}_{m}\} \subseteq \mathcal {X}\) and \(\{c_{1},\cdots,c_{m}\} \subseteq \mathbb {R}\) with \(\sum _{i=1}^{m} c_{i} = 0\).

Then, we recall the Taylor series of the inverse cosine function

$$ \arccos(z) = \frac{\pi}{2} - \sum_{n=0}^{\infty} \left(\frac{(2n)!}{2^{2n} (n!)^{2}} \right) \frac{z^{2n+1}}{(2n+1)}. $$
(19)

From this series, it is clear that arccos(x T y) is conditionally negative definite, because it is of the form “constant minus positive definite” [44]. For detailed proof, observe that with the above power series representation in (19), we have

$$\begin{array}{*{20}l} f(\mathbf{x}_{i}, \mathbf{x}_{j}) & = \arccos\left(\mathbf{x}_{i}^{T} \mathbf{x}_{j}\right) \\ & = \frac{\pi}{2} - \sum_{n=0}^{\infty} \left(\frac{(2n)!}{2^{2n} (n!)^{2}} \right) \frac{\left(\mathbf{x}_{i}^{T} \mathbf{x}_{j}\right)^{2n+1}}{(2n+1)} \\ & = \frac{\pi}{2} - h(\mathbf{x}_{i}, \mathbf{x}_{j}), \end{array} $$
(20)

where h(x i ,x j ) is a positive definite kernel. To see this, observe that the power series in (20) has nonnegative coefficients, and since \((\mathbf {x}_{i}^{T} \mathbf {x}_{j})^{2n+1}\) is point-wise product of kernels, it is itself a kernel. Thus, we have in particular that the matrix

$${} {{\begin{aligned} \mathbf{F} & =\left[ \begin{array}{cccc} f(\mathbf{x}_{1},\mathbf{x}_{1}) & f(\mathbf{x}_{1},\mathbf{x}_{2}) & \cdots & f(\mathbf{x}_{1},\mathbf{x}_{m}) \\ f(\mathbf{x}_{2},\mathbf{x}_{1}) & f(\mathbf{x}_{2},\mathbf{x}_{2}) & \cdots & f(\mathbf{x}_{2},\mathbf{x}_{m}) \\ \vdots & \vdots & \ddots & \vdots \\ f(\mathbf{x}_{m},\mathbf{x}_{1}) & f(\mathbf{x}_{m},\mathbf{x}_{2}) & \cdots & f(\mathbf{x}_{m},\mathbf{x}_{m}) \end{array} \right] \\ & =\left[ \begin{array}{cccc} c & c & \cdots & c \\ c & c & \cdots & c \\ \vdots & \vdots & \ddots & \vdots \\ c & c & \cdots & c \end{array} \right]- \left[\begin{array}{cccc} h(\mathbf{x}_{1},\mathbf{x}_{1}) & h(\mathbf{x}_{1},\mathbf{x}_{2}) & \cdots & h(\mathbf{x}_{1},\mathbf{x}_{m}) \\ h(\mathbf{x}_{2},\mathbf{x}_{1}) & h(\mathbf{x}_{2},\mathbf{x}_{2}) & \cdots & h(\mathbf{x}_{2},\mathbf{x}_{m}) \\ \vdots & \vdots & \ddots & \vdots \\ h(\mathbf{x}_{m},\mathbf{x}_{1}) & h(\mathbf{x}_{m},\mathbf{x}_{2}) & \cdots & h(\mathbf{x}_{m},\mathbf{x}_{m}) \end{array}\right] \\ & = c\mathbf{1}\mathbf{1}^{T} - \mathbf{H}, \end{aligned}}} $$
(21)

where \(c = \frac {\pi }{2}\) is a constant, \(\mathbf {1} \in \mathbb {R}^{m}\) is a column vector all 1’s. Therefore, it immediately follows

$$ \mathbf{z}^{T} \mathbf{F} \mathbf{z} = c\left(\mathbf{z}^{T} \mathbf{1}\right)^{2} - \mathbf{z}^{T} \mathbf{H} \mathbf{z} \leq 0, $$
(22)

because the first term in (22) is zero whenever z T 1 = 0 (as stipulated for conditionally negative matrices in Definition 1), and because z T H z ≥ 0 as H is the kernel matrix for h(x i ,x j ). Hence, the proposed kernel is shown to be positive definite, which is used for classifying time series of manifold points (BoW feature vectors) on a unit sphere \(\mathcal {S}^{n}\).