Multi-view region-adaptive multi-temporal DMM and RGB action recognition

Al-Faris, Mahmoud; Chiverton, John P.; Yang, Yanyan; Ndzi, David

doi:10.1007/s10044-020-00886-5

Multi-view region-adaptive multi-temporal DMM and RGB action recognition

Theoretical advances
Open access
Published: 21 April 2020

Volume 23, pages 1587–1602, (2020)
Cite this article

Download PDF

You have full access to this open access article

Pattern Analysis and Applications Aims and scope Submit manuscript

Multi-view region-adaptive multi-temporal DMM and RGB action recognition

Download PDF

Mahmoud Al-Faris¹,
John P. Chiverton ORCID: orcid.org/0000-0001-9081-4136¹,
Yanyan Yang² &
…
David Ndzi³

2490 Accesses
10 Citations
Explore all metrics

Abstract

Human action recognition remains an important yet challenging task. This work proposes a novel action recognition system. It uses a novel multi-view region-adaptive multi-resolution-in-time depth motion map (MV-RAMDMM) formulation combined with appearance information. Multi-stream 3D convolutional neural networks (CNNs) are trained on the different views and time resolutions of the region-adaptive depth motion maps. Multiple views are synthesised to enhance the view invariance. The region-adaptive weights, based on localised motion, accentuate and differentiate parts of actions possessing faster motion. Dedicated 3D CNN streams for multi-time resolution appearance information are also included. These help to identify and differentiate between small object interactions. A pre-trained 3D-CNN is used here with fine-tuning for each stream along with multi-class support vector machines. Average score fusion is used on the output. The developed approach is capable of recognising both human action and human–object interaction. Three public-domain data-sets, namely MSR 3D Action, Northwestern UCLA multi-view actions and MSR 3D daily activity, are used to evaluate the proposed solution. The experimental results demonstrate the robustness of this approach compared with state-of-the-art algorithms.

View Invariant Human Action Recognition Using 3D Geometric Features

Combining CNN streams of dynamic image and depth data for action recognition

Article 14 January 2020

Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actions

Article 27 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Action recognition is a key step in many amazing applications areas. Potential areas of interest are wide. They include automated security monitoring, [1]; social applications [2]; intelligent transportation [3]; smart hospitals [4]; and homes [5].

Action recognition methods can be based on a number of different sources of features such as space-time interest points [6], improved trajectories of features and fisher vectors [7, 8]. These techniques model motion in video data which are obviously an important source of information that can be used to help recognise actions. Instead of points of motion, less localised sources of motion can also be considered to model the motion of the body as a whole such as motion history images (MHIs) [9] and for the boundary as with motion boundary histograms (MBHs) [7]. Depth can also be incorporated with techniques such as depth motion maps (DMMs) [10].

These sources of, what might be considered handcrafted features are rich in information but not necessarily always able to capture all the relevant aspects of motion that might be needed to help a classifier to distinguish between different actions.

The introduction of deep learning techniques such as convolutional neural networks (CNNs) [11] presented significant advantages for many machine learning applications, not least computer vision including action recognition, see, e.g., [12]. Deep learning-based features extracted using, for example, CNNs have shown great performance over many traditional handcrafted features due to, in simple terms, their capability to learn the important aspects of actions from the huge amount of variation that can potentially occur in images and video sequences. This property has also enabled deep learning-based techniques to have improved invariance to, for example, pose, lighting and surrounding clutter [13]. It can also be seen that the inherent structure of CNN-based techniques enables the preservation of the important relations in both the spatial and temporal dimensions [14]. As a part of the success of the deep learning-based methods, many variations in the architectures and approaches have been proposed.

1.1 Contributions

This work makes a number of novel contributions which are:

Region-adaptive depth motion map (RA-DMM). Variable emphasis is placed on different regions in the motion maps with the aid of spatially localised estimates of motion using optical flow.
A system that combines multi-synthesised views and multi-resolution motion information with multi-resolution appearance information (RGB) within a deep learning framework for action recognition. The appearance information is important for assisting with object interactions. Whilst the multi-resolutions assist with recognising the same actions performed with differing speeds. The synthesised views improve view invariance thus helping to further distinguish between actions.
A hierarchical approach to action recognition in terms of recognising gross poses (i.e. standing, sitting, lying) and then specialised networks for the action recognition. This has the advantage of improving action recognition for the same action but in different poses. For instance, mobile phone usage whilst standing or sitting is more easily recognised given the gross pose information.

Section 2 gives an overview of related work, some disadvantages and some further information regarding the contributions that this work makes. Following that, the methodology in sect. 3 describes the approach proposed here in this paper. Section 4 presents the experiments and results. Finally, Section 5 presents the conclusions.

2 Related work

A number of techniques process single video frames as static CNN features [15, 16]. Others [15, 17, 18] have processed short video clips where video frames were employed as multi-channel inputs to 2D CNNs. A further development is the use of 3D CNNs where Ji et al. in 2013 [12] used 3D convolutions to incorporate both the spatial and temporal information of actions in video.

An extension to the conventional single-stream CNN model was proposed for the first time by Simonyan and Zisserman in 2014 [15] for action recognition. It used a two-stream approach to learn single-frame appearance information in combination with stacked optical flow of multiple frames which yielded improved performance.

More recently, deep learning techniques have increasingly been used to utilise temporal information for action recognition tasks. A unique architecture was proposed in [19] using a long-term recurrent CNN with both RGB and optical flow inputs.

Temporal periods over which temporal information is learned and recognised can be very short, e.g. 2 frames as in [20]. Incorporating more temporal information can help improve action recognition performance, as shown by, for example, [12, 16, 21], and multi-temporal resolution, as used by [14]. These methods utilised a range of different features but the advantage of the multi-temporal resolution approach is the ability to adapt to different actions carried out at different speeds.

A deeper 3D CNN network called C3D was built in [21], and the learned motion features used different massive public video data-sets. The features were shown to be compact and efficient as well as providing superior performance. The C3D model included eight convolution layers, five pooling layers, two fully connected layers.

In [22], a DMM-pyramid architecture was used to train both a traditional 2D CNN and 3D CNN to keep the partial temporal information of depth sequences for action recognition. The experiments achieved comparable results with state-of-the-art methods in terms of a number of different data-sets.

A CNN model obtained from ImageNet was used in [23]. It was used to learn from multi-view DMM features for action recognition where a video was projected onto different viewpoints within the 3D space. Different temporal scales were then used from the synthesised data to constitute a spatiotemporal pattern of an action. Finally, three fine-tuned models were employed independently on the resulting DMMs. However, a fixed number of temporal scales of DMM still made the spatiotemporal information limited to action sequences carried out over a limited range of time. This would also equally need more spatiotemporal information in order for it to be recognised. In addition, some actions included object interactions which might be very difficult to discern purely from raw depth data.

In [24], a 3D CNN structure was designed to capture spatiotemporal features for action recognition. A support vector machine (SVM) classifier was then used to classify actions based on the captured features. Experimental results showed some competitive results on the KTH action recognition data.

Similarly, a 3D CNN was proposed in [25] to automatically extract spatiotemporal features. Then, however, a recurrent neural network (RNN) was used to classify each sequence considering the learned features for each time step. The experiments on the KTH data-set demonstrated impressive performance in comparison with state-of-the-art approaches. Another use of a 3D CNN was by Taylor et al. in 2010 [20] with a Restricted Boltzmann Machine to learn spatiotemporal features.

An efficient approach was proposed by Liu et al. in 2017 [26], which used a joint-pooled 3D deep convolutional descriptor applied to skeletal feature data on action recognition data. The experimental results demonstrated promising performance. Temporal information was exploited in [27], which used a deep long/short-term memory (LSTM) method on skeleton-based data sequences, which was then combined using a fusion-based approach with appearance information and employed for action recognition.

Deep learning-based action recognition was also presented in [28] using depth sequences and skeleton joint information combined. A 3D CNN structure was used to learn the spatiotemporal features from depth sequences, and then joint-vector features were computed for each sequence. Finally, the SVM classification results of the two types of features were fused for action recognition.

The 3D positional information in depth data can be further emphasised, as was done by [29] where multiple views were derived of the depth data. The authors applied it to dynamic depth images rather than incorporating it into a DMM formulation.

The formulation of the DMM has also been considered. For instance, in [30], the authors weighted the DMM based on a function that varied the amount of influence from more recent frames. In [31], the authors extended this to multiple functions. In another approach in [32], the authors combined wearable inertial sensor data with depth camera data to weight DMMs. This latter approach is interesting; however, it requires the individual to wear and provide an additional source of data. Furthermore, the motion information is not spatially localised.

All these different sources of features are useful but most of them do not consider the way the motion might be carried out over different ranges of time. For example, the number of frames used in the optical flow stacking ranged between 7 and 15 frames, such as 7, 10 and 15 frames as used in [12, 16, 33], respectively. This can be considered important in cross-actor and even for the same actor at different time points or similar. Appearance information is also not commonly used. Also, little attention is given to how different image regions that might be considered of higher relevance for different actions. Furthermore, they do not consider the effect of higher level information (e.g. pose) on the underlying learnt feature space.

At a lower level, it can also be considered preferable to obtain motion information from multiple contiguous frames in addition to the spatial information. Therefore, more suitable approaches are needed to capture extra temporal information as well as to keep the complexity of the model as low as possible. To this end, we propose a new hierarchical pose detection and action recognition system. The pre-trained C3D model is adapted here to learn multi-resolution features from both the spatial and temporal dimensions using different contiguous frames of RGB data. Furthermore, we propose an adaptive Multi-resolution depth motion map calculated across multiple views with important action information learned through the 3D CNN model to provide extra motion-based features that emphasise the significance of moved parts of an action. In addition, multi-resolution raw appearance information (i.e. RGB) is used to exploit various spatiotemporal features of the RGB scene which helps to capture more specific information that might otherwise be difficult to obtain from depth sequence information alone such as object interactions and finer image details. Our adaptive action recognition system is illustrated in Fig. 1.

Our automated system is developed and evaluated based on three well-known publicly available data-sets including the Microsoft Research (MSR) Action 3D data-set [34], the Northwestern UCLA Multiview Action 3D data-set [35] and the MSR daily activity 3D data-set [36]. The experimental results demonstrate the robustness of our approach compared with state-of-the-art algorithms.

3 Methodology

Traditional depth motion maps (DMMs) are formulated on 2D planes by combining projected motion maps of an entire depth sequence. This does not consider the higher order temporal links between frames of depth sequences. A DMM can encapsulate a certain amount of the variation of a subject’s motions during the performance of an activity. Unfortunately, difficulties can arise for activities that have the same type of movements but performed over different temporal periods. Our formulation therefore includes multiple time resolutions, referred to as Multi-resolution DMM (MDMM). Moreover, some actions or parts of actions are performed with different intensities. The differences in depth information captured at points of fast motion are accentuated using a region and motion-adaptive formulation producing a region-adaptive MDMM (RAMDMM). This adaptivity helps to further differentiate between actions, particularly with differences in depth due to positioning compared with actions with fast motion. Parameters used throughout this work are listed in Table 1.

Table 1 Table of parameters and notation

Multi-view region-adaptive multi-temporal DMM and RGB action recognition

Abstract

Similar content being viewed by others

View Invariant Human Action Recognition Using 3D Geometric Features

Combining CNN streams of dynamic image and depth data for action recognition

Human action recognition using multi-stream attention-based deep networks with heterogeneous data from overlapping sub-actions

1 Introduction

1.1 Contributions

2 Related work

3 Methodology

3.1 Depth motion maps

3.1.1 Multi-resolution-in-time depth motion maps

3.1.2 Adaptive motion mapping

3.2 Multiple views

3.3 Feature extraction, classification and fusion

3.4 Multi-resolution spatiotemporal RGB information

3.5 People detection and pose classification

4 Experiments and results

4.1 Northwestern UCLA data-Set

4.1.1 Multi-resolution-in-time appearance information

4.1.2 Multi-resolution-in-time region-adaptive depth motion maps

4.1.3 Combining RAMDMM-, multi-view- and appearance-based multiple sequences

4.2 MSR 3D action data-set

4.2.1 Depth information

4.2.2 Multi-resolution-in-time region-adaptive depth motion maps

4.2.3 Combining RAMDMM-, multi-view- and depth-based multiple sequences

4.3 MSR 3D Daily activity

4.3.1 Multi-resolution-in-time appearance information

4.3.2 Multi-resolution-in-time region-adaptive depth motion maps

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation