Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition

Human action recognition remains an important yet challenging task. This work proposes a novel action recognition system. It uses a novel Multiple View Region Adaptive Multi-resolution in time Depth Motion Map (MV-RAMDMM) formulation combined with appearance information. Multiple stream 3D Convolutional Neural Networks (CNNs) are trained on the different views and time resolutions of the region adaptive Depth Motion Maps. Multiple views are synthesised to enhance the view invariance. The region adaptive weights, based on localised motion, accentuate and differentiate parts of actions possessing faster motion. Dedicated 3D CNN streams for multi-time resolution appearance information (RGB) are also included. These help to identify and differentiate between small object interactions. A pre-trained 3D-CNN is used here with fine-tuning for each stream along with multiple class Support Vector Machines (SVM)s. Average score fusion is used on the output. The developed approach is capable of recognising both human action and human-object interaction. Three public domain datasets including: MSR 3D Action,Northwestern UCLA multi-view actions and MSR 3D daily activity are used to evaluate the proposed solution. The experimental results demonstrate the robustness of this approach compared with state-of-the-art algorithms.

. Framework of our hierarchical Region Adaptive Multi-time resolution Depth Motion Map (RAMDMM) and Multi-time resolution RGB action recognition system. Each pose, Cartesian projection, view and time window has a separate 3D CNN and SVM. The system is configured here to detect 2 poses, across 7 views and 3 time resolutions in the 3 Cartesian planes. The RGB information is also detected across 3 time resolutions. This results in 2(×3 × 7 × 3 + 3) = 132 separate 3D CNNs and SVM classifiers.
Our automated system is developed and evaluated based on three well-known publicly available datasets including the Microsoft Research (MSR) Action 3D dataset [25], the North Western UCLA Multiview Action 3D dataset [26] and the MSR daily activity 3D dataset [27]. The experimental results demonstrate the robustness of out approach compared with state-of-the-art algorithms.
II. METHODOLOGY Traditional Depth Motion Maps (DMM)s are formulated on 2D planes by combining projected motion maps of an entire depth sequence. This does not consider the higher order temporal links between frames of depth sequences. A DMM can encapsulate a certain amount of the variation of a subject's motions during the performance of an activity. Unfortunately difficulties can arise for activities that have the same type of movements but performed over different temporal periods. Our formulation therefore includes multiple time resolutions, referred to as Multi-resolution DMM (MDMM). Moreover, some actions or parts of actions are performed with different intensities. The differences in depth information captured at points of fast motion is accentuated using a region and motion adaptive formulation producing a Region Adaptive MDMM (RAMDMM). This adaptivity helps to further differentiate between actions, particularly with differences in depth due to positioning compared with actions with fast motion.

A. Depth Motion Maps
The basic DMM, (as used in e.g. [5], [28], [29]), includes projecting each depth frame onto three orthogonal Cartesian planes. The motion energy from each projected view is then stacked. This can be through a specific interval or through the entire sequence to generate a Depth Motion Map (DMM), Γ v for each projection view, where v ∈ xy, yz, xz indicates the Cartesian projection; m t v is the projected map of the depth information at time frame t under projection view v; N is the number of frames that indicates the length of the interval. DMMs can be represented by combining the three generated DMMs Γ v together where important information of body shape and motion are emphasised. Average score fusion is used here, to be discussed shortly in section II-C.
1) Multi-resolution-in-time Depth Motion Maps: Mostly, a fixed number of frames have been used by other researchers or even the entire number of frames of an action sequence video to generate DMMs. But a length of an action is not known in advance. Hence, multi-resolution-in-time depth motion maps are needed to cover different temporal intervals and rates of an action.
To produce a Multi-resolution DMM (MDMM), the depth frames from a depth sequence are combined across three different ranges where each has a different time interval. This means that various values of temporal length, τ are set to generate the MDMMs for the same action (depth sequence). As τ ∈ N + in traditional DMMs, this can be improved by τ ∈ λ 1 , λ 2 , λ 3 where λ i ∈ N + are different temporal windows used to properly cover an action's motion regardless of whether it carries important information over a short or long duration. Each of these three duration's produce a different DMM. The values of τ are selected to cover short, intermediate and long duration's, where long would typically correspond to an entire depth sequence for the various video sequences considered here.
These MDMMs for each depth sequence can be calculated with: where v ∈ xy, yz, xz , τ = λ i and, e.g. τ ∈ 5, 10, All (as used here) are the various lengths of depth sequence used to obtain a MDMM for each single frame.
2) Adaptive Motion Mapping: As already considered, different actions can be performed over different time periods. The MDMM is able to include motion information across a range of temporal windows. However each action can also be performed at different speeds by different people and with movement in different locations in an image. Hence, an adaptive weighting approach based on the movement is applied to continuously weight the interest regions to adapt to any sudden change in an action.
To adapt various changes of an action, an adaptive weighting approach based on the magnitude of the optical flow motion vectors is employed to build a regional adaptive MDMM. Firstly, motion flow vectors are extracted using optical flow as explained in [30] on consecutive frames. Then the motion magnitude for each single pixel is computed and normalised between two consecutive frames.
Optical flow is computed between two consecutive frames, i.e. I t , and I t+1 . The result of the optical flow function is the motion flow vector o with vector elements o x and o y in the vertical and horizontal directions respectively. The motion magnitude of the flow vectors of each pixel can be calculated using: g = o 2 x + o 2 y . As the motion magnitude changes based on the type, speed and shape of an action movement, this can be utilised to improve the DMM calculation formula by including the motion magnitude in the DMM equation as a weighting function. This helps to add increased consideration for higher interest regions of a DMM template as well as providing low consideration for other regions. In addition, it can make the DMM template adapt to different movements in an action movement. The new RAMDMM can be formulated as follows: is the motion magnitude for view v at time point t + 1. Fig. 2 shows samples of DMM templates illustrating some differences between traditional DMMs and the region adaptive DMM method.

B. Multiple Views
The 3D characteristic of the depth sequences mean that it is possible to calculate different view points of the same data. This can help to improve the model by making it view invariant. A virtual camera can be rotated with a specific value in 3D space, which can be seen to be equivalent to rotating the 3D points of the depth frames.
The virtual camera can be moved within the depicted space, for instance, from point p to p . The first step is to move from p to p b with rotation angle α around Y-axis, then from p b to p with rotation angle β around X-axis. This is performed by the rotation matrices: and The right-handed coordinate system is used for the rotation where the original camera view-point is p. Hence, the new coordinate of 3D point after rotation can be considered as follows: where p is the new coordinate, and the corresponding depth value for the synthesised depth frames. Our view projection method on depth sequences is similar to [24] except applied here to enable extraction of DMMs. Some results of multi-view projection are presented in Fig. 3 with different values of α rotation angles. It can be noticed that more discriminative information can be obtained by computing RA-DMM based on the synthesised depth frames. Sequences of synthesised depth frames with different view-points can be synthesised from a series of these multi-view projections. This can contribute to better data augmentation for training processes in addition to better overall feature extraction.
In terms of the DMM formulation, multiple views extends the formulation with an additional dependency term, i.e.

C. Feature Extraction, Classification and Fusion
An effective approach was presented for action recognition in [16] to learn spatio-temporal features using a 3D convolutional neural network which was also trained on a number of different large video datasets. The training settings were kept the same as the original C3D model.
A 3D CNN is able to capture temporal information based on 3D convolution and pooling operations which are performed in the spatial and temporal dimensions.
The C3D network has eight convolution layers and five pooling layers that followed on from each other. Two fully-connected layers and a softmax loss layer are used to recognise at the individual action label level. The number of kernels are 64, 128, 256, 256, 512, 512, 512, 512 for the convolution layers. The size of all kernels in the 3D CNN was set to 3 × 3 × 3 with stride 1 × 1 × 1. For the 3D pooling layers, the kernel sizes were set to 2 × 2 × 2 with stride 2 ×2 × 2 except for the first pooling layer which had a kernel size of 1 × 2 × 2 and a stride of 1 × 2 × 2 in order to preserve the temporal information at the early stages. The fully connected layers have 4096 output units each.
Conventionally, the value at position (x, y, z) on the jth feature map in the ith layer can be formulated as follows: where tanh(.) is the hyperbolic tangent function, b ij is the bias for this feature map, m indexes over the set of feature maps in the (i − 1)th layer connected to the current feature map, w pqr ijm is the value at the position (q, p, r) of the kernel connected to the m feature map in the previous layer. The kernel sizes, R i , P i and Q i are the temporal and spatial (height and width) dimensions, respectively.
Each value of the Cartesian projections v, time resolution τ and view α has a separate 3D CNN model that is trained based on a set of actions. The 2D output of each 3D CNN is then split into temporal feature vectors and concatenation of the three orthogonal views is used to form a single feature vector. The dimensionality of each resulting feature vector is then reduced using Principal Component Analysis (PCA) determined from a covariance matrix of all the feature vectors. The projected feature vectors are then fed into different multiple class Support Vector Machines (SVM)s [31] that are trained to recognise actions. The 3D CNN is trained to use a fixed number of input frames (λ = 16) for the depth information, i.e.
where Γ v,τ,α (t) is a scaled and colour mapped (jet) version of the multi-view region adaptive multi-temporal resolution feature data Γ of v,τ,α (t) for time t. The input frame size of the pre-trained C3D network is also fixed. A padding technique and interpolation are used here to resize frames to the required dimensions. Following the 3D CNN feature extraction process, feature concatenation and dimensionality reduction, the SVM classification is performed: Classification vectors are then combined across all Cartesian planes, resolutions and views using average score fusion of the form:

D. Multi-Resolution Spatio-Temporal RGB Information
Some types of actions and motions especially those that interact with objects can be perceived better with appearance information rather than e.g. depth due to the differences in the characteristics of the object in terms of appearances. In addition, it is somehow difficult to capture the DMM information of these objects especially when the object's state is fixed or the size is relatively small. Therefore, RGB data is utilised in this work as a source of the appearance information within our 3D CNN network model to capture discriminative spatio-temporal information of both subjects and interacting-objects. Moreover, different temporal scales are used to cover different temporal ranges in the RGB scene, the same as for RAMDMM. This can help to mitigate against problems that might arise due to variations in the speed at which actions are performed that could result with different action performers. Three temporal scales are employed across three independently fine-tuned C3D models (in fixed mode for λ = 16 but then updated to use a variable number of inputs with λ ∈ {10, 25}); the outputs of which are fed into three independently trained multi-class Support Vector Machines (SVM)s. The outputs of the SVM classifiers are then combined together via average score fusion to form the multi-resolution RGB information: where c rgb r is the action classification vector for the RGB image frames taken across a time window of r frames. An overall average score fusion is then used to derive the final classification vector, given by

E. People Detection and Pose Classification
The action recognition system can be made to perform well across a wide range of actions however this task can be further enhanced if the person performing an action can be localised in the image space. This helps remove extraneous background clutter and distractors. The performance of the system can also be further enhanced if the pose of the person can be detected prior to action recognition. It can be considered that this would help to provide the classification system with a better defined delineation between different actions performed in different poses. For instance, using a telephone whilst standing or sitting could produce a range of features that may not be that well connected in feature space or separated from other features from other actions.
Person detection is performed here using the Faster R-CNN [32] person detector, based on the AlexNet [33] model as a network structure but transformed into a Region Proposal Network (RPN), with the use of a ROI max pooling layer and classification layers.
A few samples from the RGB data of the utilised datasets are used to create the ground-truth training data. After training, the created Faster R-CNN network is then used for person detection on the RGB data. This can help to eliminate the noise of the background environment in the action recognition process as can be seen in section III.
Pose detection is performed here using a specially adapted AlexNet pre-trained model [6] using transfer learning to classify the pose of an occupant out of three specified poses (sitting, standing, laying).

III. EXPERIMENTS & RESULTS
Three public datasets are used to evaluate the proposed method for action recognition: North Western UCLA Multiview Action 3D dataset [26]; Microsoft Research (MSR) Action 3D dataset [25]; and the MSR daily activity 3D dataset [27].
The overall steps and parameter values that are employed on the datasets for feature extraction and action recognition are summarised as follows: • Project the original depth sequence into different views with α ∈ (45, 30, 15, −15, −30, −45) which results in 6 synthesised views of the data and the original at α = 0; • Compute Cartesian projections of the 7 views; • Compute motion vectors' magnitude using optical flow algorithm over the original and synthesised sequences; • Compute RA-DMMs for each sequence of original and synthesised sequences; • Each action sequence is split into 16 frame sub-sequences in terms of RA-DMMs and (10,16,25) frames sub-sequences for the RGB information to train the 3D CNNs; • Compute RA-DMM of each sequence of original and synthesised sequences using sub-sequence concatenation, dimensionality reduction then average score fusion of three maps templates; • Multi-resolution RA-DMM computed using average score fusion for RA-DMM windows (5, 10, all); • Multi-view RAMDMM computed using average score fusion of SVM classifiers for all RAMDMM across different views; • Multi-resolution RGB (depth in MSR 3D action dataset) information computed using average score fusion of SVM classifiers for {10, 16, 25} frames; • Overall proposed system achieved with average score fusion between MV-RAMDMM and MR-RGB.

A. North Western UCLA Dataset
North Western UCLA Multiview action 3D dataset [26] has three Kinect cameras used to capture RGB, depth and human skeleton data simultaneously. This dataset includes 10 different action categories including: pick up with one hand, pick up with two hands, drop trash, walk around, sit down, stand up, donning, doffing, throw, carry. Each action is performed by 10 actors. In addition, this dataset consists of a variety of viewpoints.
We evaluate our proposed method with two different training and testing protocols for this dataset: • Cross-subject training scenario: In this setting we use the data of 9 subjects as training data, and leave the data of the remaining subject as test data. This is useful to show the performance of the recognition system across subjects. Furthermore, this is a standard criteria for comparison with the state-of-the-art. • Cross-view training scenario: As this dataset contains three view cameras, we use the data of 2 cameras as training data, and leave the remaining camera as test data. This kind of setting is used to demonstrate the ability of the recognition system to perform with different views and to get another standard criteria to compare with the state-of-the-art. These settings give the opportunity to evaluate the proposed system with variations for different subjects and different views. The proposed method achieves an interesting set of results for the complete system demonstrating state-of-the-art performance as can be seen shortly. But first, let us examine the performance of the individual streams with individual inputs.
1) Multi-Resolution in Time Appearance Information: To start, the classification performance using multi-temporal resolution RGB data as an input to the 3D CNN model (C3D) is investigated together with the multi-class SVM classifier based on the aforementioned evaluation scenarios. Three temporal resolutions are used in terms of the RGB model including 10, 16, 25 windows. The trainable layers are adapted in the 3D CNN model when a non-conformant input is used, i.e. λ ∈ 10, 25. An average score fusion is employed between the three SVMs to produce the multi-temporal resolution of the RGB data for action recognition. Table I includes the results for the different temporal resolutions for the RGB data in addition to the average score fusion result. As we can see in Table I, the fine-tuned C3D model achieves good performance in terms of cross-subject and cross-view classification schemes. The model already achieves relatively good recognition rates particularly as the temporal window increases. It can be seen that C3D with multi-class SVM classification on RGB data alone with 25 temporal frames achieves the highest recognition performance of 88.23% and 70.32% in terms of cross subject and view evaluation schemes respectively. This reduces to (78.12%, 61.79%) and (67.44%, 56.71%) when 16 and 10 temporal frames are used respectively. Finally, the highest overall recognition performance is achieved when average score fusion is employed, combining the outputs of the three temporal results, again for RGB scene information only.
2) Multi-Resolution in Time Region Adaptive Depth Motion Maps: The Region Adaptive DMM (RADMM) templates are calculated across the three temporal resolutions to form the multi-resolution DMM template, referred to as RAMDMM. These are used to learn discriminative features encapsulating depth, time and motion information. Results demonstrating the improvements achieved for the depth across multiple time windows are shown in Table II. A similar trend as was seen for the appearance information can be observed for these results, i.e. a greater time window increases recognition performance which is further improved by average score fusion for all time windows combined.
3) Combining RAMDMM, Multiple Views and Appearance Based Multiple Sequences: The depth, time and motion information are then further combined across multiple synthesised views to produce MV-RAMDMM based action recognition. At the end, an average score fusion is employed between the MR-RGB and MV-RAMDMM to utilise appearance, motion, shape, and historical information based action recognition. MV-RAMDMM streams help to significantly improve the recognition rate for both the cross-subject and cross-view settings. In addition, an average score fusion between MV-RAMDMM and MR-RGB gives the opportunity to share a variety of important information for action recognition; improving the recognition accuracy in comparison to individual model classification reaching to 97.15 % and 86.20 % in terms of cross-view and cross-subject evaluation schemes respectively. This can be compared to state-of-the-art approaches as seen in Table IV.   TABLE IV  A COMPARISON BETWEEN THE PROPOSED METHOD AND STATE-OF-THE-ART APPROACHES IN TERMS OF NORTH WESTERN UCLA DATASET. Paper Cross-subject Cross-view Virtual view [34] 50.70 47.80 Hankelet [35] 54.20 45.20 MST-AOG [26] 81.60 73.30 Action Bank [36] 24.60 17.60 Poselet [37] 54.90 24.50 Denoised-LSTM [38] -79.57 tLDS [39] 92 It can be seen in Table IV, Virtual view [34] and Hanklet [35] methods are limited in their performance which reflects the challenges of the North Western UCLA dataset (e.g. noise, cluttered backgrounds and various view points). To mitigate these challenges, MST-AOG was proposed in [26] and achieved 81.60%. Our method achieves a significant improvement of 18% over MST-AOG and some comparable performance for the cross-view setting due to the big challenge in a cross-view setting. A confusion matrix of the proposed method is shown in Figure 4 using spatial and motion information in terms of the North Western UCLA multi-view action 3D dataset.

B. MSR 3D Action Dataset
The Microsoft Research (MSR) Action 3D dataset [25] is an action dataset consisting of depth sequences with 20 actions: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw cross, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis serve, golf swing, pickup and throw. Each action is performed three times, each by ten subjects. A single point of view is used where the subjects were facing the camera while performing the actions. The dataset has been split into three groups based on complexity: AS1, AS2 and AS3 as used in many studies see e.g. [25], [5], [29], [40].
The action subsets are summarised in Table V. All validation schemes make use of the three subsets. Three evaluation schemes are considered in the literature (see e.g. [41]) in terms of the MSR action 3D dataset: • 1/3 evaluation scheme: 1/3 of the instances are used as training samples and the reminder as testing samples. The 1/3 scheme splits the dataset using the first repetition of each action performed by each subject as training, and the rest for testing. • 2/3 evaluation scheme: 2/3 of the instances are used as training samples and use the remainder as testing samples. The 2/3 scheme splits the dataset into training samples using two repetitions of each action performed by each subject and testing uses the rest of the data. • Cross-subjects evaluation scheme: half of the subjects are used as training samples, the other half are used as testing samples. Any half of the subjects can be used for testing, e.g. 2, 4, 6, 8 and 10; and the rest for training, i.e. 1, 3, 5, 7 and 9 (as used here). Each subset has eight actions that can be used to evaluate the proposed method in terms of 1/3, 2/3, and cross-subject validation schemes. These can help to assess the performance of the proposed method against different training settings such as shortage of training samples, many training samples and variations between different subjects.
Similar to the experiments conducted above for the North Western UCLA 3D action data set, a series of progressive sets of experiments are carried out.
1) Depth Information: This data set only has depth information (no appearance information). Therefore, instead of RGB based appearance information, the depth frames are used. The pre-trained C3D network is individually implemented based on depth data (instead of RGB) with various temporal frames 10, 16, 25 for the different MSR evaluation schemes. Then, an average score fusion is employed between the models to show the effect on the recognition rate. Table VI includes the results of the C3D network implementation based on depth data alone. Again, the recognition performance is improved with greater temporal windows and by using a combination of different temporal dimensions combined by average score fusion; making the system more robust against speed variations. This demonstrates the utilisation of shape and temporal information from the depth sequences in the recognition process.   The results in Table VII show that the recognition performance using RADMM information to form RAMDMM then learning actions' features based on RAMDMM is better than using either traditional DMM or individual length of RADMM. Moreover, it appears to show that sharing a variety of information available from the features by average score fusion between different models can improve the performance of the recognition system.
3) Combining RAMDMM, Multiple Views and Depth Based Multiple Sequences: Table VIII shows the effects of the multi-view RAMDMM (MV-RAMDMM) templates and the effect of the multi-resolution spatio-temporal information on the recognition accuracy of the system also combined with the depth sequences investigated in section III-B1.  Figure 5 shows the confusion matrices of the recognition system using the proposed models under cross-subject evaluation schemes in terms of AS1, AS2 and AS3 subsets of MSR 3D action dataset. Further, a comparison between the proposed method and the state-of-the-art approaches for human action recognition is presented in Table IX in terms of the MSR Action 3D dataset under the aforementioned evaluation schemes. It can be seen that our method outperforms the state-of-the-art approaches for the majority of cases and in others achieves at least comparable performance. Even though some of them are DMM based methods such as [46] and [5], our method achieves greater recognition rate in the range of 1-6%. This appears to indicate that MV-RAMDMM and spatio-temporal information based features can provide more powerful discrimination. Our approach utilises adaptive multiple hierarchical features that cover various periods of an action. In addition, the pre-trained recognition model uses a diverse range of layers which improves the chances to obtain the most accurate recognition performance.

C. MSR 3D Daily Activity
The Microsoft Research (MSR) daily activity 3D dataset is among the most challenging datasets because of a high level of intra-class variation and many of the actions are based on object interaction. An action with object interaction is where the subject is interacting with an object when performing an action. This dataset has been captured by a Kinect sensor. It consists of depth and RGB sequences and includes 16 activities: drink, eat, read book, call cellphone,write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up, sit down. Performed by 10 subjects each subject performs an action twice in two different poses (standing and sitting). Different evaluation schemes have been considered in the literature in terms of MSR daily activity 3D dataset. Here, similar to [18], a cross-subject validation is performed with subjects 1,3,5,7,9 for training and subjects 2,4,6,8,10 for testing. Here the person and pose detection steps are used to detect and localise a person within a frame and pose detection is used to identify the pose, whether sitting or standing.
1) Multi-Resolution in Time Appearance Information: Firstly, multiple temporal resolutions (10,16,25) of RGB information are investigated separately with the fine-tuned C3D models. The outputs of these models are, as usual, classified using different SVMs. As before, the SVM outputs are combined using average score fusion. The results for this purely multi-temporal appearance based recognition sub-system are shown in Table X. It can be seen in Table X, that, as before, the robustness of the proposed model improves with an increase in the number of frames included in the system with the best combining the results from all temporal resolutions. This dataset is often considered to be much more complicated than others due to the two different scenarios for each single action; but the hierarchical strategy with the fine-tuned model is able to achieve comparable results based on RGB raw data. Moreover, a reasonable overall performance is also achieved that reaches 64.85% when an average recognition rate is employed.
2) Multi-Resolution in Time Region Adaptive Depth Motion Maps: As before, the RADMMs templates for three different temporal windows are computed and fed into fine-tuned C3D models, multi-class SVMs the results of which constitute the RAMDMM for action recognition. Competitive results are achieved using these improved multiple temporal resolutions as can be seen in Table XI  For multiple views (MV-RAMDMM) the performance reaches (89.00%) and (86.00%) within sitting and standing poses as presented in Table XII.
Further improvements can be seen by involving the multi-resolution spatio-temporal RGB information. Average score fusion improves the recognition of some objects-interaction actions and accomplishes (89%) and (86%) in terms of sitting and standing poses respectively. The overall recognition rate of all datasets can be calculated taking the average of the two poses recognition rates which reaches (87.5%). Figs. 6 shows the confusion matrix of the hierarchical recognition system in terms of MSR 3D daily activity dataset.  A comparison between the proposed method and state-of-the-art approaches for action recognition is introduced in Table  XIII in terms of MSR 3D daily activity dataset.  [27].

Method
Accuracy % LOP [27] 42.50 Depth Motion Maps [5] 43.13 Local HON4D [47] 80.00 Actionlet Ensemble [27] 85.75 SNV [48] 86.25 Range Sample [49] 95.63 DMM-CNN [18] 85.00 Ours 87.50 In Table XIII, it can be seen that limited accuracy was previously achieved by LOP [27] and DMM [5] based approaches. Local HON4D was designed in [47] to tackle this kind of limitation and achieved a recognition rate of 80.00%. Actionlet Ensemble in [27] and SNV in [48] achieved a recognition rate that reaches 85.75% and 86.25% respectively. These used a combination of depth and skeleton data. A recent method in [18] indicated the importance of DMM information and suggested the use of Temporal Depth Motion Maps and fine-tuned convolutional models. It achieved a relatively competitive result of 85.00%. Our method achieves comparable results with an improvement over some methods using our MV-RAMDMM and the spatio-temporal information of the C3D model. However, our method performed worse than the Range Sample [49] technique. This can be explained due to the noisy, complex and dynamic background of this dataset which can introduce significant noise in the RAMDMMs. Moreover, the Range Sample [49] method contained a technique that used skeleton data to eliminate the noise from the background. The confusion matrices in terms of MSR daily activity 3D dataset are shown in Fig. 6.

IV. CONCLUSIONS
A novel feature representation technique for RGB-D data has been presented that enables multi-view and multi-temporal action recognition. A Multiple view and Multi-resolution Region Adaptive Depth Motion Maps (RA-DMMs) representation is proposed. The different views include the original and synthesised view-points to achieve view-invariant recognition. This work also makes use of temporal motion information more effectively. It integrates it into the depth sequences to help build in, by design, invariance to variations in an action's speed. An adaptive weighting approach is employed to help differentiate between the most important stages of an action. Appearance information in terms of multi-temporal RGB data is used to help retain a focus on the underlying appearance information that would otherwise be lost with depth data alone. This helps to provide sensitivity to interactions with small objects. Compact and discriminative spatio-temporal features are extracted using a series of fine-tuned 3D Convolutional Neural Networks (3D CNN)s. In addition, a pose estimation system is employed to achieve a hierarchical recognition structure. This helps the model to recognise the same action but with different positions. Multi-class Support Vector Machines (SVM)s are used for action classification. Then, late score fusion technique is employed between different streams for the final decision.
The proposed method is robust enough to recognise human activities even with small differences in actions. This is in addition to achieving improved performance that is invariant to multiple view-points and providing excellent performance on actions that partly depend on human-object interactions. The system also remains invariant to a noisy environment and errors in the depth maps and temporal misalignments.
The proposed approach has been extensively validated on three benchmark datasets: MSR 3D actions, Northwestern UCLA multi-view actions and MSR daily activities. The experimental results have demonstrated the great performance of the proposed method in comparison to state-of-the-art approaches.