In this section we first introduce our new dataset for social activity recognition, and then present the performance of the overall system. We finally analyse more in detail the behaviour of each module—segmentation, classification, proximity priors—to better understand how their role in the social activity recognition task.
Social Activity Dataset
We created a new dataset (“3D Continuous Social Activity Dataset”) for social activity recognition to validate the performance of our system on continuous stream RGB-D data. The dataset is publicly availableFootnote 1 for the research community. It consists of RGB and depth images, plus skeleton data of the participants (i.e. 3D coordinates and orientation of the joints), collected indoor with a Kinect 2 sensor. The dataset includes 20 videos, containing individual and social activities with 11 different subjects. The approximate length of each video is 90 s, recorded at 30 fps (more than 50 K samples in total). In particular, the social activities in the videos are handshake, hug, help walking, help standing-up, fight, push, talk, draw attention. Some snapshots from the dataset are shown in Fig. 9. Differently from a previous “3D Social Activity Dataset” by [6], the social activities in this new dataset appear in uninterrupted sequences, within the same video, alternating 2 or 3 social activities with individual ones such as read, phonecall, drink or sit. Furthermore, unlike the dataset introduced in [5], which was focused exclusively on the segmentation, the occurrence of all social activities is consistent in every video and the number of activities is higher, allowing to perform experiments for the performance evaluation of the classifier. The activities of this dataset, therefore, are not manually selected and cropped in short video clips, as in previous cases.
The dataset is used to train both the temporal segmentation and the classification modules, and to evaluate the performance of the whole recognition system.
Table 1 Statistics of the final social activity recognition Overall System Performance
To evaluate the performance of the whole recognition system and verify the impact of the segmentation and the proximity-based priors, we calculate accuracy, precision and recall from the results of a leave-one-out cross-validation. Table 1 shows the results of our MM-DBMM classification alone and in combination with proximity-based priors generated by the simple multivariate or the GMM approximations. Three more cases are also compared: without interaction segmentation, with manual segmentation (i.e. ground truth by human expert) and with automatic segmentation. From the results, we can observe that the segmentation greatly improves the accuracy and, in particular, the precision. Indeed, the latter is affected by the number of individual activities (about half of total in the dataset) successfully excluded by the segmentation process. When using pure MM-DBMM, the recall seems the highest in absence of segmentation. This occurs because of the internal filtering of the DBMM, which tends to improve itself in longer sequences. Although, the recall in the case of Automatic Segmentation it gets lower than the other cases in all the configurations. The drop in performance is mainly due to the non-perfect segmentation, as can be see in Table 2, and it is further discussed in the next section. As expected, the results in case of automatic segmentation are not as good as with manual segmentation, although still considerably high.
Table 2 Performance of the interaction segmentation only Finally, Table 1 shows that integrating the proximity-based prior in the classification process improves the overall recognition performance. In particular, the GMM approximation leads to better accuracy, precision and recall than the previous multivariate Gaussian case.
The current implementation of The combined-system with non-optimised code can classify RGB-D video streams at 16 fps on average. This can further be improved by executing the different modules of the MM-DBMM and priors in parallel, since they are independent until the final merge. The component that introduces the greatest limitation in time is the segmentation module. Indeed, the HMM requires the full input sequence to perform its elaboration. In order to reduce its impact on the processing speed we have reduced the time interval processed by the HMM. In Table 3, we can observe how much the accuracy of the segmentation module decreases by decreasing the interval on which the HMM is applied.
Table 3 Performance of the segmentation in relation to the time interval of the HMM
Table 4 Percentage of the errors of the segmentation over the different classes
Analysis of Interaction Segmentation
To examine the performance of the segmentation model in Sect. 5, we evaluate accuracy, precision and recall with a leave-one-out experiment on our dataset (Table 2). In addition, to measure the impact of the segmentation errors on the different social activities, in Table 4 we report the percentages of false positives and negatives in segmenting each one of them.
What these two tables show is that, in general, our segmentation module works very well. Although, in the last table we can notice that the segmentation errors are not equally distributed among the activity classes. The draw attention activity, in particular, generates more false negatives and positives because often it starts before the actual interaction takes place, and it is therefore harder to detect.
It should be noticed, however, that even for a human expert it is difficult to detect precisely when an activity starts or ends, simply because an exact moment in time does not really exists. These results should therefore be taken with a ‘pinch of salt’ and considered only an approximate measure of the segmentation performance. As shown in the previous section, however, the segmentation module affects significantly the final results of the social activity recognition, and it is therefore a crucial component of our system.
Analysis of Social Activity Classification
A further analysis of the social activity classification, with a leave-one-out cross-validation experiment, was carried out by manually segmenting the actual interactions. This allows us to evaluate the performance of our MM-DBMM independently of the other components. From the confusion matrices in Fig. 10a, we can see that the classification of social activities is in general very good. The less accurate cases are those where the activity is very short (e.g. push, draw attention), since they provide the least number of samples. It can be observed that some activities, where the two subjects right in front of each other (e.g. handshake, push), are often confused with the talk case. As shown in the next section, this problem is mitigated by the introduction of our proximity-based priors.
Analysis of Proximity-Based Priors
To analyse the reliability of our proximity-based priors, we consider a specific activity and compute the means and the standard deviation of the all the remaining ones, assuming perfectly segmented videos. Even in this case we do a leave-one-out cross-validation. What we expect is that the probability of the actual activity is higher than all the other ones. Comparing the priors obtained from a simple multivariate Gaussian and a GMM approximation (Fig. 11) for some social activities, we can see that in the multivariate case the mean probability of the actual activity is higher than in GMM case, but the variance of the latter is much smaller and therefore more reliable.
The effect of these two different priors on the activity classification is shown by the confusion matrices in Fig. 10b, c. In both cases, it is clear that the proximity-based priors improve the classification of social activities. However, we can also see that the improvement is higher when GMM priors are used.
Comparative Study
To compare our classification performance with other works we tested our social activity classification model also on the SBU Kinect Interaction dataset 2.0 [39]. The latter also includes 8 dyadic social activities (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, shaking hands), but in a cropped video scenario. To be more precise, the dataset includes 2 different types of segmented social activity clips (clean,noisy). In the clean case the clip starts and stops tightly around the activity, while in the noisy includes the same videos but more loosely segmented, including other random movements. For these reasons we can only compare our classification model enriched with the social priors discussed respectively in Sect. 6.
In [39], the authors evaluate the performance of their MILBoost classifier using the two parts of the dataset. The first evaluates the classification done on each frame of the video, while the second evaluates the performance on the classification of the full video clip. The method proposed in [20], is evaluated on full sequences on the noisy part of the dataset.
We compare our classification approach to the above ones, providing the accuracy achieved in all the four scenarios of the SBU Dataset, as can be seen in Table 5. Since our approach is meant for frame by frame classification, to classify the full sequence we select the most frequent label assigned in that videoclip. In our experiments, we have observed that the most frequent label occurs at least twice as often as the second most frequent one. Thus, we have not seen an influence of this approach in the results. The results show how our approach outperforms the others in terms of accuracy on this dataset. More detailed information about the our classification performance is provided by the confusion matrices in Fig. 12 including precision and recall, which were provided only by [20].
Table 5 Accuracy on the SBU dataset