1 Introduction

Interdisciplinary research is successfully exploring robotic technologies as personalised social companions to deliver or supplement behavioural interventions [1, 2]. Socially Assistive Robots (SARs) stand out as the most sophisticated among emerging robotics technologies. These robots incorporate audio, visual, and movement interfaces, as well as embedded computing hardware for edge AI, to simulate social behaviour such as complex dialogue with non-verbal communication, recognising emotions, and physical interaction with humans [3]. The primary objective of SAR is to establish a positive and productive interaction with humans, while also providing assistance and improving their quality of life. These robots are often utilized in domains such as motivation, rehabilitation, or learning, with the goal of achieving measurable progress in these areas [4]. SAR provides a physical manifestation for intelligent agents, rather than being confined to a digital screen. This means that SARs can be present in the physical world and directly interact with humans and objects in their environment [5]. They are capable of engaging with users through a rich variety of sensory modalities such as sound, sight, and touch. This allows for multiple options for delivering content or interactions, which can be customized to improve their effectiveness based on individual user preferences or physical abilities [6]. Several studies showed that SARs can support therapy and training of children with Autism Spectrum Disorder (ASD), who have difficulties in social interaction because of their condition, which has a male-to-female prevalence of 4:1 [7] . Children with ASD consider robots’ behaviour more predictable than humans and, therefore, it is easier for them to accept robots as social partners [8]. SARs can prompt children with ASD in a realistic social interaction via their physical presence and simulated social abilities, including non-verbal cues like eye gaze, gestures, and posture [9]. Indeed, many clinical studies [10] demonstrated significant benefits in the treatment of children with ASD, e.g. they can enhance training [11] and perform automated assessment [12]. ASD is a difficult condition, which also includes difficulties in processing novelty, which can cause anxiety and negative responses by an individual with ASD [13]. In this context, it is of fundamental importance to limit the introduction of novel technological devices to the strictly necessary ones. Indeed, children with ASD can be upset by the introduction of many novel items in their environment, therefore, simplicity is an essential pre-requisite for successfully including new technology in the therapy for the widest range of children with ASD. To monitor and acquire information from the interaction, the use of bulky external setups (e.g. computers, multiple cameras and other devices) should be avoided, as they can cause distress to the child. The best approach is to use only the robot’s embedded sensors and computing abilities to record data [14]. This necessity represents a challenge for the application of SAR in real contexts like clinical therapy because the onboard computing of commercial robotic platforms is limited to account for multiple constraints such as cost, space, heat, and power consumption. In fact, commercial platforms that are commonly used with ASD, do not have sufficient computing resources to concurrently control the robot and acquire data from sensors during the clinical interaction.

In our clinical studies, we minimise the intrusion in the therapeutic setting to avoid upsetting the children. In this article, we investigate the feasibility and propose a proof-of-concept prototype of automatic gesture recognition using only the data collected by the robot’s embedded camera without the use of any other device. The clinical study in which we collected the data consisted of robot-assisted imitation training (see Fig. 4) with six male children (M-chronological age=104.3 months, range=66-121, SD=18.6) with ASD and ID. Two children had a profound ID level, two severe ID levels, one moderate ID level and one with a mild ID level. The robot used in the study was the Aldebaran Robotics NAO [15], which is the most common humanoid platform employed in SAR [16]. NAO was used in 80% of studies in which a humanoid was employed for robot-led therapy of children with ASD [17]. The clinical activities included six encounters, in which the NAO robot was prompting the children in three Gross Motor Imitation (GMI) tasks. For each child, the robot’s camera recorded the video of 18 procedures (6x3) in total. The robot initiated the procedure by verbally instructing the child with simple and concise language, followed by prompting the child to imitate its movements. Each session lasted around 6-8 minutes per child, with a 1-minute break between each activity to allow the children to rest in the nearby multi-sensory area. More detailed information on the clinical experiment can be found in [18], which provides the details on the methodology and the evaluation of a robot-assisted imitation therapy for children with ASD and Intellectual Disability (ID). During the therapy sessions, the children’s imitation of the robot’s gestures was recorded to evaluate their performance and track their progress over time. The recordings were manually analysed and labelled accordingly to identify the gestures that children were performing in each frame. These labelled frames form the dataset used in this article.

However, while on one hand, the use of the embedded camera facilitates the acceptance of the system by the children, on the other, this creates a technological challenge because, as common for many commercial robots, the embedded camera does not have the depth measurement and it was only able to record images at a frequency of 10 fps and a resolution of 320x240 pixels because of the limitations of the onboard computational resources (CPU and memory), which were also used to control the robot behaviour for the therapy. This is a common issue with the small robotic platforms that are being used for robot-assisted therapy, which have usually limited computing and sensing on-board. Indeed, the actual resolution and frame rate of cameras could be higher but it is are usually restricted due to the limited computing capacity of the main processor and memory resources [12]. When working with children, particularly those with ID, it can be difficult to enforce constraints that are necessary for optimizing algorithm performance. As a result, it is crucial to be able to accurately estimate a child’s visual movements without relying on constraints like confining them in specific positions. While such devices can improve performance, they can also limit the portability of the system and complicate its integration into a standard therapeutic environment.

The unique contribution of this article can be summarised as follows:

  • Novel application of machine learning techniques for automated gesture recognition with real-world data, which was collected during robot-led imitation therapy sessions for children with autism spectrum disorder and intellectual disability.

  • Identification of optimal parameters for a multi-layer LSTM architecture to maximise accuracy for the assessment of children’s success in therapy.

  • Proof-of-concept evaluation of a low-power commercial embedded system for edge-AI (NVIDIA Jetson) as a potential solution for real-time computation onboard future robotic platforms.

The rest of the article is organised as follows: Section 2 presents an overview of recent results in gesture recognition applied to human-robot interaction; Section 3 provides the details of the machine learning approaches that were evaluated in our computational experiments with the children dataset described above; Section 5 discusses the results; finally, Section 6 gives our conclusion.

2 Review in gesture recognition during Human-Robot Interaction

There are numerous methods of classification of gestures in the literature. In general, the techniques differ from different feature extraction to classification methods.

Many works involve gesture recognition with OpenPose, manual feature selection and classical machine learning algorithms. In [19], the authors extracted the human pose using OpenPose and recognising the gestures with Dynamic Time Warping (DTM) and One-Nearest-Neighbor (1NN) from the time-series. Other works use instead more devices to better identify gestures. In [20], they obtained 3D skeletal joint coordinates from 2D skeleton extraction with OpenPose and the depth from a Microsoft Kinect 2. Then, the 3D coordinates are used to detect the gesture using a CNN classifier. This system was employed for real-time human-robot interaction.

The gestures can be classified as static or dynamic. A gesture is static if the user assumes a certain pose while it is dynamic when the gesture consists of several poses. For this reason, the identification of gestures is not trivial and also requires temporal segmentation. Classic gesture recognition methods are based on HMM, particle filtering and condensation algorithm, FSM approach, Artificial Neural Networks (ANNs), genetic algorithms (GAs), fuzzy sets and rough sets. Deep neural networks have become state-of-the-art in Computer Vision and are also applied in the recognition of gestures outperforming the previous state-of-the-art methods. For a recent review of classic and deep learning techniques see [21].

In [22], the Authors used OpenPose to capture the 2D positions of a person’s joints to compare gesture imitation with recorded gestures. Their goal is to estimate whether real-time movements correspond precisely with standard gestures. To make this comparison they used videos of Tai Chi teaching. From the joints, they calculated the movement trajectory for each point. A similarity metric was defined as the distance between the movement trajectories of the standard and real-time videos. Important features to better describe the gestures are redundancy reduction, robustness, invariance with respect to sensor orientation, signal continuity, and dimensionality reduction. To make the system robust, they defined the trajectory equation with Bézier curves that are robust to input noise. To define the distance between the recorded gesture and the imitated gesture, they calculated the discrete Fréchet distance. From the joints of the trajectories they then obtained 12 distances that composed a vector, finally obtaining a score by applying a weighted distance formula.

In [20] they obtained 3D skeletal joint coordinates from 2D skeleton extraction with OpenPose and the depth from a Microsoft Kinect 2. Then, the 3D coordinates are used to detect the gesture using a CNN classifier. This system was employed for real-time human-robot interaction. Human gesture and activity recognition are some of the main topics of human-machine interaction. Consequently, there are many works in literature. In [23], the authors used the difference between subsequent frames from the depth image of the Microsoft Kinect to recognise eight gestures: CLAP, CALL, GREET, WAVE, NO, YES, CLASP, REST

In [24] a simultaneous gestures system for multiple users was introduced and the results on a maximum of six users had an accuracy higher than \(90\%\). In [25] a Wi-Fi-based zero-effort domain gesture recognition system (Widar3.0) estimates the velocity profiles to characterise the gesture kinetic features. A deep learning model exploits spatial-temporal features for gesture recognition. The accuracy result achieved is high, near \(90.0\%\), independently from the domain in real environments.

In [19] highlighted the need to communicate with service robots through gestures, for example, to draw the robot’s attention to someone or something. To avoid using special hardware they used only RGB videos, extracting the pose in the frames of the videos with OpenPose. They present a method for gesture recognition, starting from the pose extracted with OpenPose, in conjunction with Dynamic Time Warping (DTW) and One-Nearest-Neighbor (1NN) for time-series classification. Before passing the joint coordinates to the DTW classifier, the key points are normalised to achieve scale and translation invariance so that they are not dependent on the relative position of the person with respect to the camera. One of the main advantages of this approach is the ability to easily add new gestures. To reduce the number of signals processing by DTW, they considered signal variance. All signals with a low variance, thus indicating no motion, were considered to be uninformative. For classification with 1NN they used warping distance as a metric instead.

In [26] they propose an approach based on the temporal and spatial relationship between joints and joint pairs. To alleviate the variation of the temporal sequence they propose a new temporal transformation module (TTM). Finally, all extracted features are merged into a multi-stream architecture and then classified by a full-connected layer. This kind of approach has been tested on datasets such as ChaLearn 2013, ChaLearn 2016 and MSRC-12 obtaining very good results.

EfficientGCN-B4 [27], an action transformer, used a fully self-attentional architecture. The skeleton poses are extracted from 2D videos with Openpose [28]. Similar to BERT and Vision Transformers, the sequences are represented as embeddings. The embeddings are fed to a Transformer Encoder. The output is fed into a linear classification head. It exceeds more elaborated networks that mix convolutional, recurrent and attentive layers. The accuracy of the EfficientGCN-B4 [27] outperforms other models like MS-G3D (J+B) [29], MS-G3D (J) [29], ST-TR [30] on MPOSE2021 dataset.

In recent years, there has been a surge of interest in developing accurate and efficient methods for gesture recognition. A number of research papers have been published, each proposing different approaches to address this challenging problem. Some of the most promising methods include Convolutional Transformer Fusion Blocks [31], Spike representation of depth image sequences with spiking neural networks [32], and Deep Hybrid Models that combine Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks [33].

One common theme among these papers is the focus on hand gesture recognition. While accurate hand gesture recognition is undoubtedly important, there are other factors to consider as well. For example, our recent paper on the recognition of gestures in autistic children takes a more holistic approach by considering the entire body of the child as they attempt to imitate the gestures of a humanoid robot.

In this work, we investigate the automatic recognition of gestures using only the RGB camera of the robot’s forehead using the video recordings collected during the previous study. We should point out that, even if the video quality of the NAO camera is very low because of the limited computing resources, it has been demonstrated that it is possible to successfully extract the skeleton joints using OpenPose [28], even in case of occlusions [34], with very good accuracy.

3 Methods

In our work, we wanted to automate and make more objective the assessment of the success or failure of the children’s imitation of the robots’ gestures in a clinical setting. To this end, we investigated the use of neural networks, particularly the Long Short Term Memory (LSTM) recurrent networks, for gesture recognition in a clinical setting. The approach is divided into two steps: first to automatically extract the human skeleton pose and the temporal features between the different poses from a low-res video sequence taken during the clinical therapy; second, the resulting features are classified into the possible gestures; finally, the gesture recognised is compared with the one performed by the robot to assess the success or failure of the imitation.

To make it applicable in real clinical settings, our approach aims at yielding the built-in camera of the NAO robot to recognise the child’s gestures from a sequence of 2D poses with a deep recurrent neural network made of 2 LSTMs. This approach reduces to the minimum the intrusion in the child space, which makes it more acceptable and suitable for the real world application than previous experimental settings, e.g.  [11].

For the feature extraction we selected OpenPose [28] algorithm (version 1.7.0), the first real-time multi-person system to jointly detect the human body and more, which is the state-of-the-art algorithm to extract the human pose from the image frame of the videos. OpenPose uses a bottom-up approach and it has a constant runtime compared to Alpha-Pose [35,36,37] (top-down approach) and Mask R-CNN [38] (similar to the top-down approach). It can achieve better accuracy results with different average confidence compared to Posenet (a similar but lighter approach) and, with a MobileNet [39] version, OpenPose can also run on devices with low computing performance.

In videos, there are a lot of occlusions or the robot looks away from the child since its performed gesture movements. Moreover, children are always on the move and they can have difficulties with other devices, like other cameras or a Microsoft Kinect, in the therapeutic environment. With the use of high-resolution cameras or a Microsoft Kinect, we can increase the performance, limiting the portability of the system [12]. In [34], OpenPose [28] was compared with the Microsoft Kinect. The final results showed that OpenPose is accurate at recognizing gestures and it can overcome the failures of the Kinect. We assumed that the OpenPose solution on the 2D video is robust enough for gesture recognition. In a preliminary analysis, we found that it is much more accurate than Kinect when there are occlusions in the videos.

A secondary aim of our investigation was to evaluate alternative technologies to allow recognition in real-time, which may prompt autonomous adjustments of the robot’s behaviour to the child’s performance level during the therapy. To this end, we tested the inference time of our recognition system on an NVIDIA Jetson TX2 to explore the performance for a possible real-time gesture recognition when the robot has AI acceleration integrated onboard.

Fig. 1
figure 1

In these examples you can see that the children’s height is always smaller than the caregivers height

Fig. 2
figure 2

Our model is based on two layers of LSTMs that take as an input the skeleton sequence and the expected action and it gives as an output the action/activity performed by the user. Remember that the goal is not the prediction of the action itself, but the verification of whether the child has imitated the robot’s movements

We would specify that the focus on this real-world application and low resources makes unfeasible the use of large deep neural network architectures, which would require significant additional computation on the robot and drain the battery very quickly. We could also consider the use of cloud computing, but this is not simply applicable to this clinical application, indeed streaming clinical therapy sessions over the network will create significant security concerns due to the sensible and private nature of the data and, therefore, significant overheads and further delays to secure the data via encryption/decryption.

3.1 Gesture recognition approach

The proposed method is divided into three steps. In the first step, each frame is computed by OpenPose [28] which is a real-time pose estimator. OpenPose returns the human pose in a reasonable time which depends on the computational power, extracting the pose from the image by a deep network based on the CNNs. See for instance Fig. 4. We considered only the child’s pose, discarding the caregiver’s pose, which differs in greater height (see Fig. 1).

Nevertheless, OpenPose was able to extract the human joints even if they are lacking. After gathering data from each video, transformed in human pose sequences with 18 joints, we normalised data according to the following equations that are applied for each joint (XY) assuming that the image centre is the origin (0, 0):

$$\begin{aligned} X = \left\lfloor X + 0.5 * width\right\rceil ; Y = \left\lfloor Y + 0.5 * height\right\rceil \end{aligned}$$

where width and height are the image dimensions of the video. Normalisation allows gestures to be better described by making the data invariant with respect to the person’s height and positioning relative to the sensor.

In the second step, the human poses extracted from each video frame are given as input to a deep model based on LSTMs like in [40]. This model (see Fig. 2) automatically extracts the temporal features of the pose sequence. We used 84 and 66 units respectively for the first and the second LSTM layer for the “Already Seen” setting while 80 and 64 units for the “Leave Child Out” and “Interleave” settings. The number of epochs was 300 to train the different models. The kernel initializer was the Xavier uniform and the optimisation algorithm for gradient descent was Adam.

The final step consists in classifying the gestures by a full-connected layer. As an activation function we used softmax and the number of nodes of the full-connected layer is 5 corresponding to the number of classes. During the experiments, 207 videos of about 1.10 minute and about 10 fps were recorded for six children. The gestures are four: “kiss”, “clap the hands”, “greeting”, “raise the arms”. We also added a “failure” class to label imitation failures.

Fig. 3
figure 3

A diagram graphically explaining the sliding window approach. The information from the sliding window is processed recursively along the video sequence frames and, at the end of this recursively process, this approach returns the final activity or action performed. Activity or action is predicted for each sliding window. The final activity is the one with the highest frequency

Fig. 4
figure 4

Four frame video showing the children (and their caregivers) with their skeleton joints recognised by OpenPose [28]

Then we trained our model with different configurations using a sliding window approach (see Fig. 3) with one and two steps. We can also find a step approach in [41] and in [40] where the authors combine the results with different steps, considering different temporal scales, in contrast to us who do not combine the different steps. We used a sliding window of 5, 10, 15, 20, 25 sequence frames. The input of the model is composed of a sequence of human skeleton joints normalised according to the image dimensions and the label of the gesture performed by the robot (“kiss”, “clap the hands”, “greeting”, “raise the arms”). The output is one of the four gesture labels or the label “failure” in case the child fails to imitate the robot.

The deep model based on LSTMs is composed of two LSTM layers that take in input the pose sequence. The features extracted from the sequence are concatenated with the gesture label (“kiss”, “clap the hands”, “greeting”, “raise the arms”) that is encoded using the one-hot-encoding process that which refers to the gesture performed by the robot.

3.2 Settings

Three evaluation settings are proposed to assess the results of our approach: Already Seen, proposed [42] as "have seen", in which the training data is composed by five children and a half of the sixth child’s data that is taken randomly; the test data is the remaining of the sixth child’s data; Leave Child Out: the model was trained on five children and tested on the sixth; in literature, we can find the same configuration named as “new person” or “leave-one-out cross-validation” [42]; Interleave, similar to the “Leave Child Out” setting, but the gestures of different children were interleaved to take into account the significantly different quality and efficacy of the gesture executions.

3.3 Comparison with classical ML methods for gesture recognition

We compared the type of approach proposed with classical machine learning methods using Weka [43]. We have tested these algorithms both with and without normalisation. The results show a general improvement in accuracy without normalisation with respect to frame resolution. The pose sequences have been processed to extract the 5 most significant poses. We applied K-means, a clustering algorithm, to search for 5 clusters. Then we identified the 5 centroids that represent the 5 most significant poses that identify the sequence of the gesture. The 5 poses extracted for each instance are the samples of our ML classifier training dataset. We used the following classification algorithms [44] which are models of supervised learning to compare our proposed approach:

  • Bayesian Network is a probabilistic model that represents a set of stochastic variables with their conditional dependencies using a DAG (direct acyclic graph);

  • HMM (Hidden Markov Model) is a Markov chain in which states are not directly observable and is widely used in the recognition of the time pattern of time series;

  • Naive Bayes is a simplified Bayesian classifier that assumes assumptions of independence of characteristics;

  • SVM (Support Vector Machine) is a model that represents data as points in space, mapping them in order to define the belonging of each data to a class;

  • J48 [45] is the implementation in Weka of the C4.5 algorithm, based on decision trees;

  • Random Forest is a classifier obtained from the aggregation of multiple random decision trees;

  • Random Tree is based on random decision trees.

Table 1 Accuracy results for the three settings with a step of 1 frame using our method

4 Results

Three different settings, two different steps and five different timesteps are tested using our deep LSTM model obtaining the results shown in Table 1. Figure 5 shows two confusion matrices for the AlreadySeen setting with 1 and 2 frames. The final average accuracy result has a very good result since the number of instances of failures is almost equal to the sum of successes. In general, however, we have very good recognition of failures and successes of the children’s imitation despite the NAO camera movement during gesture execution and despite the low resolution. We would like to emphasise the best results (see Tables 1 and 2) with a timestep of 5 and in general the tendency to overcome the \(90.00\%\) of accuracy. We want to underline the worst accuracy with “Interleave” and timestep 25 which is \(87.13\%\) of accuracy with step 1 and 87.06 of accuracy with step 2. The results gradually rise decreasing the timestep. Indeed, we have the best accuracy results in the setting “Already Seen” with \(94.56\%\) and \(94.13\%\) for steps 1 and 2.

4.1 Computational performance and power consumption evaluation

Fig. 5
figure 5

Two confusion matrices for the setting AlreadySeen with 1 and 2 frames

Table 2 Accuracy results for the three settings with a step of 2 frames using our method

The execution time on 1000 frames of OpenPose takes on average \(0.13 \pm 0.01\) sec on each frame while our model takes on average \(0.03 \pm 0.00\) sec on an entire sequence of 25 frames. We used the Mobilenet network in OpenPose algorithm to decrease the computational time on the Jetson TX2. We compared our method with classical machine learning algorithms (Table 3). The classifiers used in the comparison are the following: SVM (Support Vector Machine), Bayesian Network, HMM (Hidden Markov Model), J48, Random Forest, and Random Tree. The results of the SVM and the HMM algorithms are identical while in general all the other algorithms, except the Random Forest, have statistically worse results than the SVM and HMM algorithms at a significance level of 0.05. The Random Forest algorithm performs better than the SVM and the HMM only in the “AlreadySeen” setting. In short, our deep model has statistically better results than all the tested machine learning algorithms at the significance level of 0.05. One of the additional information we have is the behaviour of the robot that the child must imitate. In the final results, we have noticed that they improve slightly by adding this information to the 5 poses extracted from the sequence of the gesture.

Finally, we performed a power consumption analysis in order to provide an indicative evaluation for the future integration of an edge AI board like the NVIDIA Jetson TX2 into the robot. The analysis was made by measuring the current drawn by the board and the supply voltage from the standard power brick (AC to DC power converter). First, we measured the baseline current, which was on average 240mA, with 20 a standard deviation (st. dev.), then the current drawn during the inference, which was on average 491mA with 38mA st. dev. and a peak of 533mA. The supply voltage was almost constant at 19V with 0.02 st. dev. This result shows that the gesture recognition with our method consumes on average only 4.77W on the NVIDIA Jetson TX2. The peak consumption is 10.14W (including the baseline consumption).

Table 3 Accuracy results for the three settings with ML methods

This power consumption is theoretically compatible with the battery specifications of a small robot like NAO, which has a 48.6Wh battery with a nominal voltage of 21.6V and a maximum current of 2A, with a maximum peak consumption of 43.2W.

5 Discussion

The results present a solution posed by the clinical requirement to not introduce other devices, indeed the only device used to acquire data was the built-in camera of the NAO robot, which operates at a low resolution (320 x 240) and a low frame rate (10 fps).

Our method incorporates a variety of techniques including motion capture, computer vision, and machine learning to accurately recognize the gestures of autistic children. By considering the entire body, we are able to capture a wider range of subtle movements that may be missed by methods that only focus on the hand.

The results show that the proposed algorithm was able to efficiently deal with the lack of depth information by extracting the 2D poses of the children with the OpenPose algorithm.

Another practical problem was the motion of the NAO while performing gestures. The video recorded by the camera fixed on the robot’s forehead was unstable and, in a few cases, the vision of the child’s movements was partially occluded because of the robot’s movements (head, torso, arms and hands). The solution to this problem was investigated using different timesteps and taking the most likely results from a time window that corresponds to the time spent by the robot performing the gesture to imitate.

Another issue faced is that the dataset is unbalanced since it has multiple instances of children’s failures: the sum of gestures on the test set is about half of the number of failures. Although the results of the LSTM model with 1 step and 5 timestep are slightly better, in general, the 2 step behaves well with the various timesteps. This result is useful for reducing the performance, indeed the 2 steps approach has a shorter inference time using an embedded AI acceleration device like the NVIDIA Jetson TX2, which mixes good performance and low power consumption. In practice, it reduces the computation almost by half by applying OpenPose every two frames.

We highlight that the LSTM model has significantly exceeded the results of the machine learning algorithms proposed for comparison. We would like to remark that articles mentioned in the related work report an accuracy of around 90-93% with synthetic data, our approach achieves the same levels of accuracy with real-world data Furthermore, by considering the entire body, we are able to provide a more comprehensive understanding of the child’s behaviour and their attempts to interact with the robot.

Fig. 6
figure 6

Computational performance in seconds of the OpenPose algorithm and our model on NVIDIA Jetson TX2

We also provided a proof-of-concept evaluation of the use of state-of-the-art off-the-shelf embedded systems for edge-AI. We tested the performance of the NVIDIA Jetson TX2 (see Fig. 6) which is increasingly used in studies that require AI algorithms to run on low-cost, low-power platforms [46]. This proof-of-concept demonstration provides experimental information that will guide the design of future robots for robot-led therapy that will be able to provide real-time evaluation of the children’s behaviour, therefore adapt the interaction to personalise the clinical intervention autonomously.

6 Conclusion

In this work, we studied the automation of imitation recognition during robot-assisted training of children with Autism Spectrum Disorder. Indeed, we used a new data set collected during a clinical study with children with ASD in a real unconstrained setting. The clinical study provided low-resolution videos recorded by the robot camera during the robot-led therapy. The aim of the automation of this task is to guarantee the objectivity of the evaluation and provide data for continuous assessment of the progress during the therapy. This technological solution can overcome the limitations of manual annotation which is a long and tedious process which requires multiple assessors to ensure impartiality, with a considerable cost for the healthcare providers. From an applied perspective, a fundamental point of our approach was to comply with the clinical requirements, i.e. to reduce the intrusion by using only the camera that is embedded in the robotic platform. Indeed, children with ASD may be upset by the introduction of many novel items in their environment, therefore, simplicity is an essential pre-requisite for the inclusion of any technology in the actual therapy. At the same time, this creates a technological challenge because the embedded camera does not provide the depth measurement and was only able to acquire images with a frequency of 10 fps and a resolution of 320x240 pixels because of the limitations of the onboard computational resources (CPU and memory). Considering the lack of depth images, we opted for the OpenPose algorithm which is more accurate than Microsoft Kinect when there are occlusions in videos. The proposed method to automatically evaluate the gesture is a deep model based on LSTMs. Three settings were used to test the model: “AlreadySeen”, “Interleave”, “LeaveChildOut”. To enhance the performance of the deep model, we tested five different timesteps (5, 10, 15, 20, 25) and two steps (1 and 2). The final results show a very good accuracy: on average the \(93.01 \%\) of accuracy with timestep 5 and step 1. We wanted to compare these results with some classic machine-learning algorithms. The results of the deep model are statistically better than the proposed ML algorithms at the significance level of 0.05. Finally, given the low computational power of the NAO robot, in order to evaluate the performance level of imitation training during the therapy, we tested our model with OpenPose on an NVIDIA Jetson TX2, which is an embedded AI computing device. In the production stage, we can say that the deep LSTM model with step 2 would reduce by half the computational time to predict the gesture for the calculation of joints with OpenPose. In short, the calculation of joints with a step equal to 2 is not done for each frame but for every two frames.