1 Introduction

Aggression is defined as a behaviour done by a living being that is meant to either cause harm, violates rights, and hurt others either psychologically or physically [1]. According to the American Psychological Association (APA), aggression could be intentional to cause harm (i.e., hostile), not intentional (i.e., instrumental), and emotionally motivated (i.e., affective) [2]. Aggression among children was reported to be more frequent during pre-school years and during early years of childhood in the form of physical aggressions, such as biting and hitting [3, 4]. The reported rates of aggression are high among children with psychiatric disorders (i.e., 45.8% to 62.3%) and one of the reasons for the mental health referrals among children [5, 6].

Children with autism have higher rates of challenging behaviours and aggression compared to others with developmental disabilities and it was reported to start since infancy [7,8,9,10]. The challenging behaviours rates are high (e.g., 49% to 69%) and increase with the severity of autism [11,12,13,14,15]. A challenging behaviour might manifest as meltdown, tantrum, withdrawing, or as a stereotypical behaviour that could pose a risk on the children themselves or others around them [16, 17].

Technology advancement accelerated the integration of robots in healthcare to rehabilitate, monitor, assist in surgeries, and to improve the quality of patients’ lives [18,19,20]. Social robots in autism therapy have gained a lot of attention in the past decade due to the reported positive outcomes, such as increased attention and imitation [21]. However, having this form of technology that is meant to elicit behaviours in the vicinity of children with autism might trigger negative or undesirable reactions. For example, some studies reported challenging behaviours and aggression during interaction sessions with social robots [22,23,24]. Some forms of challenging behaviours, such as the throwing of a small robot or a toy, might cause a harm, especially if it hits another person’s head [25]. Design considerations are needed to account for such scenarios [26,27,28,29].

Social robots can be used to address the issue of aggression and undesirable behaviours among children to prevent progression and potential harm. In combination with other sensors and wearables, a social robot can identify the occurrence of these negative interactions of children with their surroundings (e.g., toys) and respond appropriately to that action  [24, 30]. The robot’s responses to the child can take on different forms (e.g., gestures and sounds) and should be clear enough to the child to comprehend [31]. To date, limited work has been done to identify such interactions and means to address them [23, 32, 33].

In this study, we evaluate the performance of five machine learning techniques in characterizing five possible interactions and idle. We examine the effects of adding different combinations of data and extracted features acquired from two sensors on the performance and speed of prediction. Additionally, we test the performance of the developed model with children.

The contributions are summarized as follows:

  1. 1.

    Evaluating the performance of different machine learning techniques and their prediction speed.

  2. 2.

    Studying the effects of different combinations of raw data and extracted features acquired from two sensors.

  3. 3.

    Testing the best performing model with interaction data acquired from children.

This paper is organized as follows. Section 2 describes background. Section 3 describes materials and methods. Section 4 provides results and Sect. 5 presents the discussion.

Fig. 1
figure 1

An overview of a system consisting of a social robot and detection device that meant to monitor and provide a response to a child during undesirable interactions. a The toys that were considered in this study. b A scenario where a child throws a toy. c The detection device inside the toys detects this action and sends a command to the companion social robot. d The social robot receives the commands from the detection device and responds accordingly (adapted from [31, 32])

2 Background

Reactions in social robotics are essential to establish meaningful interactions. There are few commercially available robots that exhibited responses once manipulated or handled in a specific way. Professor Einstein (Hanson Robotics, Hong Kong) is one example of a small humanoid robot that resemble an actual human. Along with the capability of being integrated with a mobile app wirelessly, this robot can track faces and perform certain preprogrammend interactions, such as telling a joke and pointing its hand. PARO is an interactive animaloid robot that is made to resemble a seal [34]. PARO can perform limited physical interactions using light, audio, and tactile sensors once handled in a specific way. For example, it emits voices when it gets stroked.

Sensors and wearable devices are being used to acquire data of different modalities to assess the activities and conditions of the users [35,36,37]. Solutions based on motion sensors have been used in healthcare applications, such as detecting falls among the elderly using wearable devices [38, 39]. For example, a study used an accelerometer embedded in a belt to detect falls with an accuracy of 99.4% using a machine learning classifier [39]. In another application in healthcare, a study considered using a wearable device to predict the occurrence of challenging behaviours among children with autism using machine learning techniques [30]. Motion sensors are also being considered in applications that require direct interactions with robots, such as robot-based games [40, 41]. For example, a study considered a tri-axial accelerometer to detect player’s motions relative to a robot, such as dodging and running [42].

Few studies were conducted to classify the interactions that might occur between a child and a robot [23, 43]. One study used a ball-like robot to categorize interactions (e.g., kicking and pickup) using an accelerometer and gyroscope embedded in the robot [33]. The study considered data acquired from adult participants to train a supervised machine learning model that was then tested with children’s data to achieve an accuracy of 49%. In an earlier study, we considered the magnitude of raw acceleration data over a small window size of 25 samples to characterize six possible interactions and scenarios between a child and a social robot [32]. The considered behaviours were hit, throw, drop, shake, pickup or carry, and being idle. Based on a neural network model, the model achieved 80% accuracy when tested with data acquired from children. In another work, we investigated the influence of reaction time of a robot’s response on the children’s comprehension when an undesirable behaviour is performed with the robot [31]. The findings highlight the importance of providing a quick response once an unwanted interaction has been detected.

3 Materials and Methods

3.1 Participants

3.1.1 Data Collection

The data were collected from six adult participants performing different undesirable interactions with three different toys (Fig. 1a). The considered interactions were hitting, throwing, shaking, carrying, being idle, and dropping. A total of around six thousand instances were collected from the adult participants for the 6 classes. Idle was considered to cover the no interaction case while carry was considered because it might be a precursor for other interactions. The data were then annotated highlighting the interactions. Handcrafted features were extracted from the annotated data over a window size of 30 samples. More details about the collection procedures and access to the raw data can be found in earlier work [32, 44].

3.1.2 Evaluation

Data acquired from ten children were used in the evaluation of the best developed machine learning model. The total duration of their interactions was around 30 min (3 min per child) that averaged at 176 instances for each session. Children performed three scenarios with the three toys (Fig. 3a). In each scenario, the children were told an imaginative scenario to perform an interaction (Table 1), for example, “You need to pick the robot up, and shake it to wake it up.” The duration of each interaction session was around three min. Parental consent was secured by the school and the children were accompanied by their teachers. The procedures for this work did not include invasive or potentially hazardous methods and were in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki).

Table 1 The experimental protocol for the evaluation experiments conducted in this study

3.2 Experimental Setup

3.2.1 Detection Device

The device that meant to detect the undesired interactions and to send commands to the social robot was based on Raspberry Pi. Raspbian (v4.14, Debian Project) was used as the operating system. Raspberry Pi was attached with a Sense Hat board contains different sensors and a display(Fig. 1c). Sense Hat contains an IMU (LSM9DS1, STMicroelectronics) that contains an accelerometer, gyroscope, and magnetometer. The built-in accelerometer can acquire acceleration values of up to ± 16 g at 30 Hz while the gyroscope can measure angular rate of up to ± 2000 dps. This device has the flexibility of being embedded in different robotic forms to acquire new data if required. A dedicated power bank was used to power up the recognition devices inside the toys. The device was used to test the developed machine learning models that were used in the detection of undesired interactions.

3.2.2 Social Robot

Three different toys were considered representing different forms of social robots. Each toy was embedded with a recognition device. The first toy was a stuffed panda, and the second toy was a stuffed toy robot, while the third toy was an excavator toy (Fig. 1a).

3.3 Development of Machine Learning Models

3.3.1 Algorithms

Five different machine algorithms were considered to evaluate their efficacy in distinguishing between the 6 classes. All the machine learning algorithms were developed based on Python programming language. The considered algorithms are listed here below:

Decision Tree (DT): A decision tree is a tree-like method to help in making decisions by listing all the possible outcomes. A typical DT model consists of nodes (e.g., decision nodes, internal nodes, and leaf nodes) and a hierarchy of branches that are constructed from building steps such as splitting, stopping and pruning.

Random Forest (RF): RF is a classification algorithm that consists of multiple decision trees trained on different portions of a training set.

K-Nearest Neighbor (KNN): KNN is a non-parametric algorithm that finds the k closest examples in a dataset. KNN can be used in both classification and regression.

Multilayer-Perceptron (MLP): MLP is a neural network that consists of an input and an output layers with one hidden layer in between. More complex configurations may include several hidden layers.

EXtreme Gradient Boosting (XGBoost): XGBoost is an ensemble machine learning model based on decision-tree that makes use of gradient boosting framework.

3.3.2 Data Format

Combinations of raw signal data and time extracted features were used in testing the machine learning models (Fig. 2). The raw data contained the signals acquired from the gyroscope and accelerometer. Additionally, the magnitude of acceleration (A) was calculated. The time extracted features were max, min, mean, and standard deviation. These features were extracted from the gyroscope and accelerometer raw data over a window size of 30 samples. The extracted features and raw data were used to test and develop the machine learning algorithms. Balanced data for each class were considered in the training of the machine learning models. Unseen samples for each behaviour were used in testing. These samples were used to calculate the speed of prediction (i.e., test time) for each algorithm.

Fig. 2
figure 2

The components of the input data that were considered in the training of the machine learning models. The input vector consists of the raw acceleration data, raw gyroscope data, and extracted time features for both

Table 2 The evaluation metrics results for the five tested algorithms and their corresponding training and testing times

3.3.3 Evaluation Metrics

All trained models were evaluated based on the accuracy, precision, recall, and F1-score. Additionally, the training and testing time for each algorithm were calculated. The relationships of the evaluation metrics are as follows:

$$\begin{aligned} Accuracy= & {} \frac{Correct~Predictions}{Total~Predictions} \end{aligned}$$
(1)
$$\begin{aligned} Precision= & {} \frac{True~Positive}{True~Positive+False~Positive} \end{aligned}$$
(2)
$$\begin{aligned} Recall= & {} \frac{True~Positive}{True~Positive+False~Negative} \end{aligned}$$
(3)
$$\begin{aligned} F1= & {} 2\times \frac{Precision* Recall}{Precision+Recall} \end{aligned}$$
(4)

4 Results

4.1 Machine Learning Algorithms

The five algorithms were tested and their evaluation metrics were tabulated (Table 2). Additionally, the training and testing times were calculated. In terms of precision, XGBoost scored the best followed by RF while KNN scored the lowest. Similarly, XGBoost achieved the best results in terms of recall, F1-score, and accuracy. RF was the second best algorithm while KNN was the worse performer. However, KNN algorithm was the fastest to train followed by DT. MLP took the longest time to be trained followed by RF and then XGBoost. In terms of test time, DT was the fastest to predict the testing samples followed by XGBoost while KNN was the slowest. XGBoost was selected to perform the upcoming tests with features due to the best achieved results and due to the relatively fast training and testing times.

4.2 Experiments with Features

Training machine learning models based on XGBoost were conducted to experiment with different combinations of features. The tested configurations included raw data only, extracted features only, and a combination of both. The experiments were performed on the separate data of the accelerometer and gyroscope sensors and on their combined data. The results for these tests along with their corresponding train and test times were calculated (Table. 3).

Table 3 The results for the experiments with features considering the raw data alone, extracted features alone, or combined

For the accelerometer, the model with the combined raw and extracted features data achieved the best outcomes in terms of precision, recall, f1-score, and accuracy. Not far behind, the model based on the raw data achieved the second best outcomes. The feature based model was the fastest in terms of training time. The test times for the three models were close.

Compared to the experiments conducted with the accelerometer, the scores for the gyroscope experiments were lower (e.g., accuracy of 89% vs 66%). Similar to the observations made for the accelerometer, the combined experiment for the gyroscope achieved the highest scores while the feature experiment achieved the fastest training time.

The experiments for the combined sensors achieved slightly better results compared to those of the accelerometer alone. The combined raw and extracted features of the two sensors achieved the best results compared to any other combination, but at an increased training time. Additionally, the test time witnessed a slight increase.

Fig. 3
figure 3

Part of the experiments with children interacting with the three toys. a Children performing three scenarios with the toys, namely, shaking, hitting, and throwing. b A plot of the magnitude of acceleration over time showing the changes corresponding to the interactions. c The prediction results of the best trained machine learning model

4.3 Evaluation Experiments with Children

The best trained model (i.e., based on XGBoost) using adult data was evaluated with data acquired from children interacting with the three toys mimicking actual scenarios (Fig. 3a). The duration of interactions were short (i.e.,  3 min) and were limited to three scenarios demonstrating shaking, hitting, and throwing behaviours. The changes in the magnitude of acceleration allowed to identify segments corresponding to the three scenarios (Fig. 3b).

The data for the children were analyzed and the behaviours were predicted using the best trained model. The outcomes of prediction for each scenario performed by each child were averaged and plotted as bar charts (Fig. 3c). In the first scenario, the model was able to identify shaking instances correctly. However, few instances of hit and throw were detected. The model was also able to detect hit instances in the second scenario, but along with shake and throw instances. In the last scenario, the model identified throw instances correctly along with shake and few hit instances.

5 Discussion

The dynamics of the physical interaction between a human and a robot can be complex. Hence, identification strategies and detection methods can be used to decipher these interactions. The integration of sensors and machine learning techniques has been considered in this study that was aimed to detect undesired interactions between children and their toys in their surroundings. Based on data acquired from adult participants, the best-trained model based on XGBoost showed promising potential in detecting undesired interactions between children and the three toys. Over the short duration of each experiment, the algorithm was able to identify the behaviours of interest. However, there were instances of incorrect predictions in each scenario. This could be attributed to the complexity of children’s interactions with the toys that made some behaviours intertwined and their predictions overlap. During the hitting scenario, some instances were predicted as shake while others as throw. Predicting some instances as shake could be attributed to the gentle hitting performed by the children compared to adult participants that caused the toys to shake. As for the incorrect throw instances, this could be attributed to the way some of the children were holding the toy while hitting. Some children were carrying the toys, hence, the hit confused the model and was reported as a throw. Additionally, part of a full throw involves a hit as a result of an impact with a surface. These intricate nuance dynamics imply the need for careful considerations when developing specific machine learning algorithms for this application that involves aggressive interactions.

How quickly an algorithm can predict an undesired behaviour plays an important role during interactions with or in the presence of a social robot. There are many factors that affect the time required to process new sensory data and for a machine learning model to provide a prediction.The machine learning algorithm selection is another crucial part that need to be decided. Selecting an algorithm with many parameters to tune and long training time will make it more challenging to optimize and experiment on the actual system. In our tests, XGBoost provided a relatively quick training time without compromising the performance and with less tuning efforts. Additionally, the selection of an algorithm can directly affect the time needed to make a prediction (Table  2). Having a machine learning model with quick predictions is crucial in applications that require a robot to respond quickly to certain undesired interactions.

The type of data and the number of sensors reflect on the performance and speed of a machine learning algorithm (Table  3). Considering extracted features that use smaller input vector compared to raw data provided faster training time, but at a slightly reduced overall performance. The time required to tabulate such features from the raw data may introduce extra time delay during the actual operations of a detection system. Using multiple sensors of different modalities might increase the overall accuracy, however, the time required to process their data will also increase. This study showed that using one sensor (i.e., accelerometer) that measures one modality is more than enough to reach a high prediction performance. While using the gyroscope sensor did not provide much noticeable improvement over the accelerometer in detecting undesired interactions, it might be still useful to incorporate for a different purpose. For example, the gyroscope can be used to detect different aspects of interactions, such as the orientation of a toy or robot, and that might be useful for certain applications that require specific interactions to be performed.

Certain design aspects are essential in robots or toys that are meant to detect undesired interactions. The internal structure should be robust enough to withstand such aggressive behaviours (e.g., hitting) while the outer structure should be optimized to mitigate any potential harm. Additionally, embedded sensors should withstand the dynamics of such interactions and not drift or lose accuracy over time due to damage, heat, or misalignment. Compensation techniques through software implementations can be used to address some of these challenges. Another design consideration is the number of detected undesired interactions needed before a robot should make a response. For example, the needed number of detected hits within a time frame before a robot may react should be determined. A frequent response to every behaviour might appear unnatural while less frequent ones might make the interactions feel dull [31]. Designers of social robots might need to make a trade-off between different parameters and traits to meet the requirements of their robotic designs and intended applications [29].

The current work has certain limitations. The data considered to develop the machine learning models were obtained from adult participants while the target end users are children. Furthermore, the current study did not evaluate the developed model with children with different degrees of autism. More data are needed to be acquired from children with or without autism to capture the full spectrum of interactions among children. The tested detection devices and toys were limited to off-the-shelf options that might not be suitable in such applications. More dedicated and custom-made devices that withstand aggressive behaviours are needed to perform better evaluations. The conducted tests were limited to a few children. Hence, experimental evaluations of robotic reactions with more children are needed.

6 Conclusion

The occurrence of aggression among neurotypical children and those with psychiatric disorders is high and can be concerning to their family, therapists, and caregivers. Technology, such as social robots, can be used to address such behaviours. However, social robots are in need to be able to detect such interactions. In this study, we demonstrated the possibility of detecting different interactions using a detection device and machine learning techniques. Detection algorithms, such as XGBoost, can accurately distinguish between different behaviors that include undesirable ones, such as throwing. Furthermore, it can provide a quick prediction for new data, hence, reducing the overall time delay. Data acquired form a single tri-axial accelerometer alone can be sufficient to provide the necessary information for the machine learning model to make an accurate enough prediction. However, integrating more sensors, such as a gyroscope, can be useful to capture different aspects of interactions. Having a social robot that responds quickly to interactions within its environment is possible using simple solutions.

The insights and findings of this work can be further explored by researchers in the field of social robotics to integrate new concepts and solutions into their designs. A social robot that can detect direct physical interactions between children and their environments can be used to address the issue of aggression and challenging behaviours among children with or without psychiatric disorders.