1 Introduction

Apraxia is a neurological dysfunction in which a person has difficulty carrying out complex gestures [1]. This can be disabling for people since they cannot perform basic tasks in daily life such as grooming or getting dressed.

Alzheimer’s disease presents very varied symptoms (memory loss, inability to communicate, changes in personality and behavior, etc.). Apraxia is a manifestation of many multiple neurological pathologies, such as Alzheimer’s disease [2]. For this reason, one of the most frequently performed diagnostic tests in neurology consultations is to ask the patient to imitate a simple gesture made by the doctor. The imitation of the gesture is evaluated by the doctor through ocular inspection to find the existence of apraxia.

The greater life expectancy entails an increase in the elderly population and therefore a higher prevalence of Alzheimer’s pathology. This justifies the need to explore research proposals that allow automating the diagnosis of this pathology. Previous related works, including [3,4,5,6], show that it is possible to make efficient use of modern tools in these areas. In this line, this work presents a methodological proposal that is based on deep learning techniques and computer vision to identify apraxias [7]. This methodology contains a set of tasks to process videos that record the gestures made by patients trying to imitate a movement. Finally, the methodology also contains a set of tasks to build support software that allows detecting if the patient has symptoms related to apraxia.

The article includes five additional sections. Section 2 reviews research papers related to the detection of apraxias. Section 3 presents our methodology to automate the identification of apraxias using neural networks and computer vision techniques. Section 4 describes an experimentation carried out to validate our methodology with real patients. Sect. 5 discusses the results of the experimentation and the research impact of the proposal. Section 6 shows the conclusions we draw from the work done.

2 Related work

Thanks to modern instruments such as smartphones as well as high connectivity and tools in the cloud, it has been possible to successfully develop multiple mobile applications. This has led to the development of large systems for clinical diagnostic support, making it easier for medical experts to analyse the data collected from patients using these kind of devices. This is often referred to as m-health, although the term u-health is also often used to refer to the use of ubiquitous computing for medical purposes, taking advantage of wearable or portable devices.

In this scenario, proposals for various applications have been made with great success: Tsang et al. [8] present an app to help prevent asthma attacks and aid self-management of the disease by predicting possible attacks using data collected from smartphones and machine learning techniques. Pryss et al. [9] studied how to predict the level of stress through GPS data collected by a mobile app. Ali et al. [10] harness wearable sensors to collect patient data and try to detect heart disease through deep learning models.

In addition to this, it is worth mentioning some applications that take advantage of the cameras of these devices (or cameras in general). Liang et al. [11] presented an app that takes advantage of smartphone cameras to allow users to self-examine themselves for signs of dental or oral diseases or problems. Kousis et al. [12] show a way to classify melanoma taking advantage of mobile phone cameras, classifying the different moles captured by the patients themselves through an intuitive application. Hasan et al. [13] take advantage of the smartphone camera to record the fingertips and relate the measured colour to the level of haemoglobin in the blood.

Fig. 1
figure 1

Hand landmarks extracted with the Mediapipe Hands solution. Obtained from [29]

Currently, the existence of apraxia can be determined exclusively through clinical examination, by asking the patient to perform different gestures and evaluating these gestures visually. Different modalities of apraxia can be assessed, such as the imitation of meaningless gestures (imitative apraxia), the pantomime of previously learned gestures (pantomime of object use, or communicative pantomime), and the real use of objects [14]. Several neuropsychological tests have been developed for the clinical examination of apraxia [15, 16]. However, the clinical scoring of apraxia gestures poses several problems. Firstly, the visual evaluation made by the health professional is subjective and can be inaccurate. Also, the gestures performed by the patient are usually rated dichotomously, as correctly or incorrectly performed, so intermediate levels of apraxia impairment cannot be adequately scored. This qualitative evaluation has additional disadvantages, because it makes very difficult to assess a progression in the apraxic deficit related to the progression of the neurological disorder, or an improvement in response to treatments. Therefore, there exists the need to develop a more accurate and quantitative method to evaluate apraxia.

In this scenario, artificial intelligence holds great potential to improve apraxia evaluation. Machine learning approaches have been successfully applied in medical diagnostic processes, particularly in the field of neurodegenerative diseases. A major application is the use of artificial intelligence to analyze brain MRI images for the diagnosis of Alzheimer’s disease. A systematic review about this topic reported that convolutional neural networks achieved the best results (weighted average accuracy 89%), but other approaches as Logistic Regression or Support Vector Machines also obtained high performances [17]. Other many applications of deep learning models include the classification of electroencephalographic signals for brain-computer interfaces [18]; the staging of neuropathological changes on digitized brain tissue slides [19]; the quantification of amyloid protein deposition in positron emission tomography images [20]; the scoring of the Rey Complex Figure copy, a test to evaluate visuospatial skills [21]; or the analysis of voice recordings to detect speech abnormalities [22] or dementia [23]. However, machine learning has not been explored for apraxia evaluation before. Caselli et al. [24] carried out a kinematic study of apraxia based on an Optotrak camera system that registered apraxia gestures. This approach allowed to analyze quantitative features of apraxia such as reaction time, intermanual symmetry or manipulation coupling. However, this study was conducted with research purposes and the goal was not to obtain a scoring system of apraxia to evaluate patients in clinical practice. Multi camera-based kinematic studies are not suitable for clinical practice because they are very expensive, time-consuming and operator-dependent. Conversely, cameras associated to modern smartphones open the possibility to register patient movements in a very inexpensive and time-efficient way [25, 26].

Furthermore, Daribay et al. [27] did propose an automated solution for identifying apraxias by utilizing computer vision techniques to extract skeletal features. However, their research primarily focused on children who do not exhibit symptoms of apraxia associated with Alzheimer’s pathology, and whose symptoms predominantly manifest as speech difficulties.

3 Methodology

This section shows the proposed system for automatically evaluating the execution of gestures by patients. The system is composed of two parts: initially the skeleton of the hand is extracted; then, it is obtained the distance between the gesture performed at each moment (frame) and the target, evaluating the execution.

3.1 Preprocessing and hand tracking using mediapipe

First of all, a pre-processing is performed for each of the videos, in order to homogenise all the data and also to reduce the number of frames to be analysed. For that, a uniform sampling of frames per second is taken from each video. In this case, we have considered a sampling of 5 frames per second.

Mediapipe [28] is a cross-platform Machine Learning library that provides several Computer Vision solutions, like everyday object detection, skeleton detection and tracking, or face recognition, among others. One of its most interesting and highly accurate functionalities (around 95.7% for palm detection) is hand detection and skeleton extraction and tracking. This solution allows detecting hands in given images or videos and extracting 21 coordinates or landmarks from each one, as shown in Fig. 1.

This functionality has been exploited for the first step of the proposed system, where videos of patients performing the gestures are processed. This is intended to take advantage of the high precision of this library, simplifying the video data and moving from a highly complex and high-dimensional datum (such as a video) to a series of landmarks distributed over time, representing the movement made by the patient. These landmarks can be processed more easily, and the similarity between two gestures can be calculated.

3.2 Similarity distance

The solution given by MediaPipe presents a skeleton of the different hands in the frame, given by 21 points each hand. One simple solution to obtain a similarity distance with the target gesture would be to use a distance between graphs such as the Hamming distance. However, that distance is based on the adjacency matrix, and in the case of this project this matrix is the same for our target gesture and the patient’s gesture. For this reason, we proposed a new approach explained below.

To evaluate the quality of the gesture we consider a new graph, \({\mathcal {G}}=(V,E,A)\). When working with single-handed gestures we maintain the 21 original points as the vertices, V. This new graph is an undirected weighted graph where all the vertices are connected to each other, giving the set of edges, E. The weight of the edges, A, is given by the euclidean distance between the two vertices it connects. The set of all the weights can be written as a matrix in which its elements are the weight of the corresponding edges. This matrix will have all zeros in the diagonal and for undirected graphs it will be symmetrical. To simplify things we consider a pseudo-weight matrix which is upper triangular. When working with bimanual gestures, the number of vertices doubles to 42, but the idea behind this new graph \({\mathcal {G}}\) remains the same. Now the matrix contains the distances between the keypoints of each hand and the distances between the keypoints of the two hands.

With these considerations being made, the new matrix is a square matrix in \({\mathbb {R}}^{g\times g}\), where \(g=21\) when the gesture being analyse is a single-handed one and \(g=42\) when it is a bimanual one. This matrix has been considered as a square one, because later it allows us to compute a matrix norm. This new matrix, \(A_{F {H}}'\), is computed for each frame \(F\in v,\) with v the video under analysis, and for each \(H\in {\mathcal {H}}',\) with \({\mathcal {H}}'\) the set of all the possible gestures except the palm one, that it is not analysed in this project. It is as follows:

$$\begin{aligned} \begin{matrix} A_{F H}'\\ \end{matrix} =\begin{pmatrix} 0 &{} a_{1,2} &{} \cdots &{} \cdots &{} a_{1,g}\\ \vdots &{} 0 &{} a_{2,3} &{} \cdots &{} a_{2,g}\\ \vdots &{} \vdots &{} \ddots &{} \ddots &{} \vdots \\ \vdots &{} \vdots &{} &{} \ddots &{} a_{g-1,g}\\ 0 &{} 0 &{} \cdots &{} 0 &{} 0 \end{pmatrix}, \end{aligned}$$
(1)

where \(a_{i,j}\) is the weight of each edge and is computed as:

$$\begin{aligned} a_{i,j} = \frac{\Vert v_{i-1} - v_{j-1} \Vert _2}{b_{1,2}} \end{aligned}$$
(2)

with:

$$\begin{aligned} b_{1,2} = \Vert v_{0} - v_{1} \Vert _2. \end{aligned}$$
(3)

In this case, \(\Vert \cdot \Vert _2\) refers to the euclidean norm and \(v_i\) is the ith point for the single-handed gestures. When a bimanual gesture is analysed, the points with \(i\in \{0,..,20\}\) correspond to those of the right hand and those with \(i\in \{21,...,41\}\) are the ones corresponding to the left hand. To know which point it is from the ones in Fig. 1, it is enough to subtract 21 to the value of i. Normalising all the weights respect to the first one, \(b_{1,2}\), allows the distance of the hand from the camera not to affect the result.

This procedure has also been carried out for the target gestures, that is a unique frame. We denote by \(A'_{\cdot H}\) the matrix corresponding to a target gesture. This is done for each \(H\in {\mathcal {H}}'.\)

Once we have computed the distances between the different keypoints in the hand landmark, we need to know how similar this gesture is to the target one. To do so, it is interesting to see how different are the distances between the different vertices in the patients gesture and in the target gesture. For example, if the index finger is stretched in both gestures, the distances between the points corresponding to it may vary a little, while, if in the patient’s video the fist is closed, these distances will vary much more.

A similarity distance is implemented as follows to determine to which point the patient’s gesture resembles the target one. Such distance makes use of the following function:

$$\begin{aligned} s(A'_{F H},A'_{\cdot H}) = 1- \frac{\Vert A'_{F H} - A'_{\cdot H}\Vert _{Fro}}{\Vert A'_{\cdot H}\Vert _{Fro}} \end{aligned}$$
(4)

where the Frobenius norm allows us to compute an element wise norm in the same way as the euclidean norm does with vectors.

We normalize to the norm of the matrix of the target gesture to obtain \( s(A'_{F H},A'_{\cdot H})\in [0,1]\). Small values of \(s(A'_{F H},A'_{\cdot H})\) are not likely to be obtained since some distances, such as those from wrist to knuckles, do not influence the state of the fingers (stretched or flexed).

This similarity distance is computed for all the different frames extracted from the original video, and the similarity distance of the video is taken as:

$$\begin{aligned} d(v,H) = \max _{F\in v}\{s(A'_{F H},A'_{\cdot H})\}. \end{aligned}$$
(5)

We also obtain the time at which this maximum value for d is achieved, this is the duration time.

3.3 Automatic gesture evaluation

The similarity distance, the execution time and the duration are fed to a three-layer neural network composed of the input layer, one hidden layer and the output one (see Fig. 2). The initial layer usually has the same number of neurons as the number of relevant information fed to the model, in this case 3. Meanwhile, the hidden layers can have as many neurons as we want, we obtained the best results with a hidden layer with 8 neurons. Both layers used the ReLu activation function.

Fig. 2
figure 2

Overview of the architecture of the network

Fig. 3
figure 3

Schematic of the initial model to which the patient’s original video is transferred. First the execution time of the gesture is extracted, and the original video goes through the MediaPipe Hands solutions to extract the skeleton. With the skeleton, the maximum similarity distance to the target gesture is computed and also the time it takes to reach it. These features plus the time duration are fed to the trained model and it gives the output: the score of the gesture. For privacy reasons, the gesture shown in this figure is made by an author of this paper and does not belong to the dataset

The last layer is composed of just 4 neurons, the same number of target scores a video can get. For this layer the activation function used is the softmax function. It has been found through experimentation that these hyperparameters are the best performers.

To obtain a final classification of the input data, the neural network goes through a learning process done by updating the weights in each connection so that the predicted results are as close as possible to the real ones. The update of the weights is done by an optimization process, in which the system minimizes the cross-entropy loss function. There are different algorithms used to minimize this loss function, such as the gradient descent or the stochastic gradient. In this project we used the Adam algorithm, that follows a stochastic gradient descent procedure based on adaptive estimation of first and second order moments. Figure 3 shows a scheme of the whole model, from the original video to the final output, including obtaining the skeleton of the hand and computing the similarity distance with the target gesture.

4 Experiments and results

This section presents the study carried out using the system proposed in the previous section, as well as the results obtained from the experimentation. In addition, the dataset used to carry out the experimentation and how it was obtained is described.

4.1 Dataset

The dataset used was obtained from patients evaluated in the Cognitive Disorders Unit of the Marques de Valdecilla University Hospital in Santander, Spain; and from the Vadecilla Cohort for the Study of Memory and Brain Aging, a local project that enrolls healthy elders free of dementia. Therefore, the study sample includes both cognitively normal subjects and patients with different cognitive disorders (such as Alzheimers disease and frontotemporal dementia), ensuring a wide range of apraxic deficits. The study was approved by the local Ethics Committee and all participants gave their written informed consent according to the Declaration of Helsinki. For those patients who could not give a reliable informed consent due to their degree of cognitive impairment, it was obtained from their accompanying relative.

Seventy-eight subjects participated in the test: 30 with AD, 26 patients with FrontoTemporal Dementia (FTD) and 22 healthy or with other diagnoses. Subjects were between 55 and 87 years old, and there being 43 women and 35 men in the sample.

Patients were asked to imitate a series of gestures, unimanual or bimanual, shown in pictures. Although apraxia is usually evaluated in the clinical setting by asking to imitate the manual postures adopted by the examiner, for this project we decided to ask to imitate the postures shown in a picture to ensure reproducibility. The gestures to be imitated can be seen in Fig. 4.

Fig. 4
figure 4

Unimanual and bimanual gestures used in the study

The execution of these gestures was evaluated by the neurologist in charge, taking into account both the difficulties in performing them and the time taken. For each of the evaluated gestures a score was given, being between 0 and 3 (3: executed perfectly; 2: executed correctly with minor deviations; 1: performed incorrectly but recognisable, 0: unrecognisable).

To generate the dataset, patients were recorded performing the different gestures using smartphone cameras. The model of the smartphone or its camera was not of great importance as the patients were at a distance of about one metre (close-up) and the quality was not affected. Resolution is also unimportant for the same reasons, and a high resolution could be computationally expensive. The camera was fixed during each of the recordings, although its position can vary between patients, as can the positioning of the patient in relation to it. This has been done on purpose, so as not to restrict the analysis to a very closed and limited environment.

Table 1 shows the number of videos with the different target scores for each one of the single-handed gestures in our dataset. In this case, the table contains indifferently those made with the left and the right hand, since there was no difference between them when computing the similarity distance.

Table 1 Number of videos in our dataset with each of the possible target scores for all the single-handed gestures

Table 2 shows the number of videos with the different target scores for each one of the bimanual gestures in our dataset.

Table 2 Number of videos in our dataset with each of the possible target scores for all the bimanual gestures

4.2 Training process and evaluation metrics

The model proposed in this work is a simple one that can be executed with any device. In fact, the network used on the last step of this implementation was trained using an LG gram laptop equipped with an Intel-core i7 processor.

For all the single-handed gestures, we chose a batch size of 12 and a learning rate in the order of \(10^{-3}\). For all the bimanual gestures, we chose a batch size of 6 and a learning rate in the order of \(10^{-4}\). This values are enough for the models to converge in 100 epochs in both cases: the single-handed gestures and the bimanual ones.

Neural networks allow us to classify data but it is important to note that any classification is done with certain error. Analysing how big this error is, which means analysing its performance, will allow us to understand how our model will behave, both during its learning process as well as when new data is introduced. The confusion matrix is a useful tool to do so. As stated in [30], a confusion matrix is a \(2\times 2\) matrix with the four following categories:

  • True positive (TP): correctly classified as the positive class (or class of interest).

  • False positive (FP), also known as type 1 error: incorrectly classified as the positive class.

  • False negative (FN) or type 2 error: incorrectly classified as the negative class (not the class of interest).

  • True negative (TN): correctly classified as the negative class.

These four categories show if the predictions made by our model match the actual values or not. The target and predicted values of each category are the basis of different metrics used to evaluate the performance of the classification models. It is important to note that these metrics are not understood as a distance function. One of this metrics is the accuracy.

$$\begin{aligned} \text {Accuracy} = \frac{\text{ Number } \text{ of } \text{ correct } \text{ predictions }}{\text{ Total } \text{ number } \text{ of } \text{ predictions }}. \end{aligned}$$
(6)

Thus, the accuracy shows the proportion of correctly classified elements. In addition to the accuracy, there are other metrics for evaluating performance, such as precision and recall, also known as true positive rate (TPR). The precision shows the percentage of positives that were correctly classified while the recall shows the proportion of positive elements that have been correctly identified:

$$\begin{aligned}&\text {Precision} = \frac{TP}{{TP} + {FP}} \end{aligned}$$
(7)
$$\begin{aligned}&\text {Recall} = \frac{TP}{{TP} + \hbox {FN}}. \end{aligned}$$
(8)

It is important to note that these two metrics are related and improving one of them usually means a reduction in the other. To obtain the optimal value of these two metrics, we usually turn to the F1-measure. The F1-measure or F1-score is the weighted harmonic mean of the precision and recall and it has a value of 1 when the best value is obtained (perfect precision and recall) and a value of 0 when the opposite is true.

$$\begin{aligned} {F1}\text {-score} = 2\frac{\text{ Precision }*{Recall}}{\text{ Precision } + {Recall}} = \frac{{TP}}{{TP} + \frac{1}{2}({FP} + {FN})}. \end{aligned}$$
(9)

These results are useful when the classification problem is a binary one and can be extrapolated to the case when we have K classes. In this case, the confusion matrix is now a \(K\times K\) matrix and the accuracy can be computed as specified by Eq. (6). The precision, recall and F1-score per-class can be calculated as in the binary case, Eqs. (7)–(9) respectively. To obtain the value of these metrics for the whole model is as simple as to compute the arithmetic mean or the weighted mean, depending on the quantity of videos per class that are in our dataset.

Thus, for the case of this study we will only pay attention to the accuracy of our model and the weighted mean of the precision, recall and F1-score.

4.3 Results

4.3.1 Single-handed gestures

It is worth to mention that for the single-handed gestures the majority of the videos have a target score of 2 or 3 as seen in Table 1, so when training the models we focused on the ability of the model to correctly predict target scores of 0 or 1 over having a higher accuracy on the training set. In the cases in which there was only one element with a certain target score, we chose the splitting of the dataset in which this video belongs to the training sample.

Table 3 shows the final values of the loss and the accuracy for each of the models, with accuracy values greater than 0.8 highlighted.

Table 3 Loss and accuracy for the different gestures achieved at the last step of the training process

In Fig. 5 it has been represented the confusion matrix for all six gestures using the test sample, to see how the models respond when new data is introduced.

Fig. 5
figure 5

Confusion matrices of the test sample for the six single-handed gestures, where the elements in the diagonal are correctly classified and the rest are not

In Table 4 it is summarized the number of total hits and total errors made by the models for each one of the six single-handed gestures. These results are obtained from Fig. 5.

Table 4 Number of total hits and total errors made by the model when classifying the videos in the test sample

Looking now to the six new confusion matrices shown in Fig. 5, and results in Table 4, it can be appreciated that all six models follow the same tendency that they followed with the training data: they can easily and correctly classify the gestures with a target score of 3. And not only this, but our models also classify correctly videos with other scores. This means that no overfitting has occurred when training the models and for new videos the model is able to give a correct prediction.

To get a better understanding of the results given by the models when using the test sample, we check the precision, recall and F1-score, that are listed in Table 5, with values greater than 0.8 highlighted. We have considered the weighted average to avoid the cases where the classification of just one video could have the same importance to compute the average precision than all of the videos with another target score.

Table 5 shows the precision and recall of all six models. We can see that the results obtained for both the precision and recall can be considered good for all six gestures; having values between the 79\(\%\) and 83\(\%\) for the accuracy and over 81\(\%\) for the recall.

The F1-score is also shown in Table 5. As in the case of the training set, some models, the gun and ok gestures, present a big difference between these two metrics. The F1-score gives a better understanding of the quality of the results given by the two models, which can be considered good.

Lastly, let us see the percentage of error given by each of the models in the test sample. These values are obtained as the percentage of 1 minus the accuracy. The values for all the gestures appear in Table 6. There, we have also included the percentage of the accuracy under the label Hits.

Table 5 Classification metrics of the model obtained from the test sample
Table 6 Percentage of the accuracy (hits) and of one minus the accuracy (errors) in the test sample for the each one of the six single-handed gestures given by the model proposed in this project

The errors listed in Table 6 are all below 20\(\%\), which translates into a percentage of the accuracy of over 80\(\%\). These errors seem logical when we consider the accuracy shown in Table 3. The reason that in this case the accuracy is similar or higher is due to the fact that in the train sample there were fewer videos that had a target score of 0 or 1, making more difficult for the model to learn when videos should receive these scores. However, these results can be considered good. We can also consider that the models perform a good classification without producing overfitting. If this were the case, the different between the accuracy in the train sample and test sample would be much higher that the results obtained.

The initial project explained in [25, 26] only gathers the percentage of hits for the gestures three and victory. Additionally, in the initial model, left and right single-handed gestures were not mixed together. For the three gesture we have the results for both hands while for the victory we only have the results for the right hand.

Since for the victory gesture we only have a percentage of error, let us compare it to that obtained by our model. In the initial model the percentage of error was 67.15\(\%\) while in ours this percentage drops to 18.18\(\%\). For the three gesture, the percentage of error in the initial model are: 69.68\(\%\) for the right hand and 56.76\(\%\) for the left hand. The results obtained by our problem present an improvement with respect to both results, dropping the percentage of error to 18.75\(\%\). For both gestures, better results are obtained with our approach than with the initial model.

Assuming the rest of the single-handed gestures would obtain similar results than the ones shown in [25, 26], we can consider that our models perform a good classification when compared to the initial model.

Furthermore, we compare our results to what it is obtained using other well-known methodologies. In particular, we make use of support vector machine (SVM, [31]), random forest (RF, [32]), bagging [33] and gradient boosting [34]. In Table 7, we have included the total number of hits and errors achieved by each of these models for each one of the single-handed gestures. We can observe that the results obtained in Table 7 are worse than those of our model, reported in Table 4, but for gesture ok with SVM and RF that the same results are obtained. That is, 30 videos are classified appropriately and 3 are wrongly classified.

Table 7 For the test sample videos of single-handed gestures, number of total hits and total errors made by different classifiers: SVM, RF, bagging and gradient boosting

In Table 8, we have included the precision, recall and F1-score for the single-handed gestures using again SVM, RF, bagging and gradient boosting. Comparing the results with those reported in Table 5, we obtain that again our model produces better results, but for the ok gesture with SVM and RF that are exactly the same.

Table 8 For the test sample of single-handed gestures: weighted average of the precision, recall and F1-score metrics when making use of the SVM, RF, bagging and gradient boosting classifiers

4.3.2 Bimanual gestures

As seen in Table 2, some gestures follow the same distribution of videos as the single-handed ones (a majority of videos with a target score of 3). If this is the case, we train the model focusing on its ability to correctly predict the other target scores. Meanwhile, for the gestures that have the same number of videos with each target score, we focused on the ability of the model to correctly predict the target scores of each video, i.e. its global accuracy.

Table 9 displays the final values of the loss and the accuracy for each of the models. The accuracy values greater than 0.8 have been highlighted.

Table 9 Loss and accuracy for the different gestures achieved at the last step of the training process

It is important to check if the results presented in Table 9 are maintained for data that is new to our model. This is the reason why we analyse the results for the data in the test sample. In Fig. 6 it has been represented the confusion matrix for all five gestures.

Fig. 6
figure 6

Confusion matrices of the test sample for the five bimanual gestures, where the elements in the diagonal are correctly classified and the rest are not

In Table 10 it is summarized the number of total hits and total errors, shown in Fig. 6, made by the models for each one of the five bimanual gestures.

Table 10 Number of total hits and total errors made by the model when classifying the videos in the test sample

Looking now to the five new confusion matrices shown in Fig. 6, and the results in Table 10, we can observe that all five models follow the same tendency that they followed with the training data: they can easily and correctly classify the gestures that had a bigger weight in the training sample. This means that no overfitting has occurred when training the models and that the model is able to give a correct prediction to new videos. Our models also correctly classify videos with other target scores, but not in such a higher amount of success.

Since in our dataset not all the categories have the same amount of videos, in Table 5 we consider the weighted average of the precision, recall and F1-score. This is done to get a better understanding of the results given by the models when we use the test sample. Values greater than 0.8 have been highlighted.

Table 11 Classification metrics of the model obtained from the test sample

In Table 11 it is shown the precision and recall of the five models when using the test sample. The results for the butterfly imply that the model has no problem classifying new data. This implication also applies to the rings, rhombus and rock gestures. Their precision with the test sample are of about 60, 66, 70 and 80\(\%\) relatively. For the guns gesture, we have that its precision is below 50\(\%\), although its recall is almost a 70\(\%\).

The F1-score is also shown in Table 11. As in the case of the training sample, some models present a big difference between these two metrics: this is the case of the butterfly and guns gestures. For these two gestures, to know the quality of the model, the best is to take a look at its accuracy and F1-score.

Table 12 Percentage of the accuracy (hits) and of one minus the accuracy (errors) in the test sample for the each of the five bimanual gestures, given by the model proposed in this project (columns two and three) and by the initial model (columns four and five)

As a last step, let us analyze the percentage of error given by each of the models in the test sample. Table 12 lists the values for all the gestures. These values are obtained as the percentage of one minus the accuracy. The errors listed in the second column of Table 12 are different from one gesture to another. The rhombus and rock gestures have an error below 20\(\%\), which translates into a percentage of the accuracy (label as Hits in the table) of over 80\(\%\). Meanwhile the error of each of the other three gestures is around 30\(\%\). These errors represent an improvement in the accuracy of our model when looking at the accuracy obtained by the training sample, shown in Table 9. The low error of the rhombus and rock gestures is due to having few videos with a target score of 0 or 1, as it happened for the single-handed gestures. The error of the other three gestures is a consequence of having a similar amount of videos with each score. Having said this, the obtained results can be considered good. We can also consider that the models perform a good classification without producing overfitting. In fact, when these models misclassify videos, they usually do it by classifying videos with a target score of 0 as 1, or vice versa, and videos with a target score of 2 as 3 or vice versa. This gives an idea of whether the gesture was well or poorly executed.

Comparing columns three and five of Table 12 (we can also compare columns two and four) we can see that the results for all bimanual gestures considered in this project have improved their results with respect to the initial model. The gesture that shows the slightest improvement is the butterfly, with an error dropping just 0.36 units (from 27.03\(\%\) to 26.67 \(\%\)). The rest of the gestures show a decrease in their percentage of error of at least 10 units, being the rhombus gestures the one with the best improvement when predicting new target scores.

As with the single-handed gestures, we compare our results with those obtained when applying SVM, RF, bagging and gradient boosting to the bimanual gestures. The obtained results are displayed in Tables 13 and 14. In Table 13 we can observe that the total number of hits for each method and bimanual gestures is smaller than what reported for our model in Table 10; and the total number of errors is higher.

Table 13 For the test sample videos of bimanual gestures: number of total hits and total errors made by different classifiers: SVM, RF, bagging and gradient boosting

Table 14 reports the precision, recall and F1-score for the SVM, RF, bagging and gradient boosting classifiers when applied to the bimanual gestures. The obtained results are worse than those reported in Table 11 for our model, with the following exceptions on the precision: the bagging method for the guns gesture and the gradient boosting for the rock gesture.

Table 14 For the test sample of bimanual gestures: weighted average of the precision, recall and F1-score metrics when making use of the SVM, RF, bagging and gradient boosting classifiers

5 Discussion and research impact of the proposal

In our proposal we present a method that performs patient movement analysis in two phases: I) The video is pre-processed to extract the skeleton of the patients’ hands throughout the video. II) The extracted skeleton is compared with the skeleton of the target gesture proposed to the patient. This comparison is done by means of a similarity function that extracts how close he/she is to doing the gesture; giving a final result with a small neural network. In addition, the time it takes the patient to perform the gesture is recorded. This way, it has been possible to use computer vision techniques and deep learning for the evaluation of apraxia. Machine learning techniques had previously been used to diagnose Alzheimer’s disease and assess its symptoms [17, 24], but without the consideration of apraxias.

Table 15 Research impact of the proposal

The main goal of the first phase is to obtain a robust system, avoiding processing the video in a single step, which makes training the network more difficult, as the complexity of analysing the video and obtaining a result is greater. Thus, our proposal leaves the heavy processing, the analysis of the video itself (highly complex data), to a specialised network focused solely on extracting the skeleton of the hands, which is highly accurate. In addition, this step can lead to anonymisation of the data, as any self-image of the patient is removed. The second phase is based on a similarity function, obtained from a deterministic mathematical operation. This phase helps the specialised physician to better understand the process in analysing the patient. The system indicates when the patient is closest to the target gesture, which can help in making a diagnosis. Unlike previous neuropsychological tests for the clinical examination of apraxia [15, 16], our proposed similarity function is calculated automatically with software support and outputs an accurate quantitative value to assess apraxia. Part of the dataset used in our experimentation was previously used in other works [25, 26] to detect apraxia using a method based on deep learning. These previous works achieved lower accuracy and did not applied a specific pre-processed phase to extract the skeletons and similarity functions.

Both phases of the method have been put into practice in a experimentation with patients. The experimentation has shown some promising results. It should be noted that in the evaluation of some relevant gestures, more than 80% accuracy was achieved. The results to evaluate apraxia were not dichotomous; they were assessed with a value between 0 and 4 so that the neurologist has a more detailed idea of the severity of apraxia. Thus, the proposal helps the neurologist analysing the patient to better understand the process. The system indicates when the patient is closest to the target gesture, which helps to make a diagnosis. The network is intended to take stock of all patients, since some gestures are more complicated to perform than others, and to give an evaluation based on all the variables considered.

In summary, we can observe that clinical tests [15, 16] provide a methodology for identifying apraxias in neurology consultations through non-automatic methods (see Table 15). Bringas et al. [26] used computer vision techniques but lacks pre-processing that extracts the skeleton features of the hand performing the gesture, resulting in low accuracy rates in identifying apraxia (see Table 15). Daribay et al. [27] have addressed a solution to automate the identification of apraxias by extracting the skeleton and using computer vision techniques, but focusing on children, who do not have this disorder associated with Alzheimer’s pathology and whose symptoms manifest with speech problems (see Table 15). Thus, the results presented in this article provide a research contribution as they automate the identification of motor apraxias in Alzheimer’s patients through automatic methods that include pre-processing to extract the skeleton and computer vision techniques (see Table 15).

6 Conclusion

A new method for evaluating the execution of gestures performed by patients that were captured on videos using smartphones has been proposed. The methodology is based on a mathematical function that performs an objective assessment by checking the similarity between the gesture performed at each moment and the target gesture.

After extracting the skeleton of the hands, taking advantage of the proposed similarity function allows for an objective evaluation of the gestures; resulting in a diagnostic tool that helps to detect apraxias. Experimentation shows that the proposed system achieves good results, even with a small and unbalanced dataset. In particular, many incorrectly performed gestures have been successfully detected, which are a type very important to detect.

The next steps in the research will mainly focus on further experimentation with new patients, trying to increase the size of the dataset. For these new cases, we want to improve the data acquisition process with stereo cameras, which better capture the depth at a short distance. This will be ideal for a better detection of the hands as they will be in the foreground. The extraction of the skeleton will therefore be better performed as it will be possible to better remove what belongs to the background. This way, it will also be possible to anonymize the data by removing the part of the image corresponding to the patient’s body.

Furthermore, in view of the good results obtained by the proposed system, there are plans to develop a mobile or desktop application that can be used in a clinical environment. This application aims to be a diagnostic tool that will help the medical team to detect apraxias through the tests carried out on patients and, with this, make early detection of the Alzheimer’s disease.