Automatic apraxia detection using deep convolutional neural networks and similarity methods

Vicedo, Cristina; Nieto-Reyes, Alicia; Bringas, Santos; Duque, Rafael; Lage, Carmen; Montaña, José Luis

doi:10.1007/s00138-023-01413-2

Automatic apraxia detection using deep convolutional neural networks and similarity methods

Original Paper
Open access
Published: 24 June 2023

Volume 34, article number 60, (2023)
Cite this article

Download PDF

You have full access to this open access article

Machine Vision and Applications Aims and scope Submit manuscript

Automatic apraxia detection using deep convolutional neural networks and similarity methods

Download PDF

Cristina Vicedo¹^na1,
Alicia Nieto-Reyes ORCID: orcid.org/0000-0002-0268-3322¹^na1,
Santos Bringas²^na1,
Rafael Duque ORCID: orcid.org/0000-0001-8636-3213¹^na1,
Carmen Lage^3,4^na1 &
…
José Luis Montaña¹^na1

1388 Accesses
1 Altmetric
Explore all metrics

Abstract

Dementia represents one of the great problems to be solved in medicine for a society that is becoming increasingly long-lived. One of the main causes of dementia is Alzheimer’s disease, which accounts for 80% of cases. There is currently no cure for this disease, although there are treatments to try to alleviate its effects, which is why detecting Alzheimer’s disease in its early stages is crucial to slow down its evolution and thus help sufferers. One of the symptoms of the disease that manifests in its early stages is apraxia, difficulties in carrying out voluntary movements. In the clinical setting, apraxia is typically assessed by asking the patient to imitate hand gestures that are performed by the examiner. To automate this test, this paper proposes a system that, based on a video of the patient making the gesture, evaluates its execution. This evaluation is done in two steps, first extracting the skeleton of the hands and then using a similarity function to obtain an objective score of the execution of the gesture. The results obtained in an experiment with several patients performing different gestures are shown, showing the effectiveness of the proposed method. The system is intended to serve as a diagnostic tool, enabling medical experts to detect possible mobility impairments in patients that may have signs of Alzheimer’s disease.

Development of a Quantitative Tool Based on Deep Learning for Automatic Apraxia Detection (DLAAD)

Parkinson’s disease diagnosis using convolutional neural networks and figure-copying tasks

Article Open access 08 September 2021

Real-Time Gesture Recognition Using Deep Learning Towards Alzheimer’s Disease Applications

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Apraxia is a neurological dysfunction in which a person has difficulty carrying out complex gestures [1]. This can be disabling for people since they cannot perform basic tasks in daily life such as grooming or getting dressed.

Alzheimer’s disease presents very varied symptoms (memory loss, inability to communicate, changes in personality and behavior, etc.). Apraxia is a manifestation of many multiple neurological pathologies, such as Alzheimer’s disease [2]. For this reason, one of the most frequently performed diagnostic tests in neurology consultations is to ask the patient to imitate a simple gesture made by the doctor. The imitation of the gesture is evaluated by the doctor through ocular inspection to find the existence of apraxia.

The greater life expectancy entails an increase in the elderly population and therefore a higher prevalence of Alzheimer’s pathology. This justifies the need to explore research proposals that allow automating the diagnosis of this pathology. Previous related works, including [3,4,5,6], show that it is possible to make efficient use of modern tools in these areas. In this line, this work presents a methodological proposal that is based on deep learning techniques and computer vision to identify apraxias [7]. This methodology contains a set of tasks to process videos that record the gestures made by patients trying to imitate a movement. Finally, the methodology also contains a set of tasks to build support software that allows detecting if the patient has symptoms related to apraxia.

The article includes five additional sections. Section 2 reviews research papers related to the detection of apraxias. Section 3 presents our methodology to automate the identification of apraxias using neural networks and computer vision techniques. Section 4 describes an experimentation carried out to validate our methodology with real patients. Sect. 5 discusses the results of the experimentation and the research impact of the proposal. Section 6 shows the conclusions we draw from the work done.

2 Related work

Thanks to modern instruments such as smartphones as well as high connectivity and tools in the cloud, it has been possible to successfully develop multiple mobile applications. This has led to the development of large systems for clinical diagnostic support, making it easier for medical experts to analyse the data collected from patients using these kind of devices. This is often referred to as m-health, although the term u-health is also often used to refer to the use of ubiquitous computing for medical purposes, taking advantage of wearable or portable devices.

In this scenario, proposals for various applications have been made with great success: Tsang et al. [8] present an app to help prevent asthma attacks and aid self-management of the disease by predicting possible attacks using data collected from smartphones and machine learning techniques. Pryss et al. [9] studied how to predict the level of stress through GPS data collected by a mobile app. Ali et al. [10] harness wearable sensors to collect patient data and try to detect heart disease through deep learning models.

In addition to this, it is worth mentioning some applications that take advantage of the cameras of these devices (or cameras in general). Liang et al. [11] presented an app that takes advantage of smartphone cameras to allow users to self-examine themselves for signs of dental or oral diseases or problems. Kousis et al. [12] show a way to classify melanoma taking advantage of mobile phone cameras, classifying the different moles captured by the patients themselves through an intuitive application. Hasan et al. [13] take advantage of the smartphone camera to record the fingertips and relate the measured colour to the level of haemoglobin in the blood.

Currently, the existence of apraxia can be determined exclusively through clinical examination, by asking the patient to perform different gestures and evaluating these gestures visually. Different modalities of apraxia can be assessed, such as the imitation of meaningless gestures (imitative apraxia), the pantomime of previously learned gestures (pantomime of object use, or communicative pantomime), and the real use of objects [14]. Several neuropsychological tests have been developed for the clinical examination of apraxia [15, 16]. However, the clinical scoring of apraxia gestures poses several problems. Firstly, the visual evaluation made by the health professional is subjective and can be inaccurate. Also, the gestures performed by the patient are usually rated dichotomously, as correctly or incorrectly performed, so intermediate levels of apraxia impairment cannot be adequately scored. This qualitative evaluation has additional disadvantages, because it makes very difficult to assess a progression in the apraxic deficit related to the progression of the neurological disorder, or an improvement in response to treatments. Therefore, there exists the need to develop a more accurate and quantitative method to evaluate apraxia.

In this scenario, artificial intelligence holds great potential to improve apraxia evaluation. Machine learning approaches have been successfully applied in medical diagnostic processes, particularly in the field of neurodegenerative diseases. A major application is the use of artificial intelligence to analyze brain MRI images for the diagnosis of Alzheimer’s disease. A systematic review about this topic reported that convolutional neural networks achieved the best results (weighted average accuracy 89%), but other approaches as Logistic Regression or Support Vector Machines also obtained high performances [17]. Other many applications of deep learning models include the classification of electroencephalographic signals for brain-computer interfaces [18]; the staging of neuropathological changes on digitized brain tissue slides [19]; the quantification of amyloid protein deposition in positron emission tomography images [20]; the scoring of the Rey Complex Figure copy, a test to evaluate visuospatial skills [21]; or the analysis of voice recordings to detect speech abnormalities [22] or dementia [23]. However, machine learning has not been explored for apraxia evaluation before. Caselli et al. [24] carried out a kinematic study of apraxia based on an Optotrak camera system that registered apraxia gestures. This approach allowed to analyze quantitative features of apraxia such as reaction time, intermanual symmetry or manipulation coupling. However, this study was conducted with research purposes and the goal was not to obtain a scoring system of apraxia to evaluate patients in clinical practice. Multi camera-based kinematic studies are not suitable for clinical practice because they are very expensive, time-consuming and operator-dependent. Conversely, cameras associated to modern smartphones open the possibility to register patient movements in a very inexpensive and time-efficient way [25, 26].

Furthermore, Daribay et al. [27] did propose an automated solution for identifying apraxias by utilizing computer vision techniques to extract skeletal features. However, their research primarily focused on children who do not exhibit symptoms of apraxia associated with Alzheimer’s pathology, and whose symptoms predominantly manifest as speech difficulties.

3 Methodology

This section shows the proposed system for automatically evaluating the execution of gestures by patients. The system is composed of two parts: initially the skeleton of the hand is extracted; then, it is obtained the distance between the gesture performed at each moment (frame) and the target, evaluating the execution.

3.1 Preprocessing and hand tracking using mediapipe

First of all, a pre-processing is performed for each of the videos, in order to homogenise all the data and also to reduce the number of frames to be analysed. For that, a uniform sampling of frames per second is taken from each video. In this case, we have considered a sampling of 5 frames per second.

Mediapipe [28] is a cross-platform Machine Learning library that provides several Computer Vision solutions, like everyday object detection, skeleton detection and tracking, or face recognition, among others. One of its most interesting and highly accurate functionalities (around 95.7% for palm detection) is hand detection and skeleton extraction and tracking. This solution allows detecting hands in given images or videos and extracting 21 coordinates or landmarks from each one, as shown in Fig. 1.

This functionality has been exploited for the first step of the proposed system, where videos of patients performing the gestures are processed. This is intended to take advantage of the high precision of this library, simplifying the video data and moving from a highly complex and high-dimensional datum (such as a video) to a series of landmarks distributed over time, representing the movement made by the patient. These landmarks can be processed more easily, and the similarity between two gestures can be calculated.

3.2 Similarity distance

The solution given by MediaPipe presents a skeleton of the different hands in the frame, given by 21 points each hand. One simple solution to obtain a similarity distance with the target gesture would be to use a distance between graphs such as the Hamming distance. However, that distance is based on the adjacency matrix, and in the case of this project this matrix is the same for our target gesture and the patient’s gesture. For this reason, we proposed a new approach explained below.

To evaluate the quality of the gesture we consider a new graph, ${\mathcal {G}}=(V,E,A)$. When working with single-handed gestures we maintain the 21 original points as the vertices, V. This new graph is an undirected weighted graph where all the vertices are connected to each other, giving the set of edges, E. The weight of the edges, A, is given by the euclidean distance between the two vertices it connects. The set of all the weights can be written as a matrix in which its elements are the weight of the corresponding edges. This matrix will have all zeros in the diagonal and for undirected graphs it will be symmetrical. To simplify things we consider a pseudo-weight matrix which is upper triangular. When working with bimanual gestures, the number of vertices doubles to 42, but the idea behind this new graph ${\mathcal {G}}$ remains the same. Now the matrix contains the distances between the keypoints of each hand and the distances between the keypoints of the two hands.

With these considerations being made, the new matrix is a square matrix in ${\mathbb {R}}^{g\times g}$, where $g=21$ when the gesture being analyse is a single-handed one and $g=42$ when it is a bimanual one. This matrix has been considered as a square one, because later it allows us to compute a matrix norm. This new matrix, $A_{F {H}}'$, is computed for each frame $F\in v,$ with v the video under analysis, and for each $H\in {\mathcal {H}}',$ with ${\mathcal {H}}'$ the set of all the possible gestures except the palm one, that it is not analysed in this project. It is as follows:

$$\begin{aligned} \begin{matrix} A_{F H}'\\ \end{matrix} =\begin{pmatrix} 0 &{} a_{1,2} &{} \cdots &{} \cdots &{} a_{1,g}\\ \vdots &{} 0 &{} a_{2,3} &{} \cdots &{} a_{2,g}\\ \vdots &{} \vdots &{} \ddots &{} \ddots &{} \vdots \\ \vdots &{} \vdots &{} &{} \ddots &{} a_{g-1,g}\\ 0 &{} 0 &{} \cdots &{} 0 &{} 0 \end{pmatrix}, \end{aligned}$$

(1)

where $a_{i,j}$ is the weight of each edge and is computed as:

$$\begin{aligned} a_{i,j} = \frac{\Vert v_{i-1} - v_{j-1} \Vert _2}{b_{1,2}} \end{aligned}$$

(2)

with:

$$\begin{aligned} b_{1,2} = \Vert v_{0} - v_{1} \Vert _2. \end{aligned}$$

(3)

In this case, $\Vert \cdot \Vert _2$ refers to the euclidean norm and $v_i$ is the ith point for the single-handed gestures. When a bimanual gesture is analysed, the points with $i\in \{0,..,20\}$ correspond to those of the right hand and those with $i\in \{21,...,41\}$ are the ones corresponding to the left hand. To know which point it is from the ones in Fig. 1, it is enough to subtract 21 to the value of i. Normalising all the weights respect to the first one, $b_{1,2}$, allows the distance of the hand from the camera not to affect the result.

This procedure has also been carried out for the target gestures, that is a unique frame. We denote by $A'_{\cdot H}$ the matrix corresponding to a target gesture. This is done for each $H\in {\mathcal {H}}'.$

Once we have computed the distances between the different keypoints in the hand landmark, we need to know how similar this gesture is to the target one. To do so, it is interesting to see how different are the distances between the different vertices in the patients gesture and in the target gesture. For example, if the index finger is stretched in both gestures, the distances between the points corresponding to it may vary a little, while, if in the patient’s video the fist is closed, these distances will vary much more.

A similarity distance is implemented as follows to determine to which point the patient’s gesture resembles the target one. Such distance makes use of the following function:

$$\begin{aligned} s(A'_{F H},A'_{\cdot H}) = 1- \frac{\Vert A'_{F H} - A'_{\cdot H}\Vert _{Fro}}{\Vert A'_{\cdot H}\Vert _{Fro}} \end{aligned}$$

(4)

where the Frobenius norm allows us to compute an element wise norm in the same way as the euclidean norm does with vectors.

We normalize to the norm of the matrix of the target gesture to obtain $ s(A'_{F H},A'_{\cdot H})\in [0,1]$. Small values of $s(A'_{F H},A'_{\cdot H})$ are not likely to be obtained since some distances, such as those from wrist to knuckles, do not influence the state of the fingers (stretched or flexed).

This similarity distance is computed for all the different frames extracted from the original video, and the similarity distance of the video is taken as:

$$\begin{aligned} d(v,H) = \max _{F\in v}\{s(A'_{F H},A'_{\cdot H})\}. \end{aligned}$$

(5)

We also obtain the time at which this maximum value for d is achieved, this is the duration time.

3.3 Automatic gesture evaluation

The similarity distance, the execution time and the duration are fed to a three-layer neural network composed of the input layer, one hidden layer and the output one (see Fig. 2). The initial layer usually has the same number of neurons as the number of relevant information fed to the model, in this case 3. Meanwhile, the hidden layers can have as many neurons as we want, we obtained the best results with a hidden layer with 8 neurons. Both layers used the ReLu activation function.

The last layer is composed of just 4 neurons, the same number of target scores a video can get. For this layer the activation function used is the softmax function. It has been found through experimentation that these hyperparameters are the best performers.

To obtain a final classification of the input data, the neural network goes through a learning process done by updating the weights in each connection so that the predicted results are as close as possible to the real ones. The update of the weights is done by an optimization process, in which the system minimizes the cross-entropy loss function. There are different algorithms used to minimize this loss function, such as the gradient descent or the stochastic gradient. In this project we used the Adam algorithm, that follows a stochastic gradient descent procedure based on adaptive estimation of first and second order moments. Figure 3 shows a scheme of the whole model, from the original video to the final output, including obtaining the skeleton of the hand and computing the similarity distance with the target gesture.

4 Experiments and results

This section presents the study carried out using the system proposed in the previous section, as well as the results obtained from the experimentation. In addition, the dataset used to carry out the experimentation and how it was obtained is described.

4.1 Dataset

The dataset used was obtained from patients evaluated in the Cognitive Disorders Unit of the Marques de Valdecilla University Hospital in Santander, Spain; and from the Vadecilla Cohort for the Study of Memory and Brain Aging, a local project that enrolls healthy elders free of dementia. Therefore, the study sample includes both cognitively normal subjects and patients with different cognitive disorders (such as Alzheimers disease and frontotemporal dementia), ensuring a wide range of apraxic deficits. The study was approved by the local Ethics Committee and all participants gave their written informed consent according to the Declaration of Helsinki. For those patients who could not give a reliable informed consent due to their degree of cognitive impairment, it was obtained from their accompanying relative.

Seventy-eight subjects participated in the test: 30 with AD, 26 patients with FrontoTemporal Dementia (FTD) and 22 healthy or with other diagnoses. Subjects were between 55 and 87 years old, and there being 43 women and 35 men in the sample.

Patients were asked to imitate a series of gestures, unimanual or bimanual, shown in pictures. Although apraxia is usually evaluated in the clinical setting by asking to imitate the manual postures adopted by the examiner, for this project we decided to ask to imitate the postures shown in a picture to ensure reproducibility. The gestures to be imitated can be seen in Fig. 4.

The execution of these gestures was evaluated by the neurologist in charge, taking into account both the difficulties in performing them and the time taken. For each of the evaluated gestures a score was given, being between 0 and 3 (3: executed perfectly; 2: executed correctly with minor deviations; 1: performed incorrectly but recognisable, 0: unrecognisable).

To generate the dataset, patients were recorded performing the different gestures using smartphone cameras. The model of the smartphone or its camera was not of great importance as the patients were at a distance of about one metre (close-up) and the quality was not affected. Resolution is also unimportant for the same reasons, and a high resolution could be computationally expensive. The camera was fixed during each of the recordings, although its position can vary between patients, as can the positioning of the patient in relation to it. This has been done on purpose, so as not to restrict the analysis to a very closed and limited environment.

Table 1 shows the number of videos with the different target scores for each one of the single-handed gestures in our dataset. In this case, the table contains indifferently those made with the left and the right hand, since there was no difference between them when computing the similarity distance.

Table 1 Number of videos in our dataset with each of the possible target scores for all the single-handed gestures

Automatic apraxia detection using deep convolutional neural networks and similarity methods

Abstract

Similar content being viewed by others

Development of a Quantitative Tool Based on Deep Learning for Automatic Apraxia Detection (DLAAD)

Parkinson’s disease diagnosis using convolutional neural networks and figure-copying tasks

Real-Time Gesture Recognition Using Deep Learning Towards Alzheimer’s Disease Applications

Explore related subjects

1 Introduction

2 Related work

3 Methodology

3.1 Preprocessing and hand tracking using mediapipe

3.2 Similarity distance

3.3 Automatic gesture evaluation

4 Experiments and results

4.1 Dataset

4.2 Training process and evaluation metrics

4.3 Results

4.3.1 Single-handed gestures

4.3.2 Bimanual gestures

5 Discussion and research impact of the proposal

6 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation