On the analysis of speech and disfluencies for automatic detection of Mild Cognitive Impairment

Alzheimer’s disease is characterized by a progressive and irreversible cognitive deterioration. In a previous stage, the so-called Mild Cognitive Impairment or cognitive loss appears. Nevertheless, this previous stage does not seem sufficiently severe to interfere in independent abilities of daily life, so it is usually diagnosed inappropriately. Thus, its detection is a crucial challenge to be addressed by medical specialists. This paper presents a novel proposal for such early diagnosis based on automatic analysis of speech and disfluencies, and Deep Learning methodologies. The proposed tools could be useful for supporting Mild Cognitive Impairment diagnosis. The Deep Learning approach includes Convolutional Neural Networks and nonlinear multifeature modeling. Additionally, an automatic hybrid methodology is used in order to select the most relevant features by means of nonparametric Mann–Whitney U test and Support Vector Machine Attribute evaluation.


Introduction
Alzheimer's disease (AD) is characterized by a progressive and irreversible cognitive deterioration, which includes memory loss and impairments in emotion, language, and judgment, along with other cognitive deficits and symptoms in behavior. Its prevalence keeps increasing mainly among the elderly, and as highlighted by the last World Alzheimer Reports, AD is becoming epidemic as 900 million people can be regarded as the world's elderly population, living most of them in developed countries [1]. Therefore, an early and accurate diagnosis of AD helps patients and relatives to plan the future and offers the best possibilities that symptoms could be treated. In an early stage, a previous cognitive loss or Mild Cognitive Impairment (MCI) appears. Nevertheless, it does not seem sufficiently severe to interfere in abilities of daily life; thus, it usually does not receive an appropriate diagnosis, and afterward, some patients develop AD. The detection of MCI is a challenge to be addressed by medical specialists and could help future AD patients [2]. Along with memory loss, one of the main problems of AD is the loss of social and language skills. This loss can be noted in difficulties speaking to and understanding people, which make even more complicated social interactions and the natural communication process. Other crucial abilities for communication are impaired as well, such as emotion and expression. This difficulty communicating appears in early stages of the disease due to language issues, and it leads people with AD to social exclusion, with a serious negative impact not only on the patients, but also on their families [3]. During communication, language resources that include pauses or disfluencies are used to maintain verbal fluency. In AD/MCI, verbal fluency clearly changes: Speech fluency is progressively substituted by more pauses and disfluencies. Therefore, disfluencies are interesting language elements that could be useful to properly diagnose MCI. Both disfluencies and speech silences have valuable information for understanding the meaning of the uttered message.
One of the main aims of this project is to develop an automatic analysis of standard assessment tests, such as Categorical Verbal Fluency (CVF), by using speech therapy techniques which will allow to obtain quickly and reliably these specific analyses [4]. In the last years, several papers in the state of the art have addressed this issue. In the present work, we focus on the integration of more robust language-independent methodologies in order to detect AD in speech, using one of the classical tasks of CVF, the so-called animals naming task. Machine Learning and Deep Learning Paradigms will be used for modeling, as well as several feature sets based on linear and nonlinear approaches in order to develop a real-time and robust system. Section 2 describes the materials. Section 3 presents the used methods. Section 4 includes the results and discussion, and finally in Sect. 5, conclusions are drawn.

Materials
Recent studies highlight the relevance of non-speech elements such as disfluencies in verbal communication to identify MCI and AD. In [5,6], it is suggested that more pauses and shorter recording times reflect that AD patients require a greater effort to produce speech than healthy people: AD patients speak with longer pauses, more slowly, with shorter speech segments, and they spend more time trying to find the correct word, leading to speech disfluencies or broken messages. Speech disfluencies are any irregularity, break, or non-language element that occurs during the period of fluent speech, and they can start, complement, or interrupt it. These include elements such as: false starts, restarted or repeated phrases, extended or repeated syllables, thinking out loud, grunts or nonlexical utterances such as repaired utterances and fillers, and speakers correcting mispronunciations or their own slips of the tongue [7]. If these disfluencies increase, it could be a clear sign of cognitive impairment. In AD patients, sometimes the verbal utterance reflects their internal cognitive process or inner dialogue when they think out loud: ''What is that?'', ''How was this…'', ''/uhm/ I cańt remember,'' ''What was the name?''. If the number of silences and disfluencies increases, it may indicate that there is a worsening of the disease, which could lead to a deficit in effective and clear communication.
As a consequence, disfluencies play an important role in verbal communication and they are a direct reflection of the cognitive process that takes part in communication and convey an unquestionable characteristic for the diagnosis of these disorders, when fluent speech starts to disappear or is replaced by some disfluencies, Fig. 1. Although AD is mainly a cognitive disease, it may have articulation and phonation biomechanical alterations. The Categorical Verbal Fluency task (CVF), animal naming (AN), or animal fluency task, is a test used in neurodegenerative diseases, which measures and quantifies the progression of cognitive impairment [8]. It is commonly used to assess language skills, executive functions, and semantic memory [9]. The used sample includes 187 healthy individuals and 38 MCI patients that belong to the cohort of Gipuzkoa-Alzheimer Project (PGA) of the CITA-Alzheimer Foundation [4,10], Table (1). For the experiments, a balanced subset PGA-OREKA has been selected.

Methods
Recent state-of-the art approaches include modeling by means of linear and nonlinear speech features [14,15]. The proposed approach is based on the integration of several types of optimum features to model speech and disfluencies, using both linear and nonlinear ones. Furthermore, this proposal is based on the description of speech pathologies [5,12] with regard to articulation, phonation, quality of the speech, human perception, and the complex dynamics of the system. In this paper, some of the most used speech features (linear and nonlinear) will be taken into account for differentiation between pathological and healthy and speech [4,5,[12][13][14][15][16], and discrimination through human perception. Most of them are well known in the field of pathological speech characterization, and therefore for each parameter, a reference is given where further information and a deeper description can be found. All features are calculated by means of software developed within our research group [4,5], SPSS [17], MATLAB [18], Praat [19], and WEKA [20].

Automatic segmentation of disfluencies
The speech recording has been automatically segmented in disfluencies and speech signal by a VAD (voice activity detection) algorithm [6].

Extraction of features
After the segmentation, the following features will be extracted: • Classical features (CF)

Automatic selection of features by Kolmogorov-Smirnov and Mann-Whitney U test
In this step, the best features are automatically selected taking into account medical criteria with regard to common significance level.
1. In a first step, the normality of the distributions is analyzed by means of the nonparametric Kolmogorov-Smirnov test [17]. 2. The automatic feature selection is performed by means of Mann-Whitney U test because the distributions are not normal distributions, being p value \ 0.1 in order to obtain a larger set for the second phase of feature selection [18].

Automatic feature selection by WEKA
Afterward, a new selection phase is carried out in WEKA [20]: 1. In a first step, the feature selection algorithm SVMAt-tributeEval is used. This provides a selection by analyzing the integration of features in the group.
2. Then, for the experimentation, several feature sets with different feature numbers are created in order to develop a real-time system.

Normalization of features by WEKA
Moreover, during data preprocessing, all the features will be normalized by means of WEKA algorithms.

Automatic classification
In order to model the system, four classifiers will be used: 1. Support Vector Machines (SVM).

Convolutional Neural Network (CNN) with L layers of
N neurons, a convolution mask of cxc, and a pool mask of pxp. We have used the WEKA software suite [16] in order to perform the experimentation.

System evaluation
For the evaluation of the system, three criteria will be used: 1. Classification Error Rate (CER in %) has been used in order to evaluate the results. We have used k-fold cross-validation with k = 10 for both training and validation [20,21] 4 Results and discussion

Creation of feature sets
In the experiments, the used materials are about 60 speech samples for the control group (CR), and 40 speech samples for the MCI group, both belonging to the aforementioned PGA-OREKA (Table 1).
1. The signal is segmented into speech and disfluencies by a VAD algorithm with a minimum signal level. 2. Initially, the number of features obtained by the methodology described in Sect. 3 is about 920 (473 for speech and 447 for disfluencies) for a 22,000 kHz sampling frequency. The proposed set of features includes features from all the kinds of features described in Sect. 3.2 for speech and disfluencies. 3. Afterward, after a normalization test, an automatic feature selection is carried out using a nonparametric Mann-Whitney U test with p value \ 0.1, and about 150 features are selected (Fig. 2). 4. In the second step of optimization, the attribute selection algorithm SVMAttributeEval of WEKA provides about 80 features. 5. Finally, several feature sets are created with the best 5, 10, 25, and 50 features, named P5, P10, P25, and P50, respectively.

Classifiers configuration
Several classifiers have been created using the criteria in Sect. 3.5. Table 2 shows the used configurations.

Experimentation
The models described in Table 2 have been evaluated by means of the 3.7 criteria. The results are stable, hopeful, good, and equilibrated for all of them.
1. Figure 3 shows the global CER results of the automatic classification for both the control group (CR) and the MCI group. CER (%) is evaluated for all the classifiers in Table 2. The new approach that integrates disfluency analysis outperforms previous works [4] for most of the classifiers. With the developed methodology, the results are in general very satisfactory for this simple task. 2. The best results are achieved with the 25 feature set, P25, and according to the rates, SVM can be considered the optimal solution. MLP2 and CNN for configurations 1, 2, and 4 obtain hopeful results with less computational load than classical MLP. In those cases, an average of 95 and 92% is achieved. As it can be seen in the selected parameters, this is due to the evaluation of features related to disfluencies and because the models are tested with important data that were not taken into consideration in the previous experiments; for example, the conversations of patients with themselves. 3. As it can be seen in Fig. 3, the results obtained in the task (AN) for MCI are very good, especially taking into account that they are modeled by 25 characteristics. 4. There is a clear improvement with regard to previous works due mainly to the improvement of the automatic feature selection and the integration of disfluency information. 5. Additionally, note that the specific weight of Thick Data could be important in this case by introducing the most noteworthy features into the algorithm, thus enriching information. This hybrid strategy of using both Thick Data and CNN could be a hopeful option for future real systems, even with small data sets. 6. The average between %CER and the time needed to build the models is shown in Fig. 4. SVM achieves good results in all tasks, especially in the significant 5% of CER. The fact that the data are well characterized is very helpful, and with CNN convolutional networks, an 8% CER is achieved. The Time Models are also optimum for these solutions. 7. Figure 5 shows the False Positive Rates for both groups, CR and MCI. These criteria are crucial in real health systems as medical criteria. In this case, SVM, MLP2, CNN1, and CNN2 appear as the best options. 8. Finally, the Objective Function (OF) with different weights for the system parameters is shown in Eq. 1 by medical criteria.  Table 2 5 Conclusions This paper presents a novel approach for the development of a real-time support system for the diagnosis of MCI. The system is based on automatic analysis of speech and disfluencies and Deep Learning modeling. Following the trend of Thick Data, the multifeature modeling is based on both automatic selection of the most relevant features by medical criteria and automatic selection of attributes over speech and disfluencies: Mann-Whitney U test, Support Vector Machine Attribute (SVM) evaluation, and Deep Learning approaches. The best approaches include deep learning by means of Convolutional Neural Networks (CNN) and SVM. The results are hopeful and lead to a new research line for the development of real-time health systems.