Introduction

Impairment in cognition is a common manifestation among all the individuals suffering from dementia, of which Alzheimer’s disease (AD) is the most common. Despite the rising dementia epidemic accompanying a rapidly aging population, there remains a paucity of cognitive assessment tools that are applicable regardless of age, education, and language/culture. Thus, there is an urgent need to identify reliable, affordable, and easy-to-use strategies for the detection of signs of dementia. Starting in 2005, the Framingham Heart Study (FHS) began digitally recording all spoken responses to neuropsychological tests. These digital voice recordings allow for precise capture of all responses for verbal tests. Accurate documentation of these studies allowed for novel application of the Boston Process Approach (BPA) [1], a scoring method that emphasizes how a participant makes errors to differentiate between participants with similar test score results. With the emergence of speech recognition and analysis tools, there was a realization that these recordings used for quality control were now data in of themselves because speaking is a complex cognitive skill. Virtually all voice recognition software requires speech-to-text transcription from which to extract linguistic measures of speech, and manual transcription is often required to reach high levels of accuracy. This manual work not only is tedious, but also to ensure high levels of accuracy often requires some level of training in speech-to-text transcription and quality control to document transcription accuracy. Such expertise is not readily available at all locations around the globe. Developing a computational tool that can simply take a voice recording as an input and automatically assess the dementia status of the individual has broad implications for dementia screening tools that are scalable across diverse populations.

Promising solutions to tackle such datasets may be found in the field of deep learning, an approach to data analysis that has increasingly been used over the past few years to address an array of formerly intractable questions in medicine. Deep learning [2], a subfield of machine learning, is based upon specific models known as neural networks which decompose the complexities of observed datasets into hierarchical interactions among low-level input features. Once these interactions are learned, refined, and formalized by exposure to a training dataset, fully trained deep learning models leverage their “experience” of prior data to make predictions about new cases. Thus, these approaches offer powerful medical decision-making potential due to their ability to rapidly identify low-level signatures of disease from large datasets and quickly apply them at scale. This hierarchical approach makes deep learning ideally suited to derive novel insights from high volumes of audio/voice data. Our primary objective was to develop a long short-term memory network (LSTM) model and a convolutional neural network (CNN) model, to predict dementia status on the FHS voice recordings, without manual feature processing of the audio content. As a secondary objective, we processed the model-derived salient features and reported the distribution of time spent by individuals on various neuropsychological tests and these tests’ relative contributions to dementia assessment.

Methods

Study population

The voice data consists of digitally recorded neuropsychological examinations administered to participants from the Framingham Heart Study (FHS), which is a community-based longitudinal observational study that was initiated in 1948 and consists of several generations of participants [3]. FHS began in 1948 with the initial recruitment of the Original Cohort (Gen 1) and added the Offspring Cohort in 1971 (Gen 2), the Omni Cohort in 1994 (OmniGen 1), a Third Generation Cohort in 2002 (Gen 3), a New Offspring Spouse Cohort in 2003 (NOS), and a Second Generation Omni Cohort in 2003 (OmniGen 2). The neuropsychological examinations consist of multiple tests that assess memory, attention, executive function, language, reasoning, visuoperceptual skills, and premorbid intelligence. Further information and details can be found in Au et al. [4], including the lists of all the neuropsychological tests included in each iteration of the FHS neuropsychological battery [5].

Cognitive status determination

The cognitive status of the participants over time was diagnosed via the FHS dementia diagnostic review panel [6, 7]. The panel consists of at least one neuropsychologist and at least one neurologist. The panel reviews neuropsychological and neurological exams, medical records, and family interviews for each participant. Selection for dementia review is based on whether participants have shown evidence of cognitive decline, as has been previously described [4].

The panel creates a cognitive timeline for each participant that provides the participant’s cognitive status on a given date over time. To label the participants’ cognitive statuses at the time of each recording, we selected the closest date of diagnosis to the recording that fell on or before the date of recording or after the recording within 180 days. If the closest date of diagnosis was more than 180 days after the recording, but the participant was determined to be cognitively normal on that date, we labeled that participant as cognitively normal. Dementia diagnosis was based on criteria from the Diagnostic and Statistical Manual of Mental Disorders, fourth edition (DSM-IV) and the NINCDS-ADRDA for Alzheimer’s dementia [8]. The study was approved by the Institutional Review Boards of Boston University Medical Center and all participants provided written consent.

Digital voice recordings

FHS began to digitally record the audio of neuropsychological examinations in 2005. Overall, FHS has 9786 digital voice recordings from 5449 participants. It must be noted that not all participants underwent dementia review and repeat recordings were available on some participants. For this study, we selected only those participants who underwent dementia review, and their cognitive status was available (normal/MCI/dementia). The details of how participants are flagged for dementia review have been previously described [9]. In total, we obtained 656 participants with 1264 recordings, which are 73.13 (± 22.96) min in duration on average. The range is from 8.43 to 210.82 min. There are 483 recordings of participants with normal cognition (NC; 291 participants), 451 recordings of participants with mild cognitive impairment (MCI) (309 participants), 934 recordings of non-demented participants (NDE; 507 participants), and 330 recordings of participants with dementia (DE; 223 participants) (Tables 1 and S1). Of the 656 participants, one participant may have several recordings with different cognitive statuses. For example, one participant could have NC at their first recording, MCI at their second recording, and DE at their third recording. This implies that participants with a recording associated with either NC, MCI, or DE are not mutually exclusive and will not necessarily add up to the overall 656 participants. The recordings were obtained in the WAV format and downsampled to 8 kHz.

Table 1 Demographics and participant characteristics. For each participant, digital voice recordings of neuropsychological examinations were collected. A, B and C show the demographics of the participants with normal cognition, mild cognitive impairment, and dementia, respectively, at the time of the voice recordings. Here, N represents the number of unique participants. The mean age (± standard deviation) is reported at the time of the recordings. Mean MMSE scores (± standard deviation) were computed closest to the time of the voice recording. For cognitively normal participants, ApoE data was unavailable for one Generation 1 (Gen 1) participant and eight Generation (Gen) 2 participants; MMSE data was not collected on Generation (Gen) 3 participants. For MCI participants, ApoE data was unavailable for one Gen 1 participant, six Gen 2 participants, and one New Offspring Spouse Cohort (NOS) participant; MMSE data was also not collected for OmniGen2 and NOS participants and was not available for one Gen 1 participant. For demented participants, ApoE data was unavailable for six Gen 1 participants and three Gen 2 participants; MMSE data was not collected for Gen 3, OmniGen2, and NOS participants and not available for one Gen 1 participant

We observed that the FHS participants on average spent different amounts of time to complete specific neuropsychological tests, and these times varied as a function of their cognitive status for most of the tests (Fig. 1). For example, almost all participants spent the highest amount of time while completing the Boston Naming Test (BNT). During this test, participants who were DE spent significantly more time (611.1 ± 260.2 s) than the participants who are not demented (NDE; 390.7 ± 167.4 s). This observation was also valid when the time spent by participants with DE was compared specifically with those with NC (405.9 ± 176.8 s), or with the ones who had MCI (321.2 ± 93.8 s). However, there was no statistical significance between the times taken for BNT by participants with NC and MCI. A similar pattern was observed for a few other neuropsychological tests including “Visual Reproductions Delayed Recall,” “Verbal Paired Associates Recognition,” “Copy Clock,” “Trails A,” and “Trails B” (Table 2). We also observed no statistically significant differences in the times taken for the participants with NC, MCI, DE, and NDE on a few other neuropsychological tests including “Logical Memory Immediate Recall”, “Digit Span Forward,” “Similarities,” “Verbal Fluency (FAS),” “Finger Tapping,” “Information (WAIS-R)”, and “Cookie Theft” (Table 2).

Fig. 1
figure 1

Time spent on the neuropsychological tests. Boxplots showing the time spent by the FHS participants on each neuropsychological test. For each test, the boxplots were generated on participants with normal cognition (NC), those with mild cognitive impairment (MCI), and those who had dementia (DE); those who were non-demented (NDE) combined the NC and MCI individuals. We also indicated the number of recordings that were processed to generate each boxplot. We also computed pairwise statistical significance between two groups (NC vs. MCI, MCI vs. DE, NC vs. DE, and DE vs. NDE). We evaluated the differences in means of the durations of all three cognitive statuses using a pairwise t-test. The symbol “*” indicates statistical significance at p < 0.05, the symbol “**” indicates statistical significance at p < 0.01, the symbol “***” indicates statistical significance at p < 0.001, and “n.s.” indicates p > 0.05. Logical Memory (LM) tests with a (†) symbol denote that an alternative story prompt was administered for the test. It is possible that one participant may receive a prompt under each of the LM recall conditions (one recording). Because many neuropsychological tests were administered on the participants, we chose a representation scheme that combined colors and hatches. The colored hatches were used to represent each individual neuropsychological test and this information was used to aid visualization in subsequent figures

Table 2 Time spent on the neuropsychological tests. Average time spent (± standard deviation) by the FHS participants on each neuropsychological test is shown. For each test, average values (± standard deviation) were computed on participants with normal cognition (NC), those with mild cognitive impairment (MCI), and those who had dementia (DE); the no dementia group (NDE) combined the NC and MCI individuals. All reported time values are in minutes. Logical Memory (LM) tests with a (†) symbol denote that an alternative story prompt was administered for the test. It is possible that one participant may receive a prompt under each of the LM recall conditions (one recording)

Data processing

To preserve as much useful information as possible, the Mel-frequency cepstral coefficients (MFCCs) were extracted during the data processing stage. MFCCs are the coefficients that collectively make up the Mel-frequency cepstrum, which serves as an important acoustic feature in many speech processing applications, particularly in medicine [10,11,12,13,14,15]. Leveraging the nonlinear Mel scaling makes the response much closer to the human auditory system and therefore renders it an ideal feature for speech-related tasks. Each FHS audio recording was first split into short frames with a window size of 25 ms (i.e., 200 sample points at 8000-Hz sampling rate) and a stride length of 10 ms. For each frame, the periodogram estimate of the power spectrum was calculated by 256-point discrete Fourier transformation. Then, a filterbank of 26 triangle filters evenly distributed with respect to the Mel scale was applied to each frame. We then applied a discrete cosine transformation of the logarithm of all filterbank energies. Note that the 26 filters correspond to 26 coefficients, but in practice, only the 2nd–13th coefficients are believed to be useful; we loosely followed this convention while replacing the first coefficient with total energy that might contain helpful information on the entire frame.

Hierarchical long short-term memory network model

Recurrent neural networks (RNN) have long been used for capturing complex patterns in sequential data. Because the durations of the FHS recordings averaged more than 1 h and corresponded to hundreds of thousands of MFCCs, a single RNN may not be able to memorize and identify patterns across such long sequences. We therefore developed a two-level hierarchical long short-term memory network (LSTM) model [16], to associate the voice recordings with dementia status (Fig. 2A). Our previous work has shown how the unique architecture of the LSTM model can tackle long sequences of data [17]. Also, as a popular variation in the RNN family, the LSTM model is well suited to overcoming issues such as vanishing gradients and long-term dependencies compared to other RNN frameworks.

Fig. 2
figure 2

Schematics of the deep learning frameworks. A The hierarchical long short-term memory (LSTM) network model that encodes an entire audio file into a single vector to predict dementia status on the individuals. All LSTM cells within the same row share the parameters. Note that the hidden layer dimension is user-defined (e.g., 64 in our approach). B Convolutional neural network that uses the entire audio file as the input to predict the dementia status of the individual. Each convolutional block reduces the input length by a common factor (e.g., 2) while the very top layer aggregates all remaining vectors into one by averaging them

A 1-h recording may yield hundreds of thousands of temporally ordered MFCC vectors, while the memory capacity of the LSTM model is empirically limited to only a few thousand vectors. We first grouped every 2000 consecutive MFCC vectors into segments without overlap. For each segment, the low-level LSTM took 10 MFCCs at a time and moved onto the next 10 MFCCs until the end. We then collected the last hidden state as the low-level feature vector for the segment. After processing all those segments one by one, the collection of the low-level feature vectors formed another sequence, which was then fed into the high-level LSTM to generate a high-level feature vector summarizing the entire recording. Note that the hierarchical design ensured that the two-level LSTM architecture was not overwhelmed by longer sequences beyond their practical limitation. For the last step of the LSTM scheme, a multilayer perceptron (MLP) network was used to estimate the probability of dementia based on the summarized vector. Both the low-level and the high-level LSTM shared the same hyperparameters where the hidden state dimension was 64 and the initial states were all set to zeros. The MLP was combined with a 64-dimensional hidden layer and an output layer along with a nonlinear activation function. The output layer was then followed by a softmax function to project the results onto the probability space.

Convolutional neural network model

For comparison with the LSTM model, we designed a one-dimensional convolutional neural network (CNN) model for dementia classification (Fig. 2B). The stem structure of the CNN model consisted of 7 stacked convolutional blocks. Within each block, there were 2 convolutional layers, 1 max-pooling layer, and 1 nonlinearity function. All the convolutional layers had a filter size of 3, stride size of 1, and a padding size of 1. For the pooling layers in the first 6 blocks, we used max pooling with a filter size of 4 and a stride size of 4. The last layer used global average pooling to tackle audio recordings of variable lengths. By applying global average pooling, all the information gets transformed to a fixed-length feature vector, making it straightforward for classification. The CNN stem structure was then followed by a linear classifier which was comprised of 1 convolutional layer and 1 softmax layer. Note that the filter size of the convolutional layer was set to be the exact size of the output feature vector.

Saliency maps

We derived the saliency maps based on the intermediate result right before the global average pooling step of the CNN model. The intermediate result was composed of two vectors which signified DE[+] and DE[−] prediction, respectively. For simplicity, we only used the DE[+] vector. Since the recording-level prediction was determined by the average of the saliency map which also preserved temporal structure, it allowed us to observe finer aspects of the prediction by revealing the values assigned to each short period. For our CNN model settings, each value corresponded to roughly 2 and a half minutes of an original recording. Note that the length of the periods is implied by the CNN model parameters; altering the stride size, the kernel size, the number of stacked CNN blocks, etc. may result in different lengths. To align the saliency map with the original recording for further analysis, we extended its size to the total number of seconds via nearest neighbor interpolation.

Salient administered fractions

The CNN model tasked with classifying between participants with normal cognition and dementia was tested on 81 participants who have transcripts for 123 recordings out of their 183 total recordings. Those 60 recordings without transcripts were excluded from the saliency analysis but were included in the test dataset. Of the recordings with transcripts, there were 44 recordings of participants with normal cognition and 79 recordings of participants with dementia. The recordings that were transcribed were divided into subgroups based on the time spent by the participant on each neuropsychological test. As a result, for each second of a given recording, the DE[+] saliency value and the current neuropsychological test are known. In order to calculate the salient administered fraction (SAF) for a given neuropsychological test, we counted the number of seconds for that neuropsychological test that also had a DE[+] saliency value greater than zero and then divided it by the total number of seconds for that neuropsychological test. For example, if the Boston Naming Test (BNT) has 90 s that had DE[+] saliency values greater than zero and BNT was administered for 100 s, then the SAF[+] would be 0.90. Similarly, we produced SAF[−] in the same way, except for periods of time where the DE[+] saliency value was less than or equal to zero. The SAF[+] was calculated for every administered neuropsychological test for each recording that had a demented participant and that the model classified the participant as demented (true positive). The SAF[−] was calculated in the same way, except for recordings with participants with normal cognition and the model classified to have normal cognition (true negative).

Data organization, model performance, and statistical analysis

The dataset for this study includes digital voice recordings from September 2005 to March 2020 from the subset of FHS participants who came to dementia review. The models were implemented using PyTorch 1.4 and constructed on a workstation with a GeForce RTX 2080 Ti graphics processing unit. The Adam optimizer with learning rate = 1e−4 and betas = (0.99, 0.999) was applied to train both the LSTM and the CNN models. A portion of the participants along with their recordings were kept aside for independent model testing (Figure S1). Note that manual transcriptions were available on these cases, which allowed us to perform saliency analysis. Using the remaining data, the models were trained using 5-fold cross-validation. We split the data on the participant level for each fold and then all of a given participant’s recordings were included in each fold. We acknowledge that this does not exactly split the data into even folds because a participant may have a different number of recordings compared to another participant; however, each participant generally had a similar distribution of recordings, which mitigated this effect.

We also tested the model performance on audio segments of shorter lengths. From the test data, we randomly extracted 5-min, 10-min, and 15-min recordings from the participants and grouped them based on the audio length. Both the LSTM and CNN models trained on the full audio recordings were used to predict on these short audio segments. Note that only one segment (5 min or 10 min or 15 min) was extracted with replacement per recording in one round of testing. This process was repeated five times and results were reported.

We evaluated the differences in means of the durations of all three cognitive statuses using a pairwise t-test. The performance of the machine learning models was presented as mean and standard deviation over the model runs. We generated receiver operating characteristic (ROC) and precision-recall (PR) curves based on the cross-validated model predictions. For each ROC and PR curve, we also computed the area under curve (AUC) values. Additionally, we computed model accuracy, balanced accuracy, sensitivity, specificity, F1 score, weighted F1 score, and Matthews correlation coefficient on the test data.

Results

Both the LSTM and the CNN models that were trained and validated on the FHS voice recordings demonstrated consistent performance across the different data splits used for 5-fold cross-validation (Fig. 3). For the classification of demented versus normal participants, the LSTM model took 20 min to fully train (8 epochs and batch size of 4) and took 14 s to predict on a test case; the CNN model took 106 min to fully train (32 epochs and batch size of 4) and took 13 s to predict on a test case. For the classification of demented versus non-demented participants, the LSTM model took 187 min to fully train (32 epochs and batch size of 4) and took 20 s to predict on a test case; the CNN model took 427 min to fully train (64 epochs and batch size of 4) and took 20 s to predict on a test case. In general, we observed that the CNN model tested on the full recordings performed better on most metrics for the classification problem focused on distinguishing participants with DE from those that have NC (Table 3, Figure S). The only exception was that the LSTM model mean specificity was higher than the mean specificity of the CNN model. For the classification problem focused on distinguishing participants with DE from those who were NDE, both models performed evenly (Table 3, Figure S3). The LSTM model’s sensitivity was higher, but the CNN model’s specificity was higher than its counterpart. Interestingly, in some cases, the model performance on the shorter segments was higher than the full audio recordings. For the classification problem focused on distinguishing participants with DE from those that have NC, we found improved LSTM model performance on both 10- and 15-min recordings, based on a few metrics. For the same classification problem, the CNN model’s performance on the full audio recordings is mostly the highest, with the 5-min segment’s mean sensitivity being the exception. For the classification problem focused on distinguishing participants with DE from those who were NDE, the LSTM model’s performance on the full recordings was the highest based on all the metrics. For the same classification problem, the CNN model’s performance on the full audio recordings is mostly the highest, with the 5-min segment’s mean sensitivity and the 10-min mean F1 score being the exceptions.

Fig. 3
figure 3

Receiver operating characteristic (ROC) and precision-recall (PR) curves of the deep learning models. The long short-term memory (LSTM) network and the convolutional neural network (CNN) models were constructed to classify participants with normal cognition and dementia as well as participants who are non-demented and the ones with dementia, respectively. On each model, a 5-fold cross-validation was performed and the model predictions (mean ± standard deviation) were generated on the test data (see Figure S1), followed by the creation of the ROC and PR curves. Plots A and B denote the ROC and PR curves for the LSTM and the CNN models for the classification of normal versus demented cases. Plots C and D denote the ROC and PR curves for the LSTM and CNN models for the classification of non-demented versus demented cases

Table 3 Performance of the deep learning models. The long short-term memory (LSTM) network and the convolutional neural network (CNN) models were constructed to classify participants with normal cognition and dementia as well as participants who are non-demented and the ones with dementia, respectively. On each model, a 5-fold cross-validation was performed and the model predictions (mean ± standard deviation) were generated on the test data (see Figure S1). A and B report the performances of the LSTM and the CNN models for the classification of participants with normal cognition versus those with dementia. C and D report the performances of the LSTM and the CNN models for the classification of participants who are non-demented versus those who have dementia

We computed the average (± standard deviation) of SAF[+] and SAF[−] derived from the CNN model (Table 4). The positive SAFs were calculated for all true positive recordings and the negative SAFs were calculated for all true negative recordings. For example, the SAF[+] for the “Verbal paired associates recognition” test was 0.88 ± 0.26. This indicates that on average 88% of the duration of the “Verbal paired associates recognition” tests administered in true positive recordings also intersected with segments of time that the model found to be DE[+] salient. Also, the SAF[−] for the “Verbal paired associates recognition” test was 0.32 ± 0.42. This indicates that on average 32% of the duration of the “Verbal paired associates recognition” tests administered in true negative recordings also intersected with segments of time that the model found to not be DE[+] salient. On the other hand, the SAF[+] for the “Command clock” test was 0.39 ± 0.45, indicating that only about 39% of the duration of the “Command clock” tests administered in true positive recordings also intersected with segments of time that the model found to be DE[+] salient. Also, the SAF[−] for the “Command clock” test was 0.76 ± 0.37. The rest of the average positive SAFs and average negative SAFs were also reported for the remaining neuropsychological tests as well as the number of true positive or true negative recordings that contained an administration of the given test (Table 4). We also computed average SAF[+] and SAF[−] for additional neuropsychological tests (Table S2), which were set aside due to the low number of samples. A schematic of the DE[+] saliency vectors used to generate SAF[+] and SAF[−] can be seen in Fig. 4. Each value in the saliency vector represented approximately 2 min and 30 s of a recording. Since the saliency vector covers the entire recording, each second of every neuropsychological test within a recording can be assigned a saliency vector value, which was then used to calculate SAF[+] and SAF[−].

Table 4 Salient administered fractions derived from the CNN model. The average salient administered fraction (SAF) and standard deviation for true positive (SAF[+]) and true negative (SAF[−]) cases are listed in descending order based on the SAF[+] value. SAF[+] is calculated by summing up the time spent in a given neuropsychological test that intersects with a segment of time that is DE[+] salient and dividing by the total time spent in a given neuropsychological test. SAF[−] is calculated by summing up the time spent in a given neuropsychological test that intersects with a segment of time that is not DE[+] salient and dividing by the total time spent in a given neuropsychological test. The number of samples for SAF[+] and SAF[−] indicate the number of true positive and true negative recordings that contain each neuropsychological test
Fig. 4
figure 4

Saliency maps highlighted by the CNN model. A This key is a representation that maps the colored hatches to the neuropsychological tests. B Saliency map representing a recording (62 min in duration) of a participant with normal cognition (NC) that was classified as NC by the convolutional neural network (CNN) model. C Saliency map representing a recording (94 min in duration) of a participant with dementia (DE) who was classified with dementia by the CNN model. For both B and C, the colormap on the left half corresponds to a neuropsychological test. The color on the right half represents the DE[+] value, ranging from dark blue (low DE[+]) to dark red (high DE[+]). Each DE[+] rectangle represents roughly 2 min and 30 s

Discussion

Cognitive performance is affected by numerous factors that are independent of underlying actual cognitive status. Physical impairments (vision, hearing), mood (depression, anxiety), low education, cultural bias, and test-specific characteristics are just a few of the many variables that can lead to variable performance. Furthermore, neuropsychological tests that claim to test the same specific cognitive domains (e.g., memory, executive function, language, etc.) do not do so in a uniform manner. For example, when testing verbal memory, a paragraph recall test taps into a different set of underlying cognitive processes compared to that of a list learning task. Also, the presumption that a test measures only a single cognitive domain is naively simplistic since every neuropsychological test involves multiple cognitive domains to complete. Thus, across many factors, there is significant heterogeneity of cognitive performance that makes it difficult to differentiate between those who will become demented versus those who will not, especially at the preclinical stage. Particularly, traditional paper and pencil neuropsychological tests may not be sufficiently sensitive to pick up when subtle change begins. While cognitive complaints serve as a surrogate preclinical measure of decline, there is an inherent bias of self-report. Given this complex landscape, digital voice recordings of neuropsychological tests provide a data source of relative independence from those limitations. To our knowledge, our study is the first to demonstrate that a continuous stream of data is also amenable for automated analysis for the evaluation of cognitive status.

Digital health technologies in general and voice in particular are increasingly being evaluated as potential screening tools for depression [18,19,20,21] and various neurodegenerative diseases such as Parkinson’s disease [22,23,24,25]. Recently, potential opportunities for developing digital biomarkers based on mobile/wearables for AD were outlined [26, 27]. Our study is unique in focusing on two deep learning methods that rely on a hands-free approach for processing voice recordings to predict dementia status. The advantage of our approach is threefold. The first is the limited need to extensively process any of the voice recordings before sending them as inputs to the neural networks. This is a major advantage because it minimizes the burden of generating accurate transcriptions and/or handcrafted features that generally take time to develop and rely on the availability of experts who are not readily available. This aspect places us in a unique position compared to previously published work that depended on derived measures [28, 29]. Second, our approach can process audio recordings of variable lengths, which means that one does not have to format or select a specific window of the audio recording for analysis. This important strength underscores the generalizability of our work because one can process voice recordings containing various combinations of neuropsychological tests that are not bounded within a time frame. Finally, our approach allows for the identification of audio segments that are highly correlated with the outcome of interest. The advantage of doing this is that it provides a “window” into the machine learning black box; we can go back to the recordings and identify the various speech patterns or segments of the neuropsychological tests which point to a high probability of disease risk,and understand their contextual significance.

The CNN architecture allowed us to generate the saliency vectors by utilizing the parameters of the final classification layer. Simply put, a temporal saliency vector for each specific case could be obtained by calculating the weighted sum of the output feature vectors from the last convolutional block in the CNN, indicating how each portion of the recording contributed to either positive or negative prediction. We then aligned the saliency vectors with the recording timeline to further analyze the speech signatures to understand if there were any snippets of the neuropsychological testing that often were correlated with the output class label. From examining the transcriptions that exist for a portion of the dataset, we were able to identify which neuropsychological tests were occurring during any given time in a recording, and then calculate the positive SAFs. This implies that the neuropsychological tests found in these segments may be presenting a test in which the participant’s voice in their response has a signal related to their cognition. For example, the SAF[+] is high for the “Verbal paired associates recognition” test, which could mean that the participants’ audio signals during this test highly influenced the model performance. This result could also imply that most participants diagnosed with dementia may have had explicit episodic memory deficits. The exact connections between the voice signals in the segments identified by the saliency maps and their clinical relevance are worth exploring in the future.

All the FHS voice recordings contained audio content from two distinct speakers of which one is the interviewee (participant) and the other is the interviewer (examiner). We did not attempt to discern speaker-specific audio content as our models processed the entire audio recording at once. This choice was intentional because we wanted to first evaluate if deep learning can predict the dementia status of the participant without having to perform detailed feature engineering on the audio recordings. Future work could focus on processing these signals and recognizing speaker-related differences and interactions, periods of silence, and other nuances to make the audio recordings more amenable for deep learning. Also, additional studies can be performed to integrate the audio data with other routinely collected information that requires no additional processing (i.e., demographics) to augment model performance. An important point to note is that studies as proposed above need to be conducted with the goal of creating scalable solutions across the globe, particularly to those regions where technical or advanced clinical expertise is not readily available. This means that users at the point-of-care may not be in the best position to manually process voice recordings or any other data to derive needed features that can be fed into the computer models. Since our deep learning models do not require data preprocessing or handcrafted features, our approach can serve as a potential screening tool without having the need of an expert-level input. Our current findings serve as a first step towards achieving such a solution that can have a broader outreach.

Study limitations

The models were developed using the FHS voice recordings, which is a single population cohort from the New England area in the United States. Despite demonstrating consistent model performance using rigorous cross-validation (5-fold) approaches, our models still need to be validated using data from external cohorts to confirm their generalizability. Currently, we do not have access to any other cohort that has voice recordings of neuropsychological exams. Therefore, our study findings need to be interpreted considering this limitation and with the hope of evaluating them further in the future. Due to the lack of available data in some cases and because not all participants took all the types of neuropsychological tests, we were able to generate the distribution of times spent for a portion of the tests. It must be also noted that the number of NC, MCI, DE, and NDE participants varied for each neuropsychological test. Additionally, there may be outside factors affecting the amount of time it took a participant to complete a neuropsychological test that is not represented in the boxplots. For example, a normal participant could finish the BNT quickly and perform well, whereas administration of the BNT to a participant with dementia could be abruptly stopped due to an inability to complete the test or for other reasons. Therefore, the amount of time spent administering the BNT in the recordings in those two cases could be similar, but for different reasons. Nevertheless, statistical tests were performed to quantify the pairwise differences on all the available neuropsychological exams, which gave us the flexibility to report the differences that were statistically different and those that were similar. Also, while it is possible that the interviewer’s behavior can influence the interviewee’s response, we must acknowledge that all the interviewers are professionally trained to uniformly administer the neuropsychological tests. Finally, we must note that some participants who were included as NC at baseline assessment showed subtle changes in cognition sufficient to warrant a dementia review, and this may have affected the model performance.

Conclusions

Our proposed deep learning approaches (LSTM and CNN) to processing voice recordings in an automated fashion allowed us to classify dementia status on the FHS participants. Such approaches that rely minimally on neuropsychological expertise, audio transcription, or manual feature engineering can pave the way towards the development of real-time screening tools in dementia care, especially in resource-limited settings.