Search results and study characteristics
The search yielded a total of 4716 articles combined from PubMed (1954–2020), Web of Science (1934–2020), and PsycINFO/EBSCO (1934–2020). After removing duplicates 2941 articles remained, for which title and abstract were screened. Additionally, references of included articles were screened, and three additional articles, which met the inclusion criteria, were identified (Chapman et al. 2006; Graville and Rau 1991; MacDonald et al. 2001), making it a total of 2944 articles that were screened for eligibility. Of these, 2895 articles were excluded as they did not pertain to the topic or did not meet inclusion criteria. Full-text screening was conducted, and inclusion and exclusion criteria were applied for the remaining 49 articles. Of these, 41 articles were excluded, with a good inter-rater agreement (κ = 0.81). The reasons for exclusion are highlighted in Fig. 1. The most common reason for exclusion was ‘Outcome not relevant’ with most studies being excluded as they investigated spontaneous or picture-elicited discourse production or verbatim recall of text. Finally, a total of eight articles were included in the review, which aimed to measure discourse comprehension at a macrolevel, in adults with Alzheimer’s disease or MCI.
An overview of the study characteristics is presented in Table 1. All the studies were cross-sectional, in which AD and/or MCI groups were compared to cognitively healthy older adults. Seven of the eight studies were conducted with native English-speakers, with six of them being conducted in USA, and one in Canada. One study was conducted in Brazil, with a native Brazilian Portuguese-speaking population. The studies were published between the years 1998 and 2019. One study included two groups of healthy older adults, classified as ‘young-older adults’ (65–80 years) and ‘old-older adults’ (> 80 years) (Chapman et al. 2006), and one study (Welland et al. 2002) included two AD groups—early stage (EDAT) and moderate stage (MDAT). The total sample sizes ranged from 20 to 84 participants, with their mean ages ranging from 65 to 86. All studies controlled for age, and all but one (Chapman et al. 2002) controlled for education, wherein the different groups were either matched on these variables or the variables were entered as covariates during analysis. Apart from this, six studies also controlled for sex (Chapman et al. 1998, 2002, 2006; Creamer and Schmitter-Edgecombe 2010; Drummond et al. 2019; Schmitter-Edgecombe and Creamer 2010), one study controlled for depression(Chapman et al. 2006), and one study controlled for IQ (Welland et al. 2002). All studies determined cognitive status of the healthy control group using at least one or a combination of several of the following measures—MMSE, self-report, Clinical Dementia Rating (CDR), Global Deterioration Scale (GDS).
Only one study (Drummond et al. 2019) used a test from a standardized battery (MAC battery) (Fonseca et al. 2008), and one (Welland et al. 2002) used a modified form of the Discourse Comprehension Test (DCT) battery (Brookshire and Nicholas 1993) to measure discourse comprehension. In other studies, an experimental task was used to measure discourse comprehension, wherein participants were presented with a series of short texts, usually narrative stories. This was generally followed by a variety of tasks designed to test participants’ comprehension of the texts. This involved giving a short summary of the story, stating the lesson or intended main idea of the story, answering true/false questions about the story, a think-aloud paradigm while reading, or reading out loud the last word in the story, which was either congruent or incongruent with previous text. With one exception (Welland et al. 2002), the studies did not report independently on hearing and visual/reading abilities of participants. However, they generally included practice trials before the start of the study to ensure participants understood the task, and were able to perform it successfully. Almost all of the included studies looked at performance of participants on one or more neuropsychological tests (for example, subtests of Boston Diagnostic Aphasia Examination) to ensure that participants were able to follow instructions, in order to be able to perform the task. The outcome measures varied across studies, with some studies measuring the proportion of inferential and non-inferential clauses produced (Creamer and Schmitter-Edgecombe 2010; Schmitter-Edgecombe and Creamer 2010), one study measuring naming latencies for congruent and incongruent pronouns (Almor et al. 2001), and others measuring gist-level retelling in the form of summary, lesson, main ideas (Chapman et al. 1998, 2002, 2006; Welland et al. 2002). Due to this heterogeneity in tasks and reported outcome measures, a meta-analysis was not performed.
One study (Drummond et al. 2019) used the Diagnostic and Statistical Manual of Mental Disorders: Fifth Edition (DSM-5) criteria for Major Neurocognitive Disorder due to Alzheimer’s Disease (Sachdev et al. 2014), for diagnosis of AD. All other studies used the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) criteria (McKhann et al. 1984). In all studies, a diagnosis of ‘probable AD’ was applied, wherein individuals are diagnosed based on clinical and neuropsychological evidence without histopathologic confirmation. As these were cross-sectional studies, they could not follow-up to confirm AD via autopsy. Additionally, all, but one, studies were conducted prior to 2011, when the NINCDS-ADRDA criteria were first revised to the National Institute on Aging-Alzheimer’s Association (NIA-AA), to include biomarker evidence in the diagnosis of AD (McKhann et al. 2011). The DSM-5 criteria, which was used in the study by Drummond et al. (2019), does not yet include biomarker evidence in diagnosis of Major Neurocognitive Disorder due to AD. The major difference between the NINCDS-ADRDA and the DSM-5 criteria is that presence of memory impairment is not required for diagnosis in DSM-5; rather, impairment in any two cognitive domains is acceptable. This shows a general trend towards moving away from memory impairment, as is seen in the NIA-AA 2011 criteria too, which was a revision of the NINCDS-ADRDA criteria. For determining the stage of the AD (mild, moderate, severe), studies used either MMSE or CDR scale (Folstein et al. 1975; Hughes et al. 1982). These two scales have been shown to have good agreement for the stages of AD that have been investigated in included studies (Perneczky et al. 2006). Overall, although two different criteria were used for the diagnosis of AD, the criteria were comparable enough that a qualitative synthesis of studies was possible.
For a diagnosis of MCI, one study (Chapman et al. 2002) used the criteria by Petersen et al. (1999); another study (Schmitter-Edgecombe and Creamer 2010) applied the criteria by Petersen et al. (2001). The studies also ruled out other possible causes of cognitive impairment (such as stroke or other neurological or psychological causes) via a series of tests. As with the diagnostic criteria for AD, the criteria for MCI too evolved to shift focus away from memory complaints, towards a more wholesome approach to include all cognitive domains. While the Petersen et al. (1999) criteria required a subjective memory complaint, the subsequent revised criteria from 2001 onwards allowed for complaints in any cognitive domain. Instead, the Petersen et al. (2001) criteria focused on classifying MCI into several subtypes (e.g. amnestic MCI, multi-domain MCI), depending on the cognitive domain(s) in which deficits were observed. Accordingly, studies included in the review that were conducted after the Petersen et al. (2001) criteria were established, have included population specifically with a diagnosis of amnestic MCI (aMCI). Finally, one study (Drummond et al. 2019) applied the Winblad et al. (2004) criteria, which was a revision of the Petersen et al. (2001) criteria. This revision acknowledges that there may be multiple aetiologies for each subtype of MCI, and modifies the stipulation concerning normal daily functioning in previous criteria, to allow for subtle impairment in complex functions. Although different evolving diagnostic criteria have been used in the included studies, the different criteria are not sufficiently different enough so as to affect a qualitative synthesis of these studies.
Measures of discourse comprehension
Due to a lack of standardized tests for measuring discourse comprehension, there was considerable variability in the method used to evaluate comprehension, and consequently in the type of outcome measures used. Most measures used some form of language production to measure comprehension. This implies a general problem which poses a dilemma for comprehension studies in other contexts as well (e.g. language acquisition, pedagogy). We know from studies on language production that patients with AD have deficits in accessing lexical units, though deficits at the morphological and syntactical level are less pronounced. These deficits could affect the validity of the measures for language comprehension.
Relevant outcome measures used in each study were identified. Several of the identified outcome measures were used in multiple studies, and these were grouped together. The names of the outcome measures were derived from the outcomes used in the included studies. However, the terms for certain measures were used interchangeably in the different studies. Therefore, to summarize the results from different studies, the measures were categorized according to the definitions or descriptions of the measure presented in the studies, rather than the terms used. Accordingly, the measures were grouped into the six variables described below. The results for each measure are summarized in Table 2.
Naming latency was used as an outcome in only one of the studies (Almor et al. 2001). In this study, participants were presented with a short text in an auditory format, in which two entities (antecedents) were introduced in the first sentence. The final sentence referred back to these entities, wherein it mentioned one of the entities and was left incomplete before the other entity is mentioned. Finally the target pronoun was presented visually, which was either congruent with the incomplete sentence or incongruent, based on the singularity or plurality of the antecedent and the pronoun. Participants were to read aloud the pronoun, and their response time was measured. Ideally, when the pronoun is incongruent to the antecedent, response time should be longer compared to when it is congruent, as it would be more difficult to integrate an incongruent word into the passage, indicating adequate processing of cohesive devices. This effect would, however, only be seen if individuals are able to integrate different information units within a macrostructure, indicating the ability to establish coherence relations. Slower reaction times for incongruent trials were seen in healthy older adults, as well as the group with AD. However, the size of the effect was much smaller in the AD group compared to the healthy older adults, meaning that the difference in the reaction times to congruent vs incongruent trials was much higher in the controls than in the AD population, as was expected. This shows that AD patients were less sensitive to incongruent pronouns, indicating a problem in integrating and connecting the presented information.
In four studies (Chapman et al. 2006, 1998, 2002; Drummond et al. 2019), participants were presented with a short story. Following this, participants were asked to retell the story or give a summary in their own words which involved focusing on important units of information that are required for an overall understanding of the story, and omitting unnecessary details. Participants’ performance was scored according to the number of main informational and/or thematic units produced. This measure can be taken to illustrate in how far language production was taken as a measure for comprehension. The linguistic output was not analysed with respect to relevant features of language production (time course, lexical choice, or number of words per sentence), but only at the level of meaning in relation to the stimulus text. AD groups produced fewer synthesized meaningful units of information compared to cognitively healthy adults in all four studies, including the old-older adults. In both studies with MCI population (Chapman et al. 2002; Drummond et al. 2019), the MCI group performed significantly worse than the healthy older adults. Between the AD and MCI groups, AD group scored significantly lower than the MCI group in one study (Drummond et al. 2019); however, the performance of the two groups was comparable in another study (Chapman et al. 2002). Additionally, there was a small but significant difference in the performance of old-older adults compared to young-older adults. This was the only measure for which such a difference was observed.
Another probe following the presentation of a short story, employed in four studies (Chapman et al. 2006, 1998, 2002; Drummond et al. 2019), was the lesson or message probe, wherein participants were to formulate a lesson or a title that could be inferred from the story. AD and MCI patients scored significantly lower than healthy adults, focusing on unimportant details from the story rather than an overall lesson. Additionally, the AD group performed significantly worse than old-older adults. When performances of MCI and AD groups were compared, the results were mixed, wherein one study (Drummond et al. 2019) reported no significant difference in their performance, whereas another study (Chapman et al. 2002) reported that the AD group scored significantly lower than the MCI group. This measure required maximum inferential processing, as participants need to be able to synthesize a large amount of information, condense it, and make interpretations about what message it carries.
This probe, also administered following a short story in three of the studies (Chapman et al. 2006, 1998, 2002), measured the ability of participants to summarize the story in one sentence i.e. the primary concept of the story, which required substantial condensation of information and abstraction into one generalized idea. Both AD and MCI groups performed significantly worse than the control group. Furthermore, a significant difference was observed between the performance of AD and MCI groups, with the AD group scoring lower than the MCI group. AD and MCI patients were generally prone to giving more unimportant information or details rather than summarizing statements, although individuals’ responses varied to some extent. Additionally, as was also observed for previous measures, the AD group’s performance was significantly worse compared to the old-older adults.
Two studies used a think-aloud procedure (Creamer and Schmitter-Edgecombe 2010; Schmitter-Edgecombe and Creamer 2010), wherein participants were given a short narrative text to read, and were asked to vocalize their thoughts about the story simultaneously while reading the narrative text. Every utterance of participants was classified either as an ‘inferential clause’ or a ‘non-inferential clause’, by two assessors, one of whom was blinded to the diagnostic status. The classification system used by Trabasso and Magliano (1996) was employed, wherein, statements that were either explanations, predictions, or formed associations, were categorized as ‘inferential’, and other statements (e.g. repetitions or paraphrases) were classified as ‘non-inferential’. Although, overall, all groups uttered more inferential clauses compared to non-inferential, both AD and MCI groups uttered significantly fewer inferential clauses compared to cognitively healthy adults.
One included study (Welland et al. 2002) used Yes/No questions as the only outcome to measure comprehension following story narration. The format used in this study was adapted from the standardized discourse comprehension test developed by Brookshire and Nicholas (1993). The questions were categorized based on the level of detail—main idea and details, and the type of information—implied or stated. Both patient groups—EDAT and MDAT—performed significantly worse on all types of questions, compared to the healthy group, but the performance of the two patient groups did not differ from one another on any measure. All groups generally performed better on ‘main idea’ questions compared to ‘details’, and on ‘stated’ information compared to ‘implied’. Three other studies (Creamer and Schmitter-Edgecombe 2010; Drummond et al. 2019; Schmitter-Edgecombe and Creamer 2010) included comprehension questions following the other retelling and ‘think-aloud’ tasks, to test for comprehension of the narrative passage. In two studies, half of the True/False questions were based on information that needed to be inferred from the text and half of the questions were based on facts that were explicitly stated in the text. AD and MCI groups answered fewer questions correctly, overall, compared to controls, in all studies. However, when performance on inferential questions was examined specifically, in the two studies that made this distinction, AD and MCI groups did not differ significantly from controls. Therefore, in these studies, this measure was relatively less informative, as the nature of the questions (True/False) poses two problems. First, there is a 50% chance of answering the question correctly, irrespective of how well one may or may not have understood the narrative. This can be observed in the AD group’s performance, which was in fact at chance level. Second, there may be possible ceiling effects in the healthy adults group’s performance, as can be observed in the high means across all the studies. It is also possible that performance on this task was made easier by reliance on recognition memory, rather than recall. Therefore, this method may not be optimal in terms of appropriateness and complexity in investigating the current question.
Overall, a deficit in discourse comprehension in individuals with AD and MCI was consistently observed across all studies, pointing to a robust effect. These result show that, with the exception of one measure, discourse comprehension measures are able to reliably distinguish early stage AD and MCI patients from cognitively healthy older adults.
Association between discourse comprehension measures and cognitive measures
In addition to examining the discourse comprehension differences between AD, MCI, and cognitively healthy older adults, the review also aimed to examine whether performance on the discourse comprehension task correlated with performance on commonly used neuropsychological tests. The purpose of this was twofold: the first was to examine which cognitive processes, if any, are able to predict performance on a discourse comprehension task, giving an indication of the underlying mechanisms involved; the second was to determine whether discourse comprehension tasks are able to tap into processes beyond what traditionally used neuropsychological tests measure. Studies used tests such as RAVLT, WAIS-III, listening span, D-KEFS, MMSE to measure verbal memory, working memory, executive functions. However, all these measures were not consistently used across all included studies. Therefore, it was somewhat challenging to draw robust conclusions about their association with discourse comprehension. For measures that were employed in multiple studies, the results were mostly mixed. When the association between MMSE scores and performance on the experimental task were examined, one study (Chapman et al. 2002) found a significant correlation (r = 0.65), whereas another study (Almor et al. 2001) found only a marginally significant correlation between the two measures, which disappeared when working memory was accounted for. In another study (Welland et al. 2002), MMSE scores did not significantly predict discourse comprehension when episodic memory or working memory were added to the regression model. Similarly, working memory measures were associated significantly (r = 0.64, r = -0.83) with discourse comprehension in two studies (Almor et al. 2001; Welland et al. 2002), but two other studies (Creamer and Schmitter-Edgecombe 2010; Schmitter-Edgecombe and Creamer 2010) found no association. It is important to note that different studies used different tests to measure working memory (e.g. listening span, WAIS-III, digit span). These varying results may be due to heterogeneity in the different experimental tasks and tests used in different studies. However, both studies that included a verbal memory measure (RAVLT) found a significant, albeit moderate (r = 0.50 to r = 0.64) correlation with discourse comprehension measures. Only one study (Welland et al. 2002) reported a positive association with episodic memory (r = 0.91). Additionally, one study (Creamer and Schmitter-Edgecombe 2010) found significant correlations with TMT-A (r = 0.58) and D-KEFS (r = 0.62), measuring attention and executive functions, respectively. The study also looked at several other tests of attention and executive functions, as well as tests of language, but none of these showed association with macrostructural measures of discourse comprehension. The moderate correlation with verbal memory, and the moderate or non-significant correlations with other measures indicate that discourse comprehension tasks tap into additional processes that are not assessed by neuropsychological tests used routinely in the clinical diagnosis of AD. This warrants investigation of discourse comprehension tasks as a possibly more comprehensive assessment tool.