Introduction

Frontotemporal dementia (FTD) is a clinically and pathologically heterogeneous type of early-onset dementia, typically characterized by atrophy of the frontal and/or temporal lobes [1]. The clinical profile of FTD shows behavioural and language disturbances, with cognitive deficits in executive function and relative sparing of memory and visuospatial abilities [2]. Up to 40% of FTD cases have an autosomal dominant pattern of inheritance. Mutations in the progranulin (GRN), microtubule-associated protein tau (MAPT) and chromosome 9 open reading frame 72 (C9orf72) genes are the most common causes [3]. Early diagnosis—albeit difficult due to the heterogeneous symptoms and overlap with other forms of dementia and psychiatric disorders—is essential for proper patient management and planning, non-pharmacological treatment, and patient stratification in upcoming disease-modifying clinical trials [4].

Research in the genetic FTD field has been increasingly moving towards the presymptomatic and early prodromal stages, as the critical time-window for treatment most likely lies prior to overt symptom onset. With promising avenues opening for clinical trials, identifying robust biomarkers is of utmost importance [5]. Previous neuropsychological studies showed that subtle cognitive deficits and decline are present in the presymptomatic stage, and gene-specific cognitive profiles can be detected [6,7,8]. These findings suggest that neuropsychological assessment in the presymptomatic and early prodromal stages can provide sensitive cognitive markers for FTD.

The semantic fluency test is one of the most widely used tests in neuropsychological assessments. In this brief, easy-to-apply test, people have to generate items from a particular semantic category (e.g., animals, foods) in 1 min [9]. The semantic fluency test presents high sensitivity and specificity for dementia diagnosis [9], with impaired performance found in both symptomatic [10] and presymptomatic FTD [7, 8]. Although the total number of items generated is commonly used to quantify test performance, qualitative, psycholinguistic information embedded in the output can also be investigated, including clusters (number of multiword stringsFootnote 1), switches (number of transitions between clusters), age of acquisition (AoA; the age at which a word is learned), and lexical frequency (LF; how often a word occurs in daily language) [12, 13]. Previous research demonstrated the prognostic value of qualitative fluency measures in cognitively healthy subjects at-risk for and in conversion from prodromal to overt Alzheimer’s Dementia (AD) [14, 15]. This approach has been underexplored in presymptomatic and/or prodromal genetic FTD, while this psycholinguistic information may be able to detect the subtle development of FTD’s characteristic language symptoms at an early stage.

The aim of this study was therefore to investigate longitudinal changes in five qualitative aspects of semantic fluency (i.e., number of clusters and switches, cluster size, AoA, and LF) in mutation carriers that developed FTD (phenoconverters), presymptomatic mutation carriers, and non-carriers from autosomal dominant GRN- and MAPT-FTD families. We were specifically interested in the inflection point (i.e., when in the disease trajectory) at which the qualitative measures start to deviate from normal. Additionally, we explored the co-correlation between the qualitative measures, and their associations with cognitive decline and grey matter (GM) volume loss, and the prognostic value of decline in qualitative measures in predicting symptomatic onset.

Methods

Participants

We included longitudinal data of 118 participants from the FTD Risk Cohort of the Erasmus MC University Medical Center (Rotterdam, the Netherlands). This is an ongoing study in which first-degree family members of FTD patients due to a pathogenic mutation are followed on a 1- or 2-year basis [16]. Participants were recruited between February 2010 and October 2019. DNA genotyping at study entry assigned participants to the mutation carrier (n = 63; MAPT n = 20, GRN n = 43) or non-carrier group (controls; n = 55). Upon study entry, all mutation carriers were presymptomatic according to clinical diagnostic criteria for FTD [2, 17, 18], and had global CDR®-plus-NACC-FTLD [19] scores of 0. Ten mutation carriers (MAPT n = 6, GRN n = 4) developed symptoms during follow-up (phenoconverters). Diagnoses were made in multidisciplinary consensus meetings, using information from the standardized clinical assessment (see below). Phenoconverters met the following criteria: [i] progressive deterioration of behaviour, language and/or motor functioning; [ii] functional decline, evidenced by multiple study visits with global CDR®-plus-NACC-FTLD [19] ≥ 0.5 without reversing back to 0; and [iii] cognitive decline, evidenced by ≥ 1.5 SD below age-, sex- and education-specific means in ≥ 1 domain on neuropsychological assessment. Eight phenoconverters had clinical features of bvFTD (MAPT n = 6, GRN n = 2), and two had non-fluent variant PPA (GRN n = 2). The presymptomatic mutation carriers that did not develop FTD symptoms are referred to as non-converters (n = 53).

Clinical assessment

Every 1–2 years, all participants underwent a standardized clinical assessment, consisting of a structured interview with the participant and a knowledgeable informant (incorporating the CDR®-plus-NACC-FTLD [19]), medical history taking, neurological examination, neuropsychological assessment, and brain MRI. The neuropsychological assessment consisted of cognitive screening tests (Mini-Mental State Examination, MMSE, and Frontal Assessment Battery, FAB), and tests within the major cognitive domains. Neuropsychiatric symptoms were assessed with the Beck’s Depression Inventory (BDI) and the brief questionnaire form of the Neuropsychiatric Inventory (NPI-Q) (see [7, 8] for full test battery).

Qualitative fluency measures

The semantic fluency task was part of the neuropsychological assessment at each study visit. In this task, participants were asked to verbally generate as many different animals as possible in 60 s [20]. The total score is the number of animals produced minus errors. Additionally, we calculated five qualitative fluency measures from the output: LF, AoA, number of clusters, cluster size and number of switches (Box 1, Appendix 1 for scoring guidelines).

Study design

Phenoconverters, non-converters and controls were compared at five time points: baseline, and follow-up after 2, 4, 5 and 6 years, for the purpose of this study restructured as (Fig. 1):

  1. 1.

    Four years before phenoconversion—data were available for six phenoconverters. The other four developed symptoms between baseline and the first follow-up visit, therefore no data 4 years prior to phenoconversion were available. The data were compared to the baseline data of non-converters and controls.

  2. 2.

    Two years before phenoconversion—data were available for all ten phenoconverters. The data were compared to the 2-year follow-up data of non-converters and controls.

  3. 3.

    Phenoconversion—data were available for all ten phenoconverters. The data were compared to the 4-year follow-up data of non-converters and controls.

  4. 4.

    One year after phenoconversion—data were available for four phenoconverters. The other six were clinically too impaired to undergo neuropsychological testing or passed away (n = 4), or converted recently, therefore no data 1 year after phenoconversion were available yet (n = 2). The data were compared to the 5-year follow-up data of non-converters and controls.

  5. 5.

    Two years after phenoconversion—data were available for four phenoconverters. The other six were clinically too impaired to undergo neuropsychological testing or passed away (n = 5), or converted recently therefore no data 1 year after phenoconversion were available yet (n = 1). The data were compared to the 6-year follow-up data of non-converters and controls.

Fig. 1
figure 1

Subject sample and study design. The total sample (n = 118) was divided into mutation carriers (n = 63) and non-carriers (healthy controls; n = 55), the mutation carrier group was split into phenoconverters (n = 10) and non-converters (n = 53). The original data was restructured for phenoconverters, so that there were five time points: 4 years before phenoconversion, 2 years before phenoconversion, phenoconversion, 1 year post-phenoconversion and 2 years post-phenoconversion. Four years before phenoconversion data was available for only 6 phenoconverters, as the other four phenoconverters developed symptoms between baseline and the first follow-up visit, and therefore no data 4 years prior to phenoconversion were available. One and 2 years after phenoconversion data was available for only 4 phenoconverters; the other 6 were either lost to follow-up, as they were clinically too impaired to undergo neuropsychological testing or passed away (n = 4), or converted recently so that follow-up data post-phenoconversion was not available at this time point (n = 2). The data of phenoconverters was compared to respectively baseline, and follow-up after 2, 4, 5 and 6 years in non-converters and healthy controls. The remaining six converters were either lost to follow-up, as they were clinically too impaired to undergo neuropsychological testing or passed away (n = 4), or converted recently so that follow-up data post-phenoconversion was not available at this time point (n = 2). The data were compared to the 5-year follow-up data of non-converters and controls

MRI acquisition and (pre)processing

On each study visit, we performed volumetric T1-weighted MRI scanning on a Philips 3 T Achieva MRI scanner (Philips Medical Systems, Best, the Netherlands) using either an 8- or 32-channel SENSE head coil and the following scan parameters: inversion/repetition time = 933/2200 ms, flip angle = 8°, voxel size = 1.1 × 1.1 × 1.1 mm, matrix size = 256 × 256 × 208, total scan time = 4.43 min. All scans underwent visual quality control. The DICOM images were subsequently corrected for gradient nonlinearity distortions and converted to NifTI format. They were then pre-processed in the Voxel-Based Morphometry (VBM) pipeline in Statistical Parametric Mapping 12 (SPM12; Functional Imaging Laboratory, University College London, London, UK; www.fil.ion.ucl.ac.uk/spm) implemented in Matlab R2018a (Mathworks, USA). First, the T1-weighted images were normalized to a template space and segmented into GM, white matter (WM) and cerebrospinal fluid (CSF), after which they were rigidly aligned. We calculated total intracranial volume (TIV) by adding GM, WM and CSF. Second, the segmentations were spatially normalized to a DARTEL template by applying the flow fields of all the individual scans. Images were smoothed using a 6 mm full width at half maximum (FWHM) isotropic Gaussian kernel. After every preprocessing step, images were visually inspected.

Statistical analysis

We performed statistical analyses using SPSS Statistics 25.0 (IBM Corp., Armonk, NY). Alpha was set at 0.05 across all comparisons (two-tailed). We compared continuous demographic data between groups using one-way ANOVAs for normally distributed data (with Bonferroni post-hoc tests), or Kruskal–Wallis tests for non-normally distributed data (with Mann–Whitney U post-hoc tests). We analysed between-group differences in sex and gene with Pearson Χ2 tests. Interrater reliability was explored using intraclass correlation analysis. We assessed the co-correlation of the five qualitative measures at each time point by means of principal component analysis using Varimax rotation. Only factors accounting for 3% or more of variance and Eigenvalues > 1 were retained. Factor loadings were only considered meaningful when r > 0.450, and any item that did not load sufficiently onto a factor was removed [23]. For ease of interpretation, we first converted raw fluency and relevant neuropsychological measures (Boston Naming Test (BNT), Semantic Association Test (SAT), Trailmaking test B (TMT-B), Stroop colour-word test card III) into z-scores by subtracting the mean of controls from each individual’s raw score at that time point, divided by the SDs of controls at that time point. We then used multilevel linear regression modeling to investigate longitudinal decline in total and qualitative fluency measures. We performed two separate analyses to assess longitudinal change in qualitative fluency measures per [i] clinical status (phenoconverters, non-converters, controls) and [ii] gene (MAPT, GRN). We entered [i] or [ii], time, and first-order interactions, with age, sex, education and total number of words generated as covariates. Assumptions were checked (non-linearity, dependence of errors, outliers, heteroscedasticity). Significant interactions between covariates and dependent variables were included in the model. We based the covariance structure (Toeplitz Heterogeneous) on the lowest Akaike Information Criterion (AIC), as a lower AIC indicates a better model fit. We included the random intercept as this model presented with a lower AIC. For converters, we calculated deltas of the standardized values for the [i] qualitative fluency measures and [ii] relevant neuropsychological tests between restructured time-point 1 in the six phenoconverters that had data 4 years before phenoconversion, or time-point 2 in the four phenoconverters that only had data 2 years before phenoconversion, and time-point 3 (phenoconversion). We then explored the association between the delta fluency measures and delta neuropsychological tests (corrected for age, sex and education) using partial correlations. Change over time maps were generated by subtracting the GM maps calculated at time-point 3 (phenoconversion) from the maps calculated at time-point 1 or 2 in SPM12. We then explored the relationship between the delta qualitative fluency measures and delta GM maps by means of multiple regression models. Age, sex, TIV, and head coil were entered as covariates. We set the statistical threshold at p < 0.05, adjusted for multiple comparisons with familywise error (FWE) correction. Lastly, to investigate classification abilities of these delta z-scores, we performed binary logistic regression analyses. Assumptions were checked (non-linearity, dependence of errors, outliers, multicollinearity). The models were selected with a forward stepwise method according to the likelihood ratio test and applying the standard p-values for variable inclusion (0.05) and exclusion (0.10). Goodness of fit was evaluated with the HL Χ2 test, with Nagelkerke R2 as measure of effect size. The analyses were adjusted for age, sex, education, and total number of words generated. All models were corrected for multiple comparisons (Bonferroni).

Results

Demographics and clinical data

Demographic and clinical data are shown in Table 1. There were no differences between phenoconverters, non-converters, and controls in age [F(2,117) = 0.212, p = 0.809], sex [X(2) = 1.568, p = 0.457], gene [X(2) = 4.830, p = 0.089] or education level [F(2,117) = 1.290, p = 0.279]. There were no differences between MAPT and GRN phenoconverters in age [F(2,9) = 2.966, p = 0.123], sex [X(2) = 0.524, p = 0.262] or education level [F(2,9) = 0.022, p = 0.945]. We found no differences between phenoconverters, non-converters, and controls regarding baseline MMSE [F(2,117) = 0.229, p = 0.796], FAB [F(2,40) = 2.504, p = 0.095], BDI [F(2,116) = 0.607, p = 0.547] or NPI-Q [F(2,73) = 0.031, p = 0.969]. There were no differences between MAPT and GRN phenoconverters in these measures (all p > 0.05). There were no differences in neuropsychological test scores between group at study entry (all p > 0.05).

Table 1 Demographic and clinical data per subgroup

Interrater reliability

Two independent raters (LCJ, SAAML), blinded to participant’s clinical and genetic status, scored the number of clusters, cluster sizes, and number of switches (see Methods—Qualitative fluency measures, and Appendix 1). The interrater reliability of all three measures was considered ‘good’ (i.e., intraclass correlation coefficients 0.75–0.90) [25]: number of clusters 0.81 [95% CI 0.76–0.85], cluster sizes 0.85 [95% CI 0.81–0.87], and number of switches 0.85 [95% CI 0.81–0.88].

Co-correlation between the qualitative measures

Appendix 2 shows the results of the principal component analysis on the five qualitative measures per time-point. Based on our criteria and visual inspection of the scree plot, we could extract two components. The first component explained between 39.5 and 49.7% of the total variance, and both LF and AoA had high loadings (r > 0.900). The second factor explained between 29.0 and 43.3% of the total variance, and both the number of clusters and the number of switches, and at some time-points also the cluster size, had high loadings (r > 0.59).

Longitudinal qualitative fluency trajectories

The inflection points and longitudinal trajectories of the fluency measures are shown in Table 2. The individual trajectories in phenoconverters are displayed in Fig. 2. Phenoconverters had declining total scores from at least 4 years pre-phenoconversion (p < 0.001), while there was no decline in non-converters or controls (p > 0.05). As can be noted from Fig. 2, the qualitative measures demonstrated more noise in their fluctuation than the total score. Phenoconverters had higher LF from 4 years pre-phenoconversion in comparison to controls (p = 0.016). Also the number of clusters (p < 0.001) and switches (p = 0.004) started to decline at this point. AoA declined from phenoconversion onwards (p = 0.050). No longitudinal change was found for cluster size (p > 0.05). There was no change in qualitative fluency measures in non-converters or controls (p > 0.05). Different inflection points and longitudinal trajectories were found for GRN and MAPT phenoconverters. At least 4 years pre-phenoconversion, GRN phenoconverters started producing fewer words in comparison to controls (p = 0.005). Moreover, they started producing fewer but larger clusters (both p < 0.001), and used fewer switches (p = 0.004). GRN phenoconverters had larger cluster sizes than MAPT phenoconverters (p < 0.001). LF and AoA did not change in GRN phenoconverters compared to controls (p > 0.05). Starting at least 4 years pre-phenoconversion, MAPT phenoconverters produced fewer words than controls (p < 0.001). Moreover, LF started to increase (p = 0.007), while AoA (p = 0.034), the number of clusters (p = 0.009), and cluster size (p = 0.010) declined. The number of switches did not change (p > 0.05).

Table 2 Longitudinal trajectories of the total score and qualitative fluency measures in phenoconverters and non-converters
Fig. 2
figure 2

Individual verbal fluency trajectories in the ten FTD phenoconverters. Lines represent the individual longitudinal changes (years to phenoconversion, X-axis) in semantic fluency total scores, lexical, age of acquisition, cluster count and size, and switches (z-scores, Y-axis) in the 10 phenoconverters. The 6 MAPT phenoconverters are displayed in blue, the 3 GRN phenoconverters in green. Only the total score was available for phenoconverter 10 (GRN 4). Abbreviations: MAPT microtubule-associated protein tau, GRN progranulin

Associations with cognitive decline

Partial correlation coefficients between the fluency measures and relevant neuropsychological tests are shown in Table 3. In phenoconverters, decline on the SAT verbal correlated with an increase in LF (p = 0.031) and a decline in AoA (p = 0.037). Worse performance on TMT-B (p = 0.031) and Stroop-III (p = 0.026) correlated with an increase in cluster size, while worse performance on Stroop-III correlated with a decline in the number of switches (p = 0.031). In GRN phenoconverters, decline on the SAT verbal correlated with a decline in AoA (p = 0.020). Worse performance on Stroop-III correlated with a decline in the number of clusters (p = 0.022), and an increase in cluster size (p = 0.014), while decline on TMT-B correlated with a decline in the number of switches (p = 0.024). In MAPT phenoconverters, decline on the SAT verbal correlated with an increase in LF (p = 0.015) and a decline in AoA (p = 0.014).

Table 3 Partial correlations between decline in fluency measures and relevant cognitive tests

Associations with GM volume loss

The relationships between the qualitative fluency measures and GM volume loss are displayed in Fig. 3 and Appendix 3. In the total group of phenoconverters, worse performance on all five qualitative fluency measures was associated with GM volume loss in large overlapping areas spanning the frontal (e.g., middle frontal gyrus) and temporal lobes (e.g., middle and superior temporal gyrus), and the anterior insular and cingulate cortices (pFWE-corrected < 0.05). In GRN phenoconverters, an increase in LF and a decline in AoA were associated with GM volume loss of the cerebellum and predominantly temporal areas (e.g., medial temporal lobe, inferior temporal gyrus). A decline in the number of clusters, cluster size, and the number of switches correlated with GM volume loss of the cerebellum, insular cortex, and putamen (pFWE-corrected < 0.05). In MAPT phenoconverters, an increase in LF and a decline in AoA were associated with GM volume loss of the cerebellum and predominantly temporal areas (e.g., anterior temporal pole, inferior temporal gyrus). A decline in the number of clusters, cluster size, and the number of switches correlated with GM volume loss of the cerebellum and predominantly frontal areas (e.g., middle and superior frontal gyrus, frontal pole) (pFWE-corrected < 0.05).

Fig. 3
figure 3

Grey matter atrophy patterns associated with lower qualitative fluency performance. VBM analyses demonstrated grey matter volume loss to be associated with lower performance in lexical frequency (red), age of acquisition (green), clusters (blue), cluster size (yellow), and switches (copper) in the total group of phenoconverters (top), GRN phenoconverters (middle), and MAPT phenoconverters (bottom). We set the statistical threshold at p < 0.05 (FWE-corrected). Abbreviations: L left, GRN progranulin, MAPT microtubule-associated protein tau

Classification abilities of qualitative fluency measures

Decline in the total score differentiated between phenoconverters and non-converters [X2(1) = 7.669, p = 0.006] and controls [X2(1) = 7.643, p = 0.006]. This was primarily driven by MAPT phenoconverters, as it differentiated well between this group and non-converters [X2(1) = 5.919, p = 0.015] and controls [X2(1) = 5.902, p = 0.015], but not between GRN phenoconverters, non-converters, and controls (p > 0.05). A decline in switches was predictive of phenoconversion in GRN [X2(1) = 5.069, p = 0.024], correctly classifying 90.3% of cases; a decline in cluster size was predictive of phenoconversion in MAPT [X2(1) = 3.894, p = 0.048], correctly classifying 89.3% of cases.

Discussion

This study examined longitudinal changes in qualitative aspects of the semantic fluency task in a large cohort of FTD phenoconverters, presymptomatic mutation carriers, and non-carriers from GRN- and MAPT-FTD families. Phenoconverters showed a decline in the total score from at least 4 years pre-phenoconversion, with individually-varying inflection points and longitudinal trajectories in qualitative fluency measures in GRN and MAPT. At least 4 years pre-phenoconversion, GRN phenoconverters started producing fewer but larger clusters, and switched less between clusters, which was correlated with executive dysfunction. A decline in switching was predictive of phenoconversion. At least 4 years pre-phenoconversion, MAPT phenoconverters demonstrated an increase in LF and a decline in AoA, which was correlated with semantic deficits. A decline in cluster size was predictive of phenoconversion. Increase in LF and decline in AoA were associated with GM volume loss of predominantly temporal areas, while decline in the number of clusters, cluster size, and switches correlated with GM volume loss of predominantly frontal areas.

The semantic fluency total score is strongly intertwined with the qualitative aspects of the task. For accurate and timely word retrieval, both AoA and LF [26], and clustering and switching components [27] are required. Coinciding with the results of our principal component analysis, in which we found a AoA-LF component and a clusters-switches component, previous studies found strong correlations between AoA and LF, and between the number of clusters and the number of switches. With respect to the relation between AoA and LF, correlations have found to be high in natural languages, as early-acquired words tend to occur more frequently than late-required words [28]. The number of switches and the number of clusters are correlated, as by identifying the number of clusters, one can generate the number of switches (corresponds to the number of clusters minus one) [27].

Irrespective of the underlying FTD mutation, we showed a decline in the total score in mutation carriers from at least 4 years prior to phenoconversion. Decline in semantic fluency was found to be an early cognitive marker in other neurodegenerative diseases. For instance, in preclinical AD and Huntington’s disease, as early as 12 years before the onset of dementia, decline in a measure of semantic memory was found [29, 30]. In another study, it was amongst the most statistically sensitive cognitive measures of symptomatic conversion [31]. Studies into semantic fluency decline in presymptomatic FTD have shown somewhat contrasting results. One study demonstrated decline in semantic fluency in MAPT mutation carriers from 6 years before estimated symptom onset [7], whereas another only found decline at estimated symptom onset [6]. It should be noted that both studies used estimated years to symptom onset as a proxy for actual onset, which can be less reliable in familial FTD [6]. Utilizing a similar research design as our current study, decline of semantic fluency from 4 years before symptom onset in MAPT converters, and decline of semantic fluency was found to be the best predictor for having an MAPT mutation [8]. Although semantic fluency did not decline in GRN converters, decline on phonemic fluency was found to be predictive of an GRN mutation, confirming the value of fluency tasks in presymptomatic FTD as they can distinguish the underlying genotype [8].

Starting at least 4 years pre-phenoconversion, GRN mutation carriers produced fewer but larger clusters, and had fewer switches, than MAPT phenoconverters and controls. A likely explanation for the decline in their total score is that GRN phenoconverters deteriorate in cognitive flexibility [32], and thereby lose the ability to switch between semantic clusters in order to generate more words. Indeed, in our study the decline in the number of clusters and switches, and the increase in cluster size, correlated with decline in executive function. Executive dysfunction is known to be a distinctive cognitive feature in GRN mutations, demonstrating deficits in symptomatic mutation carriers [33], extending to the presymptomatic stage [7, 8]. When converting to the symptomatic stage, GRN mutation carriers also show the most decline in executive function [34].

Starting at least 4 years pre-phenoconversion, MAPT mutation carriers produced words with a higher LF and a lower AoA, and had fewer clusters and smaller clusters than GRN phenoconverters and controls. These findings point towards deterioration of the semantic system as an explanation as to why qualitative fluency measures change in MAPT. First, semantic decline is most likely to affect words with a lower LF and a higher AoA first, as the categorical organization of the system retrieves the ‘typical’ exemplars faster and more accurately, and they are better represented and more interconnected to other concepts than those that enter the semantic system later in life [35]. The reliance of both processes on semantic processing is further supported by their correlation with the verbal SAT which assesses verbal semantic deficits [36]. Clustering relies on lexical retrieval, vocabulary size and lexical access, and thus is mainly supported by the integrity of the semantic system [13].

We demonstrated that—irrespective of the underlying mutation—decline in the number of clusters, cluster size, and switches correlated with GM volume loss of predominantly frontal areas, while worse LF and AoA performance was associated with GM volume loss of predominantly temporal areas. These neuroanatomical correlates are in line with the predominant frontal involvement in GRN mutation carriers [6, 34], and link the degradation of the fronto-insula network to less cognitive flexibility—and as a consequence early clustering-switching impairment—as the most likely underpinning of declining fluency performance in conversion to GRN-associated FTD. The finding that LF and AoA rely on temporal lobe functioning could explain why these qualitative features are changing early in MAPT mutation carriers, as temporal volume loss is considered the neuroimaging hallmark of MAPT [37, 38], being present up to several decades before symptom onset [39]. Although PPA is not a frequent clinical phenotype, semantic impairments are well-described in MAPT-related FTD [40]. Our cohort includes three P301L and three G272V phenoconverters, which is too small to investigate differences between the two tau mutations. Nevertheless, with larger sample sizes it would be interesting to explore if there is clinical heterogeneity across the MAPT mutations [41], as the P301L mutation often presents with a language phenotype with semantic deficits [42]. Impairments in semantic fluency are found to be common as the result of cerebellar pathology, as this subcortical region plays a crucial role in motor performance and executive processes necessary for organizing and monitoring word output [43].

The key strength of our study is our longitudinal design, spanning up to 6 years of follow-up in a large single-centre sample of participants from GRN and MAPT FTD-families. This design allowed the investigation of mutation carriers as they were converting to the symptomatic stage, which provides us more accurate information about the underlying disease process than previous studies that used estimated years to onset as a proxy [40]. We chose multilevel linear modelling to handle potential missing data and unbalanced time-points that were the result of our ongoing prospective study. The small sample of phenoconverters, in combination with the large fluctuations in the data of the qualitative measures, are the largest drawback of the study, which has hampered our statistical power and interpretation of results. As the multilevel model assumes a linear relationship between genetic status and fluency performance it is possible that we have missed non-linear effects. Ideally, the semantic fluency test should not have been used in determining phenoconversion, however in our multidisciplinary approach we have used all available clinical information—e.g., MR brain imaging, anamnestic and heteroanamnestic information, questionnaires—so that symptom onset did not solely depend on the neuropsychological assessment. Although theoretically a cluster can consist of a single word, one cannot measure interword intervals for fewer than two words, so that a single-word “cluster” cannot be corroborated by the measurement of interword intervals. Following Ledoux et al. [11], we therefore defined clusters not as single words, but only as multiword strings (i.e. two or more consecutive words) whose relationship is defined by one of the scoring rules, but realize this could have penalized patients with a low total output. Lastly, the analyses on the presymptomatic mutation carriers were performed using the original baseline and follow-up visits, regardless of years from potential phenoconversion, therefore they might have lost some sensitivity to detect decline. Future directions include replication of our findings in larger multicentre cohorts, including C9orf72 mutation carriers. Moreover, using qualitative measures in discriminative event-based models could help us understand the dynamics of disease progression and how other biomarkers (e.g., NfL) fit into this [44]. Lastly, future studies could look into the effect of using time-bins next to the usual 60-s output, as most people start with readily available animals and produce less familiar exemplars as the task develops (which affects the qualitative measures), and was found to be particularly sensitive to mutation status in presymptomatic APOE-ε4 carriers at-risk of developing AD [14].

Conclusion

Our pilot study shows that qualitative aspects of semantic fluency change in presymptomatic FTD, and shows different profiles and inflection points depending on the mutation involved. This could provide important insight into the mechanisms as to why the “traditional” total score is declining. Its brief and easy-to-apply nature makes the total score of the semantic fluency test a likely candidate cognitive biomarker for upcoming clinical trials for FTD, but more research with a larger sample of phenoconverters is needed to replicate our findings, and to explore the additional value of qualitative measures in identifying and tracking mutation carriers as they convert to the symptomatic stage.