Background

Parkinson’s disease (PD) is a movement disorder characterized by progressive degeneration of the dopaminergic system, affecting both the cell bodies in the substantia nigra and its projections to the striatum. Within the dopaminergic system, the dopamine transporter regulates the synaptic dopamine levels by dopamine reuptake, and its density reflects presynaptic functioning.

[18F]FE-PE2I, developed in 2009 [1, 2], has, through several validity studies, proven to be a valuable positron emission tomography (PET) radioligand for dopamine transporter (DAT) imaging [1,2,3,4,5,6,7,8]. A clinical PET study with [18F]FE-PE2I in twenty patients with early-stage PD showed an in vivo striato-nigral gradient of DAT loss [9], in agreement with post-mortem studies in patients and animal model studies [10, 11]. After initial clinical validation [6], additional studies showed that simplified quantification of [18F]FE-PE2I-PET can be achieved with a shortened imaging protocol, making clinical implementation realistic [7, 12]. The test-retest reliability of [18F]FE-PE2I has been previously studied in twelve young, healthy men [13], showing low variability and good reliability. However, in order to evaluate the suitability of [18F]FE-PE2I as a disease progression marker, it is critical to assess reliability also in patient samples. Due to the degeneration of striatal projections, patients with PD have lower DAT availability, which could lead to lower measurement reliability compared with healthy controls.

In the majority of PET studies of the striatal dopaminergic system, the regions of interest are the striatum, divided into caudate, putamen, and ventral striatum (nucleus accumbens), and the substantia nigra (midbrain). This anatomical subdivision of the striatum, although useful, might not represent the functional organization of the striatum. Instead, subdivision based on the connectivity between the basal ganglia and the neocortex can be used, where the striatum is divided into a limbic, associative, and sensorimotor striatum (respectively ventral striatum, caudate and ventrolateral putamen, and dorsolateral putamen) [14]. The functional regions have shown to be useful for molecular imaging studies examining correlations with behavioral and clinical outcome measures [15,16,17].

The assessment of the test-retest reliability of DAT-PET in patients with PD is relevant for several reasons. First, the knowledge on the natural variability is essential for interpretation of longitudinal follow-up study results; second, the measured variability can be used to estimate the minimum effect size on DAT needed for disease-modifying treatment trials; and third, for the purpose of power calculations for future clinical studies investigating longitudinal treatment efficacy.

The primary objective of this study was, therefore, to assess the test-retest reliability of [18F]FE-PE2I measurements in the main striatal areas and substantia nigra in patients with PD. The hypothesis was that the reliability in the striatum would be similar to that observed in healthy subjects, and the reliability in the substantia nigra lower than in the striatum considering the lower DAT density in the substantia nigra. The secondary objective was to evaluate the test-retest reliability of three connectivity-based functional subdivisions of the striatum in view of future PET analyses.

Materials and methods

Study population

Eleven patients with PD, Hoehn and Yahr (H&Y) stage < 3, were recruited via advertisement on the Swedish Parkinson Foundation website and via two specialist outpatient clinics in Stockholm (Academic Specialist Centre, Karolinska University Hospital). None of the subjects had clinically relevant somatic comorbidities, cognitive decline, history of psychiatric disease, illicit drug abuse, or alcoholism, as assessed by a structured interview. Physical examination, electrocardiography, and routine blood tests were normal. One patient had to be excluded from the PET analysis because the cerebellum was partly out of the PET axial field of view. Demographic details are shown in Table 1.

Table 1 Demographic and clinical characteristics of the patients

Data collection

Activity monitor and disease severity assessment

An activity monitor (Actigraph GT3X+) was worn on the hip for 5–7 days before each PET measurement. Average amount of steps and magnitude of movement per day were measured as a supportive measure of clinical motor stability [18, 19]. Only days with minimal 540 min wear time were included in the calculation [20]. As measure of disease severity, the Movement Disorder Society Unified Parkinson’s Disease Rating Scale, part 3, motor function (MDS-UPDRS-III) was done, including H&Y staging. All MDS-UPDRS-III assessments were performed on the same time of day by the same physician (VSK) in practically defined “OFF” (see below). Symptom duration was defined as the time from reported onset of first motor symptoms.

MRI acquisition

Using a 3 Tesla MRI system (General Electric, Discovery MR750), T2-weighted images were acquired to exclude clinically significant pathology, and 3D T1-weighted images were acquired for co-registration with PET and delineation of the regions of interest (ROI). This last sequence has 176 slices of 1 mm thickness, field of view 256 × 256 mm, resolution 1 × 1 × 1 mm, inversion time 450 ms, echo time 3.18 ms, and repetition time 8.16 ms.

PET acquisition

[18F]FE-PE2I was prepared as previously described [21]. Two 93-min [18F]FE-PE2I PET measurements were acquired in each subject within an interval of 7–28 days (see PET injection characteristics, Supplementary Table S1). PET measurements were done on the same time of day, around 1:30 pm. Patients were asked to be practically defined “OFF,” meaning a withdrawal of levodopa-medication for at least 12 h and other dopaminergic medication for at least 24 h. Also, abstinence of caffeine 3 h before PET, nicotine on day of PET, alcohol 48 h before PET, and cardiovascular training 96 h before PET were requested. An individually made plaster helmet was used for head fixation in the PET camera. PET measurements were acquired with a high-resolution research tomograph (HRRT, Siemens Molecular Imaging) after an intravenous bolus injection of [18F]FE-PE2I. Details can be found in Supplementary Table S1. A 6-min transmission scan with a Caesium-137 source was obtained for attenuation correction. Due to technical reasons, the transmission scan for one patient could not be acquired on the day of first PET measurement, so the transmission scan acquired before the second PET measurement was used for attenuation correction of the first PET measurement.

List mode PET data were reconstructed into 37 frames (8 × 10, 5 × 20, 4 × 30, 4 × 60, 4 × 180, 12 × 360 s) using 3D OP OSEM with 10 iterations and 16 subsets, including modeling of the PSF [22]. Frame-to-frame realignment was performed as previously described [23], with the only difference that the first 2 min instead of the first minute were used as reference frame for PET realignment.

PET motion correction

Head motion was evaluated by patient observation during data acquisition as well as during image analysis by reviewing the realignment plots and brain time activity curves (TACs). Translation of more than 3 mm on the realignment plots led to additional motion correction using an in-house developed automatic procedure. Description of the method is given in Supplementary Text 1.

Image analysis

Using SPM12, the T1-weighted 3D MRI sequence was first realigned to the AC-PC plane (anterior commissure-posterior commissure), after which the PET was realigned and co-registered to the realigned MRI. The following regions of interest were then delineated automatically on the T1-weighted images with FreeSurfer version 6.0.0 (http://surfer.nmr.mgh.harvard.edu/): whole striatum (STR), caudate (CAU), putamen (PUT), ventral striatum (VS), and cerebellum. For substantia nigra (SN), the functional molecular template, as created in the research group [9], was used. As exploratory outcome, three functionally subdivided striatal areas [14] were added to the analysis (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases/striatumconn). For all regions, regional non-specific binding potentials (BPND) of [18F]FE-PE2I were generated using wavelet-aided parametric imaging (WAPI) [24] with t* = 27 min and the cerebellum as reference region.

As described above, for one subject, only one transmission scan could be acquired, which had to be used for attenuation correction of both PETs. This technical issue introduced a bias in BPND due to misaligned attenuation correction in the reference region. See Supplementary Text 2 for analysis and explanation. It was, therefore, decided to exclude the subject for the main analysis and report the results including the outlier as supplementary material. We believe that the test-retest metrics calculated without the outlier are more representative of the study sample.

Statistical analysis

For statistical analysis, R version 3.4.3 was used with the package relfeas (https://github.com/mathesong/relfeas). [18F]FE-PE2I measurement reproducibility was determined with calculation of repeatability (absolute intrasubject variability, AbsVar; and the minimum detectable difference, MDD) and reliability (intraclass correlation coefficient, ICC), as per recommendation of Weir, Baumgartner, and Matheson [25,26,27]. Absolute variability was calculated as: (test–retest)/(mean test and retest) × 100. For ICC, the two-way random effects, absolute agreement, single rater/measurement was used, corresponding to:

$$ \frac{{\mathrm{MS}}_{\mathrm{S}}\hbox{-} {\mathrm{MS}}_{\mathrm{E}}}{{\mathrm{MS}}_{\mathrm{S}}+\left(k\hbox{-} 1\right){\mathrm{MS}}_{\mathrm{E}}+k\left({\mathrm{MS}}_{\mathrm{T}}\hbox{-} {\mathrm{MS}}_{\mathrm{E}}\right)/n} $$

with MSS, subjects mean square; MSE, error mean square; k, number of trials; MST, trials mean square; and n, number of subjects.

The ICC represents the proportion of the variability not attributable to measurement error. As such, an ICC of 1 indicates perfect measurement reliability with all observed variability being due to true (biological) differences and none to measurement variability (error), while an ICC of 0.5 indicates that the variability is comprised of true differences and measurement error in equal measure. Different interpretations of the ICC exist; as proposed by Portney and Watkins [28] and suggested by Matheson [27], we regard an ICC < 0.5 as low, 0.5–0.75 moderate, 0.75–0.9 good, and > 0.9 excellent. Measurement reliability with an ICC > 0.9 is recommended as a lowest acceptable standard for measurements from which diagnostic decisions are made, ICC > 0.7 for research purposes, with 0.95 and 0.8 considered as adequate, respectively [29].

The agreements between measurements in each region were plotted with the Bland-Altman plots. Power plots were generated with the jamovi software (https://www.jamovi.org/). The results of study variables are expressed as mean ± standard deviation unless otherwise stated.

Results

All subjects completed the study according to the protocol, with exception of the subject with only one transmission acquisition. The MDS-UPDRS-III and Actigraph outcomes support the subject’s clinical stability during the study period (Table 1), with the exception of two patients who had a week of influenza and a week of holiday respectively explaining the lower Actigraph outcome. One subject showed a large difference in test/retest MDS-UPDRS-III (18 vs. 28), based on 1 point increases spread over the different domains, with the left side being mildly symptomatic in the second assessment versus not symptomatic in the first assessment. The difference could not be explained by patient self-report and is probably due to either natural symptom fluctuations and/or intra-observer variability.

Representative test-retest BPND images of [18F]FE-PE2I are shown in Fig. 1. Table 2 shows the individual test-retest BPND values for the four main regions of interest, with the highest BPND in the CAU and VS and lowest in PUT and SN (Fig. 2). The Bland-Altman plots (Fig. 3) showed good agreement between the test-retest measurements. Supplementary Figure S1 shows the Bland-Altman plots including the outlier described earlier. Test-retest calculations are reported in Table 3. The striatal regions displayed low variability (AbsVar 5.3–7.5%) and high ICC (0.89–0.97). SN showed relatively higher absolute variability (11%), with a moderate ICC (0.74).

Fig. 1
figure 1

Representative test and retest parametric BPND images of [18F]FE-PE2I

Table 2 Individual binding potential (BPND) values of [18F]FE-PE2I in striatal regions and substantia nigra
Fig. 2
figure 2

Individual BPND values between scan 1 and scan 2. Striatal regions and substantia nigra, with and without the exclusion of the outlier, are displayed

Fig. 3
figure 3

Bland-Altman plots of main regions of interest. The yellow lines correspond to the upper and lower 2SD line; red line: bias

Table 3 Test-retest metrics of [18F]FE-PE2I PET measurements (n = 9)

Bland-Altman plots of the exploratory regions of interest are presented in Supplementary Figure S2, and the repeatability and reliability results in Table S3. Low absolute variability and good ICC (> 0.75) were observed in the functional striatal subdivisions.

For transparency, test-retest metrics calculated including the outlier are reported in Supplementary Table S2.

Additionally, the reliability was assessed for the outcomes of the less vs. more affected hemisphere (Supplementary Table S3). This analysis showed similar test-retest consistency for both the more and less affected hemispheres. The less affected SN exhibited numerically better reliability and repeatability, although not significant.

Discussion

This study was designed to assess the test-retest reproducibility of [18F]FE-PE2I PET-measurements of DAT in Parkinson patients (H&Y stage < 3). The results showed good repeatability and reliability of the measurements, providing support that [18F]FE-PE2I can be used as DAT biomarker in PD.

The [18F]FE-PE2I test-retest study in twelve young, healthy controls [13] observed comparable repeatability (CAU 4.8%, PUT 5.6%, SN 9.7%) and reliability (CAU 0.83, PUT 0.88, SN 0.71). This shows that the lower DAT availability as consequence of PD, at least in H&Y stages < 3, does not substantially influence the consistency of its measurements. Relatively higher ICC in patient cohorts compared to healthy control cohorts is to be expected because of inherently higher inter-individual differences in patient cohorts than in healthy control cohorts.

Test-retest reproducibility of other DAT-PET and DAT-SPECT radioligands

Hirvonen et al. [30] showed in their same-day test-retest study in five healthy subjects an absolute variability and reliability of 11C-PE2I measurements in manually delineated ROIs for ventral striatum and midbrain of 7.2 ± 4.4% and 6.5 ± 5.2%, and ICC of 0.81 and 0.83, respectively. The higher ICC in midbrain compared to Suzuki et al. is probably due to higher absolute BPND values in the midbrain observed for 11C-PE2I and measured with an HRRT. Nurmi et al. [31] found the intraclass correlations of 18F-CFT uptake values in eight healthy controls to be 0.91, 0.94, and 0.86 for caudate, anterior putamen, and posterior putamen, respectively (manual ROIs). In the seven de novo PD patients, the intraclass correlation was 0.97, 0.95, and 0.96, respectively, which is comparable to our results. Scans were performed 2.5–3 months apart, with the second scan in PD patients being after a 3-month levodopa treatment, showing that even after initiated levodopa treatment, the DAT-measurements have high reproducibility in PD patients in this time range.

Test-retest results of clinical DAT-SPECT radioligands 123I-beta-CIT and 123I-FP-CIT showed measurement variability in PD patients of 7.4–16.8 % in STR and 12.2% in striatal subdivisions [32,33,34], with corresponding ICCs of 0.59–1.00 [33, 34]. Results vary because of different ROI definition methods and outcome measures. The striatal test-retest variability of 123I-PE2I in seven healthy subjects [35] was 5.2 ± 4.5 (STR), 9.4 ± 7.0 (CAU), and 10.3 ± 5.1% (PUT), with corresponding ICC of 0.92, 0.83, and 0.84. Our test-retest results are thus comparable to the clinically implemented DAT-SPECT radioligands. Given the advantages of PET compared to SPECT, the results confirm that 18F-FE-PE2 is a strong candidate for clinical applications as well.

DAT measurement in smaller regions

The higher resolution of PET compared with SPECT permits a better assessment of low-binding regions, such as the SN. Test-retest repeatability and reliability in this region were inferior to test-retest metrics in the striatum. The smaller size of the SN and the smaller numerical value of BPND in this region are likely reasons for its greater variability. Despite this relative limitation, a previous study on DAT availability in the nigrostriatal system of early PD patients as quantified with [18F]FE-PE2I [9] showed changes of DAT availability in the SN compared to healthy controls that were still larger (30%) than the reliably detectable difference (~ 18%, based on effect size of n = 9), confirming that the assessment of the nigrostriatal system with [18F]FE-PE2I PET provides a comprehensive assessment of the PD pathophysiology in vivo.

The low variability and high reliability observed in the connectivity-based functional subdivisions of the striatum furthermore indicate that [18F]FE-PE2I PET is a reliable research tool. This is relevant for studies using the functional rather than anatomical striatal subdivisions in assessing correlations of DAT availability to specific clinical variables of cognitive function or behavior.

Power calculation

Using the variability in test-retest differences within each region, we estimated the size of within-individual changes which could be detected with a power of 80%, using a significance threshold of 0.0125 (making use of Bonferroni correction for multiple comparisons for the four primary regions of interest, 0.05/4). These estimates suggest that a study with a sample size of 9 patients is sufficiently sensitive to detect within-individual differences of greater than between 5 and 11% for different regions within the striatum and greater than 28% for the SN (Supplementary Table S2). The relationship between sample size and effect size was examined as a function of statistical power and is presented in Fig. 4. Yearly DAT decline in PD patients has been estimated to be between 5 and 13% in striatal regions [31, 36, 37], meaning that [18F]FE-PE2I PET is well suited for measuring biological DAT differences in striatal regions in a longitudinal follow-up study with a typical sample size for PET studies.

Fig. 4
figure 4

Relationship between sample size and effect size for caudate, putamen, ventral striatum, and substantia nigra. The arrow indicates the sample size needed to detect a statistically significant difference with 0.8 power, based on the effect size corresponding to a 10% change

Conclusion

[18F]FE-PE2I measurements of DAT have good reliability in Parkinson patients (H&Y 1–2.5) even in the small anatomical areas with lower DAT density, such as the substantia nigra. The test-retest metrics were equal-to-superior to other DAT radioligands. Thus, this study further supports the suitability of [18F]FE-PE2I as imaging marker for longitudinal follow-up studies in PD.