Introduction

Despite the early description of Alzheimer’s disease (AD) in 1906 [1], the first set of globally accepted criteria for the clinical diagnosis of AD was only established in 1984 by the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer’s Disease and Related Disorders Association (NINCDS-ADRDA) [2]. Prior to these criteria, a suspected clinical diagnosis of AD could only be confirmed with certainty by the post mortem observation of ‘senile plaques’ and neurofibrillary tangles in the brain, much as they were described to appear by Alois Alzheimer himself as he visualized them with low power microscopy using Bielschowsky silver stain [1]. While diagnosis in living patients represents a significant advance, when applied by expert clinicians the 1984 clinical criteria are reported to have only approximately 80% positive predictive value (PPV) and 60% negative predictive value (NPV) for a pathology-confirmed diagnosis [3]. In general, by clinical criteria, sensitivity (proportion of true positive results) increases with more permissive clinical criteria, and specificity (proportion of true negative results) increases with more restrictive clinical criteria. However, the opposite is true for neuropathologic criteria.

Appropriately, a proposal to develop and validate new biomarkers for AD was included in the 2011 National Institute on Aging-Alzheimer’s Association (NIA-AA) updated diagnostic recommendations [47]. Underpinning the 2011 recommendations is the recognition that the progression of AD biomarkers over time likely follows an ordered temporal sequence, and the Aβ biomarkers (e.g., low brain Aβ clearance, i.e., low cerebrospinal fluid [CSF] Aβ42; and tracer retention on amyloid positron emission tomography [PET] imaging) are representative of upstream events [5].

While normally CSF flows out of the brain ventricles, idiopathic normal pressure hydrocephalus (iNPH) may be caused by CSF ‘backflow’ into the brain from the ventricles [8]. Studies of various analytes have shown that a concentration gradient exists between CSF taken from the ventricles (higher concentration) and CSF taken from the lumbar spine (lower concentration) [912]. Aβ oligomers self-assemble into the larger Aβ species [13], which are deposited as plaque in the brain; elevated soluble Aβ oligomers in CSF have been associated with AD, although the data are inconsistent and the culprit toxic oligomer still needs to be identified [14]. With a hypothetical leap, it may not be so surprising that AD-like Aβ pathology frequently occurs concomitantly with iNPH.

Clinically, patients with iNPH present with a classic triad of symptoms: cognitive impairment, gait abnormalities, and urinary incontinence. Untreated, iNPH is progressive [15]. The treatment for iNPH is surgical placement of a ventriculoperitoneal shunt. A small cortical biopsy from the site of shunt placement may be obtained during the procedure. Though iNPH itself may cause dementia, an identifiable neurodegenerative process often underlies the cognitive decline [16]. Concomitant AD pathology is not uncommon [1719], up to 68% in one series [20], and may be associated with a poorer response to ventriculoperitoneal shunting [17, 18]. Biopsies from iNPH patients represent a unique opportunity to study correlations between PET amyloid imaging and pathology in living subjects.

The PET radioligand, Pittsburgh compound B (PiB), which is a neutral analog of thioflavin T (a stain which detects fibrillar amyloid) has been extensively studied both in living subjects and in autopsy tissue. PiB was designed to cross the blood-brain barrier, bind with high selectivity and nanomolar affinity to fibrillar Aβ, and clear rapidly from the brain [21]. Cortical [11C]PiB uptake, as measured by PET in living subjects, correlates with fibrillar Aβ load measured subsequently by immunohistochemical stains for Aβ40 and Aβ42 post mortem [2224]. Unfortunately, the short half-life of [11C] (about 20 minutes) requires an on-site cyclotron for its production, limiting its use to academic medical and specialized imaging centers. Consequently, several fluorinated amyloid imaging agents have emerged to bridge this gap (F18 having a half-life of about 110 minutes): [18F]flutemetamol [25] and [18F]florbetapir [26] (Aβ diagnostics both approved by the Food and Drug Administration]) as well as [18F]florbetaben making the radiopharmaceutical available for widespread community use.

[11C]PiB, [18F]florbetapir, and [18F]florbetaben have similar fibrillar Aβ binding site affinities, and can be used in a comparable manner to assess brain amyloid density [27]. In this analysis, we used [18F]flutemetamol, which differs from PiB in structure by the addition of a single F18 atom [28].

Four clinical studies (GE067-008, -009, -010, and -011, called Studies A, B, C, and D, here (and published separately as [20, 2931], respectively)) were undertaken in iNPH patients requiring surgical shunt procedures or intracranial pressure (ICP) monitoring, to determine how well cerebral fibrillar Aβ uptake of [18F]flutemetamol as quantified by PET imaging corresponded with immunohistochemical (IHC) and histochemical (HC) estimates of amyloid burden in biopsy samples taken during these procedures. Two of the studies (one in Europe and one in the United States [US]) were retrospective studies in which the biopsy was followed by the PET scan. Two of the studies (one in the Europe and one in the US) were prospective studies, in which the order of procedures was reversed.

[18F]Flutemetamol uptake was measured quantitatively in specific brain regions including the cortical area of (prospective studies) or surrounding (retrospective studies) the biopsy site (ipsilateral site) and the site in the contralateral hemisphere that corresponded to the biopsy site (contralateral site). A composite neocortical measure of [18F]flutemetamol uptake was also calculated by averaging the uptake from frontal cortex, anterior cingulate gyrus, posterior cingulate gyrus/precuneus, lateral-temporal cortex, and parietal cortex. Aβ plaque frequency was determined and scored in biopsy specimens stained with Bielschowsky silver stain and thioflavin S. The percentage of grey-matter area occupied by plaque was also assessed following IHC for the monoclonal antibody 4G8. Finally, based on all available stains/slides, an overall pathology assessment was rendered of the Aβ load in biopsy tissue grey matter. These 4 pathology endpoints served as the standard of truth (SoT) in comparisons with the [18F]flutemetamol PET data.

In a pooled analysis of these 4 studies as previously reported, for specific parameters, [18F]flutemetamol uptake in ipsilateral and contralateral sites as well as the composite cerebral cortical measure of [18F]flutemetamol uptake were significantly correlated with Aβ plaque burden [32]. This confirmed similar findings in an autopsy study using [18F]florbetapir [33].

The standard uptake value ratio (SUVR) is a quantitative measure of tracer uptake in a brain, normalized for the mean uptake in a reference region. Pooling data from Studies A, B, C, and D, using biopsy pathology as the SoT and cerebellar grey matter as the quantitative PET reference region in one set of data and pons in another, here we set out to find (1) which pair(s) of PET SUVR and pathology SoT endpoints matched best, (2) whether quantitative measures of [18F]flutemetamol PET were better for predicting the pathology outcome than majority [18F]flutemetamol PET visual-based image read, and (3) whether there was a better match between PET image findings in retrospective vs. prospective studies.

Materials and methods

Patients

Patients were eligible for inclusion if they had known or suspected iNPH, were older than 50 years of age, and were in general health appropriate for study procedures. Patients were excluded if they were pregnant or lactating, had known or suspected hypersensitivity/allergy to the [18F]flutemetamol formulation, or had a contraindication to PET or magnetic resonance imaging (MRI). For the retrospective studies (Table 1), sufficient biopsy sample had to be available for detection and quantification of Aβ pathology. All 4 studies were conducted according to the principles of the Declaration of Helsinki and approved by local human ethics boards. Informed consent was obtained from all patients and/or their designated representatives prior to study entry according to local regulations. The number of patients and additional study characteristics are summarized by study in Table 1.

Table 1 Number of patients and other study characteristics, by study

Procedures

[18F]Flutemetamol PET image acquisition, processing, quantitative measures, and methodology precedents

The injected activity, PET imaging equipment used, and details of the reconstruction of the 30-minute summed PET image are shown by study in Table 2.

Table 2 Summary of methods by study: MRI and [ 18 F]flutemetamol PET image acquisition

[18F]Flutemetamol was manufactured according to Good Manufacturing Process at Cardinal Health, Beltsville, MD for the US sites and Turku PET Center, Turku for the Finland sites, and transported to the sites. At the PET imaging site prior to [18F]flutemetamol injection, quality control tests were performed including radioactivity content and chemical purity by high-performance liquid chromatography. The radiopharmacist ensured that the correct activity was present in the injection syringe and that the product was used within the validity period. [18F]Flutemetamol (target dose, 185 MBq) was injected over approximately 40 seconds by study staff.

The PET scan was initiated approximately 90 minutes after injection with [18F]flutemetamol and lasted 30 minutes (six 5-minute frames). All PET cameras in these studies were qualified with a structured phantom prior to scanning patients in the studies. Each site performed cycles of phantom reconstruction using different filter parameters until the spatial resolution was approximately 6.5 mm (in order to prove approximately equal partial volume effects) [32]. The dynamic PET data was summed over the entire scan to create a 30-minute summation image.

Within 35 days of PET imaging, patients underwent MRI to rule out confounding conditions (e.g., vascular, structural) and facilitate volume of interest (VOI) analysis of [18F]flutemetamol retention. The patient’s MRI was co-registered with the [18F]flutemetamol PET image. The biopsy site on the [18F]flutemetamol image was located from either (1) the post-biopsy MRI scan (for retrospective studies) or (2) an MRI scan co-registered with the post-biopsy computed tomography (CT) scan (for prospective studies).

While the patient’s MRI was used to define the placement of the VOI in the PET image for the biopsy site, workstation functionality enabled location of an equivalent region on the contralateral side. VOIs were manually drawn (1) on the tissue including the biopsy site (prospective studies) or on the tissue surrounding the hollow excised biopsy site (retrospective studies) and (2) on the corresponding contralateral region. For the retrospective studies, measures were taken to minimize the influence (partial volume effect) the prior surgical procedure would have had on measured tracer retention, i.e., slightly larger VOIs were placed around the biopsy site (and matching contralateral site). Mean VOIs are shown in Table 3. A complete discussion of partial volume effect is beyond the scope of this paper, but a review of the topic can be found in [34]. Briefly, radiotracer is retained in white matter in both normal subjects and patients with AD. Tracer signal from white matter may spill over any potential cortical signal in narrow widths of cortical grey matter. In our studies, reference VOIs were placed on 2 sites (cerebellar cortex and pons) to learn whether one reference region (REF) might be preferable to the other.

Table 3 Mean volumes of interest for retrospective and prospective studies[32]

From studying [11C]PiB, which is structurally similar to [18F]flutemetamol, we know that (1) regional brain uptake of tracer is proportional to the regional level of brain amyloid, as determined by IHC and HC [22], and (2) symmetrically placed VOIs in the left and right frontal cortices result in similar SUVRs [35].

SUVRs were calculated for the VOIs as follows: SUVVOI/SUVREF, with SUV being the integrated activity over a given time period per unit of injected dose and body weight. The reference regions, cerebellar cortex and pons, were not expected to have any fibrillar Aβ plaque burden. Previous work has shown that the SUVR range of [18F]flutemetamol referenced to cerebellum in normal subjects is between 1.1 and 1.5 [36]. It should be noted, however, that in addition to mature (dense, cored) neuritic plaques, and to a much lesser degree diffuse Aβ plaques, [18F]flutemetamol is retained by vascular amyloid deposits [22]. The results from the original studies showed that 3/43 patients examined for vascular amyloid in the pooled dataset were positively identified by the pathologist (1 with an overall pathology diagnosis of normal).

A mean composite cortical VOI was calculated as the mean of several anatomic regions typically associated with significant Aβ plaque burden in AD (frontal cortex, anterior cingulate gyrus, posterior cingulate gyrus/precuneus, lateral-temporal cortex, and parietal cortex). Neocortex is associated with Thal Phase 1 of Aβ deposition in AD, and cingulate is associated with Thal Phase 2 (out of 5 phases) [37]. Thal Phases 1 and 2 correspond to the designation “A1” according to the “ABC” system set forth in the NIA-AA guidelines for the neuropathologic assessment of AD [38]. Similar composite measurements have been previously described for [11C]PiB [39], [18F]flutemetamol [36], and [18F]florbetaben [40]. Since the five brain regions vary considerably in size, and the mean was not corrected for VOI size, the SUVR of the composite VOI reflects the level of uptake as if all 5 regions were of equal importance and not the overall uptake level in the composite VOI. Composite cortical SUVR has resulted in a value (i.e., overall estimate) of the fibrillar Aβ burden in the brain as a whole similar to that for the biopsy VOI [41] and is similar to a global PiB retention summary from 6 cortical regions of interest as justified in [42].

[18F]Flutemetamol PET blinded image evaluation

Anonymized [18F]flutemetamol PET data were transferred to the GE Healthcare Image Review Center in Oslo, Norway, and reviewed according to the studies’ image review charters and in accordance with United States Food and Drug Administration Guidance to Industry [http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM268555.pdf]. Images were loaded onto Xeleris workstations (GE Medical Systems, Milwaukee, WI) and reviewed using its Volumetrix application.

Visual interpretation of images by readers blinded to subject clinical and pathological information except for iNPH status (blinded image examination [BIE]) was performed by 3 trained, experienced raters (nuclear medicine physicians) in each study, in an individually randomized order of image presentation. Inter-rater agreement has previously been shown to be strong [32]. All readers assessed the grey matter tracer patterns in the same protocol-specified regions, in the same protocol-specified order.

Biopsy

Biopsy tissue was taken with biopsy forceps or a 14-gauge biopsy needle. The biopsy sample in Study A was approximately 5 mm3 and in the other 3 studies was approximately 14 mm3. In Study A, the biopsy was from the right prefrontal cortex; in Studies C and D, right prefrontal cortex, mid-pupillary line in front of the coronal suture was specified. In accordance with neurosurgical procedures approved by the institutional review board for the Johns Hopkins site, parietal cortex was biopsied in Study B.

Immunohistochemistry (4G8) and histochemistry (thioflavin S and Bielschowsky silver) methodology and measures

The biopsy tissue was fixed in 10% neutral buffered formalin and embedded in paraffin. If the sample size allowed, up to 50 serial sections were cut at a 5-μm thickness (6 μm in Study A) and numbered sequentially. Wherever possible, 3 sections separated by 100 μm (e.g., Slides 3, 23, and 43) from each biopsy were used for each of 3 staining techniques: 4G8 IHC, Bielschowsky silver, and thioflavin S. 4G8 IHC was standardized using the methodology of Study A as a model, and performed for Studies B, C, and D at Covance Laboratories Ltd (Harrogate, North Yorkshire, UK) as previously detailed [32]. Formic acid pre-treatment was used. Five measures (Aβ percentage area or plaque score, described below) were taken from each slide, and a mean was determined for each specimen.

Automated histometric measurement of the percentage area of Aβ in grey matter in 4G8 stained sections was performed. Except for 4G8 sections from only 7 subjects for whom Aβ percentage area was measured at the University of Pennsylvania using a similar technique (Study A [20]), 4G8 sections were imaged using whole slide scanning (Aperio XT) with a pre-developed and validated macro (MATLAB) used to threshold intensity, size, and morphometry after color deconvolution to remove the hematoxylin staining channel. The 4G8 antibody dilution for the 7 samples from Study A was higher (1:500) than for the remaining 40 available 4G8 samples (1:100). Three of these 7 samples were completely 4G8 negative (0.00% 4G8); 4G8 percentage area was 0.07%, 1.52%, 2.05%, and 2.75%, respectively, in the remaining 4 samples. In Studies B, C, and D a total of 5 out of 40 samples were completely 4G8 negative, and 4G8 percentage area in the remaining 35 samples ranged from 0.01% to 13.41%.

To dichotomise the 4G8 data (normal/abnormal) we used a threshold of 2.5% (receiver-operator characteristics analysis from multiple cortical samples taken in an autopsy study cohort of 68 subjects, data not shown). In 11/47 4G8 samples, the cut-off definition did not match the overall pathology judgment (8 samples with 4G8 positivity ≤2.5% had an overall pathology judgment of abnormal, and 3 samples with 4G8 positivity >2.5% had an overall pathology judgment of normal).

In Bielschowsky silver and thioflavin S stained sections, plaques were counted and scored using the following modified Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) 4-point scale [43]: 0 = no plaques, 1 = sparse plaques (1 to 5), 2 = moderate plaques (6 to 19), and 3 = frequent plaques (20+). For Study A, Bielschowsky silver and thioflavin S plaque counts were not assessed using the same scale as in the other studies, and were therefore not included in the pooled data set. For one subject in Study C, the thioflavin S result (average of 3 sections with plaque scores of 0, 1, and 2, respectively) was changed from abnormal to normal after the results of the study were known. This was justified because, according to the published methods for Study C, only a plaque score of 2 (moderate) or 3 (frequent) for thioflavin S was considered abnormal [30], and justified also according to the standardized analysis definitions for this study. All other outcomes for this subject were abnormal.

All staining methods were pre-validated and optimized prior to study samples being stained. It is acknowledged that Bielschowsky silver stain is not readily scaled up and was performed in small batches along-side control sections. The neuropathologists assessing the sections were at liberty to request re-stains, and comments on the quality of the sections actually assessed were collected. Good, Satisfactory and Poor meant that sections were assessable with varying degrees of ease while the option to record the sections as “Unassessable” was also available.

The physician evaluating the 4G8 slides for Study A was blinded to case identity. The independent neuropathologist at the contract research organization evaluating all other slides was blinded to clinical and imaging data; 4G8 slides for histometric analysis were scanned at a separate location, and the same slides were then shipped and used in the neuropathologic assessment.

Statistical analysis

All 24 possible combinations (pairs) of the following 4 pathology SoTs and 6 image endpoints were studied:

Pathology data (SoT)

  1. (a)

    Plaque as a percentage of area IHC-stained with 4G8 antibody

  2. (b)

    Bielschowsky silver plaque score (normal/abnormal)

  3. (c)

    Thioflavin S plaque score (normal/abnormal)

  4. (d)

    Overall pathology (normal/abnormal)

Normal and abnormal are defined below.

Imaging data (SUVR type)

Cerebellum as reference

  1. (i)

    SUVR for the ipsilateral VOI

  2. (ii)

    SUVR for the matching contralateral VOI

  3. (iii)

    SUVR for the composite VOI

Pons as reference

  1. (iv)

    SUVR for the ipsilateral VOI

  2. (v)

    SUVR for the matching contralateral VOI

  3. (vi)

    SUVR for the composite VOI

Additionally, the results of the BIE were an endpoint.

There were only SUVR readings and a BIE reading for 1 patient included in the analysis (no pathology IHC or HC was available for this patient).

An explanation of the sequence and logic of statistical analysis processes is presented at the end of this statistical methodology section, after all the terms have been defined.

Objective 1: The first objective of the statistical analysis was to determine which pair of SUVR/pathology endpoints matched best.

Analysis: For each of the 4 pathology SoTs, the result was dichotomized as normal or abnormal. For 4G8 IHC (a), if the area was ≤2.5% it was classified as normal; otherwise, it was abnormal. For Bielschowsky silver stain (b) and thioflavin S (c), the normal/abnormal plaque score threshold was the midpoint between sparse and moderate (i.e., ≤1.5 was defined as normal). In addition, an overall pathology SoT (d) consisted of the pathologist’s overall impression (normal or abnormal) based on all slides prepared. The overall pathology assessment was a judgment call made by the independent expert neuropathologist (blinded to clinical information) to allow for discrepant results from the 3 staining methods (if any) resulting in the final classification of each case as normal or abnormal.

Based on each pathology SoT, receiver-operator characteristic (ROC) analyses (defined below) were performed, in which the discriminative ability of each SUVR type (i through vi) was tested against each pathology SoT (a through d). For each of these 24 ROC analyses, the area under the curve (AUC) value was determined. In addition, for each ROC curve, the optimal threshold for each SUVR score was determined by the Youden index [44] using MedCalc software (Ostend, Belgium). After obtaining the optimal threshold for each ROC analysis, the sensitivity, specificity, PPV, NPV, and accuracy of each SUVR method for prediction of each SoT were calculated.

Result: The best matched pair (SUVR type and SoT) would be the one which has the highest AUC value in the ROC analysis, the largest sum of sensitivity and specificity [45], or the largest Youden index.

Objective 2: The second objective of the statistical analysis was to determine whether SUVR was a better measure for predicting SoT outcome than the majority BIE result (i.e., at least 2 of 3 readers).

Analysis: Based on each pathology SoT, we compared the sensitivity and specificity of majority BIE read to the sensitivity and specificity of SUVR with the McNemar test (used to compare paired proportions).

Result: The results were expressed as point estimates of the sensitivity and specificity values for both BIE and SUVR, their 95% confidence intervals (CIs), and p-values for the comparisons.

Objective 3: The third objective of the statistical analysis was to determine whether there was a better match between PET imaging findings and pathology findings in retrospective or prospective studies.

Analysis: For each pathology SoT, we determined whether the pairing of PET SUVR type and SoT with the largest sum of sensitivity and specificity was for prospective or retrospective studies. We also determined, for each pathology SoT and PET SUVR type, which study cohort (prospective or retrospective) had the larger sum of sensitivity and specificity.

Result: The results were expressed as point estimates of the sum of the sensitivity and specificity values and their 95% CIs.

Definitions and explanation of terms

The scheme for classifying biopsy results is shown in Table 4.

Table 4 Scheme for classifying biopsy results

The definitions for diagnostic efficacy are shown in Table 5 where TP, FN, FP, TN are numbers of patients.

Table 5 Definitions for diagnostic efficacy of [18F]flutemetamol PET

A ROC curve is the graph where the y axis represents sensitivity and the x axis represents 1 minus specificity. The ROC AUC is a measure of test performance with a ROC AUC of 1 indicating a perfect test.

The sum of sensitivity and specificity indicates whether a diagnostic test will result in a revision of the pre-test probability of disease [45]. The highest possible sum of sensitivity and specificity is 2.

The Youden index is the sum of sensitivity and specificity minus 1; a perfect test would have a Youden index of 1.

SUVR thresholds

Using all of the available SUVR values for each combination of the 6 SUVR types, an estimate was made of the SUVR value that would maximize the value of the Youden index. These maximal Youden values and their optimal SUVR values are shown in Table 6. These SUVR values that maximized the Youden index values were then used to calculate ROC AUCs (also shown in Table 6).

Table 6 ROC AUC, Youden index, and Optimal SUVR thresholds by SUVR type and pathology SoT for all 4 studies combined

Sequence and logic of statistical analysis process

Here is the sequence and logic for the statistical analysis process. There are 24 combinations of SUVR type (6) and pathology SoT (4). For each of these 24 combinations there are approximately 49 sets of SUVR values (quantitative—continuous scale) and pathology SoT value (always dichotomized—normal/abnormal as described above). For each of 24 combinations, an ROC curve is generated using the “population” of data (SUVR and SoT values). For any 1 SUVR value, a point on the ROC is determined, since both sensitivity and specificity are determined by use of that SUVR value as a cut-off (SUVR values below that cut-off are termed normal, and SUVR values above that cut-off are termed abnormal; each SoT value is already labeled as normal or abnormal by the pathology cut-off values).

Considering all values on the ROC curve, 1 value on the ROC curve will be a maximum value for the Youden index, which is just (sensitivity + specificity -1). If the Youden index is maximized at a point on the ROC curve, then clearly the sum of sensitivity and specificity will also be maximal at the same point on the ROC curve. We recorded at what SUVR cut-off value the maximal Youden index (and thus the maximal sum of sensitivity and specificity) was found. These are the so-called “optimal SUVR thresholds” (Table 6 in Results). All subsequent diagnostic efficacy parameters were computed using this optimal SUVR threshold. Thus, in Table 7, the “sum of sensitivity and specificity values” are actually the “maximal sum of sensitivity and specificity values” for each combination of SUVR type and pathology SoT. The sensitivity and specificity values given for each combination are those at the SUVR threshold that maximized the sum of sensitivity and specificity for that combination.

Table 7 Sensitivity and specificity and their exact 95% CIs for each SUVR type/pathology SoT combination for all 4 studies combined

The number and percentage of TP, FN, TN, and FP values (refer to Table 4) given in Table 8 (and the values computed from them (accuracy, PPV, and NPV) are all values that have been determined at the optimal SUVR threshold, that is, the SUVR threshold that maximized the Youden index/sum of sensitivity and specificity for that combination of SUVR type and pathology SoT.

Table 8 Accuracy of [ 18 F]flutemetamol quantitative diagnosis and positive and negative predictive values by Aβ pathology SoT in the 4 studies combined

It is not shocking that the larger ROC AUCs generally occur with larger Youden indices, since a large Youden index means a large sum of sensitivity and specificity, and the maximal ROC AUC is observed when both sensitivity and specificity are 1.0. The Sigma Plot instructions for ROC curve analysis state “An important measure of the accuracy of the clinical test is the area under the ROC curve. If this area is equal to 1.0 then the ROC curve consists of two straight lines, one vertical from 0,0 to 0,1 and the next horizontal from 0,1 to 1,1. This test is 100% accurate because both the sensitivity and specificity are 1.0 so there are no false positives and no false negatives.”

Comparing diagnostic efficacy values using BIE and pathology SoT is essentially exactly the same as using dichotomized SUVR values. In both cases, the imaging judgment has been dichotomized into normal or abnormal, either from using a cut-off value for SUVR or an immediate visual judgment for BIE.

Results

Unless stated otherwise, results shown are for Studies A, B, C, and D combined. Representative photomicrographs of Bielschowsky silver stain and 4G8 stain are displayed in Figure 1 alongside examples of abnormal and normal PET images.

Figure 1
figure 1

Examples of abnormal and normal [ 18 F]flutemetamol positron emission tomography (PET) and corresponding magnetic resonance (MR) or computed tomography (CT) imaging and histopathology. Panel a) [18F]Flutemetamol PET imaging correlates with histopathology (Study D). Amyloid plaques were determined in biopsy samples by 4G8 imunohistochemistry (IHC). Neuritic plaques were identified in serial sections using a modified Bielschowsky silver stain. Panel b) [18F]Flutemetamol PET images were obtained either retrospectively after biopsy (Studies A and C) or prospectively before biopsy (Studies B and D). Small cortical biopsies were taken during shunt placement and histopathology was correlated to standard uptake value ratio (SUVR) measures in volumes of interest (VOIs) either ipsilateral or contralateral to the site of biopsy.

SUVR type/pathology SoT pairs with highest ROC AUC and largest Youden index

Including all data from the 4 pooled studies combined, for 3 of the 4 pathology SoTs, the SUVR type with the largest ROC AUC was the composite SUVR referenced to the cerebellum (composite-cerebellum) (overall pathology [AUC = 1.0000], Bielschowsky silver [AUC = 0.9815], and thioflavin S [AUC = 0.9462]) (Table 6). For Bielschowsky silver, the ROC AUC was as large for contralateral-cerebellum as for composite-cerebellum. For the fourth pathology SoT, the SUVR type with the largest ROC AUC completing the pair was ipsilateral-cerebellum (4G8 [AUC = 0.8544]).

Considering all SUVR type/pathology SoT pairs, the composite-cerebellum/overall pathology pair had the largest ROC AUC (1.000). ROC AUCs for composite-cerebellum/and contralateral-pons/Bielschowsky silver were nearly as large (both 0.9815) (Figure 2).

Figure 2
figure 2

Receiver-operator curves by pathology standard of truth for each SUVR type: a) 4G8, b) Bielschowsky silver stain, c) Thioflavin S, and d) Overall Pathology. The composite-cerebellum/overall pathology pair had the largest ROC AUC (1.000). ROC AUCs for composite-cerebellum/and contralateral-pons/Bielschowsky silver were nearly as large (both 0.9815).

The combination of SUVR type/pathology SoT with the largest Youden index (1.0000) was composite-cerebellum/overall pathology (Table 6). For the other 3 pathology SoTs, the SUVR types with the largest Youden index in descending order were as follows: contralateral-pons/Bielschowsky silver (0.9286), composite-cerebellum/thioflavin S (0.8438), and contralateral-cerebellum/4G8 (0.6757).

The largest Youden index for each pathology SoT was almost always found in the SUVR type/SoT pair with the largest ROC AUC. The exception was for 4G8, with the largest Youden index for the combination with ROC AUC of 0.8529, not the combination with the ROC AUC of 0.8544 (Table 6).

All of the SUVR cut-off criteria using cerebellum as the reference region were greater than 1. None of the SUVR cut-off criteria using pons as the reference region were greater than 1.

SUVR type/pathology SoT pairs with largest sum of sensitivity and specificity (4 studies combined) (Table 7)

The SUVR type with the highest sum of sensitivity and specificity for each SoT in order of descending value was composite-cerebellum/overall pathology (2.000), contralateral-pons/Bielschowsky silver (1.9286), composite-cerebellum/thioflavin S (1.8438), and contralateral-cerebellum/4G8 (1.6757).

For 4G8, sensitivity was good, greater than 0.8889 for all SUVR types (and similar to that for the other pathology SoTs), but specificity was poorer (range, 0.5000 to 0.8649).

Sensitivity, specificity, accuracy, PPV, and NPV values are illustrated in bar graphs for each pathology SoT in Figure 3.

Figure 3
figure 3

Diagnostic efficacy by SUVR type (a – c using cerebellum as reference region, d – f using pons as reference region) for each pathology standard of truth (within each group from left to right: 4G8 [blue], Bielschowsky Silver [rust], Thioflavin S [green], and Overall Pathology [purple]). Horizontal axis: Groups of bars from left to right represent Sensitivity, Specificity, Accuracy, PPV, and NPV. Vertical axis: Percentage (maximum 100%). Error bars represent 95% confidence intervals.

Image reading method/pathology SoT pairs with largest sum of sensitivity and specificity (quantitative SUVR vs. BIE) (Table 9)

Table 9 Sensitivity, specificity, and exact 95% CIs for each image reading method (quantitative vs. majority BIE) by pathology SoT for all 4 studies combined

When using Bielschowsky silver as the SoT, the sum of sensitivity and specificity was greater for BIE than for the quantitative SUVR for all of the SUVR types (6/6) (range of differences across SUVR types, 0.0357 [3.57%] to 0.1429 [14.29%]).

When using 4G8 or thioflavin S as the SoT, the sum of sensitivity and specificity was greater for quantitative SUVR than for the BIE for the majority of the SUVR types (6/6 for 4G8, 5/6 for thioflavin S) (Table 9). The only statistically significant differences were that specificity for BIE majority read was higher than specificity for the quantitative SUVR for the following: for 4G8, ipsilateral-cerebellum and all SUVR types-pons; and for thioflavin S, ipsilateral-pons.

Using overall pathology as the SoT, the sum of sensitivity and specificity for BIE and quantitative SUVR were tied (BIE was higher for 2 of 6 SUVR types, quantitative SUVR was higher for 2 of 6 SUVR types, and BIE and quantitative SUVR were tied for 2 of 6 SUVR types).

SUVR type/pathology SoT pairs with largest sum of sensitivity and specificity (retrospective vs. prospective studies) (Table 10)

Table 10 Sensitivity, specificity, and exact 95% CI for each SUVR type/pathology SoT combination in retrospective vs. prospective studies for all 4 studies combined

For the pathology SoTs Bielschowsky silver, thioflavin S, and 4G8, in the majority of SUVR type/pathology SoT combinations (5/6, 4/6, and 4/6, respectively), the sum of sensitivity and specificity was larger for prospective rather than retrospective studies (Table 10).

For the overall pathology SoT, SUVR type/pathology SoT the sum of sensitivity and specificity was tied in 1 of 6 combination pairs, larger in retrospective studies in 3 of 6 pairs, and larger in prospective studies in 2 of 6 pairs.

For retrospective studies, the SUVR types with the highest sum of sensitivity and specificity by pathology SoT in descending order were as follows: for overall pathology, ipsilateral-cerebellum, composite-cerebellum, and composite-pons (2.00, all 3 pairs); for Bielschowsky silver, composite-cerebellum (1.9375); for thioflavin S, composite-cerebellum (1.8947); and for 4G8, contralateral-cerebellum (1.75).

For prospective studies, the SUVR types with the highest sum of sensitivity and specificity by pathology SoT in descending order were as follows: for overall pathology, contralateral-cerebellum and composite-cerebellum (both 2.00); for Bielschowsky silver, contralateral-pons (2.00); for thioflavin S, contralateral-pons (1.9231); and for 4G8, ipsilateral-cerebellum and contralateral-pons (both 1.6471).

Accuracy of quantitative SUVR diagnosis, positive and negative predictive values

In addition to sensitivity and specificity, accuracy, PPV, and NPV were calculated (Table 8). The numbers and proportions of FN and FP results were lowest in the overall pathology SoT. For each of the 3 stains (IHC and HC), the proportions of patients with FP are substantially greater than the proportions of patients with FN.

SUVR types that had a 100% PPV were only found in the overall pathology SoT (ipsilateral-cerebellum, composite-cerebellum, and composite-pons). All SUVR types had a 100% NPV for the Bielschowsky silver SoT. Composite-cerebellum had 100% accuracy for 3 of the 4 SoTs (all but 4G8). The 4G8 pathology SoT had the lowest PPV and accuracy.

Discussion

Given that the SUVR of the composite VOI does not reflect the overall uptake level in the composite VOI (i.e., the mean value is not corrected for VOI size and all regions are treated as though they were of equal importance) it is perhaps unexpected that the composite measure for SUVR referenced to the cerebellum for all 4 studies combined was a better match with biopsy pathology findings for 2 pathology SoTs (overall pathology and Bielschowsky silver) than was the SUVR (any combination) from the biopsy region itself or its contralateral homolog. However, when the retrospective and prospective studies were analyzed separately, for retrospective studies, composite-cerebellum was generally the best match for the SoTs, while for the prospective studies, contralateral-pons was generally the best match for the SoTs. This could have been due to the general study limitation that there was a clear cut difference between the retrospective and prospective studies in the size of the VOI assessed. We acknowledge the general limitation that our data were not analyzed without the partial volume correction.

The largest ROC AUC for the combination of SUVR type and pathology SoT (4 studies combined) was for composite-cerebellum SUVR/overall pathology SoT, the most inclusive measure for imaging and the most comprehensive measure for pathology, respectively. ROC AUCs for composite-cerebellum/and contralateral-pons/Bielschowsky silver were nearly as large.

Based on the sum of sensitivity and specificity, BIE was a better tool for predicting the Bielschowsky pathology findings than was quantitation with SUVR (in 6 of the 6 SUVR/SoT pairs) but very similar to both quantitation with SUVR and BIE using the overall pathology judgment (i.e., high sensitivity and high specificity); the converse was true for thioflavin S (less sensitive and similarly specific) and 4G8 (substantially less sensitive and usually more specific). Overall, based on the sum of sensitivity and specificity, BIE and quantitation with SUVR appeared to be tied.

No particular advantage to using one SUVR reference region over the other (i.e., cerebellum or pons) was apparent. However, for all SUVR types, SUVR cut-off criteria using pons as the reference region were unexpectedly found to be less than 1. From the observed optimal cut-off values of approximately 0.4 (pons as reference region) and approximately 1.2 (cerebellum as reference region) in this set of iNPH patients, one would predict that on average, much more Aβ was deposited in the pons than in the cerebellum. Whereas in AD, enough Aβ for detection with tracer does not appear in the cerebellum or pons until Thal Phase 5 [37]. The significance of this finding will need to be clarified in future research.

We should note that a post-hoc Spearman correlation analysis was performed between cognitive status (Mini-Mental State Examination [MMSE] scores, for which individual subject values were previously published [32]), and measures for all of the pathologic markers of amyloid, including Bielschowsky, 4G8, thioflavin S, and overall pathology judgment. No statistically significant correlation was found, consistent with previous results that found no significant correlation between an imaging marker for amyloid load ([11C]PIB binding potential with PET) and MMSE [46].

While neocortical pathological Aβ changes in affected iNPH patients are similar to those in AD, and it is tempting to apply findings from one condition to the other, it is important to recognize that our understanding of the molecular biology of amyloid is incomplete [47]. We biopsied only cortical samples for use as our SoT. While a reduction in brain tissue is characteristic of both AD and iNPH, the reduction is due to atrophy in AD, whereas the sulci are less enlarged relative to the degree of ventricular dilatation in iNPH. However, the fact that the overall pathology SoT fared so well in the results may indicate that neocortical areas where amyloid was present in our iNPH patients might have been mimicking where Aβ pathological changes are seen in AD, i.e., widespread vs. focal. Others confirm the validity of diagnosis from a single cerebral biopsy sample [48].

To put the 4G8 antibody data into context, variable processing of the membrane-bound amyloid precursor protein (APP) produces Aβ and other APP fragments including p3 [47]. Aβ is a 39-43 amino-acid peptide; the 4G8 antibody detects amino acids 17-23 [49] or 17-24 [37]. The morphology of deposits in AD brain detected with 4G8 have been described with photomicrographs in exquisite detail as fleecy (fine fibrillar [49]), lake-like, and subpial band-like amyloid; diffuse and cored plaques; and core-only (burnt-out [50]) plaques, as well as white matter plaques; some of which types are further sub-divided into neuritic and non-neuritic [51]. From neocortex (Thal Phase 1), Aβ as detected using silver stain and 4G8 appears progressively over time in specific connected brain structures in an ordered anterograde sequence [51, 37].

Ikonomovic et al. demonstrated co-localization in an AD brain of 6-CN-PiB (thioflavin T derivative) and thioflavin S to cored plaques and core-only plaques in temporal cortex. Thioflavin S was not as selective and also stained numerous neurofibrillary tangles in this region, whereas 6-CN-PiB detected only an occasional isolated tangle [22].

Silver staining and IHC for Aβ provide complementary information. Silver, which stains both plaques and tangles (which appear in neuritic plaques), renders stable and reproducible results [52] and is recommended for quantitation of neuritic plaques [38, 43]. However, ‘argyrophilia’ is not a homogeneous phenomenon with respect to amyloid [51], and other subtle lesion-dependent variations between silver methods have been described [52].

While the newest NIA-AA criteria for the neuropathologic assessment of AD [38] also recommend IHC for ‘Aβ score’ (not plaque count) and refer to Thal [37], the NIA-AA criteria stop short of specific Aβ IHC reagent recommendations. Technical uncertainties with respect to Aβ IHC standardization may preclude its use as a standard for neuropathologic diagnosis, at least when deposits are quantified as with histological diagnostic criteria for AD [52]. For the neuropathologic changes of AD, the new criteria recognize the qualitative importance of the location and morphology of Aβ deposits as separate and different from quantification by neuritic plaque counts as in the CERAD scheme [38] or, by extension, as in the percentage area positivity we measured with 4G8 in this study of iNPH cortical biopsies.

Spillantini et al. reported finding “a substantial increase in the number of stained structures and the intensity of staining” with 4G8 after pre-treatment with formic acid [53], although data relative to no pre-treatment were not shown. Shankar et al. believe that formic acid solubilizes insoluble Aβ plaque cores isolated from human AD brains and releases their constituent dimers and monomers [54].

Plaques, composed largely of fibrillar Aβ, are dynamic structures and likely act as local reservoirs of smaller diffusable Aβ oligomers thought to be in equilibrium with plaques [55]. Disruption of the electrophysiological and microanatomical correlates of memory formation (i.e., inhibition of long-term potentiation, facilitation of long-term depression, reduced dendritic spine density, synapse loss) are associated with the smaller, soluble Aβ species [55], which are suspected of eventually [56] tipping the gain-of-toxic-function [47] past steady-state in favor of Aβ conglomeration and towards subsequent downstream events. Importantly, this paradigm provides that plaques may be present in cognitively (i.e., phenotypically) normal individuals. Clinically identifying patients on this cusp (cognitively normal/abnormal) and monitoring their disease progression (i.e., locating amyloid plaques in living patients) is of tremendous relevance and urgent importance to the testing of drugs that exploit the molecular biology of Aβ for the treatment of AD.

In this study, we found that while the numbers of FN and FP results were low (lowest for the overall pathology SoT), for each of the 3 stains separately, the proportions of patients with FP was substantially greater than the proportions of patients with FN. Interestingly, in our experience based on a large autopsy study of [18F]flutemetamol (68 brains, 43 Aβ positive, 25 Aβ negative) [report in process], cases with sparse to moderate diffuse and neuritic cortical plaques (Bielschowsky silver stain) may lead to FN or FP PET readings (similar to those discussed in a recent review [57]). This ‘cusp’ phenomenon may reflect the (currently unsettled) rest of the story of the molecular biology of Aβ, what triggers and perpetuates its neuronal anterograde trek [51, 37] and under what conditions (including the inflammatory component of the pathology outside the scope of this report), and what determines the point at which the recognizable phenotype becomes apparent (i.e., the presumed point of irreversible functional damage).

The 4G8 antibody has been used before in association with amyloid PET ligand uptake [33, 27]. The 4G8 component of our analysis can be interpreted both in an isolated manner and in context. We questioned the acceptability of pooling the 4G8 data from Study A (obtained using an antibody dilution of 1:500) with that from Studies B, C, and D (obtained using an antibody dilution of 1:100). We noticed that Clark et al. used 4G8 at a dilution of 1:2000 and we think reported similar correlations (the use of Bonferroni ρ was unclear) to what we previously reported for our pooled data (using Pearson’s r) [32] in their [18F]florbetapir PET autopsy study in subjects with and without AD or other age-related pathologies [58, 33]. And, Thal [49, 37] and Braak [59] used a dilution of 1:5000. In a systematic inter-laboratory study, Alafuzoff et al. recommended that in order to achieve reproducible results with 4G8 IHC, a dichotomized assessment (they suggested present/absent) rather than quantification should be applied [60]. Our raw data showed that in Study A 4/7 biopsy samples had 4G8 present; in Studies B, C, and D, respectively, 6/9, 13/15, and 16/16 biopsy samples had 4G8 present. Given the small sample sizes, it is impossible to know how different ‘4G8 present/absent’ in our studies truly was, and therefore whether or not pooling of data on this basis is indicated.

Our data showed that the 4G8 SoT did not perform as well as the other SoTs in terms of the sum of sensitivity and specificity; it showed good sensitivity and poor specificity (indicating too many FP). Tissue preparation may have been the reason for this, in that the threshold for abnormal may not have been optimized. If the threshold for abnormal were set lower than the 2.5% we used, sensitivity would have decreased and specificity would have increased. The fact that the 4G8 dichotomous assessment did not match the overall pathology SoT in many cases (i.e., 4G8 resulted in the most overall pathology misclassifications) supports the interpretation that the selected threshold was not optimal.

In a separate analysis (Study C) of iNPH patients included in this pooled analysis and who also underwent [11C]PiB PET imaging, ipsilateral, contralateral, and composite SUVRs for both [18F]flutemetamol and [11C]PiB correlated significantly with Aβ biopsy specimen levels evaluated by 4G8, thioflavin S, and Bielschowsky silver stain [30]. Our findings using the pooled [18F]flutemetamol data in iNPH patients are also consistent with findings for [18F]florbetapir where a correlation was shown between PET brain labeling and grey matter plaque density not only by 4G8 as alluded to above, but also as measured by silver stain at autopsy in subjects with and without AD or other age-related pathologies [58, 33]. To our knowledge, no biopsy or autopsy data for [18F]florbetaben have been published yet.

Whereas our study was limited by the logistical constraints associated with the collection of comparatively primitive neuropathologic data from clinical trials at multiple sites in the setting of rapidly changing (increasingly refined) pathologic diagnostic criteria, we propose that an ideal experiment might improve upon a similar basic study design to that of Ikonomovic et al. [22] and include the following elements: multiple patients with AD and brain [18F]flutemetamol imaging near life’s end, followed by post mortem examination of adjacent sections treated with cyano-flutemetamol vs. e.g., IHC for selected Aβ epitopes and tau epitopes (plus routine conventional staining), with the Aβ and neurofibrillary tangle pathology results described according to current nomenclature for neuropathologic changes of AD [38]. Other desirable experimental elements include a selected battery of sections of brains associated with all phases of AD severity, sections thick enough to allow confocal microscopy through entire cells, collection of information with respect to individual genetic risk association factors, thorough history and timeline of medical conditions with any inflammatory component, duration of cognitive impairment, and results of recent neuropsychiatric assessments. The importance of careful archiving of tissue samples and associated patient data to test for future findings cannot be overstated. Paradoxically, owing to the relative rarity of appropriate post mortem handling and disposition for these purposes, the human brain may be as valuable an asset in death as it is in life.

The performance characteristics and diagnostic efficacy of the Aβ ligands when used alone, and more recently when used with other imaging modalities (e.g., structural MRI and fluorodeoxyglucose PET) [e.g., [42, 61] or in conjunction with the assessment of the well known AD risk variant of apolipoprotein E (ϵ4 allele) [e.g., [62], have been described in over a decade of literature. ApoE type has not yet added clinically useful diagnostic information in conjunction with imaging; however, algorithms which include multiple biomarkers have been clearly shown to increase the power of studies and reduce the number of patients required to demonstrate the statistical significance of findings [61] and are encouraged by the Food and Drug Administration [63] and the Alzheimer’s Disease Neuroimaging Initiative (http://www.nia.nih.gov/research/dn/alzheimers-disease-neuroimaging-initiative-adni), which freely shares data. While we did not pre-specify the collection any CSF biomarker data in our 4 clinical trials (and ApoE genotype was published for only 18 subjects in our pooled dataset [31]), the relationship of certain CSF biomarkers to brain biopsy findings was recently described for a large series of patients (53 iNPH, 26 AD, and 23 other) at Kuopio University in Finland; quantified biopsy Aβ load showed a negative correlation with both ventricular and lumbar CSF Aβ42 while levels of Aβ38 and Aβ40 showed no correlation [64]. In the near future we hope to see clinical imaging with Aβ ligands combined with assessments of other recently identified AD risk genes (up to and including IHC for their protein products at autopsy) such as for specific variants of clusterin (CLU, Apo J), complement component receptor 1 (CR1), phosphatidylinositol binding clathrin assembly protein (PICALM) [61], until a more powerful algorithm is achieved that will almost certainly inform the identity of one or more useful drug targets (drugs) for the slowing, prevention, and/or arrest of this ultimately devastating pathology.

In summary, both quantitative assessment and BIE of [18F]flutemetamol images in this series of iNPH patients showed good agreement with cortical biopsy histopathology. The primary diagnostic effectiveness (as measured by ROC AUC, Youden index, and sum of sensitivity and specificity) for [18F]flutemetamol PET was best when the composite SUVR measure using cerebellum as the reference region was paired with the overall pathology SoT.