Interobserver reproducibility of tumor uptake quantification with 89Zr-immuno-PET: a multicenter analysis

Purpose In-vivo quantification of tumor uptake of 89-zirconium (89Zr)-labelled monoclonal antibodies (mAbs) with PET provides a potential tool in strategies to optimize tumor targeting and therapeutic efficacy. A specific challenge for 89Zr-immuno-PET is low tumor contrast. This is expected to result in interobserver variation in tumor delineation. Therefore, the aim of this study was to determine interobserver reproducibility of tumor uptake measures by tumor delineation on 89Zr-immuno-PET scans. Methods Data were obtained from previously published clinical studies performed with 89Zr-rituximab, 89Zr-cetuximab and 89Zr-trastuzumab. Tumor lesions on 89Zr-immuno-PET were identified as focal uptake exceeding local background by a nuclear medicine physician. Three observers independently manually delineated volumes of interest (VOI). Maximum, peak and mean standardized uptake values (SUVmax, SUVpeak and SUVmean) were used to quantify tumor uptake. Interobserver variability was expressed as the coefficient of variation (CoV). The performance of semi-automatic VOI delineation using 50% of background-corrected ACpeak was described. Results In total, 103 VOI were delineated (3–6 days post injection (D3-D6)). Tumor uptake (median, interquartile range) was 9.2 (5.2–12.6), 6.9 (4.0–9.6) and 5.5 (3.3–7.8) for SUVmax, SUVpeak and SUVmean. Interobserver variability was 0% (0–12), 0% (0–2) and 7% (5–14), respectively (n = 103). The success rate of the semi-automatic method was 45%. Inclusion of background was the main reason for failure of semi-automatic VOI. Conclusions This study shows that interobserver reproducibility of tumor uptake quantification on 89Zr-immuno-PET was excellent for SUVmax and SUVpeak using a standardized manual procedure for tumor segmentation. Semi-automatic delineation was not robust due to limited tumor contrast. Electronic supplementary material The online version of this article (10.1007/s00259-019-04377-6) contains supplementary material, which is available to authorized users.


Introduction
Therapy with monoclonal antibodies (mAbs) has greatly improved the outcome of cancer patients [1]. However, treatment failure due to the biology of the disease is a substantial problem. In addition to disease-related factors, therapy-related factors have been found to be responsible [2]. There is mainly information on pharmacokinetics in blood, whereas tumor targeting is crucial for mAb efficacy. Therefore, in-vivo quantification of antibody uptake in tumors is of interest in strategies to improve the efficacy of antibody treatment (e.g. using optimized pharmacokinetic models in early drug development to improve dosing schedules). PET imaging with zirconium-89 ( 89 Zr)-labelled mAbs provides a non-invasive tool to visualize and quantify mAb tumor uptake [3], providing that biodistribution of the radiolabelled mAb represents that of the total mAb dose (radiolabelled and unlabelled). The number of clinical studies on 89 Zr-labelled mAbs, also referred to as 89 Zr-immuno-PET, increased in recent years [4]. Sources of measurement errors (including factors such as interobserver reproducibility of tumor uptake quantification and noise induced variability) should be known to define true biological differences. A standardized method of data acquisition and tumor uptake quantification forms the basis for obtaining experimental data that will allow such an understanding.
For quantification of tumor uptake, a volume of interest (VOI) is delineated. Subsequently, a tumor uptake measure is selected to characterize tumor uptake. Maximum (max) or peak standard uptake values (SUV max and SUV peak , respectively) provide information on a limited part of the tumor. Mean standardized uptake values (SUV mean ) and total lesion uptake (TLU) serve to capture the entire lesion. In clinical studies, tumor uptake is quantified at a single (late) timepoint or at multiple timepoints. Additionally, quantification of tumor uptake at an early timepoint (D0) can be considered, for example, to estimate the blood volume fraction of the tumor.
For imaging of mAbs, 89 Zr is considered a suitable radioactive isotope due to its long half-life (t 1/2 = 78.4 h), which matches the slow kinetics of large-sized proteins. Consequences of imaging with 89 Zr are low positron abundance and relatively high radiation exposure, resulting in lower injected doses compared to 18 F. Therefore, lower signal to noise ratios due to lower count rates may result in interobserver variability of tumor uptake quantification in 89 Zr-immuno-PET. Other specific challenges for 89 Zr-immuno-PET tumor delineation and quantification are relatively low, sometimes heterogeneous, tumor uptake ( Fig. 1) and low (or even negative) contrast depending on tumor localization and background activity [5]. Therefore, the aim of this study was to determine interobserver reproducibility of tumor uptake values by manual delineation on 89 Zr-immuno-PET.

Data inclusion
For this retrospective study, 89 Zr-immuno-PET scans with corresponding 18 F-FDG-PETscans were collected. Data were selected from previously published clinical studies with therapeutic mAbs: 89 Zr-rituximab in patients with B cell lymphoma ( [6]; Dutch Trial Register NTR 3392), 89 Zr-cetuximab in patients with colorectal cancer ( [5]; NCT01691391) and 89 Zr-trastuzumab in patients with breast cancer ( [7]; NCT01691391). These studies had been approved by the ethics committees (Medisch Ethische Toetsingscommissie VUmc and Medisch Ethische Toetsingscommissie UMC Groningen) and all subjects signed an informed consent. Data acquisition and visual assessment of tumor uptake was done locally: from the first two studies performed at the VUmc all subjects with visible tumor uptake were included, from the last study performed at the UMCG seven subjects were selected randomly. Scan data at 1 h (D0), 72 h (D3) and 144 h (D6) post injection (p.i.) for 89 Zr-labelled rituximab and cetuximab and at 96 h (D4) p.i. for 89 Zr-trastuzumab were included. See Table 1 for patient characteristics and 89 Zrimmuno-PET scan details. 89 Zr-rituximab and 89 Zr-cetuximab PET scans were performed on a Philips Gemini TF-64 or Ingenuity TF-128 PET-CT scanner (Philips Healthcare, The Netherlands). A Siemens Biograph mCT64 PET-CT scanner (Siemens Healthcare, The Netherlands) was used for the 89 Zrtrastuzumab-PET scans.

VOI delineation
All immuno-PET scans were acquired and reconstructed to conform to recommendations for multicenter harmonization of 89 Zrimmuno-PET [8]. Visual assessment of immuno-PET scans was performed by an experienced nuclear medicine physician (OSH for 89 Zr-rituximab and 89 Zr-cetuximab, AHB for 89 Zrtrastuzumab). Tumor uptake was defined as focal uptake exceeding local background. For visually positive tumor lesions, a screenshot indicating tumor localization on immuno-PET was obtained for tumor uptake quantification. Quantitative assessment of tumor uptake for all lesions was independently performed by three observers [1 data analyst (SP), 2 physicianresearchers (FB, YJ)]. Tumor delineation for all VOI was performed using the ACCURATE software tool (developed in IDL version 8.4 (Harris Geospatial Solutions, Bloomfield, USA)) [9].
The observers recorded the analysis time per tumor lesion and VOI delineation method.
Manual tumor delineation on immuno-PET The observers manually delineated tumor VOI on the immuno-PET scans (attenuation corrected image), using the low dose CT for anatomical reference (Fig. 2a). Adjustment of the following settings was allowed: zoom, contrast and orientation (coronal/ Fig. 1 Challenges for 89 Zrimmuno-PET tumor delineation and quantification. Example of 18 F-FDG-PET (a) for a patient with a non-Hodgkin lymphoma showing intense tumor uptake (black arrow) and excellent contrast, while 89 Zr-immuno-PET (b) with 89 Zr-labelled-rituximab shows limited contrast for this tumor. Red arrows indicate uptake in blood vessels. Example of tumor delineation by two observers (observer 1 = blue line, observer 2 = black line) for 18 F-FDG-PET (c) and 89 Zr-immuno-PET (d). This example illustrates that excellent interobserver reproducibility (SUV max = 10 for both observers) can be expected for 18 F-FDG-PET, despite variability in tumor delineation. The limited tumor contrast for 89 Zr-immuno-PET may result in substantial interobserver variability, even for SUV max (a value of 2 and 3 for observers 1 and 2, respectively)  axial/sagittal). Use of a threshold (upper or lower limit) or fixed size VOI was not allowed. For 89 Zr-rituximab and 89 Zr-cetuximab, tumors were manually delineated on both the D3 and D6 scans, starting with the latest time point. On D0, no tumor uptake was visible, therefore the VOI delineated on D6 were imported to the D0 scan. Observers could manually adjust localization of the VOI to optimize matching of the anatomical position of the tumor lesion on the D0 scan. For all VOI, max, peak and mean activity concentrations (AC in Bq/mL) were derived and converted to standardized uptake values (SUV), by correcting for body weight and injected dose (ID). In addition, delineated volume (mL) and TLU (defined as AC mean * volume, in %ID) were obtained.
Manual tumor delineation on immuno-PET after viewing the 18 F-FDG-PET In order to support delineation of the tumor, the observers had access to the corresponding 18 F-FDG-PET and could adapt the original manually delineated VOI if necessary (for example, by creating a smaller or larger VOI, or changing the position of the VOI) (Fig. 2b). This procedure was performed on scans with visible tumor uptake (D3, D4, D6). The number of VOIs that were adapted after viewing the 18 F-FDG-PET was obtained.
Semi-automatic VOI delineation Finally, we investigated the feasibility of a mask-restricted semi-automatic VOI delineation method. Each observer, for every tumor lesion, manually delineated a mask, which is a VOI including the tumor, excluding non-tumor structures (e.g. nearby blood vessels) on the immuno-PET scan. Subsequently, the semi-automatic VOI was generated including all voxels with a value ≥50% of background-corrected AC peak within the mask (Fig. 2c). The semi-automatic isocontour was defined as 0.5 * (peak value + average background value). The background region was Zr-rituximab-PET on D6 (left panel), the mask delineated on the 89 Zrrituximab-PET shown in orange (middle panel) and the semi-automatic VOI (50% of AC peak , mask restricted) on the 89 Zr-rituximab-PET shown in green (right panel). This semi-automatic VOI was accepted by the observer, as it contains tumor and no other structures or background determined with a region growing algorithm of the tumor border, expanding three voxels away from the border of the tumor in all three dimensions [10]. The observers rated the semi-automatic VOI and accepted the VOI if it contained the tumor and no other structures or background. The number of tumor lesions for which the semi-automatic VOI was accepted by all observers was obtained.

Eligibility criteria for VOI delineation
Quantification of lesions with low tumor uptake and/or high background uptake (e.g. lesions with low contrast and/or nearby presence of blood vessels or elevated healthy tissue uptake) is difficult, due to the intrinsically low signal to noise ratios in 89 Zrimmuno-PET. To ensure that quantification is only reported when delineation is feasible, a method to determine eligibility for VOI delineation was explored. Criteria were selected based on the potential for incorporation in a standardized workflow for tumor identification by a nuclear medicine physician, followed by tumor delineation by a data-analyst.
When measurement variability for SUV max was >0, VOI were assessed for apparent insufficient tumor contrast for manual tumor delineation.
Based on this assessment VOI were deemed ineligible for quantification, according to the following criteria:

1.
A different structure was delineated by at least one observer. 2. The voxel with maximum intensity was located at the border of the VOI, of at least one observer.
Interobserver variability and reliability were analyzed for the entire group of VOI, as well as for the subset of VOI eligible for quantification.

Interobserver reproducibility
Interobserver reproducibility for manual tumor delineation on immuno-PET was assessed by an agreement parameter (standard error of measurement (SEM)) as well as a reliability parameter (ICC; [11]). As we expected that the interobserver variability between lesions within a single patient was equal or higher than between patients, we performed a VOI-based analysis.
Interobserver variability The agreement parameter reflects the measurement error due to interobserver variability [11]. For every tumor lesion, three values (value 1 , value 2 and value 3 ) were obtained from observers 1, 2 and 3, respectively. Absolute interobserver variability was calculated as: where SD is the standard deviation. SEM was calculated for each individual tumor lesion and has the same unit as the uptake measure (SUV max , SUV peak and SUV mean , dimensionless; volume in mL; TLU in %ID).
Relative interobserver variability was calculated as: where CoV (%) is the coefficient of variation. When all observers measure the exact same tumor uptake, SEM and CoV equal 0.
Correlation of absolute and relative variability with tumor uptake was assessed. For a group of n VOI, the interobserver variability is given as the median (interquartile range).
Reliability A reliability parameter was used to assess whether differences in tumor uptake between lesions can be distinguished, despite measurement error due to interobserver variability. A two-way random model with absolute agreement (single measure) was used to obtain the ICC and 95% confidence interval. This means that the three observers in our study were considered as a random sample of all possible observers, and the systematic differences between the observers were included in the measurement error as we were interested in absolute agreement between the observers.
Reliability, expressed as ICC, was calculated as: where σ 2 obs is the systematic part, and σ 2 error is the random part of the measurement error, while σ 2 lesion is the true variance between tumor lesions. ICC calculations were performed in SPSS, version 22.

Statistical analysis
For comparison of interobserver variability between two groups, Wilcoxon matched-pairs signed rank test was used for paired data (e.g. SUV mean on D3 and D6 for the same tumor lesions). For comparison of median CoV between multiple groups, a oneway ANOVA (non-parametric) was performed, using Friedman test with Dunn's multiple correction to compare median CoV for paired data (SUV mean , SUV max and SUV peak for the same tumor lesions). For all statistical tests, a p value <0.05 was considered statistically significant. Statistical tests were performed in GraphPad Prism, version 6.02.

VOI delineation
In total, 103 VOI were manually delineated by each observer. The number of VOI was not evenly distributed over the patients ( Table 1). The range in interobserver variability (SEM for SUV peak ) for all VOI combined was 0 to 2.3 (median 0.4, n = 103). The range in interobserver variability between VOI within a single patient was 0 to 2.3 (median 0.6, n = 22) for patient 2 ( 89 Zr-rituximab at D6). Interobserver variability (SEM) at D6 for the remaining five 89 Zr-rituximab patients ranged from 0.1 to 1.4 (median 0.3, n = 8).
Thus, as interobserver variability was higher within a single patient than between patients, a VOI-based analysis was performed.
Manual delineation on 89 Zr-immuno-PET required a median time of 2 min (range 1-5 min). Viewing of the 18 F-FDG-PET /adaption of the original VOI required an additional time of 1 min (range 1-30 min). The semi-automatic procedure required 1 min (range 1-5 min).
All observers reported difficulties to distinguish the borders of some tumor lesions on immuno-PET, especially if the tumor was in proximity to other structures with high uptake, e.g. a blood vessel. Viewing the corresponding 18 F-FDG-PET did not resolve this issue, as the localization and borders of the tumor lesions on immuno-PET were still not fully clear when viewing both the immuno-PET and the 18 F-FDG-PET. After viewing the corresponding 18 F-FDG-PET, 25% of the VOI were adapted by at least one observer ( Table 2).
Semi-automatically generated VOI were accepted by all three observers in 45% of all VOI ( Table 2). Inclusion of background was the main reason for failure of semi-automatic VOI.

Eligibility criteria for VOI delineation
Measurement variability for SUV max was >0 in 25% (26/103) of the manually delineated VOI.
Application of eligibility criteria resulted in exclusion of 19 VOI, as tumor contrast was apparently insufficient for correct VOI delineation.
Interobserver variability did not change after viewing the corresponding 18 F-FDG-PET (p = 0.62, n = 25 VOI adapted by at least 1 observer).
Reliability ICC data are presented in Table 4. For eligible VOI, ICC values for SUV max , SUV peak and SUV mean were ≥ 0.90 for  Table 2).

Discussion
Interobserver reproducibility for tumor uptake measures was investigated, as knowledge of measurement error is required for future clinical application of 89 Zr-immuno-PET. Interobserver reproducibility was excellent for SUV max and SUV peak (variability of 0%) and very reasonable for SUV mean (variability of 7%), especially considering the lower signal to noise ratios for 89 Zr-immuno-PET compared to 18 F-FDG-PET. For example, interobserver variability of 14% for SUV mean has been reported for manual tumor delineation of pulmonary lesions on 18 F-FDG-PET [12]. For 89 Zr-immuno-PET, this is the first study to report interobserver reproducibility of tumor uptake measures. Several factors should be considered to determine to which extent these results are generalizable. Interobserver reproducibility was determined for three different 89 Zr-labelled mAbs (rituximab, cetuximab and trastuzumab), at different time points (D3, D4, D6) and different injected doses (74 MBq for 89 Zr-rituximab vs 37 MBq for 89 Zrtrastuzumab and 89 Zr-cetuximab). This study was not designed to assess how these factors individually impact interobserver variability. Instead, the results obtained reflect a broad range of uptake characteristics, which can be used as a general estimate of the measurement error due to interobserver variability in VOI delineation. Future, larger studies can focus on factors that influence tumor contrast (e.g. tumor localization, differences in uptake characteristics between mAbs). Although ICC are reported, reliability is dependent on the range in tumor uptake and therefore not directly generalizable to other studies. In addition, tumor uptake and interobserver variability are influenced by the disproportionate high number of lesions in patient 2. Therefore, ICC values for this lesion-based analysis cannot be applied to determine whether we can reliably detect differences between patients.
Improved tumor contrast, in combination with a broad range in tumor uptake, is expected to result in improved interobserver reproducibility for all tumor uptake measures.
Another aspect to consider is that all observers used the same quantification software and a standardized operating procedure (no use of thresholds or fixed size VOI). Use of different software platforms without a standardized procedure may result in lower interobserver reproducibility. In addition, generalizability could be hampered if the three observers would have read the images in a systematically different way. In this study, there was no indication for such a systematic difference between the three observers.
These results suggest that interobserver agreement for SUV mean is sufficient to consider this uptake measure to quantify tumor uptake in a larger tumor area (opposed to only the maximum voxel or very small sample of the tumor as defined by SUV peak ). However, manual tumor delineation is a laborious task. As the concept of total lesion mAb uptake is of interest, the feasibility of semi-automatic VOI delineation was explored. For 18 F-FDG-PET with perfect interobserver agreement for SUV max [13] and higher tumor contrast, semiautomatic procedures are used to obtain SUV mean based on a semi-automatic method (e.g. with a threshold of 0.6 of SUV max ), total lesion glycolysis (TLG) and total metabolic tumor volume (TMTV) [14,15]. For our datasets, the area included by the semi-automatic VOI was often too large, indicating low tumor to local background ratios, resulting in inclusion of background voxels in the semi-automatic VOI. For mAbs showing higher tumor contrast, as well as imaging with higher count statistics (due to, for example, higher injected doses or the availability of scanners with improved detection sensitivity or time of flight resolution), semiautomatic delineation may be feasible. Reduction of noise (e.g. by introduction of total body PET scanners) is the first step towards further improvement of tumor delineation procedures. Future studies into accuracy of tumor delineation should include 'supervised' delineation methods (semiautomatic procedures with a manual check) in which the optimal threshold is experimentally determined. If the success rate can thus be increased, this may lead to further development towards a robust automatic method, which is desired for clinical application.  Data presented as ICC (95% confidence interval) a NA ICC not available, 2 eligible VOI As semi-automatic delineation was not feasible in our datasets, we explored eligibility criteria to improve standardization for manual tumor delineation, especially in case of limited tumor contrast.
In our study, 81% of the VOI (84 out of 103) were considered suitable for quantification. Based on these results, we recommend a two-step procedure to exclude lesions with insufficient tumor contrast for manual delineation: (1) verification of VOI delineation by a nuclear medicine physician to identify delineation of an incorrect structure due to limited tumor contrast, (2) exclusion of VOI with the voxel with the highest uptake located at the border of the VOI, indicating low tumor uptake and/or high background uptake.
These measures support optimal scan interpretation and standardization, which is an essential step towards potential clinical implementation of 89 Zr-immuno-PET.
For this study, we performed a multicenter interobserver analysis for data that was originally obtained in single center studies. With this experience, the next step towards standardization of quantification for 89 Zr-immuno-PET studies can be done in the context of a multicenter study [e.g. the IMPACT trials, (NCT02228954, NCT02117466 and NCT01957332)].
Reliable delineation of tumor uptake on 89 Zr-immuno-PET allows future use as a non-invasive clinical tool to determine mAb concentrations in the tumor. Knowledge on in-vivo drug delivery of mAb-based therapy (including antibody-drug conjugates, bispecific mAbs and immune checkpoint inhibitors) is crucial to understand and predict efficacy of treatment.

Conclusion
This study shows that interobserver reproducibility of tumor uptake quantification on 89 Zr-immuno-PET was excellent for SUV max and SUV peak using a standardized manual procedure for tumor segmentation. Semi-automatic delineation was not robust due to limited tumor contrast.