Optimal method for metabolic tumour volume assessment of cervical cancers with inter-observer agreement on [18F]-fluoro-deoxy-glucose positron emission tomography with computed tomography

Purpose Cervical cancer metabolic tumour volume (MTV) derived from [18F]-FDG PET/CT has a role in prognostication and therapy planning. There is no standard method of outlining MTV on [18F]-FDG PET/CT. The aim of this study was to assess the optimal method to outline primary cervical tumours on [18F]-FDG PET/CT using MRI-derived tumour volumes as the reference standard. Methods 81 consecutive cervical cancer patients with pre-treatment staging MRI and [18F]-FDG PET/CT imaging were included. MRI volumes were compared with different PET segmentation methods. Method 1 measured MTVs at different SUVmax thresholds ranging from 20 to 60% (MTV20-MTV60) with bladder masking and manual adjustment when required. Method 2 created an isocontour around the tumour prior to different SUVmax thresholds being applied. Method 3 used an automated gradient method. Inter-observer agreement of MTV, following manual adjustment when required, was recorded. Results For method 1, the MTV25 and MTV30 were closest to the MRI volumes for both readers (mean percentage change from MRI volume of 2.9% and 13.4% for MTV25 and − 13.1% and − 2.0% for MTV30 for readers 1 and 2). 70% of lesions required manual adjustment at MTV25 compared with 45% at MTV30. There was excellent inter-observer agreement between MTV30 to MTV60 (ICC ranged from 0.898–0.976 with narrow 95% confidence intervals (CIs)) and moderate agreement at lower thresholds (ICC estimates of 0.534 and 0.617, respectively for the MTV20 and MTV25 with wide 95% CIs). Bladder masking was performed in 86% of cases overall. For method 2, excellent correlation was demonstrated at MTV25 and MTV30 (mean % change from MRI volume of −3.9% and − 8.6% for MTV25 and − 16.9% and 19% for MTV30 for readers 1 and 2, respectively). This method also demonstrated excellent ICC across all thresholds with no manual adjustment. Method 3 demonstrated excellent ICC of 0.96 (95% CI 0.94–0.97) but had a mean percentage difference from the MRI volume of − 19.1 and − 18.2% for readers 1 and 2, respectively. 21% required manual adjustment for both readers. Conclusion MTV30 provides the optimal correlation with MRI volume taking into consideration the excellent inter-reader agreement and less requirement for manual adjustment. Supplementary Information The online version contains supplementary material available at 10.1007/s00259-020-05136-8.


Introduction
Cervical cancer, the fourth most common gynaecological malignancy worldwide, is a major cause of mortality in women [1,2]. Primary tumour volume-derived parameters from 18 Ffluoro-deoxy-glucose positron emission tomography with computed tomography ([18F]-FDG PET/CT) such as metabolic tumour volume (MTV) and total glycolytic volume (TGV) have been reported to be prognostic in cervical cancer patients [3][4][5]. Combined nomograms of pre-treatment MTV, cervical tumour maximum standard uptake value (SUV max ) and lymph node status on [18F]-FDG PET/CT have been suggested to predict overall survival in locally advanced cervical cancer patients undergoing chemo-radiation therapy [6].
Even though morphological MRI has assumed prominence as the imaging modality of defining the gross tumour volume (GTV) in cervical cancer adaptive brachytherapy, as enshrined in Gynaecological European Group of Curietherapie-European Society for Therapeutic Radiology and Oncology (GYN GEC-ESTRO) Working Group recommendations, PET retains a role in staging (N, M) and prognosis and is integrated into the radiotherapy workflow [7][8][9].
However, the optimal method of outlining tumour volume on [18F]-FDG PET/CT in cervical cancer has not been established, and this is required in order to standardise its use for establishing prognosis using volumetric based parameters ( Table 1).
The EANM Guidelines for Tumour Imaging suggest a 41% of the SUV max tumour VOI corresponds best with the tumour dimensions provided the tumour has high metabolic activity to background ratios and homogenous uptake [18]. Otherwise, a VOI of 50% of the SUV max was recommended. These guidelines were based on three papers, a phantom thorax study [19], a test re-test study in 11 lung cancer patients and in the follow-up of 16 breast cancer patients [20], and repeatability measurements in 11 lung cancer patients scanned 7 days apart [21]. All of the studies cited by the guidelines were performed by the same single institution, and those that were performed on humans utilised tumours with high tumour to background ratios (lung and breast). It is unclear, but unlikely, that this automatically extends to other tumour types particularly tumours close to organs with high physiologic activity such as cervical tumours adjacent to the bladder.
An early study by Miller and Grigsby involving 13 patients with cervical cancer who had [18F]-FDG PET within 2 weeks of separately acquired CT established a threshold of 40% SUV max (MTV 40 ) based on the visual inspection of tumour volume on CT scans [10].
Most subsequent studies (Table 1) on cervical tumours used this MTV 40 threshold [6,11,15,[22][23][24], although a few studies have used a fixed threshold of SUV > 2.5 [3,5,25]. Recent studies have suggested 30% SUV max (MTV 30 ) or 35% SUV max (MTV 35 ) threshold correlate best with tumour volume on MRI [13,16]. In tumour volume assessment of other malignancies, a fixed SUV max threshold has demonstrated significant limitations, including underestimating MTV in lesions with high SUV max and overestimating in lesions close to regions with high background activity [26]. In addition, partial volume averaging affects small tumours. Recently, automated gradient methods have been proposed but to date, they have not been assessed in cervical tumours [27].
Ideally, the tumour segmentation technique should be fully automated. However, in practice, intense bladder/ureteric and bowel activity adjacent to the cervical tumour can interfere with accurate tumour outlining. Therefore, manual adjustment of automated volumes, to exclude activity in adjacent physiologic structures, may be required. This has been mentioned but not accurately documented in previous studies [16]. In addition, it is not clear which software approach to exclude bladder is best and options include bladder masking or applying a constraining volume around the tumour prior to applying thresholds.
With the increasing trend towards volumetric assessment of PET data with radiomics feature analysis, it is important that the methods are standardised using a technique that is both accurate and reproducible. The reproducibility of cervical tumour volume assessment on [18F]-FDG PET/CT at various thresholds has never previously been documented.
The aims of this study are as follows: (1) To evaluate the optimal metabolic tumour volume (MTV) at different percentage rates of SUV max thresholds (method 1 using bladder masking when required; method 2 using an ellipsoid isocontour around the tumour before applying thresholds) and an automated gradient method (method 3) to outline primary cervical tumours using MRI-derived tumour volumes as to the reference standard. (2) To document any requirement for manual adjustment.
(3) To assess inter-observer agreement in MTV measurement.

Study design
Institutional ethical approval for retrospective analysis was obtained, and informed consent was waived. Consecutive patients between January 2009 and December 2016 who had staging [18F]-FDG PET/CT imaging for biopsy-proven cervical cancer at our tertiary referral specialist gynaecological oncology centre and MRI pelvis were included in the study. Inclusion criteria were (i) histologically confirmed cervix cancer, (ii) absence of previous treatment for cervical cancer (including previously excised by cone biopsy), (iii) availability of a recent comparative MRI pelvis within 10 weeks and (iv) presence of a measurable cervical tumour on both MRI and PET/CT. Exclusion criteria were as follows: patients in whom the cervical tumour was not measurable (less than 5 ml in volume [28,29]).

PET/CT protocol
The PET centre is NCRI (National Cancer Research Institute) accredited by the UK PET Core Lab and all scans were performed on the same scanner (Siemens Biograph 64). Following a 4-6-h fast with acceptable glucose level (< 11 mmol/l) patients were administered 18 F-FDG (370-410 MBq) intravenously. Post 60-min uptake period, a low-dose CT (5-mm thickness with 3-mm spacing, 120 kVp, 50 mAs, 0.8 spiral pitch) was performed followed by an emission study (mid-thighs to skull base, 5-6 overlapping bed positions 3-4 min/bed position). Images were reconstructed using ordered subset expectation maximisation (4 iterations, 8 subsets, Gaussian filter of 5 mm FWHM). The PET images were attenuationcorrected using the CT data.

Segmentation
Method 1: Semi-automated adaptive threshold contour generation ± bladder masking Bladder masking For each MTV threshold, an initial assessment was made whether bladder masking was required. The criterion for bladder masking was as follows: if the bladder was outlined instead of tumour or if part of the bladder was included in the MTV on > 5 slices. When required a single experienced observer (SG) performed bladder masking using an automated technique (Hermes Medical Solutions, Sweden).
Image analysis Images were analysed independently by two experienced observers (SG and TB, 3-and 15-years' PET/CT experience respectively). MTV was auto-contoured at percentage SUV max thresholds of 20%, 25%, 30%, 35%, 40%, 50% and 60% (MTV 20,25,30,35,40,50 and 60 ) (Hermes Medical Solutions, Sweden). Percentage SUV max thresholds were performed at intervals of 5% from 20 to 40% based on the findings of the study by Upasani et al. [13], which suggested that in their cohort the ideal threshold was between MTV 30 and MTV 35 and by earlier research which suggested a MTV 40 threshold [10]. In addition, an absolute threshold of SUV > 2.5 (SUV 2.5 ), as used by other studies, was assessed [3,25]. If adjacent physiological structures, such as bladder, ureters and bowel, were included in the automated volume, manual adjustments were made. The degree of manual adjustment was documented as either no manual adjustment, minor: ≤ five slices, major: six slices up to twenty, too difficult: 21 slices or more to adjust. In the few cases deemed 'too difficult to employ manual adjustment', the MTV was not measured and this was noted.
The MTV for the different thresholds was recorded for each observer. The inter-observer agreement for MTV was assessed from the final volume for each observer, i.e. included completely automated cases and those with manual adjustment if required.
Method 2: Semi-automated adaptive threshold contour generation with isocontour method Percentage SUV max outlining was also performed using an ellipsoid isocontour method (Siemens Syngo.Via, Siemens Healthineers AG, Germany) independently by two experienced observers (MA and TB, 5-and 15-years' PET/CT experience respectively). An ellipsoid contour was drawn around the tumour avoiding adjacent physiologic structures and negating the need for bladder masking. This then constrained the volume from which the different thresholds were obtained. The isocontours were drawn separately by the two observers. No manual adjustment of the final tumour thresholds was possible with this software.

Method 3: Automated gradient method
Using the Automated Gradient-based method (MIM Software Inc., Cleveland OH, USA), which places the contour boundary at the location where the signal gradient is the greatest [30], two observers (TB and MA) produced segmentations. The technique required the observer to select the tumour with two perpendicular cross hairs which then generated the segmentation. Any adjacent structures which were outlined but were not related to the tumour were manually removed. The same manual adjustment scoring system used for method 1 was again utilised. Bladder masking was also utilised if required.
For each method, the observers received training in ten random cases with the application specialist.
MRI pelvis protocol and image analysis Staging MRI pelvis was performed at the local network hospital or the tertiary centre. As such, there was variation in MRI scanner and exact protocol. However, all scans were performed at 1.5 T and as a minimum included 2D small field of view T2-weighted (T2W) sequences in sagittal and axial oblique (perpendicular to the long axis of the cervix) planes. The maximum slice thickness was 5 mm across the network as per GYN GEC-ESTRO (at the tertiary centre sagittal T2 MRI parameters: turbo-spin echo, TR 2275 ms, TE 90 ms, voxel size 0.57/ 0.57/5.0 mm, thickness/interval 5 mm/0 mm). Additional sequences included T1-weighted and diffusion-weighted images.
A single radiologist (NB, 12 years' experience pelvic MR imaging) outlined the cervical tumours on MRI on the sagittal T2W sequence [31]. The reader manually contoured around the outer edge of the cervical tumour on each image correlating with other sequences as necessary. The total tumour volume was automatically calculated as the sum of each of the cross-sectional volume measurements (cross-sectional area multiplied by section thickness) [16,32]. Where the primary tumour contained regions of necrosis centrally, these were included in the volume measurement. Care was taken to avoid the inclusion of adjacent normal tissue in the region of interest (ROI).

Statistical analysis
Statistical analysis was performed using SPSS (SPPS v22, IBM, New York, US). The MRI and PET/CT volumes at different thresholds were correlated using scatterplots and Pearson correlation test. Correlation is the extent to which 2 or more variables are associated with each other and the strength of the relationship is assigned an r value. Correlation and percentage difference of each MTV x% on PET/CT with the MRI volume was made. Correlation r values were scored as follows: small 0.1 to 0.29, medium 0.3 to 0.49 and large 0.5 to 1.0 [33]. A p value < 0.05 was considered to indicate statistical significance.
Inter-observer variability in the volume measurements at each threshold, following manual correction if required, was assessed using the intraclass correlation coefficient (ICC) estimates along with their 95% confidence intervals using a twoway random absolute single measures model. Values less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.90 are indicative of poor, moderate, good and excellent reliability, respectively [34].
The paired t test statistical technique was used to compare the independent readings between reader 1 and reader 2. To compare each reader to the MRI volumes, an independent t test was used. A 2-tailed paired t test with 95% CI was also used to compare each MTV threshold for the separate readers with the MRI volume. Bland-Altman plots were used to visually assess the distribution of differences (spread of points along y-axis) and to compare the distribution of estimates obtained for segmentations (spread and separation of points along x-axis). In addition, correlation between low, intermediate and high SUV max for select MTV thresholds and the presence or absence of necrosis was made with the MRI volume.

Study population
Between January 2009 and December 2016, 118 patients with cervical cancer underwent staging PET/CT imaging at our institution. Eleven patients were excluded as they had been treated previously, 14 for lack of corresponding staging MRI pelvis and 12 for inability to perform volume measurements at the different MTV thresholds due to their very small volume (less than 5 ml [28,29]). 81 patients were therefore included in this study. The time interval between the staging MR and PET/CT was a mean of 16.2 days (range 0-64 days).
38% of patients had FIGO [35] stage IIB disease ( Table 2). The mean primary tumour volume was 85.4 cm 3 on MRI (range 6.7-413). The mean SUV max of the primary tumour was 15.1 (± 6.9 SD). The mean and range of SUV max according to histological subtype is presented in Table 2.

Bladder masking
For method 1, bladder masking prior to auto-contouring was performed on 86% of patients ( Fig. 1 and Table 3). This was dependent on the MTV threshold used with the greater requirement at lower thresholds. At MTV 60 , 61 (75.3%) of PET scans required masking, which increased to 77 (95.1%) at MTV 20 . For method 2, no bladder masking was performed as this method constrains an elliptical volume avoiding bladder (where possible). For method 3, bladder masking was performed in 4% (Table 3).

Correlation between tumour volumes on MRI and PET
Tumour volumes as measured on MRI and at different MTV thresholds are summarised in Table 5. There was a large positive correlation between MRI volume and MTV  and no correlation with MRI volume and SUV 2.5 .
There was no significant difference between the MRI volume and MTV 25 for both readers. There was a significant difference between MRI volume and the other MTV values (Table 5).
For method 1, MTV 25 and MTV 30 were closest to the MRI volumes for both readers (mean percentage change from MRI volume of 2.9% and 13.4% for MTV 25 and − 13.1% and − 2.0% for MTV 30 for readers 1 and 2 respectively) ( Table 5, Fig. 1). For method 2 (ellipse isocontour method), MTV 25 and MTV 30 were also closest to the MRI volumes for both readers (mean percentage change from MRI volume of − 3.9% and − 8.6% for MTV 25 and − 16.9% and − 19% for MTV 30 for readers 1 and 2 respectively) ( Table 5 and Fig. 2). In addition, the best correlation was at the MTV 25 and MTV 30 for both readers although there was also excellent correlation with an r > 0.75 between MTV 20 to MTV 40 and the absolute SUV 2.5 .
The higher thresholds MTV 40 to MTV 60 significantly underestimated tumour volumes for both percentage threshold methods. For example, the MTV 40 had a good positive correlation with the MRI volume, but the volumes were underestimated by a mean of 39.1% for reader 1 and 31.1% for reader 2 for method 1 and 38.2% and 38.5% respectively for method 2.
Using a fixed SUV 2.5 led to a marked overestimation of tumour size (mean 69.7% overestimation), as adjacent background structures were included in the volume using method 1, whilst method 2 also showed a significant overestimation but less marked as the elliptic isocontour method permits a restraining volume.
For method 3, automated gradient method, there was a good correlation between the PET segmentation and the MRI volume (reader 1: 0.81 and reader 2: 0.79). This method, however, significantly underestimated the tumour volume

Inter-observer agreement
The inter-observer agreement ( Table 6) was good to excellent for method 1 for thresholds MTV 30 to MTV 60 (ICC estimates ranging from 0.898 to 0.976 with narrow 95% confidence intervals). Inter-observer agreement was moderate at the lower thresholds (ICC estimates of 0.534 and 0.617 respectively for the MTV 20 and MTV 25 with wide 95% confidence intervals).
For method 2, there was excellent inter-observer agreement across all thresholds from MTV 25 to MTV 60 (ICC estimates ranging from 0.935-0.973 with narrow 95% confidence intervals). Although readers could vary in the selection of the isocontour boundaries, no manual adjustments were possible with this method.
W h e n B l a n d -A l t m a n p l o t s w e r e p e r f o r m e d ( Supplementary Fig. 3), only the MTV 25 for reader 1 on method 1 demonstrated no proportional bias between the MRI and PET segmentations.

Optimal threshold
In summary, there was no significant difference between the mean MRI volume and MTV 25

Discussion
Variations in FDG uptake for different histological subtypes have been previously reported with squamous cell carcinoma (SCC) being the histological type with the highest metabolic intensity and neuroendocrine tumours often presenting a heterogeneous uptake including a well-differentiated neuroendocrine part with no/low uptake [36,37]. Whilst SCC showed the highest uptake, overall, we did not find a difference between SCC, adenocarcinoma, adenosquamous carcinoma and neuroendocrine histological subtypes ( Table 2). It is possible that this resulted from the vast majority (80%) in our cohort being of the SCC subtype.
The optimal method of outlining cervical tumour volume on PET/CT remains contentious with various segmentation methods and thresholds described in the literature (Table 1). For pelvic malignancies, inclusion of adjacent high activity in physiologic structures (bladder, ureters and bowel) is particularly problematic requiring manual adjustment of the automated volume that has been mentioned but not fully documented by previous studies.
This study assessed three different segmentation methods to outline the cervical tumours: using percentage SUV max thresholds with bladder masking when required (method 1), percentage SUV max thresholds using isocontour method around the tumour prior to different SUV max thresholds being applied (ellipsoid isocontour method, method 2), and an automated gradient method (method 3). This is the first study to assess inter-observer agreement of segmentation methods in cervical tumours and accurately document when any bladder masking and manual adjustment was required.
Our study has shown for method 1, MTV 25 was closest to MRI volume for reader 1 and MTV 30 closest to MRI volume for reader 2. For method 2, MTV 25 had the closest correlation with MRI for both readers. Method 3 demonstrated a consistent technique that highly correlated between observers but significantly underestimated the MRI volume.
The Bland-Altman plots ( Supplementary Fig. 3) demonstrated no significant difference only for reader 1 for method 1 at MTV 25 . All the other plots demonstrated proportional bias. The reason for this is that at extreme values, there was divergence between the MTV and the MRI values. This may be due to underlying extremes of SUV max and/or the presence of necrosis (Supplementary Figs. 1 and 2).
The MTV 30 threshold had excellent reproducibility between readers with narrow confidence intervals whilst MTV 25 had moderate reproducibility with wider confidence intervals using method 1 but narrower confidence intervals on method 2 which permitted a constraining volume. Although the MTV 25 was the only threshold to show no significant difference to MRI volume using paired t test for both readers using both pieces of software, this was at a trade-off of more requirement for manual adjustment using method 1 and thus reduced inter-observer agreement. Therefore, we propose that MTV 30 offers the best combination of accuracy and interobserver agreement along with less impact of the presence of necrosis and the extremes of SUV max .
Method 2 (ellipsoid isocontour method) had excellent correlation with MRI and excellent inter-observer agreement. However, it was not always possible to encompass the entire tumour without including bladder using the ellipsoid isocontour method. This method had a much higher correlation of above 0.75 for a number of different thresholds and overall the PET volumes were better correlated with the MRI volumes. This was due to manual adjustment not being feasible. Although we aimed to avoid manual adjustment in large tumours surrounded by bladder it was sometimes not possible to entirely exclude the bladder and only have tumour within the elliptic isocontour (Fig. 2). In future, if the constraining contour was not limited to a rigid ellipse, this method could be optimised further. The fact that no manual adjustment was performed on the VOIs was an added advantage because with method 1, even at the best MTV threshold, 44% required manual adjustment. Method 3 (automated gradient) was very simple to implement but required increasing adjustment for those that created segmentations which encompassed surrounding structures (Fig. 3). There was excellent inter-observer agreement but there was gross underestimation of the tumour compared with the MRI reference standard for the gradient method.
The gradient edge detection method identifies tumour based on a change in count levels at the tumour border. The gradient method evaluated in this paper calculates spatial derivatives along tumour radii then defines the tumour edge based on derivative levels and continuity of the tumour edge [30]. Compared to thresholding approaches, the gradientbased method better deals with the inherent shortcoming of PET images, such a low SNR and resolution. In phantom and surgical lung cancer studies, gradient-based methods have been proposed to best assess tumour volume compared to threshold methods [30,38]. To the best of our knowledge, this is the first paper to compare threshold methods with a gradient method in cervical cancer. However, despite good correlations with the MRI volume the gradient method consistently underestimated cervical tumour volume. In lung cancers compared to background lung, the change in count level at the tumour border is more distinct. Whilst in cervical cancers the changes in count level at the tumour border may be less which could lead to underestimation. In addition, cervical tumours tend to have irregular rather than spherical shapes and it is possible this may lead to underestimation of the tumour. Currently, for this method, the MTV is generated by plotting two perpendicular orthogonal lines; however, in the future, this method will be optimised to take into account irregularly shaped lesions.
Traditionally, MTV 40 has been used in the calculation of the MTV of cervical tumours based on a study by Miller and Grigsby [10]. This study, in only 13 subjects, suggested that MTV 40 was the optimal threshold, using separately acquired CT images as a visual correlate. However, MRI, and not CT, is considered the gold standard for measuring cervical cancer tumour volume as cervical tumours are poorly demonstrated on CT [39]. In general, for individual tumours as the threshold lowers the measured metabolic tumour volume increases. In our study, use of the MTV 40 led to a significant underestimation of tumour volume for both percentage SUV max methods. However, at thresholds below MTV 30 , there was a higher likelihood of overestimating the tumour volume using PET.
As the MTV threshold is based on the SUV max , it was a concern that lesions with low uptake will have an overestimation of their metabolic volume and therefore a poorer correlation with MRI volume. Concordant with studies in lung cancer [40], we also demonstrated overestimation of the MTV in lesions with a low SUV max most marked at MTV 25 (Supp. Fig. 1).
Recent cervical cancer studies have independently explored the optimal MTV thresholds [13,16,17]. Upasani et al. in a study of 74 patients with stage IIB or IIIB squamous cell cervical cancer concluded that MTV 30 and MTV 35 were most optimal using tri-diameter ellipsoid based measurements of T2W MRI as the reference standard [13]. However, not all tumours are simple ellipsoid shape and this method may incorrectly estimate tumour volume in irregularly shaped tumours which may explain why they recommended a higher threshold compared to our study if MRI volume was potentially underestimated. Lai et al. evaluated 29 primary cervical cancer cases and as in our study, reported MTV 30 to correlate best with MRI volume, which was measured by the same method as our study [16]. Manual adjustment was mentioned but not documented and inter-observer agreement was not assessed.
Cegła et al. assessed 30 cervical cancer patients and concluded that the MTV 35 was the closest to the MRI reference standard; however, they did not detail the method of MRI volume measurement [17]. In this study, only three thresholds were evaluated and this limited the scope of outcomes. Using PET/MRI, Sun et al. [14] found that for their 35 subjects, there was no difference at the 35% or 40% threshold MTVs, T2W images and diffusion-weighted MR images. However, their numbers were small, and no mention was made of whether the tumour segmentations on PET encompassed the entire tumour, i.e. whether there were photopaenic regions due to cavitation, etc. In our study, 35 tumours had necrosis and 46 did not, and all areas were centrally located (Supp. Fig. 2). a b c Fig. 3 Method 3, the automated gradient method. Segmentation of the primary tumour, a axial, b sagittal, c coronal. The different colours show repeated attempts of segmentation from the same reader as an example. The bladder (red arrow) is far away from the primary tumour DWI is not established for accurate volume measurement, with limited reports in the literature and since it assesses tumour cellularity, it generates different measurements compared to T2 volume. The DWI volumes in their study were generally lower than the T2W MRI volumes whilst other studies have reported the DWI volumes to be generally higher than T2-weighted volumes [41].
Other studies have used a fixed absolute SUV 2.5 [3,25]. Although fixed thresholds can be useful in regions with very low background activity such as the lung, in the pelvis, a fixed threshold may include surrounding background structures and lead to overestimation of the tumour volume. In our study, the fixed SUV 2.5 led to 69.7% overestimation of tumour volume when compared to the MRI volume and required the most manual adjustment (Table 3) using method 1 due to the inability to use a constraining volume with this method. The situation was markedly improved, however, using method 2, where the isocontour permits a restrained volume (percentage overestimation of the tumour volume 27.6 for reader 1 and 22 for reader 2). Our findings are consistent with Zhang et al. who reported SUV 2.5 overestimated cervical tumour volume (based on T2-weighted MRI) in the majority of cases and concluded it was unsuitable for thresholding of cervical tumours [15].
Bladder masking overcame one of the reasons previously cited for not using lower SUV max thresholds for tumour volume estimation [10] (Table 2). For method 1, overall 86% had bladder masking and the requirement was greater at lower thresholds (93% required bladder masking at MTV 25 , 89% at MTV 30 and 84% at MTV 40 ). Other studies have mentioned the use of this technique but have not mentioned the frequency of its usage [12]. This is the first study to accurately document the requirement for bladder masking and manual adjustment. Bladder masking was not available for method 2 and for method 3, only 4% required bladder masking. In our study, one observer performed the bladder masking for method 1 but as the masking was automated, this was unlikely to impact on the inter-observer variation.
All methods have their strengths and weaknesses. Ideally, the method of MTV delineation should be accurate, easy to use and reproducible. Therefore, as automated as is feasible but will depend on locally available software. In addition, readers should be aware absolute MTV measurement can vary with the software method available.
High-resolution T2-weighted sequences are recognised as the gold standard for tumour outlining by GYN GEC-ESTRO working group guidelines for cervical cancer brachytherapy tumour outlining [31]. The MRI based tumour volume technique used in our study (multiplying the sum of the tumour areas by the slice thickness) is considered the standard MRI volume technique closely correlating with gross specimen [32]. In our study the MRI volumes were generated by a single experienced observer; however, using the same method, Dimopoulos et al. [42] demonstrated acceptable interobserver variability from two independent observers. In addition, manual segmentation of the primary tumour using individual slices is more accurate than using three orthogonal measurements of the tumour to compute the volume of an ellipsoid as most cervical cancers are not ellipsoid [43]. Using volumetric based MRI measurement, the MTV 25 correlated closest with the MRI volume for reader 1 and MTV 30 for reader 2. As mentioned earlier, studies using 3 orthogonal measurements suggested MTV 30 and MTV 35 correlated best with MRI volumes [13]. Lau used a similar method to this study but averaged the sagittal T2W volumes obtained by two readers and found that MTV 30 was the closest to the MRI volume [16].
Although radiotherapy planning is based on MRI volume, due to the excellent depiction of patient anatomy and dose constraints to normal structures, there is a role for PET in patients unable to have an MRI and there may be a role of PET alongside MRI for auto-contouring of tumours for radiotherapy planning. In addition, the volumetric data derived from the MTV can be further assessed in radiomics studies in order to predict prognosis and evaluate the future success of adjuvant therapy.
Partial volume effect (PVE) may also influence the PET volume calculation, particularly for small tumours. Whether PVE leads to over or underestimation of MTV depends on target to background ratios (TBR). More avid tumours with higher TBR size may be overestimated and those with lower TBR may be underestimated [44]. In our study, we like other groups [11,14] excluded small tumours < 5 cm 3 due to the PVE. MR volume is less susceptible to PVE due to the higher spatial resolution.
A limitation of our study was in some cases mainly MTV 20 for method 1; the automated volume included a lot of normal structures or physiologic activity (sometimes even extending along ureters to kidneys and including the heart) and were deemed 'too difficult' to manually adjust; thus, MTV was not documented. This could lead to bias; however, it involved very few cases (for method 1: 2 at MTV 30 , 6 at MTV 25 , 7 at MTV 20 ; method 3: 1 for each reader), ( Fig. 1 and Table 4). We would propose, in clinical practice, if the MTV 30 was too difficult to manually correct then select MTV 35 instead.
Although there were two observers for each method, the second observer was different for method 1 and the level of clinical experience of the observers was different (15 years versus 3-5 years). However, regardless of the difference in the level of clinical PET/CT experience, since MTV is not routinely performed clinically, all observers received the same software training prior to the study. In addition, there was consistently good-excellent inter-observer agreement across all methods suggesting the years of clinical experience did not seem to impact the output.
The time taken for the segmentation has only been briefly discussed in the literature [45]. Although the time taken for outlining using method 1 and 2 was not accurately recorded, the former took a lot longer, approximately 15 min per scan, compared with 5 min per scan for the latter. The time taken for each scan for method 3 varied greatly from 5 min for the quick scans that required no adjustment to up to 20 min for the more demanding scans.
Another limitation of our study was that we used a correlation method to compare the PET and MRI volumes. The volume does not demonstrate that the tumour volumes obtained from the two modalities match or overlap. A method to overcome this is to use the DICE method [46] or similarity coefficient that measures the degree of overlap [27]. However, due to the effect of bladder filling changing the position of the tumour, it may not be possible to use this method to truly compare the segmentations from different modalities. Using DICE on the same modality is definitely a more accurate method and creating masks for all the PET images would be a useful area of work.
In PET/MRI, when the PET and MRI images are obtained contemporaneously, there may still be some difference in the appearance of the tumour between the two modalities due to variable bladder filling in the time interval between acquisition. The few studies [14,47,48] that have used PET/MRI for volume have stated that there was excellent co-registration between the two modalities, with the caveat that no mention of bladder filling was made. Perhaps simultaneous acquisition improves the degree of overlap between the two modalities.
All the FDG PET/CT analysis was performed with the same reconstructions on retrospective data from the same scanner. Two other studies [13,16] from other centres using different PET manufacturers (GE Discovery VCT) and reconstruction parameters also demonstrated the same optimal threshold. The effect of resolution recovery on the MTV has not been explored but as this method of reconstruction becomes more common, this may impact on the optimal segmentations.
A recent radiomics study recognised that MTVs connecting bladder is a major problem for most segmentation methods and utilised MTV 50% to avoid bladder at the trade-off of under-sampling tumour volume [49]. A systematic review and meta-analysis, reported MTV and TLG were significant prognostic factors in patients with cervical cancer [5] in spite of different methods of outlining. Future work should assess if the MTV threshold/ method within the same patient group has a different impact on predicting outcome/radiomics.
The widespread adoption of MTV will rely on the ease of use and reproducibility between observers. Future software development may permit selection of constraining volume (as in method 2) but in addition, the ability to slightly adjust the constraining volume for such cases where the tumour and bladder cannot be entirely separated by the isocontour method.

Conclusions
In conclusion, for tumours > 5 cm 3 , MTV 30 provides the optimal correlation with MRI volume taking into consideration the excellent inter-reader agreement and less requirement for manual adjustment along with less impact of the presence of necrosis and the extremes of SUV max . Depending on local software method for MTV outlining, masking of bladder activity or the use of a constraining volume prior to autocontouring enables volume measurement at lower SUV thresholds without inadvertent inclusion of bladder activity.
Funding Open access funding provided by Imperial College London.

Compliance with ethical standards
Ethical approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the principles of the1964 Declaration of Helsinki and its later amendments or comparable ethical standards. Institutional ethical approval for retrospective analysis of human data was obtained and informed consent was waived, typical for retrospective analysis of anonymised imaging data. This article does not describe any studies performed on animals.