Variations in FDG uptake for different histological subtypes have been previously reported with squamous cell carcinoma (SCC) being the histological type with the highest metabolic intensity and neuroendocrine tumours often presenting a heterogeneous uptake including a well-differentiated neuroendocrine part with no/low uptake [36, 37]. Whilst SCC showed the highest uptake, overall, we did not find a difference between SCC, adenocarcinoma, adenosquamous carcinoma and neuroendocrine histological subtypes (Table 2). It is possible that this resulted from the vast majority (80%) in our cohort being of the SCC subtype.
The optimal method of outlining cervical tumour volume on PET/CT remains contentious with various segmentation methods and thresholds described in the literature (Table 1). For pelvic malignancies, inclusion of adjacent high activity in physiologic structures (bladder, ureters and bowel) is particularly problematic requiring manual adjustment of the automated volume that has been mentioned but not fully documented by previous studies.
This study assessed three different segmentation methods to outline the cervical tumours: using percentage SUVmax thresholds with bladder masking when required (method 1), percentage SUVmax thresholds using isocontour method around the tumour prior to different SUVmax thresholds being applied (ellipsoid isocontour method, method 2), and an automated gradient method (method 3). This is the first study to assess inter-observer agreement of segmentation methods in cervical tumours and accurately document when any bladder masking and manual adjustment was required.
Our study has shown for method 1, MTV25 was closest to MRI volume for reader 1 and MTV30 closest to MRI volume for reader 2. For method 2, MTV25 had the closest correlation with MRI for both readers. Method 3 demonstrated a consistent technique that highly correlated between observers but significantly underestimated the MRI volume.
The Bland-Altman plots (Supplementary Fig. 3) demonstrated no significant difference only for reader 1 for method 1 at MTV25. All the other plots demonstrated proportional bias. The reason for this is that at extreme values, there was divergence between the MTV and the MRI values. This may be due to underlying extremes of SUVmax and/or the presence of necrosis (Supplementary Figs. 1 and 2).
The MTV30 threshold had excellent reproducibility between readers with narrow confidence intervals whilst MTV25 had moderate reproducibility with wider confidence intervals using method 1 but narrower confidence intervals on method 2 which permitted a constraining volume. Although the MTV25 was the only threshold to show no significant difference to MRI volume using paired t test for both readers using both pieces of software, this was at a trade-off of more requirement for manual adjustment using method 1 and thus reduced inter-observer agreement. Therefore, we propose that MTV30 offers the best combination of accuracy and inter-observer agreement along with less impact of the presence of necrosis and the extremes of SUVmax.
Method 2 (ellipsoid isocontour method) had excellent correlation with MRI and excellent inter-observer agreement. However, it was not always possible to encompass the entire tumour without including bladder using the ellipsoid isocontour method. This method had a much higher correlation of above 0.75 for a number of different thresholds and overall the PET volumes were better correlated with the MRI volumes. This was due to manual adjustment not being feasible. Although we aimed to avoid manual adjustment in large tumours surrounded by bladder it was sometimes not possible to entirely exclude the bladder and only have tumour within the elliptic isocontour (Fig. 2). In future, if the constraining contour was not limited to a rigid ellipse, this method could be optimised further. The fact that no manual adjustment was performed on the VOIs was an added advantage because with method 1, even at the best MTV threshold, 44% required manual adjustment.
Method 3 (automated gradient) was very simple to implement but required increasing adjustment for those that created segmentations which encompassed surrounding structures (Fig. 3). There was excellent inter-observer agreement but there was gross underestimation of the tumour compared with the MRI reference standard for the gradient method.
The gradient edge detection method identifies tumour based on a change in count levels at the tumour border. The gradient method evaluated in this paper calculates spatial derivatives along tumour radii then defines the tumour edge based on derivative levels and continuity of the tumour edge . Compared to thresholding approaches, the gradient-based method better deals with the inherent shortcoming of PET images, such a low SNR and resolution. In phantom and surgical lung cancer studies, gradient-based methods have been proposed to best assess tumour volume compared to threshold methods [30, 38]. To the best of our knowledge, this is the first paper to compare threshold methods with a gradient method in cervical cancer. However, despite good correlations with the MRI volume the gradient method consistently underestimated cervical tumour volume. In lung cancers compared to background lung, the change in count level at the tumour border is more distinct. Whilst in cervical cancers the changes in count level at the tumour border may be less which could lead to underestimation. In addition, cervical tumours tend to have irregular rather than spherical shapes and it is possible this may lead to underestimation of the tumour. Currently, for this method, the MTV is generated by plotting two perpendicular orthogonal lines; however, in the future, this method will be optimised to take into account irregularly shaped lesions.
Traditionally, MTV40 has been used in the calculation of the MTV of cervical tumours based on a study by Miller and Grigsby . This study, in only 13 subjects, suggested that MTV40 was the optimal threshold, using separately acquired CT images as a visual correlate. However, MRI, and not CT, is considered the gold standard for measuring cervical cancer tumour volume as cervical tumours are poorly demonstrated on CT . In general, for individual tumours as the threshold lowers the measured metabolic tumour volume increases. In our study, use of the MTV40 led to a significant underestimation of tumour volume for both percentage SUVmax methods. However, at thresholds below MTV30, there was a higher likelihood of overestimating the tumour volume using PET.
As the MTV threshold is based on the SUVmax, it was a concern that lesions with low uptake will have an overestimation of their metabolic volume and therefore a poorer correlation with MRI volume. Concordant with studies in lung cancer , we also demonstrated overestimation of the MTV in lesions with a low SUVmax most marked at MTV25 (Supp. Fig. 1).
Recent cervical cancer studies have independently explored the optimal MTV thresholds [13, 16, 17]. Upasani et al. in a study of 74 patients with stage IIB or IIIB squamous cell cervical cancer concluded that MTV30 and MTV35 were most optimal using tri-diameter ellipsoid based measurements of T2W MRI as the reference standard . However, not all tumours are simple ellipsoid shape and this method may incorrectly estimate tumour volume in irregularly shaped tumours which may explain why they recommended a higher threshold compared to our study if MRI volume was potentially underestimated. Lai et al. evaluated 29 primary cervical cancer cases and as in our study, reported MTV30 to correlate best with MRI volume, which was measured by the same method as our study . Manual adjustment was mentioned but not documented and inter-observer agreement was not assessed.
Cegła et al. assessed 30 cervical cancer patients and concluded that the MTV35 was the closest to the MRI reference standard; however, they did not detail the method of MRI volume measurement . In this study, only three thresholds were evaluated and this limited the scope of outcomes. Using PET/MRI, Sun et al.  found that for their 35 subjects, there was no difference at the 35% or 40% threshold MTVs, T2W images and diffusion-weighted MR images. However, their numbers were small, and no mention was made of whether the tumour segmentations on PET encompassed the entire tumour, i.e. whether there were photopaenic regions due to cavitation, etc. In our study, 35 tumours had necrosis and 46 did not, and all areas were centrally located (Supp. Fig. 2). DWI is not established for accurate volume measurement, with limited reports in the literature and since it assesses tumour cellularity, it generates different measurements compared to T2 volume. The DWI volumes in their study were generally lower than the T2W MRI volumes whilst other studies have reported the DWI volumes to be generally higher than T2-weighted volumes .
Other studies have used a fixed absolute SUV2.5 [3, 25]. Although fixed thresholds can be useful in regions with very low background activity such as the lung, in the pelvis, a fixed threshold may include surrounding background structures and lead to overestimation of the tumour volume. In our study, the fixed SUV2.5 led to 69.7% overestimation of tumour volume when compared to the MRI volume and required the most manual adjustment (Table 3) using method 1 due to the inability to use a constraining volume with this method. The situation was markedly improved, however, using method 2, where the isocontour permits a restrained volume (percentage overestimation of the tumour volume 27.6 for reader 1 and 22 for reader 2). Our findings are consistent with Zhang et al. who reported SUV2.5 overestimated cervical tumour volume (based on T2-weighted MRI) in the majority of cases and concluded it was unsuitable for thresholding of cervical tumours .
Bladder masking overcame one of the reasons previously cited for not using lower SUVmax thresholds for tumour volume estimation  (Table 2). For method 1, overall 86% had bladder masking and the requirement was greater at lower thresholds (93% required bladder masking at MTV25, 89% at MTV30 and 84% at MTV40). Other studies have mentioned the use of this technique but have not mentioned the frequency of its usage . This is the first study to accurately document the requirement for bladder masking and manual adjustment. Bladder masking was not available for method 2 and for method 3, only 4% required bladder masking. In our study, one observer performed the bladder masking for method 1 but as the masking was automated, this was unlikely to impact on the inter-observer variation.
All methods have their strengths and weaknesses. Ideally, the method of MTV delineation should be accurate, easy to use and reproducible. Therefore, as automated as is feasible but will depend on locally available software. In addition, readers should be aware absolute MTV measurement can vary with the software method available.
High-resolution T2-weighted sequences are recognised as the gold standard for tumour outlining by GYN GEC-ESTRO working group guidelines for cervical cancer brachytherapy tumour outlining . The MRI based tumour volume technique used in our study (multiplying the sum of the tumour areas by the slice thickness) is considered the standard MRI volume technique closely correlating with gross specimen . In our study the MRI volumes were generated by a single experienced observer; however, using the same method, Dimopoulos et al.  demonstrated acceptable inter-observer variability from two independent observers. In addition, manual segmentation of the primary tumour using individual slices is more accurate than using three orthogonal measurements of the tumour to compute the volume of an ellipsoid as most cervical cancers are not ellipsoid . Using volumetric based MRI measurement, the MTV25 correlated closest with the MRI volume for reader 1 and MTV30 for reader 2. As mentioned earlier, studies using 3 orthogonal measurements suggested MTV30 and MTV35 correlated best with MRI volumes . Lau used a similar method to this study but averaged the sagittal T2W volumes obtained by two readers and found that MTV30 was the closest to the MRI volume .
Although radiotherapy planning is based on MRI volume, due to the excellent depiction of patient anatomy and dose constraints to normal structures, there is a role for PET in patients unable to have an MRI and there may be a role of PET alongside MRI for auto-contouring of tumours for radiotherapy planning. In addition, the volumetric data derived from the MTV can be further assessed in radiomics studies in order to predict prognosis and evaluate the future success of adjuvant therapy.
Partial volume effect (PVE) may also influence the PET volume calculation, particularly for small tumours. Whether PVE leads to over or underestimation of MTV depends on target to background ratios (TBR). More avid tumours with higher TBR size may be overestimated and those with lower TBR may be underestimated . In our study, we like other groups [11, 14] excluded small tumours < 5 cm3 due to the PVE. MR volume is less susceptible to PVE due to the higher spatial resolution.
A limitation of our study was in some cases mainly MTV20 for method 1; the automated volume included a lot of normal structures or physiologic activity (sometimes even extending along ureters to kidneys and including the heart) and were deemed ‘too difficult’ to manually adjust; thus, MTV was not documented. This could lead to bias; however, it involved very few cases (for method 1: 2 at MTV30, 6 at MTV25, 7 at MTV20; method 3: 1 for each reader), (Fig. 1 and Table 4). We would propose, in clinical practice, if the MTV30 was too difficult to manually correct then select MTV35 instead.
Although there were two observers for each method, the second observer was different for method 1 and the level of clinical experience of the observers was different (15 years versus 3–5 years). However, regardless of the difference in the level of clinical PET/CT experience, since MTV is not routinely performed clinically, all observers received the same software training prior to the study. In addition, there was consistently good-excellent inter-observer agreement across all methods suggesting the years of clinical experience did not seem to impact the output.
The time taken for the segmentation has only been briefly discussed in the literature . Although the time taken for outlining using method 1 and 2 was not accurately recorded, the former took a lot longer, approximately 15 min per scan, compared with 5 min per scan for the latter. The time taken for each scan for method 3 varied greatly from 5 min for the quick scans that required no adjustment to up to 20 min for the more demanding scans.
Another limitation of our study was that we used a correlation method to compare the PET and MRI volumes. The volume does not demonstrate that the tumour volumes obtained from the two modalities match or overlap. A method to overcome this is to use the DICE method  or similarity coefficient that measures the degree of overlap . However, due to the effect of bladder filling changing the position of the tumour, it may not be possible to use this method to truly compare the segmentations from different modalities. Using DICE on the same modality is definitely a more accurate method and creating masks for all the PET images would be a useful area of work.
In PET/MRI, when the PET and MRI images are obtained contemporaneously, there may still be some difference in the appearance of the tumour between the two modalities due to variable bladder filling in the time interval between acquisition. The few studies [14, 47, 48] that have used PET/MRI for volume have stated that there was excellent co-registration between the two modalities, with the caveat that no mention of bladder filling was made. Perhaps simultaneous acquisition improves the degree of overlap between the two modalities.
All the FDG PET/CT analysis was performed with the same reconstructions on retrospective data from the same scanner. Two other studies [13, 16] from other centres using different PET manufacturers (GE Discovery VCT) and reconstruction parameters also demonstrated the same optimal threshold. The effect of resolution recovery on the MTV has not been explored but as this method of reconstruction becomes more common, this may impact on the optimal segmentations.
A recent radiomics study recognised that MTVs connecting bladder is a major problem for most segmentation methods and utilised MTV 50% to avoid bladder at the trade-off of under-sampling tumour volume . A systematic review and meta-analysis, reported MTV and TLG were significant prognostic factors in patients with cervical cancer  in spite of different methods of outlining. Future work should assess if the MTV threshold/ method within the same patient group has a different impact on predicting outcome/radiomics.
The widespread adoption of MTV will rely on the ease of use and reproducibility between observers. Future software development may permit selection of constraining volume (as in method 2) but in addition, the ability to slightly adjust the constraining volume for such cases where the tumour and bladder cannot be entirely separated by the isocontour method.