Manual segmentation versus semi-automated segmentation for quantifying vestibular schwannoma volume on MRI

Purpose Management of vestibular schwannoma (VS) is based on tumour size as observed on T1 MRI scans with contrast agent injection. The current clinical practice is to measure the diameter of the tumour in its largest dimension. It has been shown that volumetric measurement is more accurate and more reliable as a measure of VS size. The reference approach to achieve such volumetry is to manually segment the tumour, which is a time intensive task. We suggest that semi-automated segmentation may be a clinically applicable solution to this problem and that it could replace linear measurements as the clinical standard. Methods Using high-quality software available for academic purposes, we ran a comparative study of manual versus semi-automated segmentation of VS on MRI with 5 clinicians and scientists. We gathered both quantitative and qualitative data to compare the two approaches; including segmentation time, segmentation effort and segmentation accuracy. Results We found that the selected semi-automated segmentation approach is significantly faster (167 s vs 479 s, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p<0.001$$\end{document}p<0.001), less temporally and physically demanding and has approximately equal performance when compared with manual segmentation, with some improvements in accuracy. There were some limitations, including algorithmic unpredictability and error, which produced more frustration and increased mental effort in comparison with manual segmentation. Conclusion We suggest that semi-automated segmentation could be applied clinically for volumetric measurement of VS on MRI. In future, the generic software could be refined for use specifically for VS segmentation, thereby improving accuracy. Electronic supplementary material The online version of this article (10.1007/s11548-020-02222-y) contains supplementary material, which is available to authorized users.

tration and increased mental effort in comparison to manual segmentation. Conclusion We suggest that semi-automated segmentation could be applied clinically for volumetric measurement of VS on MRI. In future, the generic software could be refined for use specifically for VS segmentation, thereby improving accuracy.

Introduction
Vestibular schwannoma (VS) is a benign tumour of the vestibulocochlear nerve arising within the cerebellopontine angle, deep inside the cranium. It accounts for approximately 6-8% of all intracranial neoplasms and has a prevalence of around 0.02% of the population [23]. Patients may present with a variety of symptoms including hearing loss, balance problems, vertigo, dizziness and headache among others [31]. Diagnosis is usually made on a Magnetic Resonance Imaging (MRI) scan with intravenous contrast demonstrating a homogeneously-enhancing lesion within the internal acoustic canal that may also extend into the intracranial cavity [30]. Grading of tumours is performed according to radiographic characteristics indicating tumour extent and size and is used to guide treatment [19]. Patients with small or asymptomatic tumours are usually managed conservatively with serial surveillance scans. Small or medium-sized tumours deemed suitable for treatment can be treated effectively and safely with stereotactic radiosurgery (SRS) [24] but larger tumours are usually managed with surgery.
Measuring the size of a VS on MRI is important in guiding treatment or monitoring growth patterns. There are several methods for measuring tumour size but the most common technique is to measure diameter at the tumour's widest point [16,33,43]. However, this approach is prone to measurement inaccuracies. Volumetric measurement is a solution to this problem [20]. Volumetric analysis offers a more accurate representation of the tumour [21] and could significantly aid the management of these patients. Segmentation (contouring) is already used in the planning of gamma knife SRS treatment. Segmentation also provides a means of performing volumetric measurement of the tumour. Compared with two-dimensional measurements, it may be used more accurately for the active surveillance of VS. Volumetric measurement has been used to predict recurrence in patients with residual tumours following surgical intervention [37], to measure change in tumour size following SRS treatment [44] and to predict hearing preservation following SRS treatment [11]. There are three main methods of volumetric analysis: manual segmentation, semi-automated segmentation and automated segmentation. Manual segmentation involves comprehensively labelling the 3D structure in each 2D slice. It is a time-intensive task with relatively low inter-and intra-individual reliability and has not been widely employed in clinical practice.
Automated segmentation has been applied successfully to MR imaging for a wide range of brain tumours [46]. Automated segmentation may be accurate in the assessment of tumour progression and in overall survival prediction in glioma [1,28] as well as for the clinical assessment of biomarkers in glioma [4]. For VS imaging, automated segmentation has been applied with positive results [34,40] and there is growing interest in the field [10]. An automated segmentation tool could also improve clinical workflow and operational efficiency during the planning of stereotactic radiosurgery (SRS) by using the tool as an initialisation step in the process. However, automated approaches are, for the most part, not fully validated and are confined to academic use. Furthermore, some tumours display heterogeneous enhancement including the 4% of VS tumours that may be cystic, which can lead to inaccurate segmentation when automated methods are applied [27].
Semi-automated segmentation has been shown to be a more reliable option for the analysis of VS on MRI scans [26]. However, there has been no previous analysis of cognitive load or user experience of VS segmentation. When using semi-automated methods, segmentation time and repeatability may be improved when compared with manual segmentation [2,6,39,41]. Compared with fully automatic segmentation, results may be more accurate [1] and are more acceptable to clinicians due to increased transparency in the segmentation process [12]. Currently proposed methods require user input for one or more of the following steps: segmentation parameters, feedback or evaluation, including refinement and validation of the segmentation. There is little material in the literature regarding user experience of interactive segmentation in brain imaging, despite the intention to pursue clinical translation in the field [18,35]. A number of software packages are academically available for medical image segmentation spanning a variety of different methods. For manual segmentation, ITK-SNAP 1 [45] is a widely-used open-source software library with manual, semi-automated and automated segmentation offerings. 3D slicer 2 has the standard offerings of image viewing and analysis tools, along with a variety of downloadable packages for semi-automated and automated segmentation [8]. MRIcron 3 is a package of image viewing and manual segmentation tools. For semi-automated segmentation, ImFusion Labels (ImFusion, Munich, Germany) is a recent commercial-grade package with academic licensing options.
We present the findings of a proof of concept study using combined quantitative and qualitative analysis, comparing manual segmentation with semi-automated segmentation of VS on MRI. We hypothesise that semi-automated segmentation is faster than manual segmentation with a comparable performance. In this study we also compare the user experience of two software suites, including that of clinicians and senior researchers.

Materials and Method
We selected four tumours from our database for the study (See Table 1). All four patients had previously undergone Gamma Knife SRS treatment [3]. The images were representative of a variety of tumour sizes and shapes encountered in clinical practice. We selected two small and two moderatesized tumours (See Table 1). The ground truth measurements were made prior to the study by the treating skull base neurosurgeon and stereotactic radiosurgery physicist using Gamma Knife planning software (Leksell GammaPlan, Elekta, Sweden). The images used in this study were all contrastenhanced T1-weighted scans with 0.4mm × 0.4mm in-plane resolution, in-plane matrix of 512 × 512 and 1.5mm slice thickness. All cases included an extracanalicular (intracranial) component and none of the tumours had a cystic component. Patients with multiple tumours were excluded.
We selected ITK-SNAP for manual segmentation since this offered the most intuitive user interface. In our group it was also the most widely used library for manual segmentation. We selected ImFusion Labels for semi-automated segmentation since this was a recent software with a good selection of machine learning tools and a high-quality user interface. It was made available to our group through an academic license.
Five observers, including two medical students, two biomedical engineers and one neurosurgeon, performed manual and semi-automated segmentation on each of the four scans. The participants had a variety of experience with segmentation. Three participants were inexperienced segmenters (with no or limited previous experience) and two were experts in medical image segmentation, with multiple years experience of medical image segmentation. Three had previous experience using ITK-SNAP, one of whom had limited experience of using ImFusion Labels.

Study Design
A training period was included for each study participant at the start of the study and for each software library, using a training data set which was not part of the study. This training period was standardised to 10 minutes for each participant and included an initial demonstration from the study lead followed by a trial run for each participant. During the training period, participants were free to ask questions relating to the segmentation. The trial runs were not included in the results or the analysis. Participants were advised on the optimal tools to use in each software library. This training period was adapted based on the needs and previous experience of the participant, such that no demonstration was given for those participants well-versed in the use of the software library.
In ITK-SNAP, participants used the polygon drawing tool to outline tumour boundaries in each slice and fill in the tumour volume (See Fig. 1). The paintbrush tool was used to make small alterations as needed. A time limit of ten minutes per segmentation was provided in order to standardise the process according to arbitrary mock-clinical parameters.
In ImFusion Labels, participants used the 'Interactive Segmentation' module (See Fig. 2). They were advised to first draw background labels which included structures of a variety of intensities (e.g. bone, dura, healthy brain). After the first iteration of the segmentation, participants were advised to only undertake two alterations in the segmentation. This was determined to produce optimum results while creating an incentive to complete the task in a time-pressured manner.
A document containing participant instructions is included as Online Resource 1. A video depicting segmentation in ITK-SNAP is included as Online Resource 2. A video depicting segmentation in ImFusion Labels is provided as Online Resource 3.

Qualitative Data Collection
The NASA Task Load Index (TLX) [14] questionnaire was performed at the end of the study to quantify user effort for each method of segmentation. The TLX scores different aspects of a task on a graded scale from 1-21, including effort, frustration and performance. It can be found as Table 2 in the appendix. The TLX was used as a relative comparator of the libraries, rather than as an absolute scale. For data analysis we processed the raw TLX data. This may be a more reliable use of the TLX compared with using part two to calculate an overall weight-adjusted score [5].
We performed short post-segmentation interviews to explore the participants' experiences of the different toolboxes. The questions were based around themes, which included 'segmentation experience', 'toolbox' and 'study design'. Table 3 in the appendix details the questions asked of each participant. Participants were asked about each software library separately. Data was collected in shorthand form by the study lead during the interview and then expanded following the interview.  Segmentation in ImFusion Labels using background labels (blue) and foreground labels (red) to demarcate tumour and non-tumour tissue

Quantitative Data Collection and Analysis
The time taken to perform the segmentation was measured from the time of launching the software to the time of closing the software following the segmentation. A paired t-test was performed on this data to calculate the p-value as well as the confidence intervals. We quantified segmentation accuracy by comparing the segmentations in each software with the ground truth data in order to establish a comparative analysis. We calculated the Dice Coefficient (Dice) since this is a standard comparative measure of radiological data [28,29]. We also calculated relative volume error (RVE) and average symmetric surface distance (ASSD) for each segmentation. We performed subgroup analysis on both the time and accuracy data. We took the two more experienced segmenters and compared results from these individuals against the three less experienced segmenters.

Results
Segmentation time was significantly faster in ImFusion Labels. In terms of TLX data, ITK-SNAP was more time demanding and physically demanding whereas ImFusion was more mentally demanding and frustrating. The performance, in terms of accuracy, and overall effort of the libraries was comparable. Qualitatively, participants preferred the control that ITK-SNAP offered, however some did not like the time demand. ImFusion was a good tool for rapidly estimating tumour volume, but there were frustrating errors produced in complex tumour segmentation. SNAP as compared to ImFusion Labels. The "ITK-SNAP Correlated" plot only takes into account the data which corresponds to the one from ImFusion labels that we still had access to (after data loss had occurred).

Time
Between the two libraries, segmentation in ImFusion Labels was significantly faster than ITK-SNAP. The mean segmentation time (ST) in ITK-SNAP was 479s (95% CI 439 -519) while the mean ST in ImFusion Labels was 168s (95% CI 168 -249), with a p value of < 0.001 (See Fig. 3a). There was no observed difference in segmentation time between the less experienced individuals and the more experienced individuals.

Accuracy
The user-generated segmentation dataset was compromised during the study, resulting in half of the ImFusion data being unavailable for analysis of segmentation accuracy. On the remaining data, we observed comparable accuracy between the two libraries, with a Dice score range of 0.848-0.964 for ImFusion compared with a range of 0.867-0.943 for ITK-SNAP. Compared with segmentations in ITK-SNAP, segmentations in ImFusion Labels were more similar to the ground truth data in terms of Dice (0.913 vs 0.902, p=0.301), RVE (0.0723 vs 0.124, p=0.245) and ASSD (0.381 vs 0.419, p=0.349) as illustrated in Fig. 3b. In our subgroup analysis the two cohorts achieved similar levels of accuracy for manual segmentation in ITK-SNAP. The experienced cohort achieved more accurate Dice scores (0.901 vs 0.899, p=0.533), and RVD scores (0.155 vs 0.104, p=0.312) while the inexperienced cohort achieved more accurate ASSD scores (0.417 vs 0.420, p=0.936) when compared with ground truth data. However, none of these differences were statistically significant.

NASA TLX score
The TLX scores showed a trend towards ITK-SNAP being the more physically and temporally demanding approach (+6 and +3.4-point scores on average respectively), while ImFusion tended to be more mentally demanding and worse in terms of perceived performance (-7.8 and -2.4 points on average respectively). All participants graded ImFusion as being more frustrating, with a +7.4-point greater score on average. All participants also graded ImFusion as being more mentally demanding, with a +7.8 greater score on average. ITK-SNAP was graded as being more physically demanding by all but one participant. Less experienced raters tended to score the segmentation performance of ImFusion higher than more experienced raters. Overall effort was slightly greater (+2.4 points on average) in ImFusion.

Interview Data
ITK-SNAP was the preferred choice for highly accurate segmentation, whilst one participant recommended ImFusion as a 'rough volumetric estimate'. All participants cited the improved performance of the ImFusion algorithm with 'simple' tumours i.e. those which were highly contrast-enhancing, homogeneous with well-defined boundaries and no or minimal adjacent high contrast structures, such as blood vessels or dura. However, for complex tumours the algorithm often made small, but frustrating, errors in segmentation -"[the algorithm] threw up errors which required a complete restart". Occasionally non-tumour areas were included, and tumour areas were not included. There was generally no way to fix this using the tool. One participant complained that in these more challenging cases, the algorithm was "a one-trick pony. . . if you make alterations to the initial segmentation you may worsen it." Participants commented on the 'unpredictability' of the algorithm and the lack of transparency as being a significant problem in solving these issues. In ITK-SNAP the majority of participants cited the need to compromise between thoroughness and timing of segmentation. One stated "I am a perfectionist. . . if we were not timed, [the segmentation] would take me much longer." In terms of study design, participants found the instructions clear and found it "helpful to have someone here to explain and provide feedback [during the training period]". A full breakdown of the qualitative data taken from interviews is provided in the appendix (See Table 5).

Discussion
In this paper we sought to compare manual segmentation to semi-automated segmentation on several variables, both quantitative and qualitative, for segmentation of VS. It is widely published that semi-automated segmentation may reduce the time taken to perform segmentation [9,25,32]. We showed that semi-automated segmentation is significantly faster and has comparable performance when compared with manual segmentation for volumetric analysis of VS. This would suggest good viability for this approach in clinical practice, where time constraints may restrict which methods are used. However, this study does have some limitations. In terms of performance, both semi-automated and manual segmentation were highly accurate when compared with ground truth data and there was no statistically significant difference between the two methods. In terms of clinical applicability, any differences between the two may also be clinically insignificant, thereby making semi-automated segmentation a desirable option. The involvement of inexperienced segmenters may reduce the validity of the conclusions we can draw. However, we observed a high degree of similarity in accuracy data for the experienced segmenters when compared with the inexperienced segmenters, suggesting that there was no compromise on data quality due to the inclusion of less-experienced participants.
In interview, some participants suggested that the segmentation in ImFusion produced significant errors in complex tumours. The Dice scores, however, indicated a high degree of accuracy in these segmentations. One explanation for this inconsistency in perception versus result may be attributable to a finer margin for error applied to the analysis of segmentations in ImFusion. Participants spent, on average, 479s on each segmentation in ITK-SNAP, compared with 168s in ImFusion. This time discrepancy may have led to a higher acceptance threshold for the segmentation in ImFusion, and small mistakes may have been picked up more readily.
In terms of effort measures, the NASA TLX was a useful tool. However, one limitation is that the system was used as a relative measure of effort between the different software libraries used for the study. Therefore, the absolute values offered by participants may not be an accurate measure of absolute effort and would therefore provide unreliable data for inter-rater comparison. We compared the inter-rater scores by subtracting the ImFusion scores from the ITK data for each participant. We would therefore suggest the use of the full TLX as opposed to the Raw TLX to overcome these issues.
We chose to state the segmentation goal as what would be clinically, or personally, acceptable to the participants. In this way, we felt that participants would apply the same requirements to both libraries. In some cases, the opposite was true. A very thorough approach was employed by some participants in ITK-SNAP, but in ImFusion Labels they used a crude approach. This difference in perceived goals may have introduced bias in the time and effort of segmentation. This challenge could be avoided in future by clearly stating the goals of the segmentation, whether targeting accuracy or speed.
One constraint on semi-automated segmentation lies in usability of the tools. In this study, a common point of feedback was that the algorithm was inconsistent and unpredictable in its segmentation. Some users found this tedious and had to restart when the algorithm produced errors. In the literature, a commonly cited limitation in clinical application is algorithmic transparency [17]. Users did not understand what the algorithm did and why. ImFusion Labels is a generic library and has wide applicability in medical imaging. A solution to this issue may be to refine an algorithm specifically for VS segmentation.
There is very little qualitative data in the literature on the use of segmentation tools. Qualitative data are particularly important given the current interest in clinical translation of AI tools, which must be robust, easy to use and accurate [17]. As far as we can see, this is the first paper to use a mixed quantitative and qualitative format to compare semi-automated segmentation with manual segmentation in medical imaging. The small sample size of this study, in terms of participants and scans segmented, may limit the validity of the conclusions we can draw. One further challenge was in data representation for qualitative analysis, since none of the research team had previous experience of handling interview data. It may be useful to recruit this expertise in future studies.
In terms of applicability to the current clinical workflow, semi-automated segmentation may assist in monitoring VS growth, especially in those patients with small tumours being managed conservatively with serial imaging [13,15,33]. It has been established that volumetric measurement is superior to single-dimension diametric measurements for quantifying growth [26,38]. Manual segmentation is not feasible in routine clinical practice due to the time-demanding nature of the task. We showed that semi-automated segmentation is less time-demanding, less physically-demanding and of comparable performance.
In the future, it is hoped that further algorithmic developments could support the practice of radiology among other specialities [36]. Deep learning is a sub-type of artificial intelligence that utilises multiple layers of analysis to process an image. A variety of applications of deep learning are postulated [7,22,42], and one study has shown this to be a useful approach in automated VS segmentation [34] in terms of both time and accuracy. Despite the accuracy of automated approaches, interactive corrections may continue to play a role even with deep learning due to the lack of adaptability of automated methods to the specific imaging sequences and protocols used clinically [39]. The next steps are to further analyse this methodology and work towards clinical translation.
The findings of this study may also be applied more widely to semi-automated segmentation of other neuroimaging data. Some participants felt that manual segmentation could not be matched in terms of performance if plenty of time was spent. The participants did not have specific expertise in the diagnosis or management of VS, aside from the neurosurgeon. We would expect that similar results, in terms of qualitative findings, may be present in other applications; for instance tumour segmentation for glioma. We would recommend that semi-automated segmentation is used as a supportive measure to other standard approaches in neuroimaging segmentation.

Conclusion
Gains are being made in the machine learning and medical imaging fields. Machine learning applications are now performing comparably with their manual counterparts. However, a finding of this study was that even the state-of-the-art machine learning tools may not yet be fully ready for clinical roll out in segmentation of vestibular schwannoma. Users found the tools to be fast and accurate, but at times unpredictable and frustrating to use. There were limitations in the study, including the small sample size in terms of participants, particularly those with experience in segmentation, and in the number of scans segmented. This makes conclusions difficult to draw. The strengths of this study lie in the joint use of both qualitative and quantitative methods, which were employed to address the clinical applicability of algorithms. Unpredictability of algorithm behaviour and lack of transparency with algorithmic methods are cited as being key issues. To remedy this, developers should focus on involving groups with a variety of backgrounds and expertise in the development process, to ensure clinical and research applicability. How mentally demanding was the task?
How physically demanding was the task?
How hurried or rushed was the pace of the task?
How successful were you in accomplishing what you were asked to do?
How hard did you have to work to accomplish your level of performance?
How insecure, discouraged, irritated, stressed, and annoyed were you? Table 3 Interview questions for qualitative comparison of the two software libraries Was the segmentation in each software to your satisfaction?
Overall, how did you find each software?
What would you add or remove from each software to improve them?
How did you find the study?