1 Introduction

Vestibular schwannoma (VS), also named acoustic neuroma, is a benign intracranial tumor. If such a tumor grows, it might compress the brainstem and cerebellum and might cause hearing impairment, tinnitus, dizziness, syncope, trigeminal neuropathy, and facial palsy. Various treatments can be applied for VS; of these, Gamma Knife radiosurgery (GKRS) is a radiosurgery technique usually applied to small- and medium-sized (<2.5-cm) VS tumors. In addition, magnetic resonance (MR) images scanned in terms of different parameters that provide high-contrast images for particular tissues play a pivotal role in diagnosis and treatment planning. By visually inspecting these parametric MR images, physicians first detect the tumor lesions, delineate the tumor contour and subsequently determine the VS treatment [1]. MR images of different parameters can be used to inspect the two main types of VS lesions: uniform solid tumor parts with enhancement on T1-weighted (T1W)–gadolinium contrast-enhanced (T1W+C) images and cystic parts with enhancement on T2-weighted (T2W) images. During GKRS planning, tumor localization is the first step; it relies on experienced neurosurgeons and neuroradiologists repeatedly reviewing different parametric MR images. Hence, it is time-consuming and subjective. A reliable automatic VS detection can increase the efficiency of GKRS planning.

Scholars have published numerous deep learning methods that have been successfully applied to medical images; convolutional neural networks (CNN) have been especially effective. Such networks produce effective feature maps performing comparably with predefined features used for image recognition [2, 3]. Some CNN-based semantic segmentation algorithms, such as fully convolutional networks (FCNs) [4] and U-Nets [5], form a standard approach for detection in the clinical context, and substantially assist with clinical challenges like radiotherapeutic planning [6,7,8,9].

These studies displayed that the pixel-wise predictions helped to localize brain tumors precisely, which might enhance tumor volume measurements and treatment response determination. However, object detection networks usually estimate whether objects exist in a region; thus, they may provide features that are not relevant to the semantic segmentation but are highly relevant to the relationship between the region and the background [9]. Object detection techniques have been widely used in previous work in other medical images. For breast cancer detection, Li et al. [10] demonstrated automatic detection thyroid papillary cancer in ultrasound images by using faster RCNN. Mohammed et al. [11] detected breast masses in digital mammograms using a YOLO-based computer aid diagnosis system. For chest testing, George et al. [12] used YOLO for real-time detection of lung nodules from low-dose CT scans. For skin testing, Ünver et al. [13] combined YOLO and GrabCut algorithm for segmentation of the skin lesion areas in dermoscopic images. These studies have demonstrated the strength of CNN object detection algorithms in the medical image domain.

In addition, studies have demonstrated that using multiparametric MR images improves detection for brain tumors with multiple subregions [14,15,16]. For VS tumor detection, Wang et al. [17] and Shapey et al. [18] demonstrated the feasibility of CNN and multiparametric MR images to the segmentation of VS lesions. Lee et al. [19] improved VS lesion detection by treating multiparametric MR images from the same patient as different channels of the same image. These three related studies of VS detection used UNet-based semantic segmentation algorithms for pixel-wisely detecting the VS tumor lesions. Instead, the VS tumor as a whole can be detected within a region by using the proposed method, which provides an alternative tool for assisting clinician in localizing the tumor. The present study was to implement a region-wise method for VS tumor detection. We applied a CNN detection subnetwork, namely YOLO-v2, to automatically detect VS lesions using triple-channel images composed of T1W+C, T2W, and T1W images, and discussed the applicability under clinical circumstances. As the recognized object detection method, the YOLO-v2 provided a faster training duration and acceptable detection performance, so we can test the influence from different down-sampling strategies efficiently. We added a YOLO-v2 detection subnetwork after a residual network which is a stack of several different forms of residual convolution blocks. Such a deep architecture allows us to assess the influence of different features, in which the size varies from 13 x 13 pixels to 32 x 32 pixels, obtained from the residual convolution blocks. The YOLO-v2 block and residual networks were displayed in Fig. 2b. These feature-extraction layers produced feature maps that were downsampled by factors of 16 and 32. Also, before the training phase, we cropped the images for better focus on the brain. The detection capabilities of each CNN architecture were evaluated with average precision and F1 scores.

2 Materials and Methods

2.1 Data Acquisition and Preprocessing

During GKRS planning, the physicians locate the tumor regions to determine the target area and dose delivery. So that, the data of each VS patient typically contain not only the multiparametric MR images but also the individual annotations of the tumor region, which are manually marked by experienced neurosurgeons and neuroradiologists. Thus, we used these annotations as ground truth to formalize and to evaluate detection models.

To extract adequate tumor features, the MR images with different parameters were required. In VS treatment, physicians focus on not only the solid part of a tumor but also the cystic part, and the two parts of the tumor are enhanced by T1W+C and T2W, respectively. Thus, we mainly used T1W+C and T2W images to detect the target region. Also, T1W images were used because adding T1W to the training phase had made a slight improvement over using only T1W+C and T2W in the previous study [19]. In this study, we retrospectively collected 516 pre-GKRS VS patients’ three-dimensional MR images and tumor regions from Taipei Veterans General Hospital, Taiwan. The study was approved by the Institutional Review Board of Taipei Veterans General Hospital (IRB-TPEVGH No.: 2018-11-0089AC). The MR images were completely anonymized. All the MR images were scanned by a GE scanner, magnetic field strength 1.5 Tesla, including T1W, T2W, and T1W+C axial two-dimensional Spin Echo images (Fig. 1a). The pixel matrix size of the MR images was 512 × 512 × 20 with a voxel size of 0.5 × 0.5 × 3 mm3 (in x–y–z directions; Fig. 1b). For the preprocessing, ANTs N4 bias field correction was applied to all images [20]. Because the individual tumor bounding boxes were manually annotated for T1W+C images by experienced neuroradiologists, T1W and T2W images were then respectively co-registrated to the corresponding individual T1W+C image using the rigid-body 6-degrees of freedom (6DOF) registration procedure by SPM12 (Statistic Parametric Mapping) [21]. These three parametric images were concurrently used as triple-channel input images for training models (Fig. 1c). A multichannel image (e.g. red–green–blue image) contains more information than a single-channel image (e.g. grayscale) because features can be either within-channel features or cross-channel features. In this study, we treated the 512 × 512 × 20 data as 20 slices of 512 × 512 images because we used two-dimensional object detectors to detect the VS tumors, and the data provided a higher amount of information in the x–y plane (axial view). We randomly separated the preprocessed image data into 412 subjects as the training set (80% of all subjects) and 104 subjects as the evaluation set (20% of all subjects). Because we applied two-dimensional object detection methods to devise tumor detection models, the data of each patient included 20 triple-channel images in the axial view with matrix 512 × 512. During the training phase, only the images that contain tumors would be used in the following step of model training. The amount of 2342 two-dimensional slices marked with tumors were selected as training inputs. During the evaluation phase, a total of 2080 images with or without tumor markers were utilized for evaluating the model. Among the set for evaluation, 609 images were marked with tumors, including 618 different tumor volumes.

Fig. 1
figure 1

(a) Imaging of VS on different parametric MR images. The solid part of tumors appears with high intensity in T1W+C (left). The cystic part appears with high intensity in T2W (right). The T1W (middle) are used for additional image information. (b) The MR image data are in matrix size 512 × 512 × 20. (c) Three parametric MR images are used as the triple-channel images

2.2 Experimental Procedure

In the research of deep-learning architecture for object detection, there are two commonly used strategies: one-stage and two-stage strategies. The first type was a two-stage detector, using a Region Proposal Network that localized objects from backgrounds and then sent the region proposals to the subsequent networks for classification and regression. The second type was a one-stage detector, using methods such as YOLO algorithms, treating object detection as a regression problem by designating a single network to determine the class probabilities and bounding box locations simultaneously. A previous study demonstrated that a one-stage detector produced the prediction directly from the corresponding anchors in the feature map, generating a faster detection [22,23,24,25].

In the present study, we applied two CNN-based architectures, YOLO-v2 with deep residual network as a backbone model for VS detection, where the backbone model was pretrained as a feature extractor [23, 24]. For such deep models, the input images must be downsampled to the feature maps through the extraction of multilayers. A previous study demonstrated superior detection performance levels by downsampling the input images by a factor of 32 [24]. Accordingly, we applied architectures that downsampled the input images by a factor of 32, and applied another factor of 16, as depicted in Fig. 2a. In addition, the aforementioned architectures used sets of predefined anchor boxes to improve the effectiveness of detection. The anchor boxes were the initial predictive guesses regarding the bounding boxes; the sizes were determined based on the specific object sizes of the training data sets. During detection, the anchor boxes tiled across the images or feature maps so the networks could predict the probabilities and refine the corresponding anchor boxes, instead of predicting bounding boxes directly. However, too many anchor boxes may increase the computation cost and lead to overfitting. Therefore, the number of anchor boxes is an influential hyperparameter for detectors. Rather than hand-picking, we used an approximate k-means algorithm to determine the sizes of anchor boxes by clustering the intersection-over-union (IoU) distance between each pair of ground truth bounding boxes [24]. The IoU and IoU distance (DIoU) are defined as:

$${\text{IoU}}(A,B) = \frac{{\left| {A \cap B} \right|}}{{\left| {A \cup B} \right|}},\;{\text{DIOU}}(A,B) = 1 - {\text{IoU}}(A,B)$$
(1)
Fig. 2
figure 2

Flow of the present study. (a) Data preparation and preprocessing: we converted the images to 416 × 416 or 448 × 448 by cropping from original 512 × 512, and constructed triple-channel images by T1W+C, T1W, and T2W MR images. Data with these three image sizes were trained and tested separately. (b) Training phase of detection models: training the YOLO-v2 detectors using the preprocessed images. The residual network was a stack of convolution blocks as displayed. The N was set to 2 and 3, which would downsampled the input by 16 and 32, respectively. (c) Evaluation phase for detection models by average precision and F1 score

where A represents the ground truth bounding box, and B represents the predicted bounding box. We ran k-means algorithms with various values of k, where k represents the number of anchor boxes; we examined the tradeoff between the number of anchor boxes and the mean of IoU, and further determined the appropriate k.

2.3 Cropping Images

Each network architecture has its own specific suitable or restricted input size, so the result of detection is influenced by the size of the input images. Hence, before entering the training phase of the models, we cropped the triple-channel images as training data in addition to the 512 × 512 raw images. We inferred that cropping images allows the model to focus better on the brain without the backgrounds in the image, with a view to improving detection accuracy and speeding up the training process. Because the model downsampled the input images by a factor of 32, we cropped the images around the brain part to 416 × 416 pixels, which are the multiples of 32. We created the brain masks of the images by the following steps: First, we computed a global image threshold using Otsu’s method, which determines a threshold by minimizing intraclass intensity variance [26]. The original grayscale images were then converted to binary images. Subsequently, filling the holes in the binary images, where the holes were sets of black pixels that cannot be reached from the edges of the binary images. Then, we conducted an area opening operation on the binary images. Thus, we gained whole-brain masks from the original 512 × 512-pixel MR images. Subsequently, we cropped the original images to 416 × 416 or 448 x 448, centered on the central points of brain masks.

2.4 Evaluation of Each Tumor Detection Model

The main goal of the present study was to explore tumor detection effects through different pipelines and models, instead of demonstrating the regression error between bounding boxes. Three types of performance evaluations for single-class object detection according to the ground truth and prediction were used in the present study: true positive (TP), false positive (FP), and false negative (FN). The FP indicated predicted bounding boxes without any tumor, and the TP or FN were determined by whether the IoU between the prediction and the ground truth bounding boxes was greater than a predefined IoU threshold set to 0.5. With the TP, FP, and FN, we calculated recall, precision, and F1 scores. Precision represents how accurate the predictions are; recall represents how accurately the model finds all the positive items. However, evaluating a model using only recall or precision may produce bias. Therefore, the F1 score calculated by the harmonic mean of the precision and recall was estimated for better comprehension [27].

$${\text{Recall}} = \frac{{TP}}{{TP + FN}}$$
(2)
$${\text{Precision}} = \frac{{TP}}{{TP + FP}}$$
(3)
$${\text{F1}}\;{\text{Score = 2}} \cdot \frac{{{\text{Precision}} \cdot {\text{Recall}}}}{{{\text{Precision + Recall}}}}$$
(4)

In addition to the location, the confidence score of each predicted bounding box was another crucial value for detection performance measurement. During the evaluation phase, the confidence score of each bounding box represents the model's confidence in the detection, and it was one of the outputs of the neural network using regression. Through adjustments for the lower limit of the threshold of confidence score, a different number of bounding boxes would remain. Reducing the threshold would retain more bounding boxes, which resulted in the higher recall but lower precision. The appropriate threshold of confidence score for the model was selected by exploring out the highest F1 score under the specific threshold of confidence score. In addition, the precision–recall curve (P-R curve) was drawn by utilizing the precision and recall values calculated under the different thresholds of confidence score. Then, the average precision (AP) was estimated by calculating the area under the P-R curve [28]. The performance levels of the models were evaluated by the APs and F1 scores [29, 30].

The codes were implemented by using MATLAB [31] and executed with an NVIDIA GeForce RTX 2080 Ti GPU graphics card (NVIDIA, Santa Clara, CA), an Intel Core i7-8700 CPU, and a 32 GB RAM.

3 Results

In this section, we first demonstrate a relationship between mean IoU and various numbers of anchor boxes, which provided with a reference to select a suitable number of anchor boxes. Then, we list the recall, precision, F1 score, and AP of each model with different input sizes and different downscaling factors to evaluate the detection capabilities under different confidence score thresholds.

3.1 Number of Anchor Boxes

Because using too many anchor boxes would result in time-consuming and inaccurate detection attempts, we did not choose the number of anchor boxes, referred to as k, by the highest mean IoU, but by the tradeoff between number of anchor boxes and mean IoU. We implemented k-means with various values of k and plotted the tradeoff between number of anchor boxes and mean IoU (Fig. 3). As illustrated in Fig. 3a, we assumed that using six and seven anchor boxes, whose values of mean IoU were 0.7332 and 0.7965, respectively, would result in better detection, because using 10 or more anchor boxes only yield slightly higher mean IoU. However, the mean IoU and the size variation of anchor boxes were similar at these two numbers of anchors. To generate highly effective performance, we selected six predefined anchor boxes for training our VS detectors.

Fig. 3
figure 3

k-Means clustering was used for choosing the bounding boxes in our training data set to obtain prior anchor boxes. (a) The changes in the mean IoU with various number of anchor boxes k. The problem has a tradeoff between mean IoU and number of anchor boxes when k = 6 or 7. (b) The size of relative anchor boxes when k = 6 and 7. The size variation of anchor boxes is similar under the two conditions

3.2 AP of VS Detection with Different Input Sizes and Different Downscaling Factors

The AP presented in Table 1 demonstrates the detection capability of each combination of downsampling factors and input sizes. Separation of the brain region from background in the MR images may enhance the capability of detecting tumors because fewer excessive areas require examination; accordingly, the AP of the original input size, 512 × 512, was the worst among the models with the same downsampling strategy (Tables 1 and 2). In addition, a study used a model that downsampled the input images by a factor of 32 and gained desirable results on object detection benchmark data sets [24]. However, in our VS detection with triple-parametric MR images, the models that downsampled the input images by a factor of 16 achieved better performance levels than that by a factor of 32, regardless of how the input images were resized. The P-R curves and APs of each model with different input sizes and downscaling factors are shown in Fig. 4.

Table 1 Detection capabilities of each architecture trained with the images made by different resize procedure
Table 2 Training time and evaluation time of each model
Fig. 4
figure 4

P-R curve of each model. The blue curves represent the performance of VS detection with the images cropped to 416 × 416, and the red curves and the yellow curves represent that the images cropped to the 448 × 448 size and the original 512 × 512 size, respectively. The black * denotes the precision and recall when the threshold of confidence score was set to 0.5

3.3 F1 Scores of VS Detection Under Various Threshold of Confidence Score

Next, we demonstrated the detection capabilities under the different thresholds of confidence score. When the confidence score was set to 0.5, the model combined with 416 × 416-pixel input and downsampling by 16 attained the best F1 score at 0.7953 among the models with the same downsampling strategy. However, the model combined with 448 × 448-pixel input and downsampling by 16 attained a lower F1 score of 0.7730 but a higher recall of 0.7880, which means that more tumors were detected although the tumor prediction proposal was less trustworthy. Moreover, the results presented in Table 1 and Fig. 5 proved that if an appropriate threshold of confidence score was selected, such as 0.56, the model combined with 448 × 448-pixel input and downsampling by 16 generated the highest F1 score, namely 0.8171. It indicated that the threshold of confidence score influenced the performance of detectors measurably.

Fig. 5
figure 5

F1 score of each detector trained with different input sizes and different downscaling factors. The models that downsampled the input images by 16 were outperformed the models that downsampled the input images by 32. The 416 × 416-pixel images with 16 instances of downsampling performed the best (F1 score 0.7953) at the threshold of confidence score set to 0.5. However, by adjusting the threshold of confidence to slightly higher value, the performance of each model attained better performance, and the 416 × 416-pixel images with downsampling by 16 performed the best (the highest F1 score 0.8171)

4 Discussion

To our knowledge, this is the first study that demonstrated VS tumor detection capability by models contained different combinations of input-image sizes and downsampling strategies from triple-parametric MR images. In the context of medical image detection, these detected areas were selected primarily to provide physicians with a preliminary examination that would allow them to screen out the location of VS quickly and to establish a follow-up treatment process. The present study proved that the YOLO-v2 with residual network as backbone model successfully detected VS tumors. Among the models, the 416 × 416-pixel input-image architecture that downsampled the input by 16 was the architecture that attained the best AP, namely 0.779, when the threshold of confidence was set to 0.5. The results suggested that removing the unrelated background increased the detection capability, and demonstrated that too much downsampling was unsuitable for detection. It can be inferred that deeper models do not always produce better performance.

Moreover, the P-R curve and F1 scores under different confidence score thresholds demonstrated the influence of the threshold selection for confidence score on the model performance. The model with the threshold of confidence score set at 0.5 in the present study did not obtain the best performance. However, the threshold of confidence score is not a learnable parameter that could be computed through the training process. We can only select a relatively suitable value by additional validations, which would be necessary if a practical detection model for clinical use were required.

Furthermore, we compared the YOLO-v2 and Faster RCNN algorithms, both with the 16-times downsampling feature extraction network, and found that the Faster RCNN requires a much larger memory. More specifically, the region proposal network in the Faster RCNN outputs regional images of various sizes and this process acquires a lot of memory. Besides, in the training phase, YOLO-v2 can process 16 images per batch whereas the Faster RCNN can only process 1 image per batch. In addition, the Faster RCNN took more than eight hours to train 5 epochs, while the YOLOv2 took less than 20 minutes to train 10 epochs. This is due to the limited GPU RAM of our equipment, which cannot accommodate large image for processing the Faster RCNN. Therefore, we adopted the YOLO-v2 since it provided a faster training duration and superior performance (AP = 0.6654 with 416 × 416-pixel input images, and AP = 0.7077 with 448 × 448-pixel input images) in this study.

In addition, the processing was practical in the present study because most of clinical MR images of VS have lower through-plane resolution relative to in-plane (anisotropic voxel size) resolution. We used two-dimensional axial view (x-y plane) information instead of three-dimensional data from each subject. The dimension of the through-plane (z direction) of the data was only 20 in the present study; the feature map would vanish in the process of extraction by multiple convolution layers. Therefore, with the current detection method, two-dimensional information can convey more meaningful characteristics for detection. However, three-dimensional data provide more spatial information, which may aid in individual VS detection. Such speculation warrants future investigations to produce additional evidence.

5 Conclusion

The contributions of the present study included two aspects: First, we demonstrated a YOLO-v2 with a residual network as a backbone that successfully detected VS tumors by resized triple-parametric MR images. Second, we implemented various resized images and architectures that downsampled input by different factors and gained an AP of 0.7953 using 416 × 416-pixel input images and downsampling by 16 when the thresholds of both confidence score and IoU were set to 0.5. In addition, although the two-dimensional detectors provided accurate performance on VS detection, a three-dimensional detector should be developed to evaluate how effectiveness can be enhanced for individual VS detection.