1 Introduction

The second leading cause of death in Indonesia, after cardiovascular disease, is cancer. According to the Global Burden of Cancer Study (Globocan) conducted by the World Health Organization (WHO), the total number of cancer cases in Indonesia in 2020 was 396,914, and the total number of cancer-related deaths was 234,502 cases. In the screening process for early detection of cancerous lesions, radionuclide imaging is a crucial method for examining patients who may have metastatic prostate, breast, or lung cancer [1]. This screening method relies heavily on the visual interpretation of bone scan images but suffers from a lack of quantitative diagnosis. Consequently, quantitative indices called bone scan index (BSI) have been proposed as a valuable marker for tracking disease progression and treatment efficacy [2,3,4,5,6].

As BSI is calculated for each predetermined region of the bone skeleton, the determination of its value is highly dependent on the segmentation results. We can determine whether metastatic lesions are present at specific bone sites in cancer patients based on BSI values [2,3,4,5]. It should be noted that the process of segmenting the skeleton from bone scan images is the initial and indispensable step in all stages of the bone scan image metastases detection system [7,8,9,10]. The bone skeleton should be divided into several predetermined areas for this purpose, including the skull, cervical vertebrae, thoracic vertebrae, lumbar vertebrae, sacrum, pelvis, ribs, scapula, humerus, femur, sternum, and clavicle [6, 11, 12].

With the advent of deep learning in computer vision in 2012 [13], however, the adoption of deep learning methods in medical imaging has increased rapidly, particularly in medical image segmentation [14,15,16]. In addition to being one of the most important research areas in computer vision, the segmentation of medical images has significantly aided in developing multiple effective medical treatments. Despite the remarkable advances in medical image segmentation over the past few years, medical image segmentation based on deep learning still faces a myriad of challenges. One of the main challenges is achieving good segmentation accuracy, which is typically low. This is a crucial issue given that segmentation results in medical images are usually used for diagnosing a disease or detecting the presence of potentially harmful tissue. In addition to the low image resolution, it can be difficult to obtain a large number of medical images. These challenges are crucial factors that must be considered when designing an outstanding medical image segmentation system to meet actual medical needs [15].

Several successful semantic segmentation methods for medical image segmentation have been proposed [17,18,19,20,21,22]. Ronneberger et al. introduced U-Net [17] as a network architecture that employs the encoder-decoder principle. Using a contracting path (to capture context) and a symmetric expansion, their proposed architecture achieves precise localization. They added skip connections between the opposite convolution and deconvolution layer to the previously known up-sampling and down-sampling layers concept. Consequently, they combine characteristics from the contracting and expanding paths, yielding more robust characteristics for semantic segmentation [14].

The emergence of U-Net [17] served as the impetus for the development of a variety of detection and segmentation methods for medical images aimed at resolving various problems. Baheti et al. proposed Eff-UNet [23], which is the architecture for semantic segmentation using the same encoder-decoder approach as U-Net. They chose EfficientNetB5 and EfficientNetB7 for feature extraction on the encoder section because their ImageNet performance is superior to that of other CNN architectures. EfficientNet replaces the stack of convolution, activation, and pooling layers present in the original U-Net. In the decoder section, they employ a decoder architecture similar to the original U-Net. Unlike the encoder and decoder architectures in U-Net, the proposed architecture is asymmetrical, with a deeper contracting path than an expansion path.

Furthermore, Sun et al. introduced a new Circle-UNet, which is an improvement over U-Net [19] due to the addition of a circular pattern representing residual layers. They demonstrated that recurrent patterns could be effectively captured by residual layers, thereby facilitating the network’s comprehension of certain aspects of the objects. The proposed architecture consists of 101 layers and is classified as a very deep architecture based on the premise that residual layers can only contribute effectively if the layers are very deep. In addition, Guo et al. created SA-UNet [18] for retinal blood vessel segmentation by incorporating spatial attention into the U-Net architecture to improve key features. They suggested using structured dropout convolutional blocks rather than U-Net’s native convolution blocks in the encoder and decoder to prevent over-fitting.

Zhou et al. proposed HAUNet-3D [22], which introduces the attention mechanism at various scales and organizes them hierarchically from the bottom up to address the shortcomings of UNet-based lung nodule segmentation methods. Their proposed method captures features of any size and directs the blending of features from adjacent attention outputs, preserving the benefits of 3D U-Net. Their approach demonstrated superior robustness for heterogeneous lung nodule types and outstanding segmentation performance for small lung nodules.

In addition, recent advancements in transformer-based models have sparked interest in the development of medical image segmentation, particularly the U-Net model. Axial Fusion Transformer U-Net (AFTer-UNet) is a medical segmentation framework proposed by Yan et al. [21]. They adhere to the encoder and decoder structure of the original U-Net to extract image features and achieve segmentation at the pixel level, respectively. In the encoder section, they proposed a fusion mechanism based on the transformer to combine contextual data from adjacent slices along the axial axis. They obtained a high-level semantic information model using this mechanism. This feature group is then fed to the decoder’s segmentation component.

The advancement of the various semantic segmentation methods mentioned above cannot be separated from the results of the ConvNet architecture [24,25,26,27,28]. Howard et al. proposed MobileNets [28], which utilized depth-wise separable convolutions, which were introduced in Xception [29] for the first time. A depth-wise separable convolution consists of two operations: a depth-wise and a point-wise convolution. With depth-wise convolution, the number of output channels equals the number of input channels. Convolution with a kernel size of 1 \(\times\) 1 can combine the depth-wise convolution-created features, whereas point-wise convolution cannot. Zhang et al. proposed a new method called ShuffleNet [25] that combines group convolutions and channel shuffles. A group convolution consists of multiple convolutions, each of which takes a portion of the input channels, while channel shuffles randomly mix the group convolution’s output channels. In addition, Sandler et al. created MobileNetV2 [27], which is based on the ideas of MobileNetV1 [28] and uses depth-wise separable convolution as efficient building blocks. Moreover, Freeman et al. proposed EffNet [24], a slim model-optimized version of MobileNet and ShuffleNet, to address issues with existing models. Their proposed method departs from the increasing use of Convolutional Neural Networks in customer products, so it is anticipated that the model will run efficiently on mobile hardware.

Tan and Le [26] realized that balancing network depth, width, and resolution could enhance performance. Based on their observation, they proposed a new compound scaling method that scales all depth, width, and resolution dimensions uniformly using a set of constants and a compound coefficient. Their proposed architecture, which they dubbed EfficientNet, demonstrated that scaling depth, width, and resolution simultaneously produces superior results.

Meanwhile, Shimizu et al. demonstrated the successful implementation [6] of BtrflyNet to segment the skeleton from bone scan images, a critical step in detecting bone cancer metastases. Sekuboyina et al. proposed a butterfly-shaped network called BtrflyNet to efficiently combine information across reformations by combining two views [30]. Their proposed strategy was predicated on EBGANbased adversarial training. EfficientNet’s recent success has also prompted us to attempt to capitalize on EfficientNet’s strengths. Using EfficientNet’s mobile inverted bottleneck convolution (MBConv) has proven to improve the performance of numerous computer vision tasks. In light of this achievement, we attempted to design a segmentation system that would take advantage of the butterfly-type network (BtrflyNet) and EfficientNet.

This study presents a semantic segmentation of whole-body bone scan images. We employed a deep learning-based approach to achieve high accuracy in skeleton segmentation both in anterior and posterior part. We proposed a modified butterfly-type network added with EfficientNet to improve segmentation accuracy. We combine the advantages of butterfly-type network which can process anterior and posterior images simultaneously with the advantages of EfficientNet to help identify the relevant features of each segment. We conducted the experiment using 37 bone scan images and demonstrated the effectiveness of the proposed system by comparing it with the original BtrflyNet approach [6]. Using the high-level features extracted using EfficientNet in conjunction with the features extracted on the BtrflyNet architecture, we demonstrated that our proposed architecture, Efficient-BtrflyNet, is capable of producing excellent segmentation results.

The remaining structures in this article are as follows: The proposed system is described in Sect. 2; the experimental result and analysis are explained in Sect. 3; the conclusion is presented in Sect. 4.

2 Efficient-BtrflyNet for Skeleton Segmentation

This section describes our proposed architecture and performance evaluation.

2.1 Proposed Architecture

Our work employs a pair of anterior and posterior whole-body bone scan images (Fig. 1) along with their respective segmentation masks. The images were in RGB format with the size of 512 \(\times\) 128 pixels. The segmentation process produced 13 layers corresponding to 12 bones and one background [6]. Figure 2 depicts a guidance for annotating 12 bone classes for both anterior and posterior samples. The 12 bone segments are skull (green), the cervical vertebrae (blue), the thoracic vertebrae (light blue), the ribs (purple), the sternum (pink), the clavicle (light brown), the scapula (brown), the humerus (dark blue), the lumbar vertebrae (yellow), the sacrum (orange), the pelvis (dark green), and the femur (gold).

Fig. 1
figure 1

Sample pair of bone scan images: a anterior and b posterior

Fig. 2
figure 2

Guidance for ground truth annotations in 12 segments: a anterior and b posterior bone scan images

In this study, we propose Efficient-BtrflyNet to leverage the efficacy of EfficientNet [26] in butterfly-shaped network architecture to produce fine segmentation of bone scan images. According to the human expert, we hypothesize that the anterior and posterior portions of the whole-body bone scan image contain valuable information for producing fine segmentation. The use of a butterfly-shaped network is inspired by BtrflyNet [30], which successfully encoded the use of two views of spine reformation within a single architecture. The encoder–decoder architecture was applied to our proposed architecture, which consisted of convolution and pooling layers with no fully connected layers. Our proposed architecture is illustrated in Fig. 3.

Fig. 3
figure 3

Efficient-BtrflyNet for skeleton segmentation

In our proposed architecture, two arms (represented by blue and purple boxes) serve as encoder and decoder, while one body (represented by green boxes) connects these arms. The upper and lower portions of the left arm serve as the encoder for the anterior and posterior bone scan images. Each of the three sub-modules in each section contains a single convolutional layer. To capture more fine-grained patterns, a block of EfficientNet in the form of mobile inverted bottleneck convolution [27] is added to each sub-module.

In contrast, in the decoder section, which consists of three sub-modules, the resolution is incrementally increased through up-sampling and prediction for each pixel. Each sub-module received data from the left arm, which was merged with the up-sampling result from the right arm. Consequently, the network captures patterns at multiple scales. Additionally, the decoder has two parts, the top and the bottom, which serve as decoders for the anterior and posterior bone scan images.

Our work employs a pair of anterior and posterior whole-body bone scan images along with their respective segmentation masks. The images were in RGB format with the size of 512 \(\times\) 128 pixels. The segmentation process produced 13 layers corresponding to 12 bones and one background. The 12 bones are the skull, the cervical vertebrae, the thoracic vertebrae, the lumbar vertebrae, the sacrum, the pelvis, the ribs, the scapula, the humerus, the femur, the sternum, and the clavicle.

In our proposed architecture, the left arm is divided into the upper and lower portions, with three sub-modules each. The three sub-modules at the bottom are responsible for convolution and pooling, similar to the three sub-modules at the top. In the first sub-module of the upper left arm, each input image undergoes a 3 \(\times\) 3 convolution with the same padding, batch normalization, and ReLU as the activation function. The 32-bit output channel will be combined with the right arm’s up convolution results. The input image is also provided to Block 1 of EfficientNet in the second sub-module of the upper left arm, resulting in 32 output channels. A 3 \(\times\) 3 convolution is then performed to generate an output channel of 48, which is further combined with the result of the right arm’s up convolution. In the third sub-module, a 32-by-32 feature map is passed to Block 2 of EfficientNet, resulting in a 48-channel output. This sub-module applies a 3 \(\times\) 3 convolution to the feature map to generate 80 channels, which are later combined with the up-convolution results from the right arm. This 48-by-48-pixel feature map is also provided to Block 3 of EfficientNet, which will generate 80 channels. This feature map is combined with the feature map generated from the left arm’s lower portion.

The output of the left arm (top and bottom) is combined to produce a 160-channel feature map for the body part (green boxes). Then, 3 \(\times\) 3 convolution and 2 \(\times\) 2 max pooling are applied. The same convolution is applied a second time, followed by max-pooling, and then a second convolution is applied to yield 1024 channels. The number of output channels for each convolution is doubled from its input channel count. In addition, the up-convolution is applied, followed by a left-side join and the 3 \(\times\) 3 convolution. 512 channels are the number of output channels for this convolution. An up-convolution follows a left-side join, and the right arm is given a 3 \(\times\) 3 convolution.

The final step is to apply a 1 \(\times\) 1 convolution with the same padding to map each output to 13 channels representing 13 segmented regions. This process is visible in the upper and lower portions of the right arm. The codes to our implementation of the proposed Efficient-BtrflyNet are publicly available at our GitHub repository.Footnote 1

2.2 Performance Evaluation

The loss function to be minimized in the training of skeleton segmentation is cross-entropy as in Equation (1) as follows:

$$\begin{aligned} L = -\sum _{i=1}^{n} t_{i} \log (p_{i}), \end{aligned}$$
(1)

where n is the number of segments, \(t_i\) is the truth label, and \(p_i\) is the Softmax probability for the \(i^{th}\) segment. Then, overall loss L is calculated as the sum of the total loss from anterior and posterior parts.

To assess the segmentation system’s performance, we employ a Dice score as in Equation (2), which measures the similarity or overlapping of two segments, as follows:

$$\begin{aligned} dice(A,B) = \frac{2 \mid A \cap B \mid }{\mid A \mid + \mid B \mid }. \end{aligned}$$
(2)

A is the predicted segment and B is the ground truth segment. To accurately evaluate the system, this calculation excludes the background segment. The final Dice score is determined by averaging the Dice scores of each segment.

3 Experimental Results and Discussion

This section describes the data, experiment setup, and analysis on the experimental results.

3.1 Dataset Description

This experiment utilized data from two groups: Indonesian and non-Indonesian. We chose the two data sets because there may be differences in the bone geometry size between Indonesian and non-Indonesian data. The images were collected from 18 Indonesian and 19 non-Indonesian respondents. Each image consists of an anterior and a posterior part, making the total number of Indonesian and non-Indonesian images 36 and 38, respectively. The Indonesian bone scan images were obtained in DICOM format from the University of Padjadjaran’s Faculty of Nuclear Medicine [12, 31]. The non-Indonesian images were obtained through a Google Images crawl. Figure 4 depicts examples of unprocessed data for both groups.

Fig. 4
figure 4

Sample pair of bone scan images (anterior and posterior): (1) Indonesian and (2) non-Indonesian

For training process, first, we annotated each segment with different colors, as depicted in Fig. 2. We also performed normalization on the original RGB colors to speed up the training process. Table 1 shows the RGB normalization result for each segment, while Fig. 5 shows an example of normalization result from an RGB mask image.

Table 1 The RGB value of each segment after normalization
Fig. 5
figure 5

An example of normalization results from an RGB mask image: a RGB mask image, b normalized RGB mask image

3.2 Parameters of skeleton segmentation training

A total of 29 images were used for training, while eight were used for testing. During the image preprocessing phase, we utilized horizontal flipping. The training is conducted over 300, 500, and 800 epochs, with adaptive moment estimation (Adam) [32] as the optimizer, and cross-entropy as the loss function. The parameters of Adam were set to \(\alpha\)={0.001, 0.0001, 0.00001}, \(\beta _1\)=0.9, \(\beta _2\)=0.999, \(\varepsilon\)=10\(^{-8}\), and weight_decay=0. EfficientNetB7 is the EfficientNet architecture employed in this experiment. We also investigated the pre-trained model based on adversarial training (advprop) [33] in applying EfficientNet.

Table 2 shows the result of training on the use of learning rate \(\alpha\)={0.001, 0.0001, 0.00001} in the BtrflyNet model. The trend of decreasing train_loss values and increasing train_dice values looks significant when using a learning rate from 0.001 to 0.0001. Meanwhile, a larger number of epochs and a longer training time are required during training with \(\alpha\)=0.00001 to achieve convergence. There is no significant change in the train_loss value, but the train_dice value increases quite well. Therefore, this model is taken as the best model on BtrflyNet.

On the other hand, the trend of decreasing train_loss values and increasing train_dice values in the Efficient-BtrflyNet training model is also significant when using a learning rate from 0.001 to 0.0001, which can be seen in Table 3. In this Efficient-BtrflyNet model, the number of epochs is greater, and longer training time is also needed during training with \(\alpha\)=0.00001 to achieve convergence. There is no significant change in the train_loss value, but the train_dice value increases quite well. Therefore, this model is taken as the best model on Efficient-BtrflyNet.

Table 2 Experiment to observe learning rate (LR) values on BtrflyNet
Table 3 Experiment to observe learning rate (LR) values on Efficient-BtrflyNet

We also applied pre-trained model based on adversarial training (advprop) [33] in applying Efficient-BtrflyNet. Experiments were carried out at \(\alpha\)={0.001, 0.0001, 0.00001}, with the advprop parameter set to TRUE if the model uses advprop, and FALSE if the model does not use advprop. The experimental results can be seen in Table 4, showing that the use of a pre-trained model based on adversarial training (advprop=TRUE) is able to produce better loss and dice values compared to not using a pre-trained model based on adversarial training (advprop=FALSE). Based on these results, the Efficient-BtrflyNet model used is the one using advprop.

Table 4 Experiment regarding LR and adversarial training on Efficient-BtrflyNet

3.3 Results

We compare the testing performance of our proposed architecture to that of the BtrflyNet architecture developed by Shimizu et al. [6] for bone scan image segmentation. The proposed Efficient-BtrflyNet implementation is publicly available at our GitHub page.Footnote 2 We present the experimental results at Tables 56, and 7. Table 5 shows the comparison of average Dice score of anterior and posterior segments, while Tables 6 and 7 show Dice scores for each segment of anterior and posterior results, respectively.

Table 5 Comparison of averaged Dice scores on anterior and posterior parts
Table 6 The Dice score of each segment: anterior part
Table 7 The Dice score of each segment: posterior part
Fig. 6
figure 6

Segmentation results of the first four test images (anterior and posterior part). Top: predictions using BtrflyNet; middle: predictions using Efficient-BtrflyNet; bottom: ground truth annotations

Fig. 7
figure 7

Segmentation results of the next four test images (anterior and posterior part). Top: predictions using BtrflyNet; middle: predictions using Efficient-BtrflyNet; bottom: ground truth annotations

Fig. 8
figure 8

Sample of predicted cervical vertebrae segment in anterior (a and c) and posterior part (b and d). Top: predictions using BtrflyNet; Middle: predictions using Efficient-BtrflyNet; Bottom: ground truth annotations

Fig. 9
figure 9

Examples of poor prediction results for the skull and femur in the anterior section: a original bone scan image, b segment predicted by BtrflyNet, and c segment predicted by Efficient-BtrflyNet

Fig. 10
figure 10

Examples of segmentation results in the anterior ribs. Top: predicted segment using BtrflyNet; Middle: predicted segment using Efficient-BtrflyNet; bottom: ground truth annotations

Fig. 11
figure 11

Examples of segmentation results in the posterior ribs. Top: predicted segment using BtrflyNet; middle: predicted segment using Efficient-BtrflyNet; bottom: ground truth annotations

Fig. 12
figure 12

Examples of segmentation results of pelvis. Top: predicted segment using BtrflyNet; Middle: predicted segment using Efficient-BtrflyNet; bottom: ground truth annotations. a, c: anterior part; b, d: posterior part

Fig. 13
figure 13

A segmentation defect in the pelvis as a result of the presence of a metastatic hotspot. a Segmentation result using BtrflyNet; b segmentation result using Efficient-BtrflyNet; c Original bone scan image

Fig. 14
figure 14

An example of scapula segmentation result in the posterior part. The region of scapula that overlaps the ribs is well segmented; a segmented using BtrlfyNet, b segmented using Efficient-BtrlfyNet, and c original bone scan image

In general, our proposed architecture, Efficient-BtrflyNet, generates segments with higher Dice scores than BtrflyNet, both in anterior and posterior parts (shown in Tables 6 and 7). However, in some segments, it appears that BtrflyNet is able to produce segments with better Dice scores than Efficient-BtrflyNet, which are Clavicle and Scapula segments in the anterior, as well as Lumbar Vertebrae and Sacrum segments in the posterior. Nonetheless, the Dice score gaps between BtrflyNet and proposed Efficient-BtrflyNet are generally less than 0.01.

Furthermore, we visually demonstrate segmentation results on the eight testing images that can be seen in Figs. 6 and 7. Top and middle rows of the figures show segmentation outputs using BtrflyNet and Efficient-BtrflyNet, respectively. While the bottom row shows the corresponding ground truth annotations.

The cervical vertebrae segment had the lowest dice value in the anterior part (see Table 6) for both the BtrflyNet (Dice=0.4316) and Efficient-BtrflyNet (Dice=0.5337) models. As depicted in Fig. 8, the cervical vertebrae segment located between the skull and thoracic vertebrae segments has a very small surface area, making it difficult for the model to capture the segments precisely. Conversely, the results of the skull segments that occasionally merge with the cervical vertebrae segments reduce the segment’s detected area as a cervical vertebrae segment. In contrast, the cervical vertebrae segment performs well in the posterior part due to its clearer appearance and larger segment area in the posterior part.

There are almost no anterior or posterior segmentation problems in the results of skull segmentation. Figure 9 depicts the image with the worst segmentation results, where the anterior skull segment is poorly segmented. Figure 9c demonstrates that the Efficient-BtrflyNet model generates a smaller unsegmented area than the BtrflyNet model (Fig. 9b). If we examine the original image, we can see that the skull has a gray color that is difficult to distinguish from the background, making it challenging to obtain a good segment in this area.

A similar issue occurs in one of the femur segment results, as depicted in Fig. 9, with blue-circled region. The gray color of the femur from the initial bone scan image is indeed difficult to identify, so the model cannot accurately predict the femur segment.

On the other hand, both BtrflyNet and Efficient-BtrflyNet produced excellent segmentation results for the ribs segment of anterior part. As shown in Fig. 10, Efficient-BtrflyNet is able to solve segmentation problems better in some instances. Despite the application of BtrflyNet, a number of pixels cannot be identified as ribs segment. In the meantime, this issue is absent from the Efficient-BtrflyNet model.

In the meantime, the posterior rib segment has a segmentation flaw. There are multiple pixels that are not part of the ribs but are predicted to be ribs. Figure 11 demonstrates that there is a distinct region of the ribs that receives the same purple label as the ribs themselves.

The anterior and posterior segmentation of the pelvis is another noteworthy segmentation result. Figure 12 contains several examples. Using the BtrflyNet model (Fig. 12’s upper portion), there are still unsegmented segment areas; However, this condition can be greatly improved by employing Efficient-BtrflyNet (middle part in Fig. 12). On this pelvis segmentation, the presence of metastatic hotspots on the bone scan image has caused a segmentation error. As depicted in Fig. 13, the results of pelvis segmentation are widened by assuming the metastatic hotspot (the red-circled portion) is a part of the pelvis.

In contrast to the research by Shimizu et al. [6], which combines the scapula segment with the rib segment on the posterior part, we hypothesized that the scapula segment would be distinct from the rib segment in this study. A portion of the posterior scapula overlaps the segment of the ribs, making it potentially difficult to identify as part of the scapula. Therefore, we labeled the scapula on this posterior part with two different colors: pink for the scapula adjacent to the ribs and dark brown for the scapula adjacent to the clavicle (see Fig. 2 for clarity). Both of these scapula segments received favorable dice scores (see Table 7). Figure 14 depicts an example of the results of scapula segmentation in the posterior section.

4 Conclusion

Developing semantic segmentation architecture for the segmentation of whole-body bone scan images is a challenging task. Efficient-BtrflyNet, a novel method for segmenting whole-body bone scan images, has been proposed, and its superiority has been successfully demonstrated. Our approach successfully segmented several segments that BtrflyNet was unable to, and improved segmentation results in several segments, as indicated by the increase in dice score.

Based on the experimental results, even though medical images typically have low image resolution, significant future work will involve some advanced image pre-processing to enhance the quality of medical images to reduce segmentation errors. In addition, post-processing is necessary to eliminate inaccurate segments without diminishing the quality of the segmentation results in other parts that are already good. To improve segmentation performance in deep learning-based models, it is also crucial to increase the size of the training dataset.