Trigger finger is a common hand condition that causes pain, popping, locking and loss of movement of the affected finger [1]. Stretching of fingers, in which the flexor tendon glides through the tendon sheath and the pulley catches the passing tendon close against the bones of the finger [2]. Trigger finger is generally caused by a size mismatch between the flexor tendon and the pulley [3]; e.g., there is swelling in the A1 pulley of the flexor tendon sheath which causes the pulley to be stuck [4]. This situation causes painful triggering or locking of fingers.

In clinical settings, ultrasound imaging is widely used to diagnose finger tissue conditions. Sato et al. [5] compared the thickness of the pulley and flexor tendon in healthy fingers and fingers associated with the trigger finger condition. Yang et al. [6] proposed a method for identifying the position and thickness of the A1 pulley. Kim et al. [7] found that the thickness of the A2 pulley and flexor tendon under the A2 pulley are related to the severity of trigger finger. Figure 1a shows an ultrasound transverse image of the finger located at the position of A1 pulley. As shown in Fig. 1b, the elliptical shape of the tendon appears above the volar plate of the finger bone and is surrounded by the synovial sheath, which is indicated by the dotted line.

Fig. 1
figure 1

Ultrasound image of the tendon and synovial sheath of a finger. a Original image acquired around A1 pulley. b Tendon (solid line), synovial sheath (dotted line) area, and surrounding tissues

In addition to diagnosis, ultrasound can also be used as part of the surgical treatment of the trigger finger condition. Ultrasound-guided minimally invasive surgery has the advantages of fast recovery and minimal tissue damage. However, a highly experienced surgeon is required to perform the operation; therefore, surgical training and planning systems are very desirable [8,9,10,11]. The anatomical target and nearby organs or tissues are present in a 3D virtual environment, of which the user can access. Typically, to achieve more realistic and accurate simulations or planning, the target models are built from images and contours of the organs or tissues must be manually outlined to construct models. In this study, to extract tissues contours, we propose a CNN-based method for automatic segmentation of the tendon and synovial sheath.

Numerous studies used automatic algorithms of image processing techniques to segment soft tissue. Gupta et al. [12] proposed an automatic segmentation method of supraspinatus tendon based on the curvelet transform. Hamameh and Gustavsson [13] proposed a method for segmentation of the human left ventricle based on a combination of active shape model (ASM) and active contour mode (ACM). Cunningham et al. [14] proposed a method for segmentation of cervical muscles and the spine, in which ASM was used to refine the segmentation. Martins et al. [15] proposed a method for segmentation of the extensor tendon in dorsal longitudinal view, which relied on an active contours’ framework. The ASM and ACM are the most widely used methods for soft tissue segmentation. In our previous work [16], we proposed an adaptive texture-based ASM (ATASM) for segmentation of the tendon and synovium sheath. The ATASM applies a shape model together with texture feature generation and then uses a genetic algorithm-based energy optimization. In our previous work, the acquired ultrasound images were separated into two groups, clear and fuzzy, based on image quality. The average dice similarity coefficient (DSC) of ATASM is 0.905 and 0.874 for tendon and sheath, respectively, which is superior to the ASM, ACM, and texture-based ASM. Although the ATASM had been demonstrated to be a powerful method for finger tissue segmentation, it struggled with complicated computations, such as locating the shape model and image grouping.

Recently, deep convolutional neural networks (CNNs) have demonstrated enormous potential in the field of medical image analysis. Unlike traditional machine learning methods, deep neural networks do not require handcrafted features, such as texture features for training, and can be trained end-to-end for object segmentation. Thus, the CNN is a suitable choice for separating the regions of the tendon and synovial sheath. In biomedical image segmentation, recent success in precise image segmentation was achieved using the encode–decode structure, such as the fully convolutional network (FCN) [17], SegNet [18], and U-Net [19]. However, the larger number of feature maps always significantly increases the scale of the network, including network depth and the size of each hidden layer. To efficiently control the number of feature maps in different channels of parallel computation, the DenseNet [20] was proposed. A newly developed CNN, called the fully convolutional DenseNet (FC-DenseNet) [21], incorporates the fully convolutional network (FCN) with the DenseNet, which has been demonstrated to be a powerful method for object segmentation. In our preliminary work [22], we attempted to apply FC-DenseNet for the segmentation of tendons from ultrasound images; however, the average dice similarity coefficient (DSC) was only 0.88 for 380 test images.

The dilated convolution [23] applied multiscale information in segmentation tasks without losing resolution. After applying this technique, the receptive field of the convolution can be exponentially expanded. Javaid et al. [24] compared the standard U-Net and dilated U-Net for multi-organ segmentation of chest CT images and found that dilated U-Net yielded the best results. Perone et al. [25] devised a model for spinal cord gray matter (GM) segmentation using deep dilated convolutions from an MRI dataset and reported that—in a GM segmentation challenge—the state-of-the-art results were more favorable in 8 out of 10 metrics. For combining this technique with the DenseNet, Zhou et al. [26] proposed an adaptive DenseNet that uses the mechanism of dilated convolution to classify and locate thoracic disease based on X-ray images of the chest. Yang et al. [27] proposed a network called DenseASPP (densely connected Atrous Spatial Pyramid Pooling) for semantic segmentation of street scenes where dilated convolution layers were applied. Although these structures achieved outstanding segmentation performance, use of dilated convolutional DenseNet for soft tissue segmentation is still very rare, to the best of our knowledge.

Deeply supervised nets [28] minimize classification error, while the learning process of hidden layers is direct and transparent. Deeply supervised nets introduce a common objective to the individual hidden layers instead of the overall loss function, and then use the layer-wise training strategy to enhance the classification capability of the feature maps. Mo et al. [29] proposed a deep-supervised FCN for segmentation of the vessel from retinal images. Chung et al. [30] proposed a dense block applied FCN with deep supervision for segmentation of the liver from CT images. Lei et al. [31] proposed a method for ultrasound prostate segmentation based on 3D V-Net with a deep supervision mechanism; this method demonstrated high accuracy, with a DSC of 0.92. These previous studies demonstrated the effectiveness of deeply supervised training for segmentation CNNs.

The dilated convolution and deep supervision are CNN techniques that can further improve the details of segmentation. However, a CNN structure that combines both of these is rarely seen. In this article, we attempt to integrate the concepts of deeply supervised net and dilated convolution into the FC-DenseNet, resulting in a new CNN termed “deeply supervised dilated FC-DenseNet (D2FC-DN)”. This work proposes a hybrid CNN technique method for finger tissue segmentation, in which the FC-DenseNet is the basic structure and dilated convolutional is combined to allow multiscale feature map information without losing resolution. Then, deeply supervised learning is applied to increase the transparency of the hidden layers to ensure that each hidden layer output is trained under supervision. In the experiments conducted in this work, this new CNN was used to segment the finger tendon and synovial sheath from transverse ultrasound images.

This paper is organized as follows. The data material of this study and experimental results are described in “Results” and “Discussion” sections that demonstrate Chuang’s dataset [16] with independent images that were used to compare the results of this study with conventional methods, and a dataset with eight groups of images used for model building. Conclusions of this study are presented subsequently. The proposed D2FC-DN is presented in “Methods” section, and the applied FC-DenseNet, dilated convolution, and deeply supervised methods are also described.


Data materials

Two datasets of ultrasound images of fingers at the A1 pulley, captured in the transverse view, were used in this study. The first set was the Chuang’s dataset [16], which was used to make comparisons with the proposed method. The second set consisted of eight groups of images captured for building a model of finger tissue for a surgical training system, termed the modeling-building (MB) dataset. Chuang’s dataset consists of ultrasound images, of which 74 are finger tendon images and 57 are synovial sheath images. All of the images have segmentation ground truth. The images in this dataset were classified into clear and fuzzy groups, according to the difference in average intensity of two bounding boxes on the bottom boundary of the tendon in the image. For the tendon images, there are 38 images in the clear group and 36 images in the fuzzy group; for the synovial sheath images, there are 30 and 27 images in the clear and fuzzy groups, respectively. The physical spacing of the images was 0.075 × 0.075 mm2 for each pixel. The data were acquired using the t3000 ultrasound system (Terason, Burlington, MA, USA) with a 13-MHz probe.

The MB dataset consists of eight repeated acquisitions of ultrasound images of a subject on different days, which were acquired at National Cheng Kung University Hospital (IRB number: B-ER-101-012). A total of 1035 images were acquired, with approximately 90–200 images in each group. In this dataset, the ultrasound images were captured by the Siemens ACUSON S2000 Ultrasound System, using an 18–5.5 MHz linear 18L6 HD transducer. 2D transverse ultrasound images of the right-hand’s middle finger at the A1 pulley were captured. The tendon and synovial sheath regions in the images were annotated by Dr. T. H. Yang as the ground truths. The pixel spacing of the images is 0.07 × 0.07 mm2 for each pixel. Figure 2 shows example images of the datasets used in this study, which have a resolution of 384 × 192 pixels. Clear and fuzzy images from Chuang’s dataset are shown in Fig. 2a, b and images from the MB dataset in Fig. 2c, d. We found four images in Chuang’s dataset that contain two finger tissues; thus, we kept the tissue of interest and masked the other one in black.

Fig. 2
figure 2

Dataset image samples. a, b Chuang’s dataset; c, d MB dataset


We first compared the proposed network with ATASM, U-Net, FC-DenseNet, and dilated FC-DenseNet (DFC-DN), using Chuang’s dataset. These methods were evaluated by fourfold cross-validations. The network was pre-trained with images from the MB dataset. Then, for finger tissue modeling, we also evaluated the accuracy of our segmentation method. There were eight groups of continuous images in the MB dataset. During evaluation, one group was left for testing and other groups for training, in which each group was tested in turns. In practice, each training group only generated one sequence of training images, more precisely, the first image of each training group was selected as the start image and the other training images were subsequently extracted from the start image at intervals of four images. The network was pre-trained with images from Chuang’s dataset.

In this study, because the number of images was limited, the number of training images was increased using data augmentation in which images were flipped, translated, and scaled. The parameters of the augmentation are shown in Table 1, in which the augmentation type, random interval of parameter and unit are shown. In this study, each image in the training group was augmented by vertical flipping, four times of random translation and four times of random rotation. After this augmentation, the amount of training images was increased to ten times. In addition, to further increase the variety of the data, the augmentation of rotation, shrinking, translation, noise adding, and gamma transformation were randomly applied to the data in every training epoch. In this study, the networks were built with similar architecture to facilitate comparisons. The number of parameters of the U-Net was 5.24 million, while FC-DenseNet, DFC-DN, and D2FC-DN were approximately 4.97 million. The running time of U-Net, FC-DenseNet, DFC-DN, and D2FC-DN were 0.02565, 0.03281, 0.03164, and 0.03151 s per image, respectively. It is worth noting that the architecture of DFC-DN and D2FC-DN were nearly identical, except for the deeply supervised part that had approximately 3000 additional parameters. The input of the networks was the ultrasound image in size of 384 × 192 × 1 and the output was the segmentation result with 384 × 192 × 1. The training strategy and parameter settings of these networks were the same.

Table 1 Data augmentation

We used the DSC, mean of absolute distance (MAD), Hausdorff distance (HD) and Yasnoff [32] to evaluate the segmentation results of the methods. These measures show the similarity of region and contour between the predicted results and the ground truth. The definitions of these are shown below.

$${\text{DSC}}\left( {X,Y} \right) = \frac{{2\left| {X \cap Y} \right|}}{\left| X \right| + \left| Y \right|} ,$$
$${\text{MAD}}\left( {A,B} \right) = \frac{1}{2}\left[ {\frac{1}{\left| A \right|}\mathop \sum \limits_{a \in A} \mathop {\inf }\limits_{b \in B} d\left( {a,b} \right) + \frac{1}{\left| B \right|}\mathop \sum \limits_{b \in B} \mathop {\inf }\limits_{a \in A} d\left( {a,b} \right)} \right] ,$$
$${\text{HD}}\left( {A,B} \right) = { \hbox{max} }\left\{ \mathop {\sup }\limits_{a \in A} \mathop {\inf }\limits_{b \in B} d\left( {a,b} \right),\mathop {\sup }\limits_{b \in B} \mathop {\inf }\limits_{a \in A} d\left( {a,b} \right)\right\} ,$$
$${\text{Yasnoff}}\left( {A,B} \right) = \frac{100}{\text{W}}\sqrt {\mathop \sum \limits_{a \in A} \left( {\mathop {\hbox{min} }\limits_{b \in B} d\left( {a,b} \right)} \right)^{2} } ,$$

where \(\left| \cdot \right|\) denotes the pixel number of the region \(X\) (predicted result), \(Y\) (ground truth) or \(X \cap Y\) (overlapped region of the predicted result and ground truth); \(A\) and \(B\) are the contours of \(X\) and \(Y\), respectively; \(d\left( {a,b} \right)\) is the distance between \(a\) and \(b\), where \(a\) is a pixel on \(A\), and \(b\) is a pixel on \(B\); W is the number of pixels in the image. To evaluate the smoothness of the output contour, a convex hull Hausdorff distance (CHD) is defined, as described below

$${\text{CHD}}\left( {A,C} \right) = { \hbox{max} }\left\{ \mathop {\sup }\limits_{a \in A} \mathop {\inf }\limits_{c \in C} d\left( {a,c} \right),\mathop {\sup }\limits_{c \in C} \mathop {\inf }\limits_{a \in A} d\left( {a,c} \right)\right\} ,$$

where \(A\) is the contour of the predicted region, \(C\) is the contour of the convex hull of the predicted region; \(a\) is a pixel on \(A\), and \(c\) is a pixel on \(C\). So, if the contour of the predicted result is smooth, CHD will be low, otherwise it is high. Figure 3 shows examples of result contours with their convex hull contours. It can be observed that the CHD increases as the smoothness of the result contours decreases. Note that the artifacts are filtered when calculating HD and CHD, because they are very sensitive to noise.

Fig. 3
figure 3

Predicted result contours (cyan) and the convex hull outputs (red). a Smooth contour (CHD = 1.41). b Contour with a bud (CHD = 7.62). c Contour with a groove (CHD = 18.68)

Segmentation performance evaluation

Chuang et al. [16] proposed a conventional ultrasound image segmentation method called ATASM, which showed state-of-art results on finger tendon and synovial sheath. In this study, we used the same dataset to evaluate segmentation performance. To separate the Chuang’s dataset, we used the same criterion as Chuang’s study. The difference between the average intensities of two 15 × 15 windows, which are above and below the bottom tendon boundary, was calculated. If the average intensity of the window above is higher than the one below by 30, the image belongs to the clear group; otherwise, it belongs to the fuzzy group. The clarification of clear and fuzzy images, which depends on the boundary of the tendon, can be referred in the literature of Chuang’s dataset.

The segmentation evaluation of the tendon is shown in Table 2, where the clear group is shown on the left side and the fuzzy group is on the right. Our proposed method obtains the best DSC and outperforms ATASM, with 2% higher on the clear group. DFC-DN and D2FC-DN both have the highest DSC in the fuzzy group that is approximately 2% more than ATASM’s. For the contour similarity measurements, the proposed method achieves the best MAD, HD and Yasnoff for both the clear and fuzzy groups. In addition, D2FC-DN has the best CHD in both groups, which also indicates it can give smooth output contours. It is worth noting that ATASM can achieve better or on par results with U-Net and FC-DenseNet on DSC and better MAD than these two methods.

Table 2 Tendon segmentation results of Chuang’s dataset (mean ± standard deviation)

Figure 4 shows examples of the visualization evaluation of the tendon segmentation on Chuang’s dataset. Samples 1 and 2 are in the clear group, and the other two are in the fuzzy group. The results are magnified for better visualization. The source and result images of the methods are shown from top to bottom where the predicted result contours are shown in cyan color and the ground truth are in red. The results of U-Net are over-segmented and many artifacts can also be seen. The results of FC-DenseNet are significantly better; however, sample 4 is largely over-segmented. The results of DFC-DN and D2FC-DN are quite similar, which indicates that their network structures are very similar except that there are some additional parameters for the deeply supervised training. Some little artifacts on samples 2 and 4 of DFC-DN outputs can be observed; in addition, the contours are more fitting on D2FC-DN.

Fig. 4
figure 4

Segmentation results of tendon on Chuang’s dataset

Table 3 shows the synovial sheath segmentation results of the methods on Chuang’s dataset. The proposed method achieved the best DSC that is 3% higher than ATASM in both the clear and fuzzy groups. In general, the D2FC-DN outperformed the other CNN methods in MAD, HD, Yasnoff, and CHD. Comparison of the segmentation results of tendon and synovial sheath showed that the DSC difference of ATASM is large, with a 4% difference in the clear group and 2% in the fuzzy group. This indicates that the segmentation criteria are not alike between these tissues. However, the differences are smaller in CNN methods; for example, only a 1% difference in the fuzzy group for D2FC-DN. In other words, the CNN method gives more stable results with different segmentation conditions.

Table 3 Synovial sheath segmentation results of Chuang’s dataset (mean ± standard deviation)

The synovial sheath visualization results are shown in Fig. 5; the arrangement of the results are the same as Fig. 4. The results of U-Net are largely over-segmented on sample 3 and irregular contours were observed on samples 1 and 3; however, the result of sample 2 is rather fitting. The result contours of FC-DenseNet are not smooth enough on the samples, over and less segmented on samples 3 and 4, respectively. Compared to FC-DenseNet, the result contours of DFC-DN are better; however, the results are over-segmented on the samples. The result contours of D2FC-DN are smooth and better fitting on the samples, although there is still little not a complete match on sample 1.

Fig. 5
figure 5

Segmentation results of synovial sheath on Chuang’s dataset

3D models of MB dataset

The MB dataset, which consists of eight groups of ultrasound images, was used to build 3D finger tissue models for surgical training. We first evaluated segmentation performance between the methods, then we used the segmentation results to build the 3D models for surgical training, and finally we show the segmentation accuracy of each group.

Table 4 shows the segmentation results of the methods, where the results of tendon are shown on the left side and synovial sheath on the right. The proposed method displayed the best performance on tendon. An extremely high DSC is achieved by our method of approximately 95% for synovial sheath, where DSC, MAD, and HD are on par between D2FC-DN and DFC-DN. However, CHD is significantly better on D2FC-DN, which shows the proposed method has the ability to yield smoother contours.

Table 4 Tendon and sheath segmentation results of MB dataset (mean ± standard deviation)

Figure 6 shows the visualization results of the CNN methods; the first two samples are tendon and the other two are synovial sheath. The results are magnified for better visualization. The result contours of the methods are shown from top to bottom in cyan color, and the ground truth is in red. The U-Net results are less segmented for samples 2 and 4, and the contours of samples 1 and 3 are not smooth enough. The results of FC-DenseNet and DFC-DN are similar, but a large false segment on sample 2 resulting from the FC-DenseNet method and small artifacts on samples 1 and 4 from DFC-DN exist. It can be observed that the D2FC-DN method gives finely matching and smooth results between these methods, although the last sample is not smooth enough but it still fits.

Fig. 6
figure 6

Segmentation results of tendon and synovial sheath images

D2FC-DN had the best segmentation performance; thus, we used this method to segment the tendon and synovial sheath of the MB dataset images. The segmentation accuracy of the eight groups is shown in Table 5. The tendon results are shown on the left side and the synovial sheath results on the right side. For the tendons, the DSC of group 1 and 8 is remarkably high and MAD is very low, which shows the similarity between the results and that the ground truth is very high. The DSC of groups 3, 4, and 6 is also high (0.92) and MAD is approximately 2.5. The DSC of the remaining three groups are approximately 0.9 and MAD is approximately 3.3. For the synovial sheaths, the DSC of all groups is amazingly high and MAD is extremely low, which may have been because of the higher consistency of the synovial sheath contour as compared to tendon on the ultrasound images. It is worth noting the variance of the HD results; however, the CHD is very small, which shows that the result contours are smooth. The increase of CHD of group 8 synovial sheath in which a few images have the problem is shown in Fig. 3c; in other words, it shows that CHD is a powerful measurement for the smoothness of a contour.

Table 5 Tendon and sheath segmentation results of each group in MB dataset (mean ± standard deviation)

The 3D models built from the segmentation results using our proposed method are shown in Fig. 7. Groups 1 to 8 are shown, from top to bottom and left to right. Different appearances can be seen in these soft tissue models. In addition, the length of them are related to the number of acquired images, in which groups 4 and 6 are longer than the others, and groups 2 and 8 are short. Figure 8 shows the synovial sheath models; the appearances of these are thicker than the tendons. The models were built from the segmentation results using Marching cubes algorithm [33] with smoothing. To provide valid models for virtual surgical training, the artifacts outside the target tissues were removed.

Fig. 7
figure 7

3D tendon model of MB dataset

Fig. 8
figure 8

3D synovial sheath model of MB dataset


Advantages of D2FC-DN

For the evaluation of tendon and synovial sheath segmentations, the proposed D2FC-DN got the best performance when comparing to the other CNN methods. D2FC-DN outperforms the other deep models because it combines several effective CNN techniques to achieve their advantages and improve the segmentation results. It basically follows the encode–decode structure of the U-Net with skip connections in the network which can retrieve the lost information on the encoder side after downsampling. The dense block which is a major component of the FC-DenseNet is applied in the proposed network to achieve multiple feature map layers for convolution, which also improves the efficiency of the feature propagation. Dilated filtering is also applied to increase the receptive fields of convolution without reducing the feature resolution or increasing the filter size. Eventually, we apply the feedback supervision on each level of network output when training, so that the discriminative ability of the entire network can be improved. Due to these advantages, the proposed method not only acquires high segmentation accuracy, but also achieves the resulting contours more precisely.

Performance of DFC-DN and D2FC-DN

The main network architectures of DFC-DN and D2FC-DN are similar, except the deeply supervised connections on the upsampling side for D2FC-DN. Similar performance was also observed between DFC-DN and D2FC-DN. We further explored their differences in terms of CHD, which represents the smoothness of the result contours. We also calculated the CHD of the finger tissue contours drawn by Dr. T. H. Yang and we found that they are extremely low and contain very few outliners. Figure 9 shows the CHD boxplot results of Chuang’s dataset from DFC-DN, D2FC-DN, and manual evaluation (by Dr. T. H. Yang), and the resulting distributions are discussed. For the boxplot of data, the median and the interquartile range (the 75th and the 25th percentiles) are indicated by the central mark, top and bottom edges of the box, respectively. The whiskers extend to show the extreme observations of the data, and the outliers are indicated individually by a ‘+’ symbol. It can be observed that the CHD median and interquartile range of D2FC-DN are better than DFC-DN on the fuzzy group for both tendon and synovial sheath, and they are closer to the manual ones. The CHD median and interquartile range of D2FC-DN are on par with DFC-DN on the clear groups. Figure 10 shows the CHD statistical performance of the MB dataset. It can be observed that the manual results are extremely low, which means that the result contours are smooth. The CHD median and interquartile range of D2FC-DN are mostly better than DFC-DN and closer to the manual ones. These results show that our deeply supervised strategy gives more stable, smoother contour results, in which the manual ground truth tends to have smooth contours. The visualization of the smoothness comparison between these methods can be referenced in Figs. 4, 5 and 6; for example, the contours of D2FC-DN are more smooth than DFC-DN in Fig. 5 sample 1 and Fig. 6 sample 2. Note that the artifacts are filtered for evaluation because CHD is very sensitive to noise.

Fig. 9
figure 9

CHD results of Chuang’s dataset. a Tendon, b synovial sheath

Fig. 10
figure 10

CHD results of MB dataset. a Tendon, b synovial sheath. *Method (group#)

Comparison to manual segmentation results

We compared the proposed method to the manual segmentation results from Chuang’s study [16]. In the study, two users were trained to outline the tendon from 16 ultrasound images. Both of them were blinded to the ground truth and the automatic segmentation results. The average DSCs of users 1 and 2 are 0.88 ± 0.07, 0.93 ± 0.05, respectively, and the average DSC of the proposed D2FC-DN is 0.91 ± 0.04. It can be observed that user 2′s and our segmentation results are closer to the ground truth with high DSC values. User 1 has a relatively lower DSC value because two cases of the bottom areas of tendon were misjudged. We believe our proposed segmentation method can give accurate enough and reliable results when comparing with the regular and more experienced users.

For the training and testing performance of the proposed method, the training time for the Chuang’s dataset is approximately 2 h and approximately 3.5 h for the MB dataset. The segmentation result prediction time is approximately 0.03 s per image. The system was run on a PC with Intel i7 CPU, 24G RAM memory, and an NVIDIA GeForce RTX 2080 Ti display card equipped. The proposed network is implemented in Python with Tensorflow framework.


This article proposed a deeply supervised dilated FC-DenseNet (D2FC-DN) for finger tendon and synovial sheath segmentation from ultrasound images. In the D2FC-DN, dilated dense block and deeply supervised training were adopted to successfully improve the contour smoothness. The proposed method achieves better results than not only the state-of-art conventional image processing method, but also certain representative CNN methods. The proposed method is fully end-to-end trained and tested; thus, it overcomes the preprocessing requirement of the conventional methods. The segmentation results were remarkably high on DSC values and had the smallest MAD of two datasets. The proposed method was applied to segment multiple groups of tendon and synovial sheath ultrasound images. The results had very high DSC and low MAD and were used to build 3D models for virtual surgical training. In the future, we will obtain more data for evaluation and migrate to the handling of soft tissue images for the surgical training system and clinical evaluation.



The U-Net [19] is a popular and basic segmentation network for medical images, which shows remarkable performance. It is a U-shape structure CNN with encoder and decoder parts, in which skip connections exist between them so that the information lost through encoding can be retrieved during decoding. Figure 11 shows an example of a U-Net with skip connections of the feature map from the downsampling side (encoder) to the upsampling side (decoder). In the network, feature map convolutions at each level are processed twice, the corresponding feature map on the downsampling side concatenates the upsampled feature map, and the result becomes an input of the convolution in the upsampling side. Downsampling and upsampling are performed by max pooling and up-convolution, respectively. According to this network, the final output map becomes the segmentation result of the input image.

Fig. 11
figure 11

The architecture of U-Net

Fully Convolution DenseNet

The DenseNet [20] was proposed to improve classification accuracy by efficiently reusing feature maps. The FC-DenseNet [21] is the extension of DenseNet to the segmentation task. FC-DenseNet not only retains the efficiency of feature reuse, but also achieves excellent performance in semantic segmentation. Figure 12 shows an example of an FC-DenseNet, which follows the U-shape structure of U-Net with the convolution replaced by the dense block of DenseNet. Furthermore, the input feature map concatenates with the output of dense block in the structure. Downsampling and upsampling are performed by the transition down- and up-layers, respectively. For a transition down-layer, a sequence of processes is applied, consisting of batch normalization, ReLU, 1 × 1 filter size convolution, and 2 × 2 max pooling with stride 2. For a transition up-layer, the corresponding sequence of processes is batch normalization, ReLU, and 3 × 3 transposed convolution with stride 2.

Fig. 12
figure 12

The architecture of FC-DenseNet

For the design of DenseNet and FC-DenseNet, the dense block is the most important component and efficiently uses multiple feature maps. The diagram of a four-layer dense block is shown in Fig. 13, in which the output feature map of each layer is k channel, where k is called the growth rate. Each layer includes a sequence of processes, consisting of batch normalization, ReLU, and n × n convolution. Concatenation of the input and output of each layer becomes the input to the next layer, with the exception of the output of the final layer, which concatenates with all the outputs of the previous layers in the block. Thus, in Fig. 13, if k = 4, the output channel number of the block is 16. This design can prevent an increase in the number of channels when the network is deep and allows the efficient use of multiple features.

Fig. 13
figure 13

A dense block with four layers

Dilated FC-DenseNet

Max-pooling is applied in conventional CNNs; otherwise, the filter size must be enlarged to increase the receptive field of the feature maps. However, max-pooling results in information loss, while filter size enlargement increases the computation cost. To overcome these disadvantages, dilated convolution [23] for dense prediction was proposed, where the receptive field was increased in an efficient manner. The discrete convolution of a discrete function F and a kernel \(k\) is shown in Eq. (6),

$$\left( {F*k} \right)\left( p \right) = \mathop \sum \limits_{s + t = p} F\left( s \right)k\left( t \right),$$

where \(p\) is the position variable and \(*\) is the discrete convolution operator, in which \(F :{\mathbb{Z}}^{2} \to {\mathbb{R}}\) and \(k :\varOmega_{r} \to {\mathbb{R}}\) where \(\varOmega_{r} = \left[ { - r,r} \right]^{2} \cap {\mathbb{Z}}^{2}\). For instance, \(\varOmega_{r}\) = {(0,0), (0,1), (1,0), (1,1), (0,−1), (−1,0), (−1,−1)}, when \(r = 1\). The kernel \(k\) is a discrete filter with size (2 \(r\) + 1)2. The discrete dilated convolution of the function F and the kernel \(k\) are shown in Eq. (7),

$$\left( {F*_{l} k} \right)\left( p \right) = \mathop \sum \limits_{s + lt = p} F\left( s \right)k\left( t \right),$$

where \(*_{l}\) is the dilated convolution operator with dilation factor \(l\). The definitions of the parameters \(p,s\), and \(t\) are the same as in Eq. (6). Figure 14 shows multiple kernels with different dilated factors, and the dilated filter in the blue color is overlapped on a 7 × 7 receptive field of F. Only the overlapped position contributes to convolution and the receptive field of convolution increases when the factor increases. When the dilation factor = 1, it is equivalent to the conventional discrete convolution.

Fig. 14
figure 14

Dilated convolution with different dilation factors

To increase the receptive field without losing information and increasing computation cost, recent studies [23,24,25,26,27] applied dilated convolution layers on the networks for classification and segmentation and obtained outstanding improvements. In this study, dilated convolution was applied on the dense block to increase the receptive field, which achieved better segmentation results. Figure 15 shows the original and the dilated dense block layers. For the original dense block layer, the process sequence was batch normalization, ReLU, and (2r + 1)2 filter size convolution, where the receptive field was fixed for all layers. For the dilated dense block layer, the process sequence was the same, but dilated convolution was applied. In this study, we set r = 1 to apply 3 × 3 convolutions with different dilation factors. The dilation factor increased according to the layer number N; i.e., the dilated factor for the first layer was 1, the dilated factor for the second layer was 2, and so forth. In the dilated dense block, the convolution receptive field was concurrently enlarged on deeper layers. It should be noted that dilated convolution has the same number of parameters as the fixed style, indicating that no extra effort is required when using it.

Fig. 15
figure 15

Original and dilated dense block layers. a Original, b dilated

Deeply supervised dilated FC-DenseNet (D2FC-DN)

The layer-wise segmentation CNN consists of deep downsampling of feature maps, which may cause the gradient to vanish at low resolution levels, which affects training and network performance. To overcome these problems, deeply supervised methods [28,29,30,31] could be applied, which increase the discriminative capability and prevent the gradient from vanishing for a deep segmentation network. Because of these advantages, we applied deeply supervised learning to the segmentation network. To achieve deep supervision, pixel-based auxiliary classifiers are applied at certain intermediate layers of the network when training.

For the segmentation of finger tendon and synovial sheath, although the U-Net and FC-DenseNet showed effective segmentation performance, however, the stability of them is not enough because of over-segmentations or irregularly segmented contours. The dilated convolution has the ability to increase the receptive fields of feature abstraction so that it can overcome the disadvantages of over-segmentations; however, the resulting contour is still not smooth enough. After applying the deeply supervised strategy, we found that the smoothness of the segmentation contour was significantly improved. Consequently, the proposed method also outperformed the other networks based on the evaluation of CHD measure which was most widely used to evaluate the smoothness of contour.

Figure 16 shows the proposed deeply supervised dilated FC-DenseNet (D2FC-DN) for the segmentation of tendons and synovial sheaths from ultrasound images. In this study, the network structures for the segmentation of tendon and synovial sheath were identical but were trained individually. The input was the ultrasound image, while the output was the segmentation result. The network has an encoder and decoder structure with skip connections of three levels, where the encoder and decoder are implemented by downsampling and upsampling of feature maps, respectively, and the skip connection is the corresponding concatenation of the feature maps from the downsampling side to the upsampling side. The dilated dense block is the dense block with dilated convolution layers as shown in Fig. 15. Downsampling and upsampling correspond to the transition down- and up-layers, respectively, and were the same as those of the previously mentioned FC-DenseNet. The last feature map at each intermediate level on the upsampling side was convoluted by a 1 × 1 filter to become a probability map, and then was upsampled by bilinear interpolation to the input image resolution. After thresholding on these upsampled probability maps, the binary supervision results were obtained and compared with the ground truth, which provided deeply supervised training. In this study, the binary threshold values of probability maps were all set to 0.5 to achieve the intermediate and final segmentation results.

Fig. 16
figure 16

Proposed segmentation network

The detailed architecture of the D2FC-DN is shown in Table 6, which corresponds to Fig. 16, in which [.] denotes a dilated dense block layer. Similar to [34], the number of layers, \(n_{l}\), and the growth rate, k, for each level were set in an increasing manner. In this study, we set \(n_{l}\) = 4 at the highest resolution level, which was increased by 1 for each downsampling. The growth rate of the network began at 8 and doubled after each downsampling; the detail settings are shown in Table 6. It is worth noting that sufficient improvement was not found when using dense convolution on the upsampling side; hence, it is done by transition up-layer to reduce the parameter size of the network.

Table 6 Network architecture

Training of D2FC-DN

During training, the network was initialized by Gaussian distribution with adaptive standard deviation [35]; the standard deviation is shown below

$${\text{std}}\,{\text{dev}} = \sqrt {\frac{2}{{\left( {2r + 1} \right)^{2} \times c_{\text{in}} }}} ,$$

where \(\left( {2r + 1} \right)^{2}\) is the filter size and \(c_{\text{in}}\) the channel number of the feature map. The network was optimized by the RMSProp optimizer, the size of the mini-batch was four, and the initial learning rate was \(5 \times 10^{ - 4}\). Similar to [36], a polynomial decay was applied as shown below

$$l_{n} = l_{0} \times \left( {1 - \frac{n}{{N_{e} }}} \right)^{\alpha } ,$$

where \(l_{n}\) is the learning rate at the \(n\)th epoch, \(l_{0}\) the initial learning rate, and \(N_{e}\) the stopping epoch. We set \(\alpha = 1.5\), \(N_{e}\) = 50 for Chuang’s dataset and \(N_{e}\) = 20 for MB dataset in this study. The total loss applied to optimize the network is shown below

$$L_{\text{total}} = {\text{DSC}}_{\text{loss}} \left( {{\text{output}}_{\text{main}} } \right) + \sum w_{i} {\text{DSC}}_{\text{loss}} \left( {{\text{Output}}_{{{\text{supervised}}_{\text{i}} }} } \right),$$

where \({\text{output}}_{\text{main}}\) is the final output of the network, \({\text{Output}}_{{{\text{supervised}}_{\text{i}} }}\) is the supervised outputs at different levels of the network, and \(w_{i}\) is the weighting for supervised levels. As explained in [37], the weighting is low at the low-resolution supervising level. We observed that the best performance was achieved when \(w_{i} = \left\{ {\frac{1}{16},\frac{1}{8},\frac{1}{4} } \right\}, i = \left\{ {1,2,3} \right\}\) in terms of the evaluation measurements; thus, we adopted this setting for this study. The definition of \({\text{DSC}}_{\text{loss}}\) [38] is shown below

$${\text{DSC}}_{\text{loss}} = \frac{{2\mathop \sum \nolimits_{i}^{N} p_{i} q_{i} }}{{\mathop \sum \nolimits_{i}^{N} p_{i}^{2} + \mathop \sum \nolimits_{i}^{N} q_{i}^{2} }} ,$$

where p, q are the corresponding pixel class label of the predicted result and the ground truth, respectively. N is the total number of pixels of the image. The class label of object was set to 1 and background was set to 0. In this study, two identical networks for segmenting the tendon and synovial sheath were trained individually, using the same training strategies.