Lightweight deep learning methods for panoramic dental X-ray image segmentation

Dental X-ray image segmentation is helpful for assisting clinicians to examine tooth conditions and identify dental diseases. Fast and lightweight segmentation algorithms without using cloud computing may be required to be implemented in X-ray imaging systems. This paper aims to investigate lightweight deep learning methods for dental X-ray image segmentation for the purpose of deployment on edge devices, such as dental X-ray imaging systems. A novel lightweight neural network scheme using knowledge distillation is proposed in this paper. The proposed lightweight method and a number of existing lightweight deep learning methods were trained on a panoramic dental X-ray image data set. These lightweight methods were evaluated and compared by using several accuracy metrics. The proposed lightweight method only requires 0.33 million parameters (∼7.5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 7.5$$\end{document} megabytes) for the trained model, while it achieved the best performance in terms of IoU (0.804) and Dice (0.89) comparing to other lightweight methods. This work shows that the proposed method for dental X-ray image segmentation requires small memory storage, while it achieved comparative performance. The method could be deployed on edge devices and could potentially assist clinicians to alleviate their daily workflow and improve the quality of their analysis.


Introduction
Among many medical imaging protocols, panoramic X-ray imaging [1] is a great tool for diagnosing teeth diseases as it requires relatively low dose and low cost. By visualizing panoramic dental X-ray images, dentists can examine whole oral conditions and so could justify whether there are any dental diseases, such as caries/cavities, gum diseases, cracked or broken teeth and oral cancer. Mainly, dentists have to analyze the X-ray images based on their experience and visual perception [2]. For obtaining dental panoramic images, the X-ray tube needs to rotate around the subject. Consequently, this would make it a challenge for dentists to effectively examine the images due to various reasons, e.g., different levels of noise generated by the machine, low contrast of edges, overlapping of anatomic structures, etc. These issues related to poor quality of the obtained X-ray images would make it hard for dentists to identify diseases. Therefore, automatically analyzing dental X-ray images would be very helpful for assisting practicing dentists to alleviate their daily workflow and improve the quality of their analysis.
For assisting practicing clinicians to analyze dental X-ray images, automatic methods are to be developed to solve those challenges which include identifying and classifying tooth diseases, identifying anatomical landmarks and segmenting tooth structures [2,3]. Perhaps, among those challenges, segmenting tooth structures is the most elementary task for automatically analyzing dental X-ray images. For example, before identifying tooth diseases, it would have to isolate the teeth by using segmentation techniques. Image segmentation is a common technique for analyzing images. The goal of image segmentation is to partition a digital image into various regions which makes it easier to represent an image according to the distinct objects in the image. For example, in dental X-ray image analysis, the goal of segmentation is to isolate each tooth from other objects in the image such as jaws, gums and other details of face. Further analysis, e.g., identifying diseases, could be carried out given an isolated tooth. In this paper, we focus on segmenting the panoramic dental X-ray images.
In the field of image segmentation in general, there are various kinds of approaches to segmenting images. For example, bounding box method is an approach to segment an object from an image. Typically, each object is represented by an axis-aligned bounding box that tightly encompasses the object. This approach could be represented as a classification problem and the task is to classify the image content in the bounding box to a specific object or background [4][5][6]. Objects as points [7] is an approach to simplify the bounding box method. The idea of objects as points is to represent objects by a single point at the objects' bounding box center, and other features such as object size, dimension, 3D extent, orientation and pose are represented as a regression function of the image features at the center point. Different to bounding box methods, image segmentation could also be presented as a classification problem. The pixels in the image are classified as a predefined object, and so, the task is to assign a class to each pixel of the image. Many deep learning methods treat image segmentation as a classification problem. Experimental results show deep learning methods achieve good performance. Typically, U-Net [8] is a popular deep learning method for image segmentation.
For medical image segmentation, the U-Net would be one of the most popular methods and has been largely modified for the purpose of image segmentation. For example, inspired by the DenseNet [9] and based on the U-Net, the UNet?? proposed an encoder-decoder network where the encoder and decoder sub-networks are connected through a series of nested, dense skip pathways [10]. Such skip pathways could be able to reduce the semantic gap between the feature maps of the encoder and decoder sub-networks. R2Unet [11] integrates recurrent neural network (RNN) and residual block with U-net architecture. 3D U2-Net [12] can deal with multiple images of different kinds of images using convolution 3D kernel, which means it is not necessary to retrain a new model for different images. RAUNet [13] proposes a new architecture which includes attention and residual block based on UNet. VNet [14] is a special variant designed for 3D data set. However, all these models require huge computational costs, huge numbers of parameters and large memory. Therefore, cloud computing is required for deploying these algorithms, which would not be applicable in many scenarios. For example, these algorithms must be used with Internet connections; it would not be possible to deploy these algorithms on X-ray devices because of memory requirements. In this paper, for tackling these problems, we investigated a few existing lightweight algorithms for dental image segmentation. In addition, we have proposed a new lightweight algorithm employing knowledge consistency for our segmentation task. In the next section, we reviewed related works for dental image segmentation and knowledge distillation methods which are relevant to our method.

Dental image segmentation
Deep learning techniques have been experimentally shown as better performance comparing to traditional methods for image segmentation tasks and therefore dental image segmentation. Various traditional methods have been applied to semantic segmentation for dental X-ray images. These algorithms include region growing, splitting and merging, global thresholding, fuzzy method, level set method, etc. For example, the work of [3] provided a comparative study for applying these algorithms to dental X-ray images. Some experiments have shown that deep learning methods perform much better than traditional methods, which was seen as the early study demonstrating the advantages of using deep learning approaches for panoramic X-ray image segmentation [3,[15][16][17][18][19][20]. The work of [21] systematically reviewed recent segmentation methods for stomatological images based on deep learning. For example, mask R-CNN (MRCNN) was used to make instance segmentation for dental X-ray images [22]; U-Net [8] was another popular method for image segmentation and was applied to semantic segmentation for dental X-ray images as well [16,23]. These research have shown that data augmentation techniques including horizontal flipping and ensemble technique improved the performance of U-Net [23]. However, the performance of MRCNN and U-Net would be affected by issues in the images such as the low contrast in the tooth boundary and tooth root. This issue could be mitigated by using attention techniques [16].

Lightweight methods
Generally, the mentioned deep learning methods perform well for dental image segmentation. However, these methods require large memory for storing the trained model parameters and a lot of floating point operations (FLOPs) which results in long running time. Potentially, it is required to deploy these models to edge devices and thus light weight models are needed for such purpose.
In semantic image segmentation, some lightweight methods have been proposed for the purpose of edge device application. For example, the efficient neural network (ENet) [24] was proposed for real-time semantic segmentation. ESPNetv2 [25] is a typical lightweight convolutional neural network which has shown superior performance on semantic segmentation and requires fewer FLOPs.
Other schemes such as compression methods [26] and knowledge distillation methods [27] could be applied to dental image segmentation to obtain a lightweight model. Typically, knowledge distillation is a scheme to transfer knowledge from cumbersome teacher models to teach a lightweight student model to mimic a teacher model. Originally, the main strategy of knowledge distillation is to use a soft probability output to make a student model to perform similarly to the teacher model. An advanced approach for knowledge distillation is to use the intermediate feature maps to train a student network. For example, adapting intermediate feature maps helps to improve performance of image segmentation [28]. Recently, some novel knowledge distillation methods have been proposed for semantic segmentation tasks. As an example, an efficient and lightweight method using knowledge distillation has been proposed for medical image segmentation such as CT images [29]. Other knowledge distillation methods such as structured knowledge distillation [30] and transformer-based knowledge distillation [31] could also be applied to semantic segmentation. Interestingly, transformer-based knowledge distillation would be an interesting approach for dental X-ray segmentation which learns compact student transformers by distilling both feature maps and patch embedding of large teacher transforms [31].
Rare work has been done for dental image segmentation using knowledge distillation, although it has been used in chest X-ray images [32] and 3D optical microscope image [33]. In this paper, we investigate knowledge distillation methods for semantic dental X-ray image segmentation, which will produce lightweight models for the purpose of deployment on edge devices. In the following section, we will describe how knowledge distillation could be used for dental image segmentation, and propose a knowledge distillation approach for dental image segmentation. Precisely, different to other distillation methods, we propose to use knowledge consistency neural networks for distilling knowledge from teacher to student models. Such knowledge consistency neural network could potentially extract consistent features from the feature maps learnt by using teachers. In the experiments, we will demonstrate that our method outperforms a number of other knowledge distillation methods for dental X-ray image segmentation.

Knowledge consistency neural networks for knowledge distillation
In this section, we propose a knowledge network which aims to extract consistent knowledge learnt by a teacher network. For simplicity, we called it knowledge consistency neural network (KCNet). Simple attention scheme such as the sum of absolute values of the learnt feature maps has been proved useful for extracting spatial activation features in computer vision and therefore image segmentation tasks [34]. These extracted activation features could be used to transfer knowledge from a huge teacher network into a lightweight student network. Such attention strategy has shown a good performance in our semantic segmentation tasks in the experiment results section.
Various attention schemes could be used for transferring knowledge from a large teacher network to a student. For example, these attention schemes could be sum of absolute values of feature maps, sum of the absolute values raised to the power of p. Using this approach, the student network could be designed as a shallower lightweight network. This enables the student network to have similar spatial attention feature maps to those of the teacher network. The idea of this attention scheme is simply to match the sum of the absolute values of the feature maps between teacher and student networks. An advantage of this approach is that it does not introduce extra model parameters, comparing to the Fitnets [35] which requires to learn extra regression parameters.
In this paper, we investigate a nonlinear relationship between the spatial attention feature maps of teacher and student networks. This nonlinear relationship would be able to extract the consistent features from the feature maps. Note that the attention network in [34] could be viewed as a linear relationship. In the literature, it has shown that even different learning tasks for image recognition could have consistent knowledge. For example, when our task was image classification, we can train an ensemble of neural network models and each of them could be performing very well and they could have the similar performance as well. It is believed that these feature maps learnt from different models could have shared consistent knowledge [36]. This means that the feature maps could be mapped to an intrinsic (or consistent) feature map which are shared across all the models by using a nonlinear function. (In the following, we may exchangeably use intrinsic and consistent knowledge.) One possible approach to learn these intrinsic knowledge is to use the knowledge consistency proposed in [36]. The knowledge consistency was to learn a nonlinear function to map the features from one pretrained model to another. Instead, our target is to train a student neural network with its intermediate feature maps being consistent to the teacher networks. One difference between our approach to attention and Fitnets are that our method does not learn a similarity between teacher and student networks, but instead to learn knowledge that is consistent between the networks which are believed as the intrinsic knowledge learnt by both models.
We now define our model for learning intrinsic/consistent knowledge. The model architecture is shown in Fig. 1. Denote an intermediate feature map from a teacher network and a student network by x T and x S , respectively. A function is applied to the teacher feature map to match the student feature map by minimizing the following loss function: where W S and W T are the parameters in student and teacher networks, and f W kc is a neural network with parameters W kc similar to the knowledge consistency function defined in [36]. Typically, we use the architecture described in Fig. 2 for f W kc . The hope is that the knowledge consistency function [36] is able to extract the consistent knowledge from teacher feature maps.
For the purpose of prediction, we used a knowledge distillation loss function (equivalent to Kullback-Leibler divergence), to match the predicted masks between teacher and student networks, assuming there are N samples: PðW T ÞðlogðPðW T ÞÞ À logðPðW S ÞÞÞ; where P denotes a softmax output, i.e., PðWÞ ¼ expðz i ðWÞ=sÞ P j expðz i ðWÞ=sÞ with a temperature parameter s, and z i denotes the output of a network. To compute the distance of the label outputs of teacher and student networks, the cross-entropy loss is used as follows, where y Ã represents the truth label and y represents the softmax probability output. Our total loss taking into account all these loss functions has the following form: where J indicates the total number of intermediate layers in the student network to be penalized using knowledge consistency. We found that the parameters a ¼ 10000, b ¼ 0:05 and c ¼ 1 À b in the loss function were working well in our model.

The data
The data set in this experiment for evaluating our algorithms contains 1500 panoramic dental X-ray images [3,37]. There are 10 categories in this data set, but following the suggestions from [37], categories 5 and 6 were not used in our experiments, as they include images with implants and deciduous teeth. Finally, 1321 dental X-ray panoramic images in total were used in our experiments. There are 432 images for training, 111 images for validation and 778 images for testing. For the purposes of our experiments, we only preprocessed the images by normalization and resizing. Typically, the images were resized from 1991 Â 1127 to 256 Â 256.

Evaluation metrics
In our experiments, for comparison purposes, five error metrics were used to evaluate these algorithms, which are intersection over union (IoU), Hausdorff distance (HD), Dice coefficient (DC), volumetric overlap error (VOE) and relative volume difference (RVD). They are described as follows.
IoU is a metric to evaluate the accuracy of detecting objects in an image. It is used to measure the correlation between the predicted box and the box containing truth objects.
where A PB represents the area of predicted box and A TB represents the area of truth box containing the targeted objects.
Hausdorff distance can be used to compute the distance between two sets in a metric space. This distance could be interpreted as the maximum value of the shortest distance from a point of set to another point of set where sup denotes supremum and inf denotes infimum, and dðx; YÞ ¼ inf y2Y dðx; yÞ. The Dice coefficient is a collective similarity metric defined by where O pred denotes the segmentation output image of model and O truth denotes the ground truth segmentation image. The range of DC is [0,1], where '1' indicates that the prediction is identical to ground truth. In DC, the numerator is the intersection of the prediction and the truth, and the denominator is the union. Typically, in the numerator, the intersection between prediction and ground truth is computed twice (Figure 4.2). Some experiments showed that when the intersection between prediction and ground truth was not factored by 2, the Dice value would be fluctuated and hard to be a stable metric. Similar to DC, the VOE is defined as follows. It denotes the error rate, and [0,1] is its range, where '0' indicates that there is no mistake.
RVD is a metric to show the difference of volume between ground truth and prediction. Its value belongs to [0,1], and '0' means the model can produce the same segmentation prediction with ground truth. The RVD is defined as RVD ¼ jO pred j À jO truth j jO truth j :

Settings for experiments
The following are the hyper-parameters for training all the models. We set 200 epochs for training the models. For preventing over-fitting, early stopping was used with 20 epochs as tolerance. We observed that most models converged in around 50 epochs. The learning rate was set to 1e À 3, and the batch size was set to 4. For the purpose of using distillation algorithms [27], the temperature s in the following class probability is used as a parameter to produce a soft probability distribution over classes: We set s ¼ 4 in our experiments. Note that higher value of s indicates a softer probability distribution over classes.

Results
In the experiments, deep learning models were applied to the same data set for comparing their performance. These models include UNet [8], UNet?? [10], SegNet [38] and RAUNet [39]. Besides, lightweight models including ENet [24] and ESPNet v2 [25] were applied to the data as well.
In our model, we used UNet?? as our teacher network and ESPNet-v2 as our student network. Their architectures are shown in Fig. 3 and Table 6, respectively. We experimentally compared our method to these models. Typically, in our model employing knowledge consistency, the number of layers, i.e., the number of blocks in knowledge consistent network, is a hyper-parameter to tune. We set the number of blocks from one to three, and we chose the number of blocks with best performance. The results are given in Table 1 which shows that the model achieved the best performance when it has two blocks. Table 2 compares our algorithm to nine other methods. It shows the sizes of these trained models and the FLOPs. It shows that our model has the same size in term of models size and FLOPs to ESPNet-v2, showing that our method requires least number of parameters. In terms of FLOPs, our model is comparable to ENet which has the smallest FLOPs. All these methods are compared in terms of IoU, Dice, HD, VOE and RVD. Across IoU, Dice and HD, the UNet?? method performed the best among all the methods. Our method achieved the best performance in terms of VOE and RVD. It also shows that our model outperforms ESP-Net-v2 across all the evaluation metrics, which indicates that the knowledge consistency network may have performed the role to feed the knowledge learned from teacher network into the student network. We also compared our method to other knowledge distillation models which are the knowledge distillation (KD) method proposed in [27], the Fitnet [35] and the attention method [34]. The resutls are shown in Table 3. In terms of IoU and Dice, both the attention method and ours performed the best. The KD method was the best method in terms of HD, and Fitnet performed best in VOE and RVD. Figure 5 shows some example results for segmentation when lightweight methods were applied. Figure 6 plots some examples of possible improvements of our model compared to the student model. The Dice and FLOPs are plotted in Fig. 4 showing the differences between these lightweight methods and other methods. We can see that ENet, ESPNet-v2 and our method need significantly less FLOPs than other methods, while they achieved comparative performance in terms of Dice (Figs. 5, 6). The impact of those harmonic factors a, b and c in loss function over the performance was investigated. The Dice results obtained by using various values of those harmonic factors are shown in Table 4. It indicates that the model would perform poorly if they were not carefully chosen, e.g., a ¼ 1e4, b ¼ 5 and c ¼ À4. However, it would not be possible to scan all the possible values for harmonic factors. Optionally, it would be interesting to optimize these hyper-parameters instead. The performance of using      For assessing the sensitivity of our proposed algorithm to noise, our algorithm was applied to images with various levels of noise. Two different kinds of noises were considered which are impulse noise and Gaussian noise. For the impulse noise, Table 7 shows the Dice performance on five different noise rates. It shows that the Dice performance was pretty good when noise level was less than 0.3, but reduced dramatically to 0.457 when noise rate was set to 0.4, i.e., 40% data points were interrupted by noise. Similarly, our algorithm was tested on Gaussian noise. Table 8 shows the performance in terms of Dice when various Gaussian noise levels were added to an image. As expected, when either the mean value or the variance was increased, the Dice value was decreased. Figure 7 shows the examples of the input image without noise and the images with the two different types of noises, as well as their segmented results. In addition, the algorithm was tested on flipped and cropped images. Figures 8 and 9 show that our algorithm was working well on such flipped and cropped images. The EESP and Strided EESP blocks can be found in [25]. The convolution layer has filter size 3 Â 3 and 2 strides Fig. 6 An example of the segmented images using teaching network UNet??, student network ESPNet v2 and our KCNet. The yellow circles indicate improvements over student network

Discussions
There are a few data sets of dental X-ray images available in the literature which are reviewed in the paper [3], but most of them are not accessible. The data set used in this paper had been studied for segmentation in the literature by using various machine learning algorithms. For example, a number of traditional methods including region-based, threshold-based, cluster-based, boundary-based and watershed-based methods were applied to this data set for segmentation [3]. Besides, a deep learning method, i.e., MASK R-CNN, was also applied to the data, which achieved 0.9208 in terms of classification accuracy. The results clearly indicate that the deep learning algorithm outperforms any traditional methods greatly. Some other deep learning algorithms were also tested on this data set. For example, the work of [43] proposed a multi-scale location perception network for dental X-ray image segmentation and achieved 0.9301 accuracy in terms of Dice; the work of [16] proposed a two-stage attention network scheme for segmentation which achieved 0.9272 in terms of Dice, and in the same paper, the authors also reported 0.8933 in terms of Dice by using UNet. In our work, Table 2 shows that the Dice was 0.905 by using UNet which is similar to the results reported in other work. The result variability in terms of Dice using UNet may come from the data splitting, which would happen to any machine learning algorithms. There is other work using a different dental X-ray image data for segmentation using UNet which achieved 0.94 in terms of Dice. However, the results may not be comparable because the results were from different data sets, but at least it indicates that the performance of UNet for dental X-ray image segmentation was stable across different data and various research groups.
In addition, we also applied other algorithms to this data set. Our results show that UNet?? had the best performance in a number of evaluation metrics, which drives us to use it as our teacher network. By reviewing the literature, we can see that Dice was a popular evaluation metric for dental X-ray image segmentation. We also applied other metrics for evaluating these algorithms. Besides, all these algorithms applied to the data in this paper were evaluated in terms of the number of model parameters and FLOPs, which are rarely seen in other work on dental Xray image segmentation. As we have stated in the results section, these metrics in terms of model size and computational efficiency are also important for evaluating a   model specifically for the purpose of deployment. For example, although UNet?? achieved the best performance, the size of the model and FLOPs are high which would prevent to deploy on edge devices. Precisely, the UNet?? requires 9.2 million parameters which is approximately 210 megabytes for memory storage. Of course, an option for deploying these large models would be using cloud computing. However, an obvious obstacle for using cloud computing is that the machine device such as panoramic dental X-ray machine has to be connected to Internet, which would not be possible in many scenarios. Even it is connected to Internet, it would be disrupted if the network was down. A second obstacle would be that the machine must have large memory for storing these trained models, which might not be a problem, but it would need much financial cost. As we have stated, our purpose is to devise efficient models to reduce both the financial and computational costs and maintain high accuracy which is comparable to large models. The results have shown that the sizes of our lightweight models are greatly reduced ( $ 7:5 megabytes) and the computational cost. This indicates that these algorithms would be able to deploy on edge devices; for example, these models could be directly deployed on panoramic dental X-ray machines. Dental practitioners could immediately use the results on the machines. Nevertheless, we would further investigate to improve the methods developed in this paper and develop other methods to achieve improved performances. For instance, ensemble methods based on UNet have improved performance than single methods [23]. A straightforward approach would be an ensemble method which combines those lightweight methods investigated in this paper. In theory, such ensemble method would definitely have better performance while maintaining low memory and computational costs. In addition, other advanced methods such as [30,31,44,45] could be used to improve the performance of dental X-ray segmentation.
Furthermore, the proposed method could be applied to instance segmentation. For example, it would be interesting to investigate the performance of those lightweight algorithms for caries detection [46][47][48], which is more important and realistic than semantic segmentation. The proposed method could also be applied to segmenting mandible in panoramic X-ray [49]. This would be plausible for eventually deploying these models on edge devices for clinical use.
The proposed lightweight method could potentially be applied to other interesting directions including human face detection in risk situation [50], CT brain tumor detection [51] and IoT-based framework for disease prediction in pearl millet [52]. It would be interesting to compare our method to these methods in these directions.

Conclusions
In this paper we have proposed a method for panoramic dental X-ray image segmentation. Compared to various other methods, our algorithm achieved best performance in some metrics and comparative in other metrics. Interestingly, our method is a lightweight method which could be deployed on edge devices. Our method is also compared to three other lightweight methods and achieved the best performance in two out of five evaluation metrics. In our future research, we will interpret our model and investigating what has been learnt by using knowledge consistency networks.
Data availibility The data sets analyzed during the current study were published in [3] and are available in the following repository: https:// github.com/IvisionLab/dental-image.

Declarations
Competing interests The authors have no competing interests to declare that are relevant to the content of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.