Background

For the reason of abnormal jaw development in terms of size, shape and the positional relationship between maxilla and mandible, patients with dentomaxillofacial deformities suffer from malocclusion, facial abnormality related dysfunctions, and etc. The combination of orthognathic and orthodontics treatment can rehabilitate occlusal function and harmony facial profile. Given the complexity and diversity of dentomaxillofacial deformities, accurate diagnosis and precise surgical planning are indispensable.

Cephalometric analysis is commonly used in the diagnosis and surgical planning. The cephalometric analysis of CT images provides more information than lateral cephalogram (X-ray image) because CT images can be reconstructed into a 3-Dimensional (3D) skull model [1]. However, the application of 3D cephalometric analysis is limited in clinic for the reason that it is time-consuming and labour-intensive.

Deep learning (DL) provides us new solutions to these challenges and has already demonstrated its great potential in 2-Dimensional (2D) cephalometric analysis based on X-ray images [2,3,4,5]. In recent years, research on automatic 3D cephalometric analysis based on CT or cone beam CT (CBCT) has been more and more popular. Due to the high graphic memory footprint to process 3D images and the clinical need for high detection accuracy, dividing original images into multiple sub-regions is a promising strategy. In research of G. Dot et al. (2022), input image resolution was preserved in their proposed model by defining 5 regions of interest (ROI) as coarsely predicted localization of the landmarks. However, their model involved few tooth landmarks and did not include any landmarks in facial soft tissue, which are indispensable for cephalometric analysis [6]. Lang et al. (2022) came up with a three-stage coarse-to-fine framework and managed to reduce the prediction error to 1.38 ± 0.95 mm. However, it inevitably increased the operating time [7]. Employing lightweight networks like 3D U-Net [8] or V-Net [9] is one way to reduce graphics memory footprint. Liu et al. (2021) employed 3D U-Net for landmarks detection, while the 3D U-Net also could not process the origin image. They had to decrease the resolution of CT images to 96 × 96 × 96 to maintain training process, and finely tune the detection result in another detection stage [10].

Driven by clinical demands and limitations of previous researches, this study aimed to develop a two-stage landmarks detection model to accomplish automatic and accurate 3D cephalometric analysis under low graphic memory footprint, especially targeting patients with dentomaxillofacial deformities. To promise clinical practicability of the method, we included 77 landmarks on bone, teeth and facial soft tissue in the detection task. To minimize graphics memory footprint while optimizing prediction accuracy, a recently proposed lightweight network named 3D UX-Net [11] was employed and a new region division pattern was designed in this model.

Materials and methods

Data preparation

In this study, 80 sets of CT data of patients with dentomaxillofacial deformities were selected from Shanghai Ninth People’s Hospital (Shanghai, China). This research was approved by the Research Ethics Committee of Hospital (IRB No. SH9H-2022-TK12-1). The inclusion criteria were: (1) patients diagnosed with dentomaxillofacial deformity and orthognathic-orthodontic joint treatment were required; (2) CT scanned before treatment. The exclusion criteria were: (1) congenital dentofacial deformities; (2) have a history of orthognathic treatment. Each CT had a pixel size of 0.45 mm x 0.45 mm, a slice interval of 1 mm, and a resolution of 512 × 512 × 231. To reduce graphics memory footprint in computational process, CT images were resampled and the pixel was resized to 1 mm x 1 mm, so that each CT had a resolution of 229 × 229 × 231.

Based on clinical requirements and researches involved 3D cephalometric analysis [12,13,14], 77 landmarks, including 13 facial soft tissue landmarks, 28 skeletal landmarks and 36 dental landmarks, were included in the detection task. The names, defintions and locations of the selected CMF landmarks were shown in detail (Fig. 1, Supplementary Table 1). 77 landmarks were manually digitized in all CT images by 2 junior CMF surgeons and modified by a senior CMF surgeon using Mimics software (Materialise, Belgium). Then, the final labelling results were exported in xml format as the ground truth for model training.

Fig. 1
figure 1

Name and locations of 77 CMF landmarks. a. 13 facial soft tissue landmarks. b-f. 28 skeletal landmarks. g. 36 dental landmarks

Model architecture

A two-stage deep learning model was proposed (Fig. 2). In Stage 1, a region division neural network was utilized to divide original CT images into 9 sub-regions based on the region division pattern we designed. In Stage 2, landmarks detection neural networks were constructed to propose possible location of each landmark based on sub-regions obtained in Stage 1.

Stage 1: region division neural network

A new region division pattern was designed based on the feature of craniomaxillofacial structures. Compared to the ROI detection pattern in [6], it divided the skull into adjacent sub-regions to adapt for more and scattered landmarks detection. To create the annotations for the region division, location of some representative landmarks which were pre-annotated to segment the image were used, as shown in the Supplementary Fig. 1. By employing a classic segmentation neural network, V-Net, the skull was divided into 9 anatomical regions, as shown in Fig. 2 (Stage 1). Instead of directly compressing the image to reduce graphic memory footprint, adopting region division pattern can preserve original image resolution. Therefore, more graphic information was able to be identified by the landmarks detection neural network, contributing to more accurate landmarks detection results.

Stage 2: landmarks detection neural network

The landmarks detection network (Stage 2) employed 3D UX-Net as the backbone. The basic network architecture of 3D UX-Net consists of large kernel projection layer, encoder and decoder section. At the same time, skip connections are set to avoid information loss and gradient disappearance. Input data were divided into small patch data by large kernel projection layer, and patch-wise features were extracted as the input of the encoder section. The encoder section contains four 3D UX-Net blocks and four downsampling blocks. Large convolutional kernels (7 × 7 × 7) and small convolutional kernels (1 × 1 × 1) were included in 3D UX-Net blocks to enlarge global receptive field and supplement additional contextual information. Sixteen-fold image compression was implemented by four downsampling blocks to reduce the computational volume and reserve sufficient semantic feature. The decoder section was set to recover image resolution through res-blocks and long skip connections. After applying Softmax function, landmark heatmaps were obtained and the voxel with the highest probability value in each landmark heatmap was proposed as final predicted landmark. The architecture and components of the model, as well as the input and output dimensions of each layer, was illustrated in Supplementary Fig. 2. Since the input and output dimensions vary for different regions, frontal region (FR) was used as an example.

Fig. 2
figure 2

Overview of the proposed two-stage model for detecting 77 landmarks from CT images. The first stage is to divide original CT images into 9 regions using V-Net. The second stage is to detect landmarks using 3D UX-Net.

Implementation details

The training set and validation set included 58 and 22 samples respectively. Abnormal data (reasons for the abnormal outcomes will be analysed in the Discussion section) were eliminated using the median absolute deviation (MAD) algorithm. The model was validated in validation set every 10 epoch and mean error of the validation set was taken as the result. Input and output size of different regions were included in Supplementary Table 2. The training details of Stage 1 neural network were set as follows: Optimizer: Adam, learning rate: 0.003, loss: Dice focal loss, batch size: 2, epochs: 300. The training details of Stage 2 neural network were set as follows: Optimizer: Adam, learning rate: 0.0001, loss: focal loss, batch size: 2, epochs: 500.

Performance evaluation

The evaluation metric we chose for prediction performance was prediction error, which was Euclidean distance between the coordinates of predicted landmarks and ground truth. Prediction error was calculated using Eq. 1:

$${l_e} = \frac{1}{n}\sum\limits_{i = 1}^n {{{\left[ {{{(x_{pre}^i - x_{gt}^i)}^2} + {{(y_{pre}^i - y_{gt}^i)}^2} + {{(z_{pre}^i - z_{gt}^i)}^2}} \right]}^{\frac{1}{2}}}}$$
(1)

Where le represents the prediction error, (xpre, ypre, zpre) represents the predicted coordinates, (xgt, ygt, zgt) represents the ground truth, and n represents the number of validation samples.

To compare the landmarks detection performance of 3D UX-Net and V-Net, model training process was also implemented with the backbone of Stage 2 neural network substituted with V-Net. Furthermore, an experiment was conducted to evaluate the impact of region division on the final predicted results. In order to evaluate the effectiveness of our proposed model in clinical practice, we asked two CMF surgeons to manually digitize all 77 landmarks on 10 CT images from training set, and one of them to repeat the work on the same 10 CT images one week later. Inter-observer and intra-observer variations were calculated based on the results.

Results

Prediction performance in four different settings

Setting 1: v-net without region division

Without region division and at a resolution of 96 × 96 × 96, the model with V-Net as the backbone in Stage 2 had a mean error of 2.40 ± 1.08 mm (Table 1), and 22.08% of the 77 landmarks fell within 2 mm, 59.74% within 2.5 mm, 87.01% within 3 mm, and 96.10% within 4 mm (Table 2).

Setting 2: 3D UX-net without region division

Without region division and at a resolution of 96 × 96 × 96, the model with 3D UX-Net as the backbone in Stage 2 had a mean error of 2.34 ± 1.01 mm (Table 1), and 35.06% of the 77 landmarks fell within 2 mm, 61.04% within 2.5 mm, 85.71% within 3 mm, and 96.10% within 4 mm (Table 2).

Setting 3: v-net with region division

With region division and at a resolution of 229 × 229 × 231, the model with V-Net as the backbone in Stage 2 had a mean error of 1.90 ± 0.93 mm (Table 1), and 61.04% of the 77 landmarks fell within 2 mm, 89.61% within 2.5 mm, 96.10% within 3 mm, and 98.70% within 4 mm (Table 2). This setting had a similar graphic memory footprint to Setting 1.

Setting 4: 3D UX-net with region division

With region division and at a resolution of 229 × 229 × 231, the model with 3D UX-Net as the backbone in Stage 2 had a mean error of 1.81 ± 0.89 mm (Table 1), and 76.62% of the 77 landmarks fell within 2 mm, 90.91% within 2.5 mm, 93.51% within 3 mm, and 98.70% within 4 mm (Table 2). This setting had a similar graphic memory footprint to Setting 2 and demonstrated the best performance for landmarks detection.

The training loss curve (Fig. 3), validation loss curve (Fig. 4) and validation error curve (Fig. 5) were demonstrated as follows.

Table 1 Prediction errors in 4 different settings (unit: mm)
Table 2 Prediction accuracy within a given margin
Fig. 3
figure 3

Training loss curve in 4 different settings

Fig. 4
figure 4

Validation loss curve in 4 different settings

Fig. 5
figure 5

Validation error curve in 4 different settings

Comparison with manually digitized landmarks

The error of landmarks detection using 3D UX-Net with region division was compared to the inter- or intra-observer variation. The inter-observer and intra-observer variations were 1.27 ± 0.70 mm and 1.01 ± 0.74 mm, respectively. The intraclass correlation coefficient of landmarks digitized by the two observers was greater than 0.99. Unpaired t-tests proved that there is no statistically significant difference between prediction error of the model and inter-observer variation except for teeth region (TR) (p > 0.05) (Table 3). Using the form of scatter plots, differences in each of the nine regions are shown in Fig. 3. In TR, errors of the model and inter-observer variation were 1.73 ± 0.84 mm and 0.71 ± 0.53 mm, which showed a statistically significant difference (p < 0.05).

Table 3 Comparison between prediction error of model and inter-observer variation (unit: mm)
Fig. 6
figure 6

Differences of prediction error among the model, inter-observer variation and intra-observer variation. (The height of the points in the scatter plot represented the average error of the landmarks detection, and the dashed line represented the average error of all the landmarks in the region)

Precision analysis for cephalometric indicators

To better explore the applicability of the model in clinic, a precision analysis was conducted for 6 indicators (3 angle indicators and 3 distance indicators) commonly used in cephalometric analysis, including SNB, SNA, ANB, and three sides of the mGoR-uN-mGoL triangle. The values of these indicators were calculated by the coordinates of related landmarks, which were obtained by the model prediction or manual annotation (i.e. ground truth). 22 samples in validation set were used in analysis. Paired t-tests showed that no statistically significant difference between the predictions of this model and the ground truth in all six cephalometric indicators (Table 4).

Table 4 Error of indicators used in cephalometric analysis

Discussion

With region division and 3D UX-Net, the performance of the proposed model in landmarks digitization was equivalent to that of experienced CMF surgeons, except for TR. In TR, landmarks were manually digitized based on highly precise optical dental models that were registered to CT images, while such optical dental models could not be processed by this model. Moreover, the landmarks on the teeth are easier to be recognized than in other anatomical regions, reducing the inter-observer variation. These factors led to a statistically significant difference between the error of the model and inter-observer variation in the dental region. In addition, six important indicators in 3D cephalometric analysis were obtained and further proved the clinical feasibility of the proposed model (Table 7).

Despite the desirable results were achieved for most samples, there were still some extremely abnormal errors existed in some cases, which resulted from the following reasons:

  • Inconsistency in patients’ eye status: The abnormal detection of landmarks in the periocular soft tissues (sICaL, sICaR, sOCaL, sOCaR) could be caused by differences between the patient’s open and closed eyes (Fig. 7a).

  • Inconsistency between the head position of CT data and natural head position [15]: The head positions of some CT images are too forward-leaning, which can interfere with the detection process of the model, especially after region division (Fig. 7b).

  • The severity of deformity: Severely abnormal structures impeded the accuracy of landmarks detection. for example, impacted tooth. (Fig. 7c).

  • Craniomaxillofacial information missed in CT data: In some cases, the inferior part of chin was not captured, causing the loss of detection for landmarks such as sMe and mMe (Fig. 7d).

Fig. 7
figure 7

Possible reasons for extremely abnormal results

Graphic memory footprint refers to the amount of random-access memory (RAM) in graphics card that software references or uses when running. Excessive graphic memory footprint can result in highly expensive training cost and deployment cost, hindering the development of the model and its wide application. The lightweight medical image processing network V-Net is often used as the backbone network for landmarks detection to reduce the graphic memory footprint. The recently proposed 3D UX-Net, characterized by combining the features of the Swin Transformer [16] with convolutional networks, has been increasingly popular for its lightweight architecture and efficient image processing performance. This inspired us to consider replacing V-Net with 3D UX-Net as the backbone of the landmark detection neural network to increase the accuracy of landmarks detection. This study proved that 3D UX-Net outperformed V-Net in 3D landmarks detection.

Due to the high resolution of CT images and limited graphic memory, even lightweight networks such as V-Net or 3D UX-Net was unable to directly process the entire CT images. Thus, the images were often compressed to lower resolutions like 96 × 96 × 96 to reduce the graphic memory footprint. However, image compression will inevitably cause detection inaccuracy. To avoid significant resolution loss, a new region division pattern was designed to divide the entire CT images into 9 regions, allowing landmarks detection to be performed in each region. In this method, the resolution loss was limited to a smaller extent while achieving higher detection accuracy with a lower graphic memory footprint.

Time consumption is one of the most important factors that hinder the implementation of 3D cephalometric analysis in clinical practice. Manual digitization often takes 15 to 25 min for experienced surgeons, and even longer for beginners. Our two-stage deep learning model has effectively solved this problem under lower graphic memory footprint, using only 83s to complete the landmarks digitization task (on NVIDIA RTX 2080Ti 11G).

Comparing to the recent methodologies [6, 7], we proposed a new region division pattern adapt for more and scattered landmarks detection and covered landmarks on all three tissues in CMF CT, including 13 facial soft tissue landmarks, 28 skeletal landmarks and 36 dental landmarks. At the same time, this study has some limitations. There was not sufficient consideration on a method to reduce the occurrence of abnormal results or to ensure the safe use of the model in clinical practice. The abnormal outcomes have also inspired us to further optimize the whole process of automatic landmarks detection, including aligning the head position [17,18,19], standardizing the process of CT imaging, addressing abnormal data and evaluating the applicability of the model to different patients. Furthermore, in the field of landmarks detection, incorporating the dependency between landmarks into model training is proved to be effective [7, 20]. In this research, the training process were implemented with 9 separate regions from CT images, which inevitably cuts off some potential dependency between landmarks. Restoring the balance between global and local constraints while still maintaining the region division pattern will be further investigated in future.

In summary, the model demonstrated excellent performance in detecting craniomaxillofacial landmarks in CT images while consuming low graphic memory footprint and short time. This model satisfies the clinical requirements for detection accuracy of 3D cephalometric indicators. With further modification and big samples validation, the proposed method could be applied in clinical practice and contributed to the diagnosis and treatment planning of dentomaxillofacial deformities.