Introduction

Cephalometric analysis is a fundamental diagnostic procedure that utilizes radiological landmarks to measure various linear, angular, and proportional parameters on lateral and posteroanterior (PA) cephalograms1,2, which offer valuable information for evaluating craniofacial structures, such as growth assessment, orthodontic treatment planning, orthognathic surgery planning, and treatment outcome assessment3,4,5. Nevertheless, manual diagnostic procedure is a demanding and time-consuming task. Moreover, despite the essential role of landmark identification in cephalometric analysis, intra- and inter-observer variability and a lack of reliability continue to pose a challenge6,7. Since the accuracy of landmark identification determines the quality of diagnosis, inaccurate identification of cephalometric landmarks can result in misguided planning of orthodontic therapy and orthognathic surgery.

Therefore, there is a growing need to develop fully automated and reliable cephalometric landmark identification methods that use artificial intelligence (AI) algorithms. Recent advances in AI, particularly in deep learning, have garnered considerable attention for its use in diagnostic imaging, disease classification, and monitoring8,9. In orthodontics, deep learning technologies have been utilized for automated cephalometric landmark identification, among other applications3,10,11.

Since the 1990s, various studies have proposed fully automated cephalometric landmark identification systems that utilize machine-learning techniques12,13. However, their limited accuracy has hindered their success. Recently, deep learning algorithms, such as convolutional neural networks (CNNs), have been used increasingly to detect landmarks on lateral cephalograms automatically3,14,15. These studies have reported that deep learning algorithms exhibit high accuracy in detecting landmarks at shorter time, achieving precision levels within the range of 2.0 mm. Additionally, they have achieved successful detection rates (SDR), surpassing 70% and 90% for the respective thresholds of 2 and 3 mm4,16,19,20,21. However, most studies reporting the development and evaluation of automated cephalometric identification algorithms primarily focused on utilizing lateral cephalograms as their main target.

Although lateral cephalometric analysis is valuable in assessing anteroposterior and vertical issues, it has limitations in evaluating skeletal asymmetry and dentofacial structures in the transverse plane. The use of posteroanterior cephalogram holds significant value in the assessment of transverse skeletal and dentoalveolar relationships, enabling the quantification of bilateral structural issues22,23,24. It provides vital diagnostic information that is essential for evaluating patients who present with functional, dentoalveolar, and/or facial asymmetries.

Nevertheless, since the impact of such analyses greatly depends on their accuracy, there are limitations associated with the use of the PA cephalogram owing to errors in landmark identification. There are two primary categories of cephalometric errors: "projection errors," which result from the geometric aspects of the radiographic setup, and "identification errors," which occur owing to the uncertainty in locating specific anatomical landmarks25,26. The reliability and reproducibility of identified landmarks are influenced by various factors, including image density, image sharpness, anatomical complexity, and superimposition of anatomical structures27,28,29. Consequently, the identification errors on PA cephalograms were higher than those of lateral cephalometric analysis owing to their nature, which involves more superimposition of anatomical structures and variations in head positioning30. Furthermore, the experience and predisposition of the examiner play a significant role, asrdeeper understanding of anatomy and familiarity with radiographic images can help reduce intra- and inter-examiner errors22,31.

Considering the limitations and potential errors associated with manual landmark identification in PA cephalograms, there is a growing need for automatic identification methods. The development of automatic PA cephalogram identification systems can greatly enhance the accuracy and reliability of the analysis, minimizing the risk of errors caused by human factors. Moreover, the implementation of automatic PA cephalogram identification would not only expedite the analysis, but also contribute to enhanced diagnostic capabilities and more effective treatment planning for patients presenting with functional, dentoalveolar, and/or facial asymmetries. Therefore, this study aimed to develop a fully automatic PA cephalometric landmark identification model using a deep learning algorithm and compare its accuracy and reliability with the assessments made by expert human examiners.

Results

Table 1 presents the mean radial error (MRE) and intra-class correlation coefficient (ICC) values for each of the 19 landmarks detected by the human examiners that were defined as the gold standard points. The average MRE for all landmarks was 1.68 ± 1.85 mm. The ICC values for all landmarks were above 0.7, except for the y-coordinate of the upper dental midline (U1M), indicating high reliability between the two examiners. The lower ICC value for the y-coordinate of the U1M may be attributed to the detection tendency of the two human examiners, which is relatively less relevant to transverse evaluation.

Table 1 Mean radial error (MRE) and inter-examiner reliability of two human examiners.

Examples of the 19 landmarks were visually represented on a PA cephalometric image to facilitate comparison between the landmarks detected by the human examiners (gold standard) and those by the automatic identification algorithm (Fig. 1).

Figure 1
figure 1

Examples of superimposed gold standard (blue dot) and auto-identified (red dot) landmarks on PA cephalometric image (af). The best result was obtained with an average MRE of 1.06 mm (a), and the worst result with an average MRE of 4.02 mm (b).

Table 2 presents the MRE and standard deviation (SD) for the 19 landmarks identified by the human examiners (the gold standard) and the automatic identification algorithm. The landmarks with the lowest MRE were SphR (1.12 mm), SphL (1.10 mm), and MstR (1.10 mm), whereas those with the highest MRE were ConR (3.47 mm) and ConL (3.16 mm). The average MRE was 1.87 ± 1.53 mm.

Table 2 Mean radial error (MRE) between gold standard and auto-identification using the AI algorithm.

Table 3 shows the SDR of the landmarks within the error ranges of < 1.0, < 2.0, and < 4.0 mm. The average SDR was 34.7%, 67.5%, and 91.5% within the ranges of 1.0, 2.0, and 4.0 mm, respectively. The automatic identification algorithm showed a high accuracy (> 80%) within a range of 2.0 mm for SphR and SphL (89.0%), MstR and L6MCL (87.8%), and U6MCL and L6MCR (81.7%). In contrast, ConR (25.6%) and ConL (32.9%) had the lowest SDR values.

Table 3 Success detection rate (SDR) of auto-identification using the AI algorithm.

The horizontal and vertical error patterns between the two human examiners and the automatic identification of each landmark are illustrated in Fig. 2.

Figure 2
figure 2

Scatter plots with 95% confidence ellipses for the landmark detection errors of the human examiners and automatic identification.

Discussion

This study proposed a fully automated deep learning model for identifying PA cephalometric landmarks and compared its accuracy with that of two expert human examiners.

Numerous studies have introduced algorithms for the automated identification of lateral cephalograms as well as methods for evaluating their accuracy3,10,11. Conversely. A recent systematic review and meta-analysis reported AI agreement rates of 79% and 90% for the thresholds of 2 and 3 mm, respectively, with a mean divergence of 2.05 compared to manual landmarking19. Another study showed that most studies did not exceed a 2-mm prediction error threshold in mean and that the mean proportion of landmarks detected within this 2-mm threshold was 80%20.

However, only a limited number of studies have used deep learning algorithms to automate identification in PA cephalograms18. Although lateral cephalometric analysis is a basic tool for diagnosis in orthodontics, PA cephalometric analysis is essential for evaluating skeletal asymmetry and providing dentofacial structural information in the transverse plane22,23,24. Owing to its anatomical orientation, PA cephalometry inevitably causes greater overlap and superimposition of the skeletal structures and dentition than lateral cephalometry. These overlapped images reduce the accuracy and reliability of landmark identification, resulting in identification that highly depends on the examiner's experience and subjectivity22,31,32,33.

The present study aimed to develop a fully automatic PA cephalometric landmark identification system using a two-step landmark detection framework. In the first step, ResNet 18 was used to detect the region of interest by roughly extracting 19 landmark positions. Random augmentation and loss functions were used to improve performance. In the second step, ResNet 50 architecture was used to extract fine points from the cropped image. Then, CLAHE was applied to the cropped images for clear visualization of the bone, soft tissue, and the background regions.

The inference time and accuracy according to number of ResNet layers during the coarse training step are listed on Table 4. Deeper networks can extract more complex and abstract features. However, such complex feature extraction comes at the cost of increased computational burden, resulting in longer inference times. Similarly, as shown in the Table 4, as the layer depth increased, the computing time also increased. Although ResNet34 demonstrated approximately a 19% improvement in average accuracy for landmark detection compared to ResNet 18, its inference time was approximately 40% slower. Therefore, we opted to use ResNet18 for the coarse training stage.

Table 4 Inference time and accuracy according to the number of ResNet layers.

To evaluate the performance of the proposed model, manual identification of the 19 landmarks was performed independently by two human examiners on 1032 images used for model training and validation. Furthermore, the landmarks were identified manually by two expert human examiners and automatically by the developed AI algorithm on 82 test set images. The MRE and SDR between gold standard and auto detected point were calculated from the results of the test sets.

Although every cephalometric landmark has its own definition, its identification is subjective and dependent on the experience and judgment of the examiner22,31. Even highly educated experts may hold different opinions regarding the ideal location, indicating the absence of an 'absolute gold standard' reference point. To address this limitation, we employed a method where the arithmetical mean point for each landmark was calculated by averaging the X and Y coordinates detected by two human expert examiners. The inter-examiner reliability was previously established using an assessment of the ICC. Although this mean point may not precisely align with the specific definition of a particular landmark, it can be regarded the closest approximation to the gold standard point.

The SphR, SphL, and MstR had the lowest MRE, whereas the condyle points had the highest MRE. The average MRE value was 1.87 mm, which falls within the clinically acceptable range of previous studies that considered an MRE of up to 2.0 mm as acceptable3,10,11,34. In terms of the SDR, auto-identification demonstrated a high accuracy at the sphenoid points and the mastoid process across all error ranges. Conversely, the lowest SDR was observed at the condyle points.

The tendency of these results could be related to the degree of overlap of the anatomical structures. Auto-identification showed low MRE and high SDR values for the sphenoid points and mastoid processes, respectively, which have relatively little anatomical overlap. Conversely, auto-identification exhibited high MRE and low SDR values for the condyle points, which overlap with the maxilla and zygomatic arch, respectively, consistent with the inter-examiner evaluation. These results align with those of prior studies on auto-identification on PA cephalograms, which reported relatively significant errors in identifying the landmarks located within the frontozygomatic suture, zygomatic process, and condyles owing to an overlap of the anatomical structures17,22,35.

The precision of auto-identification of the landmarks located on the teeth, which was anticipated to have low accuracy owing to the overlap of adjacent teeth and the presence of orthodontic brackets and prosthesis, produced favorable results, with an MRE within 2.0 mm. This finding suggest that auto-identification can be utilized for precise evaluation of asymmetry using the dentition.

The scatterplots of the landmark detection errors showed a characteristic distribution within the co-ordinate system, as shown in Fig. 2. For instance, Menton (Me) distinctively oriented horizontally, whereas Crista galli(Cg) oriented more in a vertical direction, which reflects the tendency of detection errors of each landmarks and corresponding to the differences observed between the two examiners.

In conclusion, this study presented a novel approach for the fully automatic identification of PA cephalometric landmarks using a deep learning algorithm. The accuracy and reliability of the proposed model were evaluated by comparison with those of expert human examiners. Our results showed that the accuracy and reliability of the constructed AI model are comparable to those of human experts. These findings suggest that with advances in AI, automatic PA cephalometric landmark identification can significantly improve the efficiency and accuracy of cephalometric analysis while reducing the time and effort required.

Methods

This study was approved by the Ethics Review Board of Yonsei University Dental Hospital Institutional Review Board (approval number 2–2020-0005) and passed the exemption review of informed consent on the use of patients’ cephalometric data. The requirement for written or verbal informed consent was waived owing to the non-interventional retrospective study design, and all cephalometric images were anonymized to ensure confidentiality. This study was performed in accordance with the Declaration of Helsinki.

The inclusion criteria were as follows: (1) patients aged between 18 and 39 years, with permanent dentition and complete facial growth, and (2) those who underwent orthodontic therapy or orthognathic surgery between 2015 and 2021. The exclusion criteria were as follows: (1) partial or total edentulism, and (2) a history of dentofacial trauma, craniofacial syndromes, or systemic diseases. Thus, 1114 PA cephalometric images taken before treatment from the participants who met the inclusion criteria and were included in this study.

The PA cephalograms used in this study were acquired using a Rayscan machine (Ray Co. Ltd., Hwaseong, Korea) and collected from the picture archiving and communication system of the Yonsei University Dental Hospital as JPEG files. The images had a resolution of 1930 × 2238 pixels and a pixel spacing of 0.13 mm. Each pixel was represented by a single grayscale channel with values ranging from 0–255.

The 1114 PA cephalometric images included in the study were randomly divided into three sets: 803 images for training purposes, 229 for validation, and 82 for testing. The training and validation sets were used exclusively during the model training phase, whereas the test set was used solely to evaluate the reliability of the human examiners and the accuracy of the auto-identification model.

A total of 19 clinically important PA cephalometric landmarks used in routine dentofacial diagnosis were selected. Table 5 and Fig. 3 describe their definitions and positions. Two expert human examiners, an oral and maxillofacial surgeon with over 10 years of clinical experience in dentofacial deformity and an orthodontic specialist with 5 years of orthodontic training, independently and manually identified the landmarks on the 1032 images used for model training and validation to obtain the ground truth.

Table 5 Definition of landmarks.
Figure 3
figure 3

Landmarks on the PA cephalometric radiograph.

During model training, the large size of the original image facilitates the creation of more feature maps for learning; however, it is also associated with the disadvantages of GPU memory allocation limitation and long computing time. Therefore, in the first step, the image was resized to 964 × 1119 pixels, which was ¼th of the original size. It is important to retain the features of the widest possible area to extract the approximate coordinates of the 19 landmarks. Thus, the x- and y-coordinates of the 19 landmarks were extracted by locating the center of mass of each labeling point, which enabled the construction of a coordinate landmark detection model.

The landmark detection framework operates through a two-step process, as shown in Fig. 4. The 19 landmark positions were coarsely extracted, and the image was cropped to a certain size based on these rough positions. Subsequently, the fine points were extracted from the cropped images. The adoption of this two-step framework facilitated efficient learning and high accuracy.

Figure 4
figure 4

Schematic design of the two-step landmark detection framework. (Conv: convolution, FC: fully connected, ROI: region of interest, CLAHE: Contrast Limited Adaptive Histogram Equalization).

ResNet 18 was used as the initial step to preserve the original features and expedite the learning process while minimizing the computational complexity. ResNet 18 is a model that can solve the gradient vanishing problem as the layer deepens through residual learning using skip connection, and it is widely used in facial landmark detection tasks. ResNet 18 consists of 17 convolution layers and a fully connected layer at the end. However, the first convolution layer was limited to a 7 × 7 kernel and max pooling to minimize the input size, whereas all subsequent layers were implemented using a 3 × 3 kernel convolution layer. The final fully connected layer comprised 38 output features, thereby enabling the derivation of the x- and y-coordinates of the 19 landmarks. Residual shortcut connections were introduced between the two convolution layers to optimize the learning process, as shown in Fig. 4, where the solid line represents the input and output having the same dimension, and the dotted line represents an increase in the dimension with zero padding and a stride of 2.

An augmentation strategy randomly selected from a list of methods, including rotation, scale, flip, and contrast, was applied to account for patients with tilted heads and asymmetric X-rays, as shown in Fig. 5. Wing loss was utilized as the loss function in the first step, which helped reduce an excessive focus on outliers to find the approximate landmark positions36,37. Wing loss is more resistant to the impact of outliers than the mean squared error (MSE) loss function.

Figure 5
figure 5

Description of the augmentation policy (CW: clockwise, CCW: counterclockwise).

In the second step, the original image was cropped to a size of 400 × 400 pixels and centered around the 19 landmark positions obtained in the first step. Subsequently, contrast-limited adaptive histogram equalization (CLAHE) was applied to the cropped images. CLAHE is a histogram-flattening method that enhances the contrast of the radiographs, thereby enabling clear visualization of the bone, soft tissue, and background regions38. The application of CLAHE is known to enhance image quality and has gained widespread usage in deep learning model studies that utilize medical images39,40. ResNet 50 architecture was used in this step. It is similar to ResNet 18 but with deeper networks, and it comprises 49 convolution layers and a fully connected layer at the end. The final fully connected layer was designed with two output features to derive the x- and y-coordinates of one landmark. To optimize learning, residual shortcut connections were applied to three convolution layers with kernel sizes of 1 × 1, 3 × 3, and 1 × 1. The 1 × 1 convolution layers were responsible for dimensionality reduction and restoration, whereas the 3 × 3 layer functioned as a bottleneck with smaller input/output dimensions. MSE loss was utilized as the loss function for the second step.

The two-step models were initialized with a learning rate of 0.01 during model training, which was then decayed by a factor of 0.5 every 30 epochs. An Adam optimizer with a batch size of 64 was used for 400 epochs. All procedures were conducted using the PyTorch framework running on an NVIDIA Quadro RTX8000 GPU.

Two expert examiners specializing in oral and maxillofacial surgery and orthodontics manually identified 19 cephalometric landmarks on 82 images that constituted the test set. The MRE and SD were calculated to evaluate the inter-examiner reliability, and the ICC was computed to assess the degree of reliability between the two human experts. The mean values of the x- and y-coordinates determined by the two examiners were used as the gold standard for subsequent analysis.

Automatic detection of 82 test set images was completed using the constructed AI algorithm, and the MRE and SDR with error ranges of < 1.0, < 2.0, and < 4.0 mm for all landmarks were calculated to evaluate the performance of the proposed model. All calculations were performed in Microsoft Excel using the following formulae:

$${\text{Radial}}\,{\text{error}}\,({\text{R}}) = \sqrt {\Delta x^{2} + \Delta y^{2} } \quad ({\text{mm}})$$
$${\text{MRE}} = \frac{{\mathop \sum \nolimits_{i = 1}^{N} R_{i} }}{N} \quad ({\text{mm}})$$
$${\text{SDR }} = \frac{{{\text{Number}}\,{\text{of}}\,{\text{accurate}}\,{\text{identifications}}}}{{{\text{Number}}\,{\text{of}}\,{\text{total}}\,{\text{identifications}}}} \times100\%$$