Background

Lateral cephalometry has consistently played an important role in orthodontic diagnosis and treatment protocol design for decades. It can not only clarify the dentofacial morphology and the anatomic basis for malocclusion but also evaluate the growth patterns in the craniofacial complex. Thus, it is significant for orthodontists to locate cephalometric landmarks precisely [1, 2].

Currently, cephalometric analysis is commonly carried out by manual tracing with computer-assisted analysis software [3,4,5]. Precise definitions of landmarks, calibration of operators and replication of tracings are all important in cephalometric analysis. However, repeat work of landmark localization may spare a great deal of time but do little to improve the accuracy of analysis [6]. The precision, reliability and time demands of cephalometric analysis depend to a great extent on the experience of orthodontists, who must spend a great deal of time training and accumulating experience before they can perform the analysis reliably.

In recent years, a large number of artificial intelligence (AI) studies have been conducted to explore automated landmark locations [7,8,9,10,11,12,13,14]. Compared with traditional cephalometric analysis, AI landmark identification shows superiority in repeatability and efficiency [11, 15]. Although there were several studies comparing the difference of cephalometric analysis between human and AI [16,17,18,19,20], the various automated cephalometric software available to practitioners still need to be evaluated with the development of AI application in orthodontics. Planmeca Romexis software is widely used for cephalometric analysis, and 2-dimensional automatic tracing is feasible and practical with this program. Rather than spending valuable time on training, supporting and monitoring the work of others, clinicians only need a few seconds to obtain the results of cephalometric analysis from it. Therefore, it is important to evaluate the accuracy of this program as a potentially time-saving tool. Our study was designed to compare the automatic localization program with one of the most frequently used computer-assisted cephalometric analysis programs, Dolphin software. Moreover, as far as we know that there have been only a few studies involving both cephalometric landmark and measurement analysis, the accuracy of landmarks and measurements in terms of bone, teeth and soft tissue were evaluated respectively in this study to explore more comprehensively the potential application of AI cephalometric analysis in clinical work.

Methods

The study was approved by the Ethical Committee Department of the Affiliated Hospital of Stomatology, Nanjing Medical University (PJ2022-030-01).

Data collection

In this observational study, reconstructed lateral cephalograms (RLCs) processed from cone-beam computed tomography (CBCT) scans were assessed. The retrospective CBCT images obtained from the patient database at the Affiliated Hospital of Stomatology, Nanjing Medical University were collected for the purposes of clinical diagnosis and analysis including airway volume analysis, temporomandibular joint or alveolar bone assessment and so on. A statistical power of 80% at a significance level (alpha) of 0.05 using a two-sided paired t test was assumed by the G*power software (version 3.1, Heinrich-Heine-University Dusseldorf, Germany). The calculation showed that 35 images were required with an effect size of 0.49 according to the previous study [17]. For this study, we searched the patient database and selected 85 patients who underwent CBCT from 2015 to 2020.

The sample inclusion criteria were as follows: (1) all CBCT images were taken in a supine position using the same machine and fully included all relevant anatomical structures, and (2) no central incisors were missing or defective. Additionally, the factors of age, gender and skeletal classification were considered during sample inclusion to increase the representativeness of the sample and therefore the applicability of the conclusion. The exclusion criteria were patients who had (1) impacted teeth in the anterior region, (2) prosthetic restoration of the central incisors, (3) previous orthodontic treatment or orthognathic surgery, or (4) cleft lip and palate syndromes.

All CBCT images were obtained with the same CBCT machine (NewTom 5G; Quantitative Radiology, Verona, Italy) with a standard acquisition protocol ( 18 × 16 cm field of view, 110 kV, 1–20 mA pulsed mode, and 0.3-mm voxel size).

RLC acquisition and processing

CBCT images were reconstructed into RLCs by using Dolphin software (version 11.9, Chatsworth, California, USA). The orthogonal projection type was chosen to create non-distorted images. All RLCs were generated with optimal opacity of hard tissue visualization and adjusted with the slider to sharpen the images. Then, the RLCs were entered into the computer-assisted manual analysis program (Dolphin software) and the automatic cephalometric analysis software (Planmeca Romexis software, version 6.2, Helsinki, Finland).

Cephalometric analysis

Totally 19 landmarks and 23 measurements (Fig. 1) were chosen on the basis of Steiner, Downs, Tweed cephalometric analysis and some other measurements commonly used in clinical practice. All RLCs were analysed twice by a well-trained orthodontic clinician. The procedure was supervised by two experienced orthodontists (BY and LL, with 17 and 10 years of experience) to ensure the accuracy of manual analysis. No more than 20 images were traced in a single day, and repeat analyses were separated by a two-week interval. The time required for manual tracing was also recorded. Then, the landmark coordinate information and cephalometric measurement values were obtained. To verify the inter-operator precision, another operator (also an orthodontic clinician having undergone the cephalometric training) performed the same process, and the cephalometric measurement values were compared with those of the first operator. The operators identified the landmarks on the same computer.

Fig. 1
figure 1

Cephalometric landmarks and measurements in terms of bone, teeth and soft tissue. The definitions of the landmarks and measurements are enumerated in Additional file 1 and Additional file 2

Statistical analysis

Analyses were conducted using IBM SPSS, version 23 (IBM Corp., Armonk, NY). Statistical significance was defined as a two-sided P value < 0.05. The Shapiro–Wilk test was used to test the normality of the distributions.

The error of landmark positions was identified by superimposing ruler points using the x and y coordinates. The radial error R was calculated between manually detected and AI-detected landmark coordinates:

$$R=\sqrt{{\varDelta x}^{2}+{\varDelta y}^{2}}$$

The mean radial error (MRE), standard deviation (SD) and successful detection rate (SDR) were calculated as follows (in the following equations, N denotes the sample size, and z refers to the precision ranges of 1, 2, 2.5, 3, and 4 mm.)

$$MRE=\frac{\sum _{i=1}^{N}{R}_{i}}{N}$$
$$SD=\sqrt{\sum _{i=1}^{N}{{(R}_{i}-MRE)}^{2}/(N-1)}$$
$$SDR = \frac{\begin{array}{l}{\rm{the}}\,{\rm{number}}\,{\rm{of}}\,{\rm{successfully}}\,{\rm{detected }}\\{\rm{landmarks}}\,{\rm{or}}\,{\rm{measurements}}\,{\rm{within}}\,\\{\rm{the}}\,{\rm{range}}\,{\rm{of}}\,{\rm{z}}\end{array}}{N}{\rm{*100\% }}$$

The MRE; SD; and SDR at 1, 2, 2.5, 3, and 4 mm for 19 landmarks were analysed to evaluate the accuracy of AI-driven localization performance. Moreover, the scatter plots with 95% confidence ellipses were depicted to visualize and evaluate the error pattern in 2-dimensional directions.

The intraclass correlation coefficient (ICC) was analysed to confirm the consistency of the results among multiple measurements. Paired t tests; proportion within 95% limits of agreement; and Bland‒Altman plots were used to assess the consistency of manual and AI cephalometric measurements.

Fig. 2
figure 2

The scatter plots with 95% confidence ellipses show the difference in the coordinates of each landmark between the manual and AI methods. a Skeletal landmarks, b dental landmarks, c soft tissue landmarks. The origin points (0,0) are located by the AI program for different landmarks. The Red and blue points represent the difference between two repetitions of manual digitization and AI detection, respectively

Results

In this study, we analysed 85 CBCT images, taking into account baseline variables including gender, age, and skeletal classification. The distributions of specific characteristics within the sample are described in Table 1.

Table 1 Characteristics of the sample

The ICC values showed mean intra-operator reproducibility of 0.97 (0.91–0.99) and 0.96 (0.90–0.99). The ICC of the mean measurement values between the two operators was 0.94 (0.86–0.99) on average. Both the intra-operator and inter-operator comparisons [see Additional file 3] indicated excellent reliability [21]. Given the high consistency between operators, the mean values of operator 1’s analysis were used in this study. The average time required for cephalometry was 157 s for manual tracing and 2 s for AI analysis.

Table 2 The bias between manual and automatic detection and the SDR of automatic detection

Accuracy in detecting landmarks

The differences in landmark tracing between manual and AI detection and the SDR of automatic detection are shown in Table 2. The detection accuracy of landmarks varied over a large range. The average variation of 19 landmarks was 2.07 ± 1.35 mm. In a comparison of results among the three landmark categories, soft tissue landmarks had the highest consistency (1.54 ± 0.85 mm), while dental landmarks had the lowest consistency (2.37 ± 1.55 mm). Among the 19 landmarks, the Prn point (1.01 ± 0.48 mm) had the highest consistency and the best performance in terms of SDR. Meanwhile, the Pg (1.31 ± 0.57 mm) and U1 tip (1.06 ± 0.49 mm) points almost had the lowest MRE and highest SDR among the skeletal and dental landmarks. The scatter plots (Fig. 2) suggest the distribution of landmark position differences.

Fig. 3
figure 3

Bland‒Altman plots of cephalometric measurements. a Skeletal measurements, b dental measurements, c soft tissue measurements. For each plot, the x-axis presents the average of manual and AI results, while the y-axis presents the difference between the paired measurements. The blue line represents the mean bias, and the red dashed lines represent the upper and lower 95% limits of agreement. The scale values on the x- and y-axis vary according to the mean and bias of each measurement comparison

Accuracy in acquiring measurements

Except for 4 cephalometric measurements, U1-NA (°), U1-AP (°), U1-AP (°) and UL-EP (mm), the other difference values of 19 measurements between manual and AI detection were statistically significant (P < 0.05) according to a paired t test (Table 3). Considering the clinically acceptable limit of 2 mm or 2°, 8 measurements [20], L1-FH (°) (-4.05 ± 3.44), L1-NB (°) (2.78 ± 3.10), U1-L1 (°) (-2.72 ± 4.67), MP-FH (°) (2.53 ± 1.93), SNA (°) (2.50 ± 2.94), SNB (°) (2.21 ± 2.76), L1-AP (°) (2.15 ± 3.33), and U1-SN (°) (2.14 ± 4.06), were discrepant from manual tracing. The Bland‒Altman plots (Fig. 3) represent the consistency of the results between the manual and AI methods. Measurement bias was within the 95% limits of agreement in more than 90% of instances (Table 3).

Table 3 Comparison of the cephalometric measurements between manual and AI analyses

Discussion

Cephalometric analysis is essential in every stage of orthodontic treatment. However, manual analysis has low consistency between repetitions and takes a great deal of time. With the rapid development of technology, computer vision allows machines to understand and work with images for cephalometric landmark detection after being trained with a given dataset. In contrast to conventional manual methods, AI can assess images within seconds, helping users save time and increasing the efficiency of repetitive work [11, 15]. According to our study, humans take an average of nearly 80 times as long as the AI program to complete the task. Using an automatic program saves a great deal of time and frees up orthodontists for other tasks. Additionally, the accuracy of cephalometric landmark digitization depends largely on the operator. The results of tracings by different operators or even repeated tracings by the same operator are likely to be significantly different [22]. In contrast to computer-assisted manual tracings, the results obtained from the same automatic program are fairly stable, showing its superiority in terms of repeatability [15]. Nevertheless, whether the accuracy of manual and AI cephalometric analyses is acceptable is always an important concern. To date, growing numbers of studies have focused on the accuracy of the automated cephalometric software, which is more accessible to the practicing orthodontist than in-house algorithms [16,17,18,19,20]. Therefore, our study is practical and meaningful for both the users of automatic programs and the engineers who seek to improve the algorithms.

The results regarding accuracy in landmark detection are consistent with published studies [9, 10]. The most accurately defined points, such as Prn, U1 tip and Pg, are on the edges of anatomical structures with sharp margins on CBCT images. Automatic programs are capable of identifying them with a high degree of precision similar to that of humans. The MREs of the U1 root (4.14 ± 1.33 mm), Go (3.08 ± 1.67 mm), L1 root (2.85 ± 0.93 mm), N (2.63 ± 2.31 mm), S (2.61 ± 1.88 mm), Me (2.41 ± 0.65 mm), D (2.27 ± 0.86 mm) and P points (2.24 ± 1.14 mm) exceeded the average value (2.07 ± 1.35 mm). The great deviation of the U1 root and L1 root points may be due to the indistinct outlines of the incisor roots [10, 23]. The recognition of incisor roots is obstructed by other dental roots when patients have anterior tooth crowding or overbite malocclusion, which is also a challenge for orthodontists in clinical work. The mixed density of incisor roots on imaging makes them difficult to identify by manual or automatic analysis. The reason for the significant errors in the Go, Me and P points is probably the asymmetrical structure of the lower mandible margin and external auditory canal, which varies over a wide range in the population. Midpoints of bilateral structures are chosen when the left and right sides cannot overlap, while asymmetrical structures increase the complexity of the region. Localizing these points relies heavily on the personal judgement of the operator. Obstruction by adjacent anatomical structures may also affect the localization of the Go point, which increases the difficulty of detection. As for the S, N and D points, the relevant structures are sometimes vague and quite ambiguous in RLCs, which makes them difficult for automatic programs to locate. The distance in landmark position between manual localization and automatic localization is interpreted as follows: <2 mm is correct, and < 4 mm is acceptable [24]. In this study, the average SDR as defined by MRE within 1 mm, 2 mm, 2.5 mm, 3 and 4 mm were 18.82%, 58.58%, 71.70%, 82.04% and 91.39%, respectively. Thus, more than half of the points were localized correctly, and more than 90% of the points were localized acceptably well.

The scatter plots (Fig. 2) indicate the distribution of landmark localization error by distance and direction. The 95% confidence ellipse shows a two-dimensional expansion of the Bland-Altman plot to observe the correlation between the x- and y-axis errors in the shape of the ellipse [15]. Points such as Or, UL and LL were distributed along the x-axis, which implied mainly horizontal bias in automatic detection. Similarly, points such as N, B, Gn and Pg’ illustrated mainly vertical bias. The anatomical positions of cephalometric landmarks are subject to bias in both the x and y coordinates. Points with considerable horizontal bias or vertical bias are usually ambiguous to locate in the corresponding direction to a certain extent. The results of AI detection are consistent with those of manual detection.

For cephalometric measurements, the rate of consistency within the 95% limits of agreement was between 91.76% and 98.82% (Table 3), which indicated high consistency in these measurements. Also, the Bland-Altman plots (Fig. 3) visualize the consistency between paired measurements. The results essentially corroborate those of a previous study [17]. The range between the limits of agreement is wider for dental measurements than for skeletal or soft tissue measurements. This conclusion is consistent with our conclusion regarding landmark identification. There were a total of 8 measurements that were both statistically significant (P<0.05) and had mean differences above 2 mm or 2°. Dental measurements have more discrepancy in comparison with manual tracing. Soft tissue measurements are basically equivalent in the two methods. Since the RLCs selected in this study were all from people who had not yet undergone orthodontic treatment, the precision of dental measurements might increase if the automatic program were applied to post-treatment images.

Generally, the consistency of AI cephalometric measurements is almost at an acceptable level, considering that it is nearly 80 times faster than manual cephalometric analysis. The main results that were of interest to us proved fairly good precision and consistency with manual analysis. Only a few measurements showed significant bias. Thus, human supervision is still necessary for automatic landmark identification. The solutions for increasing the accuracy and efficiency of automatic cephalometric analysis may be to improve the quality of the image like sharpening and algorithm approving and to implement an outlier detection and feedback system [23]. Orthodontists would need to spend only a short time checking the results of automatic localization, which could save them a great deal of time during clinical work.

Despite its strengths, this study also has several limitations. We used initial diagnosis RLCs to assess the automatic program. Diverse and complex intraoral conditions, such as overlapping teeth, overjet and overbite, are complex and make it difficult for AI to recognize relevant landmarks and obtain measurements. More detailed classification of image samples would increase the comprehensiveness of the conclusions. In addition, we selected fewer soft tissue landmarks than skeletal or dental landmarks, and the same was true of soft tissue measurements versus skeletal and dental measurements. The results would be more reliable if more soft tissue indicators were included.

Conclusion

Cephalometric measurements obtained from automatic analysis software are almost reliable enough to be acceptable in clinical work. Nevertheless, automatic cephalometry is not capable of completely replacing manual tracing. There still exist differences in localization, especially for some dental landmarks. Additional manual supervision and adjustment of automatic programs will help increase their accuracy, consistency and efficiency.