Abstract
Purpose
The aim of this investigation was to create an automated cephalometric X‑ray analysis using a specialized artificial intelligence (AI) algorithm. We compared the accuracy of this analysis to the current gold standard (analyses performed by human experts) to evaluate precision and clinical application of such an approach in orthodontic routine.
Methods
For training of the network, 12 experienced examiners identified 18 landmarks on a total of 1792 cephalometric X‑rays. To evaluate quality of the predictions of the AI, both AI and each examiner analyzed 12 commonly used orthodontic parameters on a basis of 50 cephalometric X‑rays that were not part of the training data for the AI. Median values of the 12 examiners for each parameter were defined as humans’ gold standard and compared to the AI’s predictions.
Results
There were almost no statistically significant differences between humans’ gold standard and the AI’s predictions. Differences between the two analyses do not seem to be clinically relevant.
Conclusions
We created an AI algorithm able to analyze unknown cephalometric X‑rays at almost the same quality level as experienced human examiners (current gold standard). This study is one of the first to successfully enable implementation of AI into dentistry, in particular orthodontics, satisfying medical requirements.
Zusammenfassung
Ziel
Ziel der vorliegenden Untersuchung war, eine vollständig automatisierte Fernröntgenseitenanalyse auf Basis eines spezialisierten künstlichen neuronalen Netzwerkes zu entwickeln. Die Genauigkeit dieser Analyse wurde mit der Auswertung menschlicher Experten, dem aktuellem Goldstandard, verglichen, um die Eignung eines solchen Systems zur Verwendung im klinischen Alltag zu überprüfen.
Patienten und Methodik
Für das Training des Netzwerkes wurden auf insgesamt 1792 Fernröntgenseitenbildern jeweils 18 kieferorthopädische Bezugspunkte durch 12 erfahrene Untersucher markiert. Zur Beurteilung der Auswertungsqualität des trainierten Netzwerkes wurden an 50 weiteren Fernröntgenseitenbildern, die nicht Teil der Trainingsdaten waren, sowohl durch die Künstliche Intelligenz (KI) als auch durch jeden der Untersucher 12 gängige kephalometrische Messungen durchgeführt. Als Goldstandard wurde für jeden Parameter der Medianwert der Untersucher definiert, welcher dann mit den Auswertungen der KI verglichen wurde.
Ergebnisse
Zwischen den Auswertungen der KI und dem menschlichen Goldstandard konnten nahezu keine statistisch signifikanten Unterschiede festgestellt werden. Sofern Unterschiede zwischen beiden Auswertungsmethoden bestanden, konnten diese als klinisch irrelevant betrachtet werden.
Schlussfolgerungen
Im Rahmen der vorliegenden Untersuchung war es möglich, ein künstliches neuronales Netzwerk darauf zu trainieren, unbekannte Fernröntgenseitenbilder in annähernd derselben Qualität wie der eines erfahrenen Klinikers auszuwerten. Diese Studie ist einer der ersten erfolgreichen Ansätze, künstliche neuronale Netzwerke in der Zahnmedizin, speziell in der Kieferorthopädie, zu integrieren.
Similar content being viewed by others

Introduction
Artificial intelligence (AI) is predominant in many areas of daily life. Nowadays, AI-based algorithms are included in everyday technology and are widely used, for example, in internet search engines, mail spam filtering or online assistants with speech and even image recognition on social media platforms.
AI itself is a general term that describes computers mimicking human intelligence. More precisely, “machine learning”, as a subset of AI, is characterized by mathematical and statistical techniques enabling machines to improve their abilities by using self-adapting algorithms. Based on sample data (= training data), machine learning algorithms generate mathematical models that generalize specific patterns to predict decisions without being explicitly programmed to perform this special task [35]. Architecture of machine learning algorithms is inspired by biological neural networks of the human brain [12, 13, 20, 30]. A variety of artificial neurons are connected to each other forming a net, which is organized in layers. Between the first (input) and the last (output) layer, there are a certain number of (at least one) so-called hidden layers that are responsible for decision making of the AI. The required number of hidden layers (= depth) depends on the complexity of the task that the AI is to be trained for. The subset of machine learning algorithms with multiple hidden layers is therefore called “deep learning” [20]. Such multilayered networks exist in different configurations. One type is a “convolutional neural network” (CNN) which especially has established itself in analysis of image content [35].
Due to recent advancements in computing, such AI algorithms can now be used for abstract and complex tasks. Accordingly, these are promising fields in health care. AI algorithms are suitable to assist clinicians in analyzing medical imaging, diagnosis of certain diseases and therefore to support therapeutic decisions. For example, enormous success has been achieved in the diagnosis of different types of cancer [32] as well as in the early detection of Alzheimer’s disease [8]. Many nationally and internationally active organizations such as the Fraunhofer Institute for Intelligent Analysis and Information Systems, the Max Planck Institute for Innovation and Competition, the German Research Foundation (DFG) or the German Federal Association for AI (KI-Bundesverband) have recognized the potential of AI for medical applications. Moreover, the European Commission as well as the Federal Ministry of Education and Research of the Federal Government of Germany (BMBF) support AI start-ups and fund various research projects with the aim of making use of AI in health care applications.
Despite advances and successful integration of AI in human medicine [9, 16, 35], application of AI has remained an exception in dentistry until now. The first promising attempts were made in automated caries detection on intraoral X‑rays [21]. Likewise, the analysis of cephalometric X‑rays appears to be a suitable diagnostic application of CNN algorithms. Introduced in 1931 by Broadbent, lateral cephalometric X‑ray analysis of sagittal and vertical skeletal configurations [4] is still one of the major diagnostic procedures in orthodontics today and routinely performed in orthodontic treatment planning. Cephalometric X‑ray analysis is based on identification of radiological landmarks to subsequently measure various angles, distances and ratios for the interpretation of craniofacial structures [1]. While nowadays software is commonly used for cephalometric measurements, tracing of the landmarks remains a manual task that must be performed by an orthodontic expert [26]. The quality level of this analysis depends mainly on the expert’s experience and even his daily form. Moreover, a lack of interrater reliability can be observed [33]. As inaccurate identification of cephalometric landmarks may lead to incorrect decision-making for orthodontic therapy, a fully automated and reliable identification of the cephalometric landmarks is desired, especially for the purpose of guaranteed quality management [1]. This is where AI algorithms present new opportunities to support orthodontic experts in their daily routine. To the knowledge of the authors, there have been only two investigations in which CNNs were used for automated cephalometric X‑ray analyses [1, 26]. Although these preliminary trials were encouraging, they had major methodical limitations. Clear statements to the practical usability of such algorithms are still lacking.
Therefore, the aims of this investigation were (1) to create an automated cephalometric X‑ray analysis using a customized CNN and (2) to compare the accuracy of this analysis to the current gold standard (analyses performed by human experts) to evaluate feasibility of such a system in orthodontic daily routine.
Patients and methods
The present investigation was carried out in compliance with the Declaration of Helsinki.
Data basis for the AI
Cephalometric X‑rays used for this investigation were obtained from a private orthodontic dental practice. All applicable data protection laws were respected. All images were recorded on a Sirona Orthophos XG (Dentsply Sirona, Bensheim, Germany). The resulting .tif images were fully anonymized and fed into a custom web-based assessment platform (CellmatiQ GmbH, Hamburg, Germany) into which individual users could log in to.
At the University Hospital of Würzburg, Department of Orthodontics, 12 examiners (6 orthodontic specialists, 6 dentists in the second half of their post-graduate orthodontic education) identified and marked a total of 18 radiographic landmarks on each cephalometric X‑ray (Table 1). In a further step, these landmarks were used for measurement of angles, distances and a ratio commonly used for orthodontic treatment planning. Table 2 gives an overview of the 12 parameters we investigated.
Analyses of the cephalometric X‑rays were performed in four different campaign-steps:
Quality calibration #1:
All examiners analyzed the same 20 randomly selected cephalometric X‑rays.
Content generation #1:
All examiners analyzed 30 different randomly selected cephalometric X‑rays.
Quality calibration #2:
All examiners analyzed the first 20 X-rays from quality calibration #1 in random order again and additional 30 cephalometric X‑rays which were the same for all examiners.
Content generation #2:
All examiners analyzed further randomly selected cephalometric X‑rays on voluntary basis (there was no required minimum or a given maximum amount).
The 20 cephalometric X‑ray images that had been analyzed twice by each examiner (quality calibration #1 and #2) were used to verify intrarater and interrater reliability to ensure a high quality level for the training data for the AI.
The intention of quality calibration #2 was to define a gold standard analysis for statistical comparisons with predictions of the AI. We set the median value of the 12 examiners as the “humans’ gold standard” for each parameter. As three out of the 50 images exhibited severe artefacts, they were excluded from statistical analysis. Of course, all analyses from both quality calibration campaigns were excluded from training of the AI.
Basis for training of the AI were the analyses from both content generation campaigns. Altogether, the 12 examiners analyzed a total of 1792 different cephalometric images as training data.
Technical aspects
All technical procedures were conducted in collaboration with CellmatiQ GmbH.
Hardware specifications
Calculations relevant to AI were performed on a custom desktop personal computer with the specifications listed below:
Mainboard: ASUS X99-E-10G WS
CPU: Intel i7-6850K (six core) LGA2011-v3
GPU: 2 × ASUS GeForce GTX 1080 Founder’s Edition
RAM: 64 GB DDR4 2133 MHz (4 × 16 GB)
SSD: Samsung EVO (M.2) 1TB
OS: Ubuntu 16.04
Preparation of the input data
To enlarge the total number of training samples analyzed by human examiners, different augmentation procedures (e.g. rotation, tilting, parallel shifting, mirroring, noise adding as well as changes in brightness and contrast) were performed increasing the original number of 1792 cephalometric X‑rays. Finally, all images were prepared for AI training. To this purpose, the cephalometric X‑ray images were converted to an AI-compatible format, resized to a consistent resolution of 256 × 256 pixels and reduced to a color depth of 8 bit (gray scale).
Design of the AI
We used a customized open-source CNN deep learning algorithm (Keras & Google Tensorflow) geared towards analyzing visual imagery. As previously described, CNNs consist of an input layer, multiple hidden layers and an output layer. For our investigation, the complete set of all cephalometric X‑rays after augmentation served as input data. The numeric gray-scale values of each pixel were individual input neurons of the input layer. The output layer was defined as pairs of “X”- and respectively “Y”-coordinates for each cephalometric landmark.
Between input and output layers, there are typically different types of hidden layers in CNNs (Fig. 1). Central building blocks are convolutional layers in which a set of learnable and adaptable filters (= convolutional kernels) with small receptive field is placed over each pixel (= neuron) resulting in a mathematical convolution of the previous layers. The values within the filters are adjusted through the learning processes [35]. Each kernel generates a new layer resulting in a stack of neuronal layers. After each convolution layer, an activation function amplifies the signal of the previous layers. For this purpose, we used “rectified linear units” (= ReLU) that set all negative inputs to “0” and pass all positive values without any transformation. Each convolutional layer with activation function is followed by a pooling layer (another type of hidden layer). The most common type of pooling layers are “max pooling layers”. Hereby, neurons of the previous layer are separated into squares of 2 × 2 neurons and only the neuron with strongest activation is kept for further proceedings. This protocol leads to significant subsampling in which 75% of the activations are discarded with the intention of reducing required computing power as well as eliminating interpretation of image noise. To summarize, convolutional layers increase the total number of layers, whereas pooling layers reduce the size of each subsequent layer.
a Illustration of the convolutional neural network (CNN) design used to analyze cephalometric X‑rays. b Schematic illustration of convolution and pooling processes in a CNN. In this example, the numbers of the input layer represent the gray-scale values of the cephalometric X‑rays. a) Zero padding: to perform the convolution function for peripheral pixels of an input image, zero values are stocked up one pixel outside of the input image. b) Convolution: a set of learnable and adaptable filters (= convolutional kernels) with a small receptive field is displaced over each pixel (= neuron) of the previous layers resulting in a mathematically convolution of the previous layers. Only one kernel is used in this example for illustration purposes. c) ReLU (rectified linear unit): Activation function that amplifies the signal from the previous layers by setting all negative values to “0”. d) Max pooling: neurons of the previous layer are separated into squares of 2 × 2 neurons and only the one neuron out of four with the strongest activation is kept for further procedures
a Darstellung des Designs des „convolutional neural network“ (CNN), das zur Auswertung der Fernröntgenseitenbilder genutzt wurde. b Schematische Darstellung der „Faltungs- und Verstärkungsprozesse“ in einem CNN. In diesem Beispiel repräsentieren die Zahlen im Input Layer die Grauskalenwerte der einzelnen Pixel eines Fernröntgenseitenbildes. a) Zero padding: Um die Faltungsfunktion an den äußeren Pixeln eines Fernröntgenseitenbildes durchführen zu können, wird das Fernröntgenseitenbild an den äußeren Kanten mit einer Pixelreihe mit Nullwert erweitert. b) Convolution („Faltung“): Ein Set aus variablen, anpassbaren Filtern („convolutional kernels“) mit kleinem rezeptivem Feld wird über jeden Pixel (Neuron) der vorherigen Schichten verschoben. Hierdurch wird die vorherige Schicht mathematisch gefaltet. In dieser Grafik wird zur besseren Darstellung nur ein einzelner Kernel verwendet. c) ReLU („rectified linear unit“): Aktivierungsfunktion, welche das Signal der vorherigen Schichten verstärkt, indem alle negativen Werte auf „0“ gesetzt werden. d) Max pooling: Die Neurone der vorherigen Schichten werden zu kleinen Gruppen aus 2 × 2 Pixeln zusammengefasst – aus diesen 4 Neuronen wird nur das Neuron mit der stärksten Aktivierung für die weiteren Berechnungen verwendet
Finally, after a sequence of several convolutional and max pooling layers, a fully connected layer finishes the artificial network. The neurons of the fully connected layer have connections to all activations of the previous layer. The number of neurons of this last layer depends on the desired output—in our CNN, every neuron of the fully connected layer codes for an “X”- or “Y”-coordinate of a cephalometric landmark (Fig. 1a).
Training procedure of the AI
The entire training sample was divided into “training images” (96.6%) and “validation images” (3.4%) (Fig. 2). Training images pass through the CNN algorithm resulting in an AI output of coordinates of cephalometric landmarks. These outputs are compared to the coordinates of the cephalometric landmarks of human inputs. The mean absolute error between the human input and the AI output is then calculated (henceforth referred to as “error-calculation”). The CNN now readjusts the convolutional kernels of the hidden layers and tries to improve the new outputs. These “training epochs” are repeated several hundred times to match approximation of AI outputs to human inputs as closely as possible.
Validation images also pass the CNN algorithm: likewise, the mean absolute error is calculated in each epoch, but in contrast to training images, there is no further adjustment of the convolutional kernels of the CNN by error-calculation based feedback of validation images. The purpose of this procedure is comparison of error-calculation of training images versus validation images: after a certain number of epochs, the error-calculation of the validation images will not improve any further or possibly even worsen, whereas the error-calculation of the training images will probably further improve. This may be for example due to “overfitting” which describes a mere memorization of the training images by the CNN. An overfitted CNN will not achieve satisfying results for new and unexperienced images as it has lost ability to generalize. Therefore, the training status of the CNN with the lowest error-calculation for the validation images is set as final AI algorithm. At this point, the ability of the trained AI to precisely analyze completely new or unknown cephalometric X‑rays has to be clinically verified. Further improvement of the AI algorithm is only to be expected by adding new training samples and repeating the training procedure. Fig. 3 shows two examples of cephalometric X‑rays including the landmarks set by the AI as well as by the humans’ gold standard. These two examples were not part of the training data for the AI and therefore completely new to it.
Analysis of two cephalometric X‑rays previously unknown to the artificial intelligence (AI). AI (yellow) and the humans’ gold standard (red). a Cephalometric X‑ray of a patient with horizontal growth pattern. b Cephalometric X‑ray of a patient with vertical growth pattern and severe double contours at the mandible
Auswertung zweier für die Künstliche Intelligenz (KI) unbekannten Fernröntgenseitenbilder. KI (gelb), menschlicher Goldstandard (rot). a Fernröntgenseitenbild eines Patienten mit horizontalem Wachstumsmuster. b Fernröntgenseitenbild eines Patienten mit vertikalem Wachstumsmuster und starken Doppelkonturen im Bereich des Unterkiefers
Statistical analysis
A professional biometrician of the Centre for Clinical Studies at the University Medical Centre of Regensburg supported statistical analysis of this investigation. All analyses were performed using SPSS Statistics Version 25.0 for Windows® (IBM, Ehningen, Germany).
Quality of supervised training data was verified by analyzing two types of reliability using intraclass correlation coefficients (ICC) on basis of the 20 cephalometric X‑rays that had been analyzed twice by each examiner: interrater reliability was analyzed for each parameter and intrarater reliability was verified for each examiner and each parameter.
We performed different statistical analyses to compare predictions of the AI to human’s gold standard. First, t-tests for paired samples were performed to rule out consistent bias (= average prediction of the AI is higher respectively lower than the humans’ gold standard, independently from a given parameter’s value). In a second step, we correlated predictions of AI and the humans’ gold standard using Pearson product–moment correlation. Furthermore, Bland–Altman plots were made for all investigated parameters to illustrate differences between the predictions of AI and the humans’ gold standard versus the two measurements’ average. In these plots, we visualized the mean differences between the two analyses as well as the 95% limits of agreement (mean difference ± 1.96 x standard deviation of the differences). Finally, to determine if there were any proportional biases (= average prediction of the AI for small values of a parameter are smaller respectively higher for high values or vice versa), we performed simple linear regression analyses for each parameter with the difference between the AI’s predictions and the humans’ gold standard as the dependent variable (criterion) and the mean of both analyses as the independent variable (predictor). The resulting regression lines were finally added to the corresponding Bland–Altman plot.
In a further step, we assessed if the AI predictions were on the same level as the mean of the twelve human examiners. This was done by comparing the absolute values of the differences between the AI predictions and the humans’ gold standard to the mean absolute values of the differences between the human examiners and the humans’ gold standard for every parameter using t-tests for paired samples. Hereby, we evaluated the clinical relevance of the differences of the AI and the humans’ gold standard.
The level of significance was set to 5% for all statistical analyses.
Results
Reliability of the training data
Interrater reliability was very high through all parameters analyzed in this study (all ICC > 0.900 with p < 0.001). Likewise, we were able to prove very high intrarater reliability for each examiner and each parameter (all ICC > 0.800 with p < 0.001).
Comparison of AI predictions to humans’ gold standard
Comparison of AI predictions to the humans’ gold standard are depicted in Table 3. There was a very high correlation between the predictions of the AI and the humans’ gold standard (all Pearson product–moment correlation coefficients r > 0.864 with p < 0.001). Absolute mean differences between both analyses were less than 0.37° for angular parameters, less than 0.20 mm for all metric parameters and less than 0.25% for the proportional parameter facial height. The p-values demonstrated no statistically relevant differences between predictions of the AI and the humans’ gold standard (all p-values > 0.05) with exception of SN-MeGo (p = 0.043). Only this parameter showed consistent bias of 0.31°.
Bland–Altman plots of all 12 parameters are depicted in Fig. 4: Green lines illustrate mean differences of the predictions of the AI and the humans’ gold standard. As these lines were extremely close to the zero line, there was no clinically relevant consistent bias. Red lines illustrate 95% limits of agreement: These two lines confine a very narrow area proving very high accuracy of the AI’s predictions (very low intraindividual bias). Parameters of incisor inclination exhibited a moderately extended area between the 95% limits of agreement.
a–c Bland–Altman plots of all orthodontic parameters. The differences between the predictions of the AI and the humans’ gold standard (Y-axis) is plotted against the averages of the two measurements (X-axis). The green lines illustrate the mean difference of both measurements, the red lines illustrate the 95% limits of agreement. The dashedblue lines (for some parameters almost perfectly congruent with the mean difference) represent the linear regression line with the difference as the dependent variable (criterion) and the mean as the independent variable (predictor). a Bland-Altman plots of the four skeletal sagittal parameters
a–c Bland-Altman-Plots der kieferorthopädischen Parameter. Die Differenzen zwischen der Vorhersage der Künstlichen Intelligenz und dem menschlichen Goldstandard (Y-Achse) ist gegen die Mittelwerte beider Messungen (Y-Achse) aufgetragen. Die grünen Linien stellen die mittlere Differenz zwischen beiden Messungen dar, die roten Linien die 95% „limits of agreement“. Die gestrichelte blaue Linie (für manche Parameter fast deckungsgleich mit der mittleren Differenz) repräsentiert die lineare Regressionsgerade mit der Differenz als abhängige Variable (Kriterium) und dem Mittelwert als unabhängige Variable (Prädiktor). a Bland-Altman-Plots der 4 skelettal sagittalen Parameter
Simple linear regression analyses for each parameter with the difference between the predictions of the AI and the humans’ gold standard as the dependent variable (criterion) and the mean of both analyses as the independent variable (predictor) showed no statistically relevant p-values (all p > 0.05) with exception of L1-MeGo (p = 0.045). Therefore, only this parameter exhibited proportional bias. The resulting linear regression lines (dashed blue lines) were subsequently added to the corresponding Bland–Altman plot. For the majority of parameters, these regression lines were almost perfectly congruent with the green lines of the mean differences.
To evaluate clinical relevance of the differences between AI’s predictions and the humans’ gold standard, we compared the absolute values of the differences between the AI and the humans’ gold standard to the mean absolute values of the differences between the human examiners and the humans’ gold standard for each parameter (Table 4). With exception of U1-SN (p = 0.035) no statistically significant differences were found.
Discussion
AI algorithms such as CNNs have been successfully trained for a wide range of different applications [1] such as speech recognition, natural language processing [20], image classification [6, 19, 25] and image segmentation [28, 34]. Performance of some of these algorithms even surpasses human possibilities [1]. Application of CNNs recently also gained wide attention for a plethora of medical purposes, especially for the analysis of radiological images [9, 16, 35]. However, in contrast to many other applications, successful implementation of CNNs for medical purposes is still challenging and frequently restricted as datasets for their training are limited [1].
There are different types of sample data for training of machine learning algorithms. So-called “supervised learning” uses training examples provided by experts. This training data contains both inputs and desired outputs [29]. On this basis the AI calculates and optimizes mathematical objective models trying to predict outputs for new and unknown inputs [24]. This training method is particularly suitable for training of complex decision-making that is typically within the scope of human capabilities. In contrast, sample data of “unsupervised learning” only consists of inputs without associated outputs where the AI tries to find any kind of structure or pattern in the sample data that differs from unstructured noise by itself [15]. Unsupervised learning is typically used for cluster analyses or categorization tasks. Usually, supervised training data is required for automated medical analysis.
In fact, a major strength of the present investigation that aimed to create an automated cephalometric X‑ray analysis on the basis of a custom CNN was the quality of supervised training data. We only used high quality cephalometric X‑rays that had been generated on an approved X‑ray unit. These X‑rays were obtained from a private orthodontic dental office in accordance with all applicable data protection laws and not gathered from the public domain as was done elsewhere [26]. These X‑rays were not preselected in order to achieve a huge variety of different skeletal and dental anomalies, which is a key requirement for reliable learning of the AI [26]. Moreover, only experienced clinicians performed tracing of cephalometric landmarks. Very high intrarater and interrater reliability of these examiners’ analyses were statistically verified. The acquired high-quality training sample was many times larger than those that had been used in previous investigations [1, 26]. It is well known that the accuracy of deep learning predictions is highly dependent on the number of training data [26]. Therefore, both quality and quantity of our training data are unique features at this stage when trying to set up an automated analysis of cephalometric X‑rays using a CNN algorithm.
During recent decades, various other techniques for automated detection of cephalometric landmarks such as image respectively edge enhancement and detection techniques [7, 11, 22, 27, 31], template and gray-scale matching operators [5] or various machine learning techniques have been studied [10, 18, 23]. Despite the variety of techniques, it is still doubtful whether these approaches were able to detect cephalometric landmarks within a clinically acceptable range [1]. First attempts to use CNNs for this task showed greater success rates for detection of cephalometric landmarks compared to top ranking benchmarks of other techniques [1]. However, practical usability of such a trained AI algorithm in orthodontic routine is still lacking as previous studies were restricted to only landmark detection without assessment of the accuracy of cephalometric parameters [1] or due to insufficient statistical evaluations [26]. Therefore, the aim of the present investigation was to precisely evaluate the potential of a trained CNN algorithm to analyze new and unknown cephalometric X‑rays at the level of orthodontic parameters rather than the landmark level.
To prove the performance of the AI, we compared the analyses of our CNN algorithm to analyses performed by human raters, which is deemed to be the gold standard. Although there are clear definitions of cephalometric landmarks, human tracing is sensible to errors [14, 17, 33]. To achieve a constant level of quality for our gold standard, a set of 50 cephalometric X‑rays was analyzed by 12 experienced examiners. The resulting median value for each parameter was defined as the humans’ gold standard against which our AI algorithm had to compete. Hereby, outliers of the humans’ analyses were ruled out. This particular set of 50 cephalometric X‑rays was not part of the previously performed training of the AI, and therefore completely new to it.
The comparison between predictions of the AI and the humans’ gold standard showed that there was no consistent bias between the two analyses with the exception of mandible inclination. As the consistent bias for this parameter was only 0.31°, the clinical relevance of this discrepancy is questionable. Moreover, with exception of the inclination of the lower incisor, we found no proportional bias. Additionally, we created Bland–Altman plots and evaluated the 95% limits of agreement for every parameter [2, 3]. For the majority of parameters examined in this investigation, these 95% limits of agreement bordered a very narrow area proving very high accuracy and minor intraindividual bias of the predictions of the AI. According to Bland and Altman, the decision on how small the limits of agreement should be to conclude that two different analyses agree sufficiently is a clinical and not a statistical decision [3]. The permissible clinical error size for each parameter is difficult to evaluate [26]. As the literature does not provide concrete data hereon, we compared the absolute values of the differences between AI predictions and the humans’ gold standard to the mean absolute values of the differences between the 12 human examiners and the humans’ gold standard. Actually, this can be considered an “unfair” competition for the AI as exactly these human raters also set the humans’ gold standard. Nevertheless, in 11 out of 12 parameters, there was no statistically significant difference between the absolute values of the differences of the AI and the humans’ gold standard to the mean absolute values of the differences between the human examiners and the humans’ gold standard. Only the inclination of the upper incisors showed a statistically higher absolute difference of 2.18° between the AI’s predictions and the humans’ gold standard compared to the mean absolute difference between the human examiners and the humans’ gold standard, which was 1.50°. The clinical relevance of this discrepancy is once more doubtful. Altogether, the accuracy of the AI predictions are comparable to the measurements of the 12 human examiners.
To the author’s knowledge, no other study has evaluated clinical precision of an automated cephalometric X‑ray analysis to a comparable extent.
Conclusion
In this work, we presented an automated cephalometric analysis on basis of a customized CNN deep learning algorithm and evaluated its precision. For supervised training, we used a huge set of high-quality training data that was provided by experienced orthodontic clinicians and that is unique in literature for this purpose at this stage. In the framework of this investigation, the precision of the predictions for 12 commonly used orthodontic parameters made by the AI algorithm had been evaluated under clinical aspects. We analyzed skeletal sagittal and vertical, as well as dental parameters, including assessment of the position and inclination of the jaws respectively of the incisors, skeletal class and growth pattern. As every orthodontic expert prefers other cephalometric parameters, the presented analysis can be expanded either by new geometrical calculations using the already existing landmarks or by retraining the AI algorithm with new ones. Due to the high quality level and the great quantity of training data, we were able generate an AI algorithm capable of analyzing new cephalometric X‑rays with comparable precision to experienced human examiners, which is deemed to be the current gold standard. The trained AI algorithm analyses a cephalometric X‑ray in a fraction of a second, even when used on a standard personal computer.
In summary, this study is one of the first investigations to successfully integrating AI into dentistry, in particular orthodontics.
References
Arik SO, Ibragimov B, Xing L (2017) Fully automated quantitative cephalometry using convolutional neural networks. J Med Imaging 4:14501
Bland JM, Altman DG (1999) Measuring agreement in method comparison studies. Stat Methods Med Res 8:135–160
Bland JM, Altman DG (2003) Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol 22:85–93
Broadbent B (1931) A new X‑ray technique and its application to orthodontia. Angle Orthod 1:45–66
Cardillo J, Sid-Ahmed MA (1994) An image processing system for locating craniofacial landmarks. IEEE Trans Med Imaging 13:275–289
Ciresan DC, Meier U, Masci J, Gambardella LM, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. Paper presented at the Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, Barcelona. vol 2
Desvignes M, Romaniuk B, Clouard R, Demoment R, Revenu M, Deshayes MJ (2000) First steps toward automatic location of landmarks on X‑ray images. In: Proceedings 15th International Conference on Pattern Recognition ICPR-2000, 3–7 Sept. 2000, pp 275–278
Ding Y, Sohn JH, Kawczynski MG, Trivedi H, Harnish R, Jenkins NW, Lituiev D, Copeland TP, Aboian MS, Aparici CM, Behr SC, Flavell RR, Huang S‑Y, Zalocusky KA, Nardo L, Seo Y, Hawkins RA, Pampaloni MH, Hadley D, Franc BL (2019) A deep learning model to predict a diagnosis of alzheimer disease by using 18F-FDG PET of the brain. Radiology 290:456–464
Dreyer KJ, Geis JR (2017) When machines think: radiology’s next frontier. Radiology 285:713–718
El-Feghi I, Sid-Ahmed MA, Ahmadi M (2004) Automatic localization of craniofacial landmarks for assisted cephalometry. Pattern Recognit 37:609–621
Forsyth DB, Davis DN (1996) Assessment of an automated cephalometric analysis system. Eur J Orthod 18:471–478
Fu Jie H, LeCun Y (2006) Large-scale learning with SVM and convolutional for generic object categorization. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR′06) 17–22 June 2006, pp 284–291
Fukushima K (1975) Cognitron: a self-organizing multilayered neural network. Biol Cybern 20:121–136
Gonçalves FA, Schiavon L, Pereira Neto JS, Nouer DF (2006) Comparison of cephalometric measurements from three radiological clinics. Braz Oral Res 20:162–166
Hinton G, Sejnowski TJ (1999) Unsupervised learning: foundations of neural computation. MIT Press, Cambridge, MA
Kahn CE Jr. (2017) From images to actions: opportunities for artificial intelligence in radiology. Radiology 285:719–720
Kamoen A, Dermaut L, Verbeeck R (2001) The clinical significance of error measurement in the interpretation of treatment results. Eur J Orthod 23:569–578
Kaur A, Singh C (2013) Automatic cephalometric landmark detection using Zernike moments and template matching vol 9
Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. Paper presented at the Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe. vol 1
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Lee JH, Kim DH, Jeong SN, Choi SH (2018) Detection and diagnosis of dental caries using a deep learning-based convolutional neural network algorithm. J Dent 77:106–111
Levy-Mandel AD, Venetsanopoulos AN, Tsotsos JK (1986) Knowledge-based landmarking of cephalograms. Comput Biomed Res 19:282–309
Liu JK, Chen YT, Cheng KS (2000) Accuracy of computerized automatic identification of cephalometric landmarks. Am J Orthod Dentofacial Orthop 118:535–540
Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of machine learning. MIT Press, Cambridge, MA
Nebauer C (1998) Evaluation of convolutional neural networks for visual recognition. IEEE Trans Neural Netw 9:685–696
Nishimoto S, Sotsuka Y, Kawai K, Ishise H, Kakibuchi M (2019) Personal computer-based cephalometric landmark detection with deep learning, using cephalograms on the internet. J Craniofac Surg 30:91–95
Parthasarathy S, Nugent ST, Gregson PG, Fay DF (1989) Automatic landmarking of cephalograms. Comput Biomed Res 22:248–269
Ronneberger O, Fischer P, Brox T (2015) U‑net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention MICCAI 2015. Springer, Cham, pp 234–241
Russell S, Norvig P (2010) Artificial intelligence: a modern approach, 3rd edn. Prentice Hall, Upper Saddle River, NJ
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting vol 15
Tong W, Nugent ST, Jensen GM, Fay DF (1989) An algorithm for locating landmarks on dental X‑rays. IEEE, 552–554, https://doi.org/10.1109/IEMBS.1989.95869
Vial A, Stirling D, Field M, Ros M, Ritz C, Carolan M, Holloway L, Miller AA (2018) The role of deep learning and radiomic feature extraction in cancer-specific predictive modelling: a review. Transl Cancer Res 7:803–816
Wang CW, Huang CT, Hsieh MC, Li CH, Chang SW, Li WC, Vandaele R, Maree R, Jodogne S, Geurts P, Chen C, Zheng G, Chu C, Mirzaalian H, Hamarneh G, Vrtovec T, Ibragimov B (2015) Evaluation and comparison of anatomical landmark detection methods for cephalometric X‑Ray images: a grand challenge. IEEE Trans Med Imaging 34:1890–1900
Yang X, Wu N, Cheng G, Zhou Z, Yu DS, Beitler JJ, Curran WJ, Liu T (2014) Automated segmentation of the parotid gland based on atlas registration and machine learning: a longitudinal MRI study in head-and-neck radiation therapy. Int J Radiat Oncol Biol Phys 90:1225–1233
Yasaka K, Akai H, Kunimatsu A, Kiryu S, Abe O (2018) Deep learning with convolutional neural network in radiology. Jpn J Radiol 36:257–272
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
F. Kunz, A. Stellzig-Eisenhauer, F. Zeman and J. Boldt declare that they have no competing interests.
Additional information
This paper received the Arnold-Biber Research Award of the German Orthodontic Society for the year 2019.
Rights and permissions
About this article
Cite this article
Kunz, F., Stellzig-Eisenhauer, A., Zeman, F. et al. Artificial intelligence in orthodontics. J Orofac Orthop 81, 52–68 (2020). https://doi.org/10.1007/s00056-019-00203-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00056-019-00203-8







