Introduction

Visual examination is the method of choice for screening, monitoring, detecting, and diagnosing dental pathologies of teeth, and the corresponding diagnostic indices and methodological procedures have been described by the researchers [e.g., 1, 2]. However, the knowledge transfer from scientists to dental practitioners might sometimes be lacking, which is especially true for detecting and diagnosing individuals or teeth with molar-incisor-hypomineralization (MIH). Here, families notified diagnostic uncertainties by dental professionals which potentially results in conflicting positions, diverging recommendations and additional dental consultations [3, 4]. It might be beneficial to develop diagnostic methods to verify suspected dental hard tissue findings independently from the investigating dentist. In addition, this aim might be supported by the documented MIH prevalence rates. The mean global MIH prevalence was estimated recently at 13.1% by Schwendicke et al. [5]. In Germany, 28.7% of all 12-year-olds were found to have hypomineralizations [6, 7]. Both numbers indicate that a relevant proportion of adolescents is affected by this developmental disorder. Therefore, diagnosing and managing MIH is a frequent challenge in daily dental practice.

The aim of establishing independent diagnostic methods might become feasible by the availability of smart image analysis methods. Artificial intelligence (AI) currently offers the potential for the automated detection and evaluation of diagnostic information in medicine and dentistry [6,7,8,9]. The aim to digitalise medical and dental workflows must be understood as an emerging topic, and interest in this area has recently increased in dental research as well. Meanwhile, different workgroups have started to analyze all available types of dental radiographs [10,11,12,13,14] by using deep learning with convolutional neural networks (CNNs) for the detection of caries [15], apical pathologies [16], or periodontitis [17]. In contrast, only a few projects using AI-based algorithms for the automated identification of pathologies on intraoral clinical photographs have been reported [18,19,20,21,22,23,24,25,26]. When considering recently published reports and the latest software developments, it can be stated that, to the best of our knowledge, no application for the automated detection of MIH on intraoral photographs has been developed and/or evaluated thus far. Therefore, this diagnostic study aimed to train a CNN for MIH detection (test method); this CNN was then compared in its final stage to the expert evaluation (reference standard). The aim was to reach a diagnostic accuracy of at least 90% for the test method.

Materials and methods

Study design

This diagnostic study used anonymized intraoral clinical photographs (Fig. 1) from clinical situations in which photographs were captured for educational purposes as well as from previously conducted clinical trials. The Ethics Committee of the Medical Faculty of the Ludwig-Maximilians University of Munich reviewed and approved the study concept (project number 020–798). This investigation was reported in accordance with the recommendations of the Standard for Reporting of Diagnostic Accuracy Studies (STARD) steering committee [27] and recently published recommendations for the reporting of AI studies in dentistry [28]. The pipeline of methods, mentioned below, was applied and described in previously published reports [19, 20].

Fig. 1
figure 1

Overview of the chosen diagnostic categories based on the criteria provided by the European Academy of Paediatric Dentistry [3] and frequent intervention modalities

Intraoral photographs

Dental photographs were consistently taken with professional single-reflex cameras equipped with a 105-mm macro lens and a macro flash after tooth cleaning and drying [19, 20]. All images were stored (jpeg format, RGB colors, aspect ratio of 1:1) and selected for this study project. To ensure high data quality, duplicate or inadequate photographs, such as out-of-focus images, under- or overexposed pictures and photographs with saliva contamination, were excluded. Clinical photographs showing additional caries cavities and any other developmental disorders, e.g., amelogenesis or dentinogenesis imperfect or hypoplasia, were omitted. Caries-related restorations were also excluded to rule out potential evaluation bias. Finally, 3241 anonymized, high-quality clinical photographs from anterior and posterior permanent teeth with MIH (test group) and without any pathology/restoration (control group) were included in the study.

Classification of teeth with MIH (reference standard)

Each photograph was classified with the goal of detecting and categorizing teeth with MIH in relation to the diagnostic classification system of the European Academy of Paediatric Dentistry [3] and possible dental interventions, such as restorations or fissure sealants. In detail, characteristics indicating the well-established MIH categories of demarcated opacities and enamel breakdowns are prevalent and can appear clinically in combination without any dental restoration, with an MIH-related—so called atypical—restoration or sealant (Fig. 1). Each image was precategorized by three graduated dentists (JS, PE, and AS) according to the given cross classification; afterwards, images were independently counterchecked by an experienced examiner (JK, > 20 years of clinical practice and scientific experience). In the case of divergent findings, each intraoral photograph was re-evaluated and discussed until consensus was reached. Every diagnostic decision—one per image—served as a reference standard for cyclic training and repeated evaluation of the deep learning-based CNN.

All the annotators were trained and calibrated before the study. During a 2-day theoretical and practical workshop guided by the principal investigator (JK), all annotators (JS, PE, and AS) were educated. Finally, 140 photographs were evaluated by all participating dentists to determine intra/interexaminer reproducibility for MIH classifications. Statistically, kappa values were computed for all coder pairs using Excel (Excel 2016, Microsoft, Redmond, WA, USA) and SPSS (SPSS Statistics 27, 2020, IBM corporation, Armonk, NY, USA). Intra/interexaminer reproducibility was calculated as 0.964/0.840–0.712 (JS), 0.982/0.747–0.727 (PE), 1.000/0.774–0.693 (AS), and 0.836/0.749–0.693 (JK), respectively. The documented kappa values indicated substantial to perfect agreement [29].

Training of the deep learning-based CNN (test method)

In the following, the used pipeline of methods for developing the AI-based algorithm is described. Before training, the whole set of images (N = 3241) was divided into a training sample (N = 2596) and a test sample (N = 649); the CNN had no knowledge of the latter during training; it served as an independent test set only. The distribution of all images in relation to the diagnostic classification can be taken from Table 1.

Table 1 Description of the image set in relation to the diagnostic classification

To increase variability within the images, the underlying training set was augmented. For this purpose, the randomly selected images (batch size = 16) were multiplied by a factor of ~ 5, altered by different transformations (random center and margin cropping by up to 30% each; random deletion removing up to 30%; random affine transformation up to 180°; random perspective transformation up to a distortion of 0.5; and random changes in brightness, contrast, and saturation up to 10%) and resized (300 × 300 pixels). In addition, to compensate for under- and overexposure, all images were normalized [19, 20]. Torchvision (version 0.9.1, https://pytorch.org) in conjunction with the PyTorch library (version 1.8.1, https://pytorch.org) was used. ResNeXt-101–32 × 8d [30] was selected as the basis for the continuous adaptation of the CNN for MIH detection and categorization. The CNN was trained using backpropagation to determine the gradient for learning. Backpropagation was repeated iteratively for images and labels using the abovementioned batch size and parameters. Overfitting was prevented by two measures: selecting a low learning rate (0.0001) and performing dropout (at a rate of 0.5) on the final linear layers as a regularization technique. CNN training was repeated over 15 epochs with cross entropy loss as an error function and the application of the Adam optimizer (betas 0.9 and 0.999, epsilon 1e-8). With an open-source neural network employing pretrained weights (ResNeXt-101–32 × 8d pretrained on ImageNet, Stanford Vision and Learning Laboratory, Stanford University, Palo Alto, CA, USA), CNN training was accelerated. Existing learning results regarding the recognition of basic structures in the existing image set could thus be reused and skipped in the initial training. Training was performed on a university-based computer with the following specifications: RTX A6000 48 GB (Nvidia, Santa Clara, CA, USA); i9 10850 K 10 × 3.60 GHz (Intel Corp., Santa Clara, CA, USA) and 64 GB RAM [19, 20].

Statistical analysis

The data were analyzed using Python (http://www.python.org, version 3.8). The overall diagnostic accuracy (ACC = (TNs + TPs)/(TNs + TPs + FNs + FPs)) was determined by calculating the number of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs). The sensitivity (SE), specificity (SP), positive and negative predictive values (PPVs and NPVs, respectively), and the area under the receiver operating characteristic (ROC) curve (AUC) were computed for the chosen MIH categorization [31]. Saliency maps were plotted to illustrate image areas that were used by the CNN to make individual decisions. The saliency maps were calculated by back propagating the CNN prediction and visualizing the gradient of the input of the resized images [19, 20, 32].

Results

After the deep learning–based CNN was trained, the CNN was able to detect MIH and correlated interventions correctly in eight out of nine MIH categories with a diagnostic accuracy higher than 90% (Table 2). The overall diagnostic accuracy was determined at 95.2%. The SE and SP amounted to 78.6% and 97.3%, respectively. In detail, the accuracy values ranged from 91.5% (enamel breakdown/no intervention) to 99.1% (enamel breakdown/sealant). The lowest diagnostic accuracy of 88.4% was found for demarcated opacities with no intervention (Table 2). This was the only category—one out of nine—where the target accuracy of 90% was not reached (Table 2).

Table 2 Overview of the diagnostic performance of the developed convolutional neuronal network (CNN), where the independent test set (n = 649 images) was evaluated by the AI-based algorithm for the detection of MIH-related enamel disturbances and related interventions. The overall diagnostic accuracy (ACC, including the sensitivity (SE), the specificity (SP), the negative predictive value (NPV), the positive predictive value (PPV) and the area under the receiver operating characteristic curve (AUC)) was computed

When considering the diagnostic parameters of SE and SP in detail (Table 2), it is important to note that SP values were found to be consistently high, ranging from 92.9% (no intervention/demarcated opacity/) to 100.0% (sealant/enamel breakdown) in comparison to the SE. The latter ranged from 40.0% (enamel breakdown/sealant) to 96.2% (sealant/no MIH). The AUC values varied from 0.873 (enamel breakdown/sealant) to 0.994 (sealant/no MIH). With respect to the overall high AUC values, no ROC curves were plotted.

The confusion matrix (Fig. 2) illustrates the case distribution in the test set. Here, it also became obvious that the majority of diagnostic predictions by the AI-based algorithm (test method) were made in accordance with the expert decision in the test set. However, a distinct number of cases were not categorized correctly, especially if multiple characteristics were present on one photograph. In addition to the explorative data analysis, exemplary saliency maps (Fig. 3) are shown to illustrate areas on each intraoral photograph that the CNN used for decision-making.

Fig. 2
figure 2

The confusion matrix shows the case distribution between the convolutional neuronal network (CNN, test method) and expert diagnosis for MIH assessment in the independent test set (n = 649 images)

Fig. 3
figure 3

Example clinical images and the corresponding test results generated by the AI algorithms. Furthermore, the illustration includes saliency maps that depict those image areas (in blue) that the CNN used during the decision-making process

Discussion

The present diagnostic study demonstrated that an AI-based algorithm is able to detect MIH on intraoral photographs with a moderately high diagnostic accuracy (Table 2). With respect to the fact that accuracy > 90% was achieved in eight out of nine categories, the initially formulated hypothesis was accepted. When considering the documented accuracy and AUC values (Table 2), it could be further concluded that on the one hand, the overall diagnostic performance appears to be satisfactory but on the other hand, the partially low SE and high SP values indicate that the reported data need to be interpreted with caution. In detail, SP played a far more important role in this image sample probably because of the higher number of teeth without MIH. Therefore, the diagnostic accuracy is mainly driven by the SP rather than the SE and it could be argued that the AI-based algorithm is better in scoring sound teeth compared to MIH teeth. In this context, the complex clinical appearance of teeth with MIH, especially molars, needs to be highlighted. In addition to the fact that multiple findings can be present in teeth with MIH, this information will be further enhanced on intraoral photographs, which currently have a good resolution and can be thoroughly evaluated by the study team. Here, several demarcated opacities were found to have more or less extended enamel breakdowns that might be difficult to assess and for experienced clinicians to allocate to one of the given categories. While the experts allocated teeth with strictly small enamel breakdowns to this category, it can be taken from the saliency maps (Fig. 3) that the developed AI-based algorithm might have some difficulties in making such strict decisions as well. The same might be true for brighter demarcated opacities or small-sized atypical restorations, where the experts can provide precise assessments. To address and overcome this issue, appropriate pixelwise annotations must be recognized as a forward-looking methodological approach. But this also require a well-trained and well-calibrated annotator team as well as consistent quality controls to ensure correct diagnostic decisions. In the present study, the reproducibility was found to be in a good to excellent range. Additionally, the independent check of each diagnosis by an experienced dentists as well as consensus discussions and decisions completed the quality management.

Furthermore, the concept of transfer learning must be discussed. Contrary to earlier studies of our study group where only one diagnostic domain was included, e.g., caries [19] and sealant detection [20], the clinical complexity of teeth with MIH required the consideration of two domains with three diagnostic scores each and ultimately resulted in nine categories (Fig. 1). Pertinently, an imbalance of clinical cases is closely linked to the proposed cross-tabulated case categorization. Here, a few categories are underrepresented with respect to their rare presence in clinical practice. Therefore, the clinical variability of MIH characteristics as well as the low frequency of some categories probably impeded the training of the AI-based algorithm and may have lowered its overall diagnostic performance in comparison to the previously mentioned studies that used only a few diagnostic categories. To overcome this issue, the previously mentioned aspects of increasing the image data set and performing pixelwise annotations must be repeated. However, when considering the overall diagnostic performance of this initially developed AI-based algorithm for MIH categorization, the documented results (Table 2 and Fig. 2) should be interpreted as encouraging. Nevertheless, consistent future research is required.

Since no comparative studies or other AI-based methods are available for MIH diagnostics thus far, it is not possible to discuss this aspect specifically with respect to the current literature. However, it is feasible to consider results from other recently published diagnostic studies that used clinical photographs for the detection and categorization of dental findings. Here, a workgroup [21] published data for plaque detection on primary teeth, where an accuracy of 86.0% was reached. Noncavitated and cavitated caries lesions were detected with accuracies of 92.5% and 93.3%, respectively [19]. In other recently published diagnostic studies, white spot lesions were registered automatically with 81–84% accuracy [26] and caries lesions were classified and located with a mean AUC of 85.6% [33]. When also considering the available diagnostic performance data for various dental findings on different types of X-ray images [10,11,12,13, 15, 16, 18,19,20, 34, 35], it can be emphasized that the documented diagnostic accuracies in this trial are on the same order of magnitude compared to those of several other dental reports.

When summarizing the methodological strengths of this study project, it can be concluded that it was technically feasible to develop CNNs with substantial precision by using the described pipeline for software development. Therefore, it can be predicted that AI-based diagnostics will gain increasing attention in dentistry in the near future. However, further developments are needed before they can be used in a clinical setting [35, 36]. Moreover, it is crucial to assess the necessity of numerical extensive and qualitative image material to further improve the performance of the developed CNN for MIH categorization. Simultaneously, less frequent diagnostic categories should be included in appropriate numbers as well. Independently from this, it should further be noted that AI-based algorithms need to be also developed for rare developmental disorders, e.g., dentinogenesis or amelegenesis imperfecta. The chosen methodology primarily presents a simple approach to handle dental diagnoses and is typically linked with diagnostic accuracy values of approximately 90% (Table 2, Fig. 2). Aiming at increasing diagnostic performance up to 100%, the methodological requirements for consistent improvement of the data set and detailed image annotation by pixelwise labelling have been expressed. Another aim might be to perform CNN training on high-performance computers to reach a higher degree of neuronal connectivity. However, all these requirements will necessitate more time and personal and computing resources.

Conclusion

It was possible in the present study to automatically categorize clinical photographs from teeth with MIH by using a trained deep learning-based CNN with an overall diagnostic accuracy of 95.2%. The higher NPV and SP values in comparison to PPV and SE indicate that the CNN performed better in healthy teeth compared to those with MIH. Future improvements are necessary to increase the diagnostic performance.