Comparison between stone and digital cast measurements in mixed dentition

Purpose To assess the validity, reliability, reproducibility, and objectivity of measurements on stone casts of patients with mixed dentitions compared to measurements on three-dimensional (3D) digital models derived from surface scans of the stone casts. Methods Pairs of stone casts of 30 young patients in their mixed dentition stage were included and processed into 3D digital models using an intraoral scanner (iTero Element 2; Align Technology, San Jose, CA, USA). Then an experienced and an inexperienced examiner independently performed measurements of five defined parameters, each in triplicate, both on the digital models with analysis software (OnyxCeph3™; Image Instruments, Chemnitz, Germany) and on the original casts with a vernier calliper. Paired t-tests were used for validity and interexaminer objectivity, Pearson correlation coefficients for intermethod reliability, and intraclass correlation coefficients (ICCs) for reproducibility testing. Results Significant (p < 0.05) intermethod differences were identified for four parameters, but only the differences for overbite and intermolar distance exceeded the threshold of clinical relevance (≥ 0.5 mm). Intermethod reliability was high and method error invariably lower for the digital measurements and for the experienced examiner. Both examiners achieved ICCs > 0.907 with both methods. Interexaminer variation involved significant differences for all parameters but one (intermolar distance) on the stone casts and for three parameters on the digital models. Conclusion Measurements performed on digital models of mixed dentitions can yield clinically acceptable outcomes with OnyxCeph3™ software. Both the digital and the analogue measurements were highly reproducible and reliable. Objectivity of the measurements could not be confirmed, as operator experience did make a difference.


Introduction
Orthodontics is going digital. In recent years, digitization has become an indispensable part of orthodontists' workflows. Paperless patient information systems, including digital photographs and radiographs, are today standard [33].
In addition, it is essential for comprehensive orthodontic diagnostics and treatment planning to analyse study casts that represent the patient's dentition. Mixed dentitions require high accuracy in evaluating arch length discrepancies (ALD) for successful orthodontic treatment, and space analysis in these situations is traditionally performed by contrasting the existing mesiodistal width of the supporting area with a nominal value [27].
The gold standard for diagnostic measurements is to use a calliper on a stone cast [3]. Study casts made of dental stone are, however, susceptible to damage and loss of accuracy by external forces [10]. Following research and development to address this issue, Cadent (now Align Technology, San Jose, CA, USA) introduced the first generation of OrthoCad™ software for "digital casts" in 1999 [15]. In 2006, this was followed by the iTero Element (Align Technology, San Jose, CA, USA) intraoral scanner (IOS), which uses parallel confocal imaging and point-and-stitch reconstruction to generate three-dimensional (3D) comput-erized images. Current software applications (e.g., Onyx-Ceph3™; Image Instruments, Chemnitz, Germany) are capable of managing, analysing, and generating virtual arch models.
Reported benefits of using an IOS include better patient comfort [43], reduced storage requirements for casts [29], ability to share data easily anywhere in the world [39], extensive possibilities of analysing digital models [39], and time efficiency [14,26]. As a result, 3D imaging is today routinely used in orthodontic diagnostics and treatment planning. Young patients, in particular, tend to require maximum comfort and hence prefer the digital impression technique [5]. Digital study models can be generated in three different ways: directly by intraoral scanning, indirectly by surface scanning of stone casts, or based on a cone-beam computed tomography (CBCT) scan [40].
However, while the digital method is a clinically acceptable alternative to the analogue gold standard in analysing permanent dentitions, no comparative studies have been available for digital versus analogue measurements in children with mixed dentitions. These latter situations are different in that, rather than measuring all teeth, a limited number of permanent teeth need to be analysed along with the supporting area. Hence, longer distances must be measured when examining the supporting area, which might be more difficult to achieve accurately.
Evidence-based orthodontics relies on analysis methods known to offer validity, reliability, reproducibility, and objectivity for treatment planning to remain unaffected. The aim of validation is to prove that these requirements are met by an intended analysis technique in daily practice [39]. In this context, validity is defined as the truth value of evidence and whether what is measured equals what was intended to be measured, equating accuracy [35], intermethod reliability as the degree to which test scores are consistent when there is variation in the methods or instruments used [13] and reproducibility, being concerned with the consistency of evidence [12], as the closeness of agreement between the results of successive measurements of the same measure carried out under the same conditions [44].
Against this background, we designed the present study to assess measurements obtained on digital models as compared to the gold standard. Given the variance in how well a method performs depending on the experience of the examiner, we incorporated this factor into the study design to assess objectivity, defined as neutrality of evidence [12].
The null hypothesis of the study was that measurements on digital models would not show significant differences from that on conventional stone casts so that the digital method will be considered as accurate, reliable, reproducible, and objective.

Methods
Approval of the study design, which was in accordance with the 1975 Declaration of Helsinki as revised in 2013, was obtained by the Ethics Committee of the Medical University Innsbruck (Innsbruck, Austria; study ID: 1124/2021). Using G*power software (version 3.0.1; University of Düsseldorf, Germany), a sample size estimation was conducted as described elsewhere [16,23,25] to compare differences between the dependent groups by paired t-tests. Assuming mean differences of 0.5 mm with standard deviations < 1 mm as clinically relevant, a sample size of 30 pairs of stone casts was found to be required to achieve a power of 80%, with α set to 0.05 and β to 0.2.
Hence, we randomly retrieved 30 pre-existing pairs (upper and lower arch) of stone casts that represented mixed dentitions from the records of our Department of Orthodontics (Medical University of Innsbruck, Austria). All had been fabricated from ISO type 4 stone (Silky-Rock yellow; Whip Mix, Louisville, KY, USA). Based on dental age, we estimated that these fully anonymized orthodontic patients had been 7-13 years old.
We gathered this sample based on the criteria of mixed dentition (permanent incisors and first molars fully erupted), intact stone casts, and no previous orthodontic treatment. To ensure homogeneity of the sample, care was taken choosing only casts with fully erupted teeth. Exclusion criteria were syndromes.
Baseline measurements on the casts were performed using a vernier calliper (Zurich model; Dentaurum, Ispringen, Germany) to the nearest 0.1 mm. Then, the casts were mounted in a stand and digitized by one experienced operator using an IOS (iTero Element 2, version 1.12.0.990; Align Technology) as recommended by the manufacturer (Fig. 1). All scans were visually checked on screen, followed by re-scanning whenever a flaw was identified. Upper and lower casts were scanned separately, the scans saved as standard tessellation language (STL) files and uploaded to the internet for further transformation into digital casts.
All of the returned digital casts, accessible for download within 48 h, were measured applying the OnyxCeph3™ Pro analysis software (version 3.2.147; Image Instruments, Chemnitz, Germany) on the same computer (Fig. 2). This environment supports an effective workflow for orthodontic archiving, diagnostics, planning, and counselling, thus assisting orthodontists in optimizing administrative tasks and A digital study cast portrayed in OnyxCeph3 TM (Image Instruments) dental software with measuring points set. The red dots indicate the landmarks which must be set by the operator Abb. 3 Ein digitales Modell im OnyxCeph3 TM (Image Instruments) Bearbeitungsprogramm. In rot jene Punkte, die vom Befundenden selbst gesetzt werden müssen implementing treatment. For the segmentation process, the middle of each tooth must be selected by the examiner to simulate the shape of each tooth. Then, the software offers automated identification of 20 landmarks (12 mesiodistal widths of incisors and molars, 8 upper and lower incisal edges) and requires input for seven more points, including one point on the palatal aspect of each upper molar, one on the labial aspect of each lower incisor, and the origin of the second pair of palatal rugae on the median plane (Fig. 3).
To assess objectivity, two examiners with different levels of experience performed the same measurements independently: an orthodontist with four years of practice and experienced in scanning (examiner 1), and a dental student in his last year of training without any experience in orthodontics or scanning (examiner 2). To calibrate reproducibility and minimize interexaminer variability, all measurements on stone casts and digital models were performed in triplicate by both examiners under identical conditions with intervals of at least one week.
To evaluate the validity, measurements for maxillary and mandibular space analysis were performed to the nearest 0.1 mm, recording and calculating for each model the amounts of available space, required space, and arch length discrepancy (ALD). Overjet, overbite, and upper intermolar distance were measured to the nearest 0.5 mm: overjet from the incisal edge of the most anterior upper incisor to the labial surface of the most anterior lower incisor; overbite as the longest vertical overlap between upper and lower incisors; maxillary intermolar distance (McNamara, [22]) between the lingual surfaces of both upper first molars. Clinical relevance was assessed by subtracting the mean differences of repeated measurements obtained on the stone casts from those on the digital models. Positive values of mean differences () indicated that the measurements on the stone casts were smaller than those on the digital models. In accordance with the literature, differences ≥ 0.5 mm were regarded as clinically relevant [34].
For validity assessment, only measurements of examiner 1 were taken into consideration, in order to avoid distorted results due to inexperience. for each examiner and method were compared using paired t-tests for validity and objectivity. Method errors were evaluated based on Dahlberg's formula and intermethod reliability by Pearson correlation coefficients (r). Reproducibility was assessed as intra-examiner repeatability of measurements, using intraclass correlation coefficients (ICCs), > 0.9 indicating excellent, 0.75 to < 0.9 good, 0.5 to < 0.75, moderate, and < 0.5 poor reproducibility. Differences emerging from any of these statistical tests were regarded as significant at p < 0.05 or as highly significant at p < 0.01.   Table 1). The differences for ALD UJ, ALD LJ, and overjet were not clinically relevant (< 0.5 mm). Thus, the null hypothesis regarding the accuracy of measurements obtained from digital casts had to be rejected, at least partly. Table 2 lists both examiners' random errors of repeated measurements. On the stone casts, the experienced examiner 1 produced errors within a range of 0.03 to 0.53 mm, while the inexperienced examiner 2 had a considerably harder time, with errors ranging from 0.1 to 0.75 mm. While both did invariably better on the digital models, examiner 2 again incurred higher errors for all five parameters. Given r-values of 0.499-0.918, a positive correlation between both measuring methods could be established for all parameters ( Table 2).

Intermethod reliability
Based on these results, the null hypothesis had to be accepted. Table 3 gives an overview of intraclass correlation coefficients, which were calculated as a measure of consis-  (Table 3). Overall, the null hypothesis regarding the reproducibility of measurements for both methods had to be verified. Table 4 summarizes all mean values of measurements and the calculated interexaminer differences, evaluated both by paired t-tests to determine statistical significance and by absolute differences to determine clinical relevance. The method on stone casts was associated with significant K  (Table 4).

Objectivity
In consideration of these findings, the null hypothesis concerning the objectivity, had to be declined.

Discussion
The authors are not aware of any previous studies evaluating the validity, reliability, reproducibility, and objectivity of measurements performed on digital models of mixed dentitions. Because there is no direct way of determining the true dimensions of teeth, measurements on stone casts were considered the gold standard against which validity was to be evaluated.
Since the accuracy of digital models is known to depend on scanner types and intraoral conditions (e.g., saliva or restricted mouth opening) [6,11], our use of surface scans performed on stone casts ruled out any loss of precision that may have been incurred by clinical intraoral capturing. In other words, none of the differences between stone casts and digital models we observed can be attributed to intraoral scanning distortion.
In the hands of the experience examiner, all mean differences in repeated measurements between the stone casts and the digital models were significant (p < 0.05) or highly significant (p < 0.01), the only exception being ALD UJ (p = 0.28). However, clinically relevant differences were noted for only two parameters (overbite: 1.04 mm; transver-sal measurement according to McNamara: 1.18 mm). Differences in overbite measurements were most likely due to the delicate contact points on the digital images compared to the bulky calliper on the stone casts. Differences noted between the analogue and digital measurements of intermolar distance might be a function of the longer distance to be measured.
For three out of five measured parameters, our results, although based on a different method of digitization and different software (OnyxCeph3™), support previous findings that the validity of measurements on digital models (OrthoCad) is clinically acceptable [4,20,21,28]. Radeke et al. [29], who previously compared measurements on stone casts and digital models with OnyxCeph3™, did report differences between software-and calliper-based measurements of mesiodistal tooth width but found them to be operator-associated and not statistically significant. Zilberman et al. [45] and Santoro et al. [34] found that stone casts resulted in slightly better measurements than digital models. Both studies relied on impressions sent to OrthoCad, so that any flaws during impression-taking or subsequent alginate shrinkage could have been passed on to the digital models. Alginate shrinkage has been previously reported to account for underestimated measurements on digital models, compared to the situation in vivo [24,37], which might affect the validity.
Consistent with our own result that manual overbite measurements were significantly increased (Table 1), the manual method has repeatedly been implicated in exaggerated overbite measurements [34,38]. Thus, in absolute terms of millimetres, smaller teeth should be expected to result in smaller overbite values, although inconsistent overbite measurements could also have been introduced by other factors such as incorrect probe angulation during manipulation of the traditional casts or rounding of the digital measurements to the nearest 0.5 mm. Another explanation might lie in digital measurements of tooth size consistently falling short of the values obtained on stone casts [23,36]. In our study, this observation held true for the parameters ALD UJ, overbite, and the transversal measurement according to McNamara.
Notwithstanding, it stands to reason that the measuring tools of the OnyxCeph3™ software may allow for better defined measuring protocols resulting in more valid measurements than callipers used on stone casts. Further, automated identification of many landmarks requires the examiner merely to visually check these results (Fig. 3). Thus, one could speculate that inaccuracies of analogue measurements may be introduced by human error in the form of misreading callipers or misplacing reference points. That said, only the reference points for ALD UJ, ALD LJ, and overbite measurements are automatically set by OnxyCeph3 TM , whereas analogue versus digital analysis yielded significantly different mean values for all the parameters investigated but one (ALD UJ: p = 0.28).
Intermethod reliability indicates the agreement of results obtained with any two comparable measuring methods. A Pearson correlation coefficient of > 0.499 testifies to a high degree of agreement between the two methods in our study (Table 2). Zilberman et al. [45] and Bell et al. [2] reported mean intra-examiner errors of 0.18 or 0.17 mm, indicating excellent results on both stone casts and digital models in line with our own findings.
Intra-examiner ICC values as calculated in the present study yield estimates of reproducibility under the model of equal marginal distributions. When marginal distributions are inaccurate, the deviations are captured and rated as unreliable. Based on all parameters, consistent agreement of repeated measurements on the stone casts emerged for the experienced examiner 1, given an ICC > 0.918 as compared to > 0.666 for the inexperienced examiner 2 ( Table 3). All parameters measured by examiner 1 achieved ICCs > 0.9 (= excellent reproducibility). Most of those performed by examiner 2 also exceeded the threshold of excellent reproducibility; one digitally measured parameter was rated as "good" (ICC: 0.8775) and one analogue parameter as "moderate" (ICC: 0.666). Overall, both examiners achieved very good ICCs. The fact that those for the experienced examiner were slightly higher suggest that different levels of orthodontic experience may have an influence on reproducibility.
The reproducibility of analogue measurements performed on stone casts may not only depend on an operator's ability to identify landmarks and to accurately transfer quantitative data to a computer, but there is a need for careful handling to avoid breakage of the casts. The reproducibility of digital measurements will also depend on the hardware and software used for digitization [18]. In the present study, excellent reproducibility was noted for measurements performed on both stone casts and digital models. Similar results have been reported previously by Wiranto et al. [42], Stevens et al. [36], and Czarnota et al. [8]. Measurements of arch length discrepancy were associated with slightly higher ICCs for the digital technique. Measurements of overbite, overjet, and the transversal measurement according to McNamara involved moderately higher ICCs for the calliper-based analogue technique. Arguably, this might be because the reference points for overjet and McNamara need to be manually selected on screen, thus being more prone to misjudgement, while all landmarks for tooth-width and space analysis are identified by OnyxCeph3™ on its own.
Four out of five interexaminer differences for measurements on stone casts were statistically significant ( Table 4). All these differences, however, remained below the threshold of clinical relevance (< 0.5 mm). Based on digital measurements, significant interexaminer disagreement was seen for three parameters (ALD LJ, overbite, transversal measurement according to McNamara) and reached clinical relevance for McNamara (-1.83 mm). Given this nearly complete absence of clinically relevant differences between both examiners, both methods may be regarded as objective. The only exception just mentioned, a large interexaminer difference between the digital measurements of upper intermolar distance, may be attributed to operator experience, as the landmarks for this parameter need to be identified manually in OnyxCeph3™. This finding is consistent with Radeke et al. [29]. Intermolar distance remains a major problem of intraoral scanners, with large deviations incurred by the matching/stitching algorithm [17,32].
Our data further support findings of Dalstra et al. [9] about the effect that interexaminer variation was generally low. Rheude et al. [30] investigated how orthodontic experience might influence treatment planning. In accordance with our own results, they found statistically significantly differences in diagnostic decisions, which did not, however, result in different treatment decisions. It should be noted that the present study featured an inexperienced examiner solely to identify different measurements as a function of orthodontic expertise.
Limitations to our study arise from the fact that the intraoral scanner was used on stone casts rather than in patients' mouths and examination was done retrospectively. Though, the sample size was checked conscientiously, it might be limited for a study of diagnostic accuracy.

Conclusions
Digital models of permanent dentitions are today an accepted alternative to stone casts in orthodontics. Driven by both an emerging aesthetic idealism and intraoral scanners eliminating the need for impression-taking, the demand for orthodontic treatment of paediatric patients has been growing in recent years. Intraoral scanners can minimize discomfort, which is essential in children. This preliminary study confirms that digital models (here generated based on surface scans of physical casts) can be a reliable alternative to stone casts in analysing mixed dentitions. Measurements performed on these virtual models yield reproducible intra-and interarch relationships, but the results are affected by operator experience. We cannot confirm that longer distances have a greater deviation between measurements, even though the transversal measurement according to McNamara is one long distance to measure, but the space analysis requires many small measurements. Welldesigned investigations dealing with intraoral scanning followed by digital model generation are needed to assess the time requirements for these procedures in young patients to evaluate the feasibility of this digital work-flow. Ethical standards All procedures performed in studies involving human participants or on human tissue were in accordance with the ethical standards of the institutional and/or national research committee and with the 1975 Helsinki declaration and its later amendments or comparable ethical standards. Approval was obtained from the ethics committee of the Medical University of Innsbruck.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4. 0/.