Review

Introduction

Since the introduction of lateral cephalometric radiography in 1931 by Broadbent in the USA and by Hofrath in Germany, this radiograph and its related analyses have become a standard tool in orthodontic assessment and treatment planning [13]. Lateral cephalometric radiographs are systematically collected prior to orthodontic treatment in many European countries [3, 4]. Nevertheless, the real value of this imaging technique for diagnosis and planning of orthodontic treatment remains uncertain [27]. Some authors showed that an adequate orthodontic diagnosis and treatment plan could not be performed without comparing a lateral cephalometric radiograph before and after orthodontic treatment and that treating skeletal malocclusions without a cephalometric radiograph introduced serious errors [8]. While only a small percentage of the orthodontic treatment planning are modified based on lateral cephalometric radiographic analysis [6], it could adjust some aspects of treatment planning, such as tooth extraction, extract pattern and anchorage features [2, 9].

The controversy about the correct use of the lateral cephalometric radiograph is also present in orthodontic textbooks where guidelines for orthodontic imaging are not expressed properly [10]. Several radiographic techniques, like panoramic and full-mouth periapical radiographs, used in orthodontics are found unproductive, since it provides duplicate information [4]. The latter is an important finding as the use of ionising radiation should always be justified and kept ‘as low as reasonably achievable’ and definitely in children as radiographs are often performed at different time intervals during orthodontic treatment [11, 12]. Even when there are means to optimise radiation dose of cephalometric radiographs, the primary issue is to justify the decision to take a lateral cephalogram prior to orthodontic treatment [10, 11, 1315].

The present systematic review was initiated by the fact that three-dimensional (3D) cephalometric analysis is emerging, while there is still lack of scientific evidence on the validity and reliability of two-dimensional (2D) cephalometric imaging for orthodontic treatment planning [2, 3].

Therefore, the aims of this study were to systematically review the available scientific literature and to evaluate the existing evidence about the validation of lateral cephalometric radiograph in orthodontics. This review also studied the accuracy and reliability of lateral cephalograms and its cephalometric analysis.

Materials and methods

Information sources

A comprehensive electronic database search to identify relevant publications was conducted, and the reference lists in relevant articles were searched manually for additional literature. We set no language limitations, although we did not attempt to explore the informally published literature: conference proceedings and abstracts of research presented at conferences and dissertations. The following databases were searched: Ovid Medline (1946 to 11 January 2012), Scopus (to 11 January 2012) and Web of Science (1899 to 11 January 2012).

Search strategy

We developed the search strategy with the help of an information specialist at King's College London Dental Institute in London. The searches did not have a date limit and were not restricted to particular types of study design. The search strategy focused on the following terms: Cephalometr* and (orthodontic* or ‘orthodontic treatment planning’) and (‘efficacy’ or ‘reproducibility’ or ‘repeatability’ or ‘reliability’ or ‘accuracy’ or ‘validity’ or ‘validation’ or ‘precision’ or ‘variability’ or ‘efficiency’ or ‘comparison’) not (‘Cone-Beam Computed Tomography’ or ‘Three-Dimensional imaging’ or ‘Cone Beam Computed Tomography’ or ‘Cone Beam CT’ or ‘Volumetric Computed Tomography’ or ‘Volume Computed Tomography’ or ‘Volume CT’ or ‘Volumetric CT’ or ‘Cone beam CT’ or ‘CBCT’ or ‘digital volume tomography’ or ‘DVT’ or ‘Spiral Computed Tomography’ or ‘Spiral Computer-Assisted Tomography’ or ‘Spiral Computerized Tomography’ or ‘spiral CT Scan’ or ‘spiral CT Scans’ or ‘Helical CT’ or ‘Helical CTS’ or ‘Helical Computed Tomography’ or ‘Spiral CAT Scan’ or ‘Spiral CAT Scans’ or ‘3D’ or ‘3-D’ or ‘three dimension*’).

Study selection

At the first stage, two reviewers (experienced dento-maxillofacial radiologists) independently screened the titles of the retrieved records, and only the titles related to 2D cephalometry, radiographs for orthodontic treatment and tracings were included. Next, the abstracts of the retrieved publications were read by the two observers and categorised according to the study topic. An article had only to be justified by one observer to be included for the second selection phase. All articles of interest in languages other than English were included. Of these two were included, one article was written in Portuguese and another in French. Eligibility of potential articles was determined by applying the following inclusion criteria to the article abstracts: (1) technical efficacy, (2) diagnostic accuracy efficacy, (3) diagnostic thinking efficacy, (4) therapeutic efficacy, (5) patient outcome efficacy or any combination of the previous items as published by Fryback and Thornbury [16]. The other inclusion criteria were (1) accuracy, (2) reliability, (3) validity of lateral cephalometric radiograph, (4) landmark identification on tracings (intra- and inter-observer errors) and (5) the effect of using 2D cephalometry on the orthodontic treatment plan.

Diagnostic accuracy efficacy was defined as follows:

  1. 1.

    Observer performance expressed as overall agreement, kappa index or correlation coefficients

  2. 2.

    Diagnostic accuracy as percentage of correct landmark identification and further tracing analysis, validity and effectiveness of cephalometry in orthodontic treatment planning

  3. 3.

    Sensitivity, specificity or predictive values of landmark identification

Diagnostic thinking efficacy was defined as follows:

  1. 1.

    Percentage of cases in a series in which images were judged ‘helpful’ for the diagnosis

  2. 2.

    Difference in clinicians' subjective estimated diagnosis probabilities before and after evaluation of the cephalogram

Therapeutic efficacy was defined as follows:

  1. 1.

    Percentage of times the image was judged helpful in planning management of the patients in a case series

  2. 2.

    Percentage of times therapy-planned pre-visualization of a lateral cephalogram needed to be changed after the image information was obtained

  3. 3.

    Percentage of times clinicians prospectively stated therapeutic choices needed to be changed after evaluating a cephalogram

  4. 4.

    Whether different analyses lead to different decisions on treatment planning

  5. 5.

    Intra- and inter-observer identification errors

  6. 6.

    Reliability of landmark identification

The analysis had to be based on primary materials or comprise a review on efficacy. When an abstract was considered by at least one author to be relevant, it was read in full text. At the second stage, the full texts were retrieved and critically examined. Reference lists of publications that had been found to be relevant in the first stage were hand-searched, and articles containing the words ‘cephalometry’ , ‘lateral cephalometric radiography’ , together with ‘treatment planning’ , ‘orthodontic radiographs’ , ‘landmark identification’ and ‘error’ were selected. Book chapters and reviews were excluded since the aim of this systematic review was to evaluate primary studies.

Data extraction

Data was extracted with the aid of protocol 1 (Table S1 in Additional file 1). It was established by reading the relevant literature on how to critically evaluate studies about diagnostic methods. To minimise bias, two observers independently evaluated the quality and validity of original studies according to the quality assessment of diagnostic accuracy studies tool using protocol 2 (quality assessment of studies of diagnostic accuracy included in systematic reviews - QUADAS) (Table S2 in Additional file 1) [17]. When there was any disagreement concerning the relevance of an article, it was resolved by a discussion between the two reviewers. Each observer presented their arguments, and further discussion was held until a consensus was reached. Before the assessment, the protocols were tested for ten publications. A further five publications were read to calibrate the two reviewers regarding the criteria in protocol 2. Only publications that were found to be relevant to the reviewer in both protocols 1 (diagnostic efficacy) and 2 (level of evidence) were ultimately included. The quality and internal validity (level of evidence) of each publication was judged to be high, moderate or low according to the criteria in the following subsection [18].

Levels of evidence and criteria for evidence synthesis

High level of evidence

A study was classified with high level of evidence if it fulfilled all of the following criteria:

● There was an independent blind comparison between test and reference methods.

● The population was described so that the status, prevalence and severity of the condition were clear. The spectrum of patients was similar to the spectrum of patients on whom the test method will be applied in clinical practice.

● The results of the test method being evaluated did not influence the decision to perform the reference method(s).

● Test and reference methods were well described concerning technique and implementation.

● The judgments (observations and measurements) were well described considering diagnostic criteria applied and information and instructions to the observers.

● The reproducibility of the test method was described for one observer (intra-observer performance) as well as for several (minimum 3) observers (inter-observer performance).

● The results were presented in terms of relevant data needed for necessary calculations.

Moderate level of evidence

A study was assessed to have a moderate level of evidence if any of the above criteria were not met. On the other hand, the study was assessed not to have deficits that are described below for studies with a low level of evidence.

Low level of evidence

A study was assessed to have a low level of evidence if it met any of the following criteria:

● The evaluation of the test and reference methods was nonindependent.

● The population was not clearly described, and the spectrum of patients was distorted.

● The results of the test method influenced the decision to perform the reference method.

● The test or the reference method or both were not satisfactorily described.

● The judgments were not well described.

● The reproducibility of the test method was not described or was described for only one observer.

● The results could have a systematic bias.

● The results were not presented in a way that allowed efficacy calculations to be made.

Rating conclusions according to evidence grade

The scientific evidence of a conclusion on diagnostic efficacy was judged to be strong, moderately strong, limited or insufficient depending on the quality and internal validity (level of evidence) of the publications assessed [18, 19]:

● Strong research-based evidence: at least two of the publications or a systematic review must have a high-level of evidence.

● Moderately strong research-based evidence: one of the publications must have a high level of evidence and two more of the publications must have a moderate level of evidence.

● Limited research-based evidence: at least two of the publications must have a moderate level of evidence.

● Insufficient research-based evidence: scientific evidence is insufficient or lacking according to the criteria defined in the present study.

Synthesis of evidence

The results of this review were described narratively. No meta-analyses were attempted because of lack of original studies.

Results

The number of articles reviewed in each phase to perform this systematic review is presented in the PRISMA flow diagram (Figure 1) [20]. The initial search revealed 784 articles listed in Medline (Ovid), 1,034 in Scopus and 264 articles in the Web of Science. The second stage of the search protocol was to retrieve the reference lists of the selected articles, which yielded 14 additional articles of interest. After excluding 1,128 duplicates, 968 articles remained for review. In the first phase selection, the observers screened the articles by reading titles and abstracts. Articles that were not eligible because of irrelevant aims and were not directly related to this systematic review were excluded, thus 203 articles remained for further reading. Thirty-five articles were assessed for eligibility.

Figure 1
figure 1

Methodology followed in the article selection process (adapted from Moher et al.[20]).

After screening all the articles using protocols 1 and 2, 17 articles met the inclusion criteria and were selected for qualitative synthesis and appraised to present some level of evidence. All articles that remained after screening passed the qualitative synthesis.

These 17 articles were categorised by topics as follows: 7 studies on the role of cephalometry on the orthodontic treatment planning, 8 studies on cephalometric measurements and landmark identification and 2 studies on cephalometric analysis.

Role of cephalometry on the orthodontic treatment planning

Seven articles related to the importance and contribution of cephalometry to orthodontic treatment planning were found (Table 1). Six of the publications were found to have low levels of evidence [24, 6, 9, 10] and one classified as moderate level of evidence [7].

Table 1 Publications related to the importance and contribution of cephalometry on the orthodontic treatment planning

Cephalometric measurements and landmark identification

Only eight articles were selected as eligible in this category (Table 2). Five publications presented a moderate level of evidence [2125], while the other three were identified as having a low level of evidence [5, 26, 27].

Table 2 Publications concerning landmark identification

Cephalometric analysis

Two publications with low-level evidence were found [28, 29]. The studies did not use any reference standards, and the number of observers was not stated. The study designs were also not clearly explained (Table 3).

Table 3 Publications on cephalometric analysis

Discussion

The validity, efficacy and contribution of cephalometry in orthodontic treatment planning remain questionable [2]. In 2002, 90% of orthodontists in the USA routinely performed cephalometric radiographs [3]. This systematic review was performed to assess the validity and reliability of 2D lateral cephalometry used for orthodontic treatment planning as well as the errors that can occur on 2D tracing. Despite the abundant amount of articles found on lateral cephalometry (n = 968), it is surprising that the present systematic review could only identify very few studies (n = 16, 1.6%) on its validity and reliability. This finding underlines the need for the present study and is an important cross point, considering the fact that we are flooding into 3D cephalometric studies nowadays. Apart from our findings, 2D cephalometry has other specific limitations, such as orthognatic surgery, airway and growth assessment and skeletal maturation. In order to be included in this systematic review, publications had to satisfy pre-defined methodological criteria. Two protocols were used regarding the search strategy, one based on diagnostic methods and the second based on the QUADAS tool [17]. The ‘levels of evidence’ for assessing the quality and internal quality of each publication included in this review - how well the study was designed, how reliable its results appeared to be and the extent to which it addressed the questions posed - were modified according to the Oxford Center for Evidence-Based Medicine levels of evidence for diagnostic methods (CBEM) [18]. Only publications assessed to present a high or moderate level of evidence can form the basis for any scientific conclusions. Ten articles were identified as low level of evidence, five had moderate level and only one showed high level of evidence.

All retrieved articles, assessing the importance and contribution of lateral cephalometric radiograph in orthodontic treatment, concluded that there is no significant difference on treatment planning decision with or without the evaluation of the lateral cephalogram. However, it should be considered that the suitable studies in this review were based on small samples rather than large cohorts representing the entire population. In one study, the sample used was restricted (six patients) [2]. Furthermore, the short time lapse between observations in some studies did not allow a full washout effect, which could lead to the repetition of the results [4, 7, 10]. The latter bias is further strengthened by the fact that recognition factors were often included, e.g. the possibility of identifying patient by photographic visualisation as part of the examination. On the other hand, in one paper, only dental casts were presented to the observers, which might also lead to error since it does not mimic the clinical situation. Sample bias is also suspected based on the fact that selection of subjects is often poorly described or unclear [2, 6, 9], like the questions made to the observers that were not stated by any questionnaire [6], and in one article, observers were forced to choose yes/no answers, which again do not perfectly simulate the reality [3].

In the two articles by Atchison et al. there was the possibility to identify patients as well as sample size was very restricted (six patients). There was no repetition of the questionnaire to test the variability between answers [4, 10]. When it comes to the validity and reliability of cephalometric analysis, several errors should be considered: landmark identification, tracing and measuring, and magnification of certain anatomical structures.

Landmarks placed in anatomically formed edges are easier to identify, while some landmarks placed on curves are more prone to error. The gonion and lower incisor apex are the least consistent landmarks [21]. Furthermore, landmarks such as point A have a higher variance than others like point B because of wider variation and anatomical localisation of point A [23]. Dental landmarks tend to have poorer validity than skeletal landmarks. Also, when landmarks are located on a curve like point A, point B or pogonion, the error is larger [25]. The evidence shows that landmark identification is a great source of error in 2D lateral cephalometry [24]. Major errors in angles with dental landmarks may occur [25]. In addition, different levels of knowledge and experiences between the observers also lead to varying results on landmark identification. In a study using 18 observers, in which 13 were dental students and 5 were post-graduate students, post-graduate's revealed lower intra-observer tracing variance than dental students [27]. Patient positioning during the procedure is also very important to avoid errors on measurements and landmark identification [23, 26]. The publication of Ahlqvist et al. [26] was assessed with a low level of evidence because there was only one observer. A similar classification occurred for Bourriau et al. [5], intra-observer agreement could not be evaluated and the number of radiographs (n = 4) used was very low. Kvam and Krogstad's [27] publication also used a limited number of subjects (n = 3). The choice of the observers also plays an important role on the results. Eighteen observers, in which 13 were dental and 5 were post-graduate students, participated in their study [27]. The latter can also bias results because of the distinct level of education and expertise due to the lack of experience of the observers.

Regarding the influence of magnification, Bourriau et al. [5] could not identify significant differences between equipment with a 4-m distant cephalometric machine and a 1.5-m distant cephalometric arm. Despite that, it should be considered that distance varying between the X-ray source and the image receptor will always cause a degree of magnification, the larger the distance, the lower the magnification. A focus object distance of 4 m in 2D cephalometric equipment is usually favoured for the reduced radiation burden and lack of enlargement, while an equipment with 1.5-m arm has a direct advantage of being compact and integrated in a multimodal system as well as having an increased resolution. On the other hand, panoramic equipment with a cephalometric arm at a 1.5-m distance may present shortcomings in enlargement factors and superimposition of the bilateral structures more distant from the midsagittal plane, considering the less magnified structures on the side nearby the image receptor [30]. We were not able to identify studies correlating landmark identification errors in lateral cephalograms and their influence on the outcome of patient treatment.

Finally, in 1982, De Abreu showed that different 2D cephalometric analysis may lead to different diagnosis of the same patient, varying the diagnosis between class II and class III in 8 out of 129 cases [28]. Also, Abdullah et al. [29] found that Steiner's cephalometric analysis is not accurate enough to plan orthodontic treatment. Both publications were assessed with low levels of evidence. In both publications, the number of observers was not referred. Furthermore, the statistical method used was not mentioned in [28].

The accuracy in the evaluation of the results, as well as producing changes in the treatment compared with clinical evaluation, seems to be one of the major benefits of 2D cephalometry. Risk-benefit analysis should be carefully evaluated.

Conclusion

The existing literature suggested that lateral cephalometric radiographs have been used without adequate scientific evidence of its usefulness and are often used prior to treatment. There is a need for diagnostic accuracy studies on 2D lateral cephalometric radiograph where standardised methodological criteria for diagnostic thinking efficacy and therapeutic efficacy are incorporated. This systematic review has shown that the evidence to agree or disagree on the usefulness of this radiographic technique in orthodontics today is limited. Lateral cephalograms are used in many occasions for reasons other than clinical diagnosis or treatment, such as medico-legal reasons in a teaching environment or due to a lack of experience in the field. These conclusions are rather worrying. The use of radiation in children should be even better justified, and scientific evidence of that justification seems lacking. At present, there is a need for further studies on larger patient populations, focusing on the therapeutic efficacy of lateral cephalograms.