The initial reported association of methamphetamine (MA) abuse with advanced dental disease [1] engendered a slew of case reports describing the severe dental consequences encountered in MA users [28]. Popularly known by the moniker “meth mouth,” the examples of extreme dental decay and tooth loss were disseminated widely by the news media without being linked to rigorous epidemiological studies that validated the reports. As a first step towards scientifically verifying the accumulating anecdotal evidence, we had previously utilized the infrastructure of a large multisite clinical study (Methamphetamine Treatment Project or MTP) to systematically examine the health consequences of chronic MA use in a representative sample of users [9]. Comprehensive medical and brief dental assessments carried out by participating physician examiners were used to define the nature and rates of medical and dental disease in a subset of 301 MA users. The dental findings were compared to the dental status of a sociodemographically similar group of non-MA using participants enrolled in the National Health and Nutrition Examination Survey (NHANES III). One of our main findings was that overt dental disease was a key distinguishing comorbidity in chronic MA users [9]. The MA users from the MTP cohort had significantly higher rates of missing teeth and dental disease than demographically comparable controls and reported long-term unmet oral health needs.

Building on our precursory findings, we conducted a follow-up, cross-sectional study comprising a new cohort of MA users from the community and involving detailed dental examinations by trained dentists following protocols aimed at achieving well-calibrated assessments. Primary goals were to characterize the occurrence and nature of dental caries and periodontal disease in individuals with a range of MA-use behaviors and to investigate whether the specificity of the caries patterns could be used to distinguish MA users from non-users. Because the oral health assessments involved different raters at separate collection sites, ensuring the comparability and uniformity of the data being collected was a critical prerequisite. Take for example the consequences of inter-rater variability on inferences based on periodontal assessments. Variations in pocket depths tend to be relatively small (measured in millimeters) and as such, even slightly imprecise assessments resulting from procedural errors by the dental examiners could generate substantial bias and have an adverse impact on the validity of corresponding inferences. Thus, an effective and robust Quality Assurance (QA) program is essential for standardized data collection across sites and for monitoring and ensuring the precision and accuracy of the data collected. This paper provides an overview of the QA program used for our cross-sectional study of the dental consequences of MA use, with particular attention to quality-control procedures. A main objective of the paper is to assess the performance of the QA program by examining the inter-rater reliability or agreement between the field dental examiners and the reference examiners for individual data elements.


Study setting

The study was conducted in Los Angeles County, one of the largest and most populous urban areas in the USA and beset with high rates of MA use [10, 11]. Between February 9, 2011 and August 26, 2013, 574 MA users recruited from local communities underwent comprehensive oral examinations and psychosocial assessments at dental clinics associated with two large community health centers: a) the AIDS Project, Los Angeles (APLA) center that primarily serves a sociodemographically diverse group of individuals with HIV/AIDS, and b) the Mission Community Hospital (Mission) in the San Fernando Valley that caters to a large, underserved migrant population. The study sites were chosen to provide access to a diverse cohort of afflicted Angelenos with a broad range of MA-use behaviors. For the main study, we used a case-control study design that compared the MA users to a control group of propensity-score matched samples from the National Health and Nutrition Examination Survey (NHANES) [12, 13], where propensity scores were used to identify a demographically comparable subset of individuals from the population-based NHANES. Study findings presented in this paper are based on a selected sub-sample of participants who underwent replicate examinations for data quality assurance purposes. The study design and data collection protocol was reviewed and approved by the UCLA Institutional Review Board. Trained dental examiners conducted the dental examinations, with the data recorded by dental assistants. In conjunction with the dental exam, an experienced bilingual interviewer conducted comprehensive assessments of psychosocial, behavioral, and substance-use characteristics of study participants.

Training and quality assurance

The study was launched with a 2-day project orientation for all persons involved in the study, including the Principal Investigator, co-investigators, project manager, reference examiners, dental examiners, dental assistants, research assistant, and the statistical and data management team. At this initial meeting, the lead investigators reviewed the study’s scientific background, objectives, inclusion and exclusion criteria, evaluation criteria, key outcome variables and study enrollment targets. The practical aspects of the study were discussed including recruitment and screening strategies, management of any adverse events, technical considerations relating to data collection and recording, as well as responsibilities with regard to protocol adherence, quality assurance and logistical support.

Training and Quality Assurance (QA) were conducted under the leadership of BD (lead reference examiner), who is the national trainer and reference examiner for NHANES. He was assisted by a locally-based dental epidemiologist (local reference examiner) who provided ongoing monitoring of the dental examiners, evaluating their assessments and providing remediation when necessary. A dentist from each participating health center was selected and trained to conduct a standardized dental examination. Both dentists provided direct patient care at the participatory sites but had little prior experience in clinical research. On the 1st day of the orientation meeting, the lead reference examiner used lecture and slide presentations to familiarize the study dental examiners with customary NHANES study protocols and assessment criteria followed by explanation of the examination technique and equipment use. To initiate the standardization process, demonstration examinations, conducted by the lead reference examiner, was followed by the trainee dental examiners conducting practice examinations in volunteer subjects. During the standardization process, the trainee examiners were encouraged to ask questions regarding assessment criteria while conducting the study protocols. In an effort to minimize differences in examination findings, data from each standardization round were reviewed for inconsistencies and findings were discussed with the trainee examiners led by the reference examiner. In a subsequent calibration phase, the reference examiner and dental examiners repeated dental and periodontal assessments in the same set of 12 volunteers.

Once the field study was initiated, the local reference examiner visited each site monthly to observe data collection and to randomly replicate the dental examination in 2–3 subjects. Data from these replicate exams were used to produce inter-rater reliability statistics to evaluate examiner performance and to provide feedback. If an examiner’s performance fell below an acceptable level, retraining was conducted on site. About 1 year after study launch, the lead reference examiner returned to observe field operations and, along with the local reference examiner, conducted another round of calibration exercises with the dental examiners. The calibration visits took place at each clinical site during normal field operations and involved 12 subjects enrolled in the study. The monitoring visits helped determine whether the dental examiners were conducting the oral health examinations within the parameters of the NHANES study protocols and if the standards for examination between the dental examiners and reference examiner had been maintained.

The project manager oversaw the data collection activities at each field site, providing ongoing logistical support and supervision and ensuring that the each site followed all appropriate procedures for subject recruitment and consenting. Dental examiners were clinic-based for this study. At the APLA site, where the bulk of the assessments took place, the study examiner remained for the duration of the study. The initial examiner at Mission site departed towards the midpoint of the data collection phase, and a replacement examiner was trained and calibrated to collect data. Overall, 51 enrolled subjects, representing 9 % of the study sample, received replicate exams.

Variables and their measurement

The main oral-health outcome variables were the rates and patterns of dental caries and the periodontal disease status of the subjects. Assessments for dental caries and periodontal status adhered to NHANES examination protocols, which have been described in greater detail elsewhere [1416]. In brief, NHANES diagnostic criteria were used to assess for dental caries at the surface-level and presence required manifest cavitation. Examinations were conducted under artificial light with the study participant in the supine position using a standard explorer and dental mirror without additional magnification. Radiographs were not taken. A Hu-Friedy periodontal probe color-coded and graduated at 2–4–6–8–10–12 mm was used to assess periodontal status. Dental caries experience was calculated as the number of diseased (D), missing (M), and filled (F) teeth (T) and the number of untreated dental caries was calculated as the number of diseased surfaces (DS). Periodontal disease status was assessed using the case definitions recommended for periodontitis surveillance by the CDC Periodontitis Workgroup (CDC/AAP) ([17]. The CDC/AAP case definitions require information from two interproximal sites (DF, MF, ML, and/or DL) and are not dependent upon the presence of an adjacent tooth. Gingival recession and pocket depth measures were made at four sites per tooth (the disto-facial (D), mid-facial (B), mesio-facial (M), and the disto-lingual (DL) sites). An algorithm calculated loss of attachment from the information on gingival recession and pocket depth. All four quadrants were examined and 3rd molars were excluded.

Data collection and management

To ensure standardization and quality assurance in data collection and processing, all dental and psychosocial data were captured directly on a laptop computer using a web-based data-management system developed and maintained by the UCLA-Semel Institute Statistics Core (SIStat). Data collected through the user-friendly graphical interface on the laptop was encrypted and transmitted to be stored centrally in a secure server with firewall protection. Built-in logic and data-range checks allowed data verification to prevent invalid data. Automated reports and dashboards allowed the investigators and project manager to monitor the quality of the data collected at each clinical site by generating a variety of summary reports on data completeness and questionable values. The real-time input verification facilitated the timely identification and resolution of any problems in data collection and processing.

Statistical methods

SAS software (Version 9.3; SAS Institute Inc., Cary, NC, USA) was used for statistical analysis and data handling. Demographic information was tabulated for the full sample (n = 574) as well as within each of the replicate samples at APLA (n = 33) and Mission (n = 18). Participants who indicated that they had used methamphetamine for less than 10 of the last 30 days at the time of screening were classified as being “light” methamphetamine users, while the study participants who used MA for more than 10 of the last 30 days were classified as “moderate +” users. Education level was broken into three categories: less than high school, high school completion, and more than high school. High school completion was indicated by high school graduation or by obtaining a GED.

The reliability analysis was conducted using a set of both continuously-scaled and dichotomous outcome measurements. Continuously-scaled measurements from the caries examination included DMFT (the total number of decayed, missing, or filled teeth in a participant’s mouth), DFT (number of decayed or filled teeth), DS (number of decayed surfaces), and tooth retention (number of teeth present in the mouth). Continuously-scaled outcomes from the periodontal examination included mean gum recession, pocket depth, and calculated attachment loss per participant. Average attachment loss and pocket depth were additionally stratified by the four periodontal sites (mid-facial, mesial-facial, disto-facial, disto-lingual), recognizing that different site types could have differing variability. Means and standard errors were produced for each rater on these outcomes, as well as the bias, defined as the difference between the examination rater measurement and the reference measurement. Intra-class correlation (ICC) coefficients were estimated as the primary reliability statistic for the continuously-scaled measurements. The intra-class correlation was defined as the ratio of variance between subjects to the total variance between and within subjects. ICCs closer to 100 % indicate greater inter-rater reliability.

Discrete outcomes from the caries examination contained the caries experience (defined above), having at least one surface of untreated decay, having at least one restoration or surface restoration, tooth retention (all teeth present), and having at least five anterior surfaces with untreated decay. From the periodontal examination, the dichotomous outcomes included having at least one site with attachment loss greater than or equal to a given threshold (taken to be either 3, 4, or 6 mm); having at least one site with pocket depth greater than or equal to a given threshold (either 4, 5, or 7 mm); and indicators of periodontitis, moderate periodontitis, and severe periodontitis. Percent agreement between the examination and reference raters was computed as one metric of reliability. Cohen’s kappa statistic and a corresponding asymptotic standard error (SE) were also produced. Examiner strength of agreement using the Kappa coefficient was not evaluated by hypothesis (statistical) testing and instead was evaluated by applying commonly used guidelines [18].


Table 1 shows the number of persons examined in the main study and those receiving a repeat examination by the reference examiner by select characteristics. Study participants were generally older, male, Hispanic or non-Hispanic black, had lower educational attainment, were smokers, or were moderate/heavy methamphetamine users. Overall, the distribution of demographic and behavioral characteristics for individuals participating in the replicate exams was similar to that observed in the main study.

Table 1 Sociodemographic characteristics of methamphetamine users

The inter-rater reliability statistics by clinic examination site (A & B) for categorical evaluations of dental caries and periodontal status are shown in Table 2. The Kappa statistics for untreated dental caries ranged from 0.57 to 0.75, with percent agreement ranging from 83 to 88 %. For identifying untreated caries on at least 5 surfaces of anterior teeth, Kappa scores were 0.77 and 0.87, and percent agreement was 94 and 97 %. Examiners at both clinical sites performed equally when identifying caries experience (Kappa 1.00 and percent agreement 100 %). Kappa scores for various thresholds of pocket depth ranged from 0.20 to 0.77 and for attachment loss ranged from 0.29 to 0.52 (among Kappa scores that could be computed, as Kappa statistics could not be estimated when the probability of classification into a given category for one or both raters is 0 or 100 %). There were three instances where the attachment loss thresholds were identified in 100 % of the subjects by one or both raters, and thus the Kappa statistics could not be computed. The percent agreements for these three instances ranged between 72.2 and 100 %. When Kappa was calculated based on the CDC/AAP case definitions for moderate and severe periodontitis, inter-examiner reliability was higher at site B compared to site A. For severe periodontitis, the Kappa was 0.27 for site A, whereas at site B the reliability statistic was calculated at 0.67.

Table 2 Inter-rater reliability statistics for selected dental health conditions

Table 3 shows the mean values and intra-class correlation coefficients when continuous measures of periodontal status and dental caries were used. For overall attachment loss (AL) and pocket depth (PD) across all four periodontal sites, the intra-class coefficients ranged from 0.87–0.89 and 0.79–0.81 respectively. For measures of overall gingival recession (CJ mean), the ICCs ranged from 0.88 to 0.91. For attachment loss, ICC for facial (B) measures ranged from 0.54 to 0.82 and for the disto-lingual (DL) measures from 0.64 to 0.87. For pocket depth, the ICC values were lower than for attachment loss at the B site. The ICCs for interproximal facial and disto-lingual pocket depth measures ranged from 0.43 to 0.60 and 0.75 to 0.85 respectively. When looking at caries experience and tooth retention, the dental examiners at both sites A and B demonstrated nearly ideal correlation with ICCs ranging from 0.96 to 0.99.

Table 3 Dental examiner inter-class correlation coefficients (ICCs) for selected periodontal measures


Overall, quality assurance findings from this study investigating the distribution of dental caries and periodontal disease in MA-using individuals indicate that substantial agreement existed with the reference examiner and the site examiners for dental caries. However, examiner concordance was lower for specific periodontal assessments. In this study, we focus on two key measures of examiner reliability: intra-class correlation coefficients (ICCs) for continuously-scaled data and Kappa statistics for categorical data. Kappa statistics incorporate a correction for the agreement that would be expected by chance alone. Consequently, their values are lower compared to percent agreement calculations. Even though a number of different standards can be used to ascertain the strength of the agreement between examiners, we have relied on a widely used guideline proposed by Landis and Koch for interpreting kappa scores [18]. In a summary, a kappa statistic ≤ 0 is reflective of having “poor agreement”, > 0 but ≤ 0.20 is “slight agreement”, 0.21–0.40 is “fair agreement”, 0.41–0.60 is “moderate agreement”, 0.61–0.80 is “substantial agreement”, and >0.80 is “almost perfect agreement”.

In this study, examiner percent agreement for selected dentition and dental caries assessments was 83 % or higher and Kappa statistics were 0.77 and higher, indicating substantial to almost perfect agreement between the site examiners with the reference examiner. Nonetheless, agreement among examiners for untreated caries at examination center “Site B” was moderate (0.57). Examiner reliability performance for most of the periodontal assessments ranged from fair to substantial agreement. Overall, examiners from Site A were more in concordance with the reference examiner when assessing dental caries and examiners from Site B were more in concordance with the reference examiner when assessing periodontal disease. Prevalence and bias can influence the calculation of the Kappa coefficient. The greater variance in case prevalence across the sites, the more likely the magnitude of the Kappa statistic can be affected. In our study, participates at Site A were more likely to have more advanced periodontal disease and participants at Site B were more unlikely to have untreated dental caries.

Another set of reliability statistics we have presented in this report are ICCs. Although it has been suggested that a threshold of 0.75 or greater would represent excellent reliability [19], examiner bias (the mean difference between reference examiner and survey examiners) is also an important consideration. Using the conventional measures for caries experience and tooth retention, the examiner reliability for our study should be considered excellent. For examiners at both examination sites, the ICC statistics for DMFT, DFT, DS, and the number of retained teeth were ≥ 0.96. For the periodontal data, the mean attachment loss ICC statistic, as measured for all teeth and eligible sites, was 0.87 and 0.89 for each examiner, indicating excellent overall reliability for AL. When evaluating the ICCs calculated for pocket depth, the range was 0.79 to 0.81. Overall recession measure reliability data also was excellent ranging from 0.88 to 0.91. When evaluating the individual sites, examiner reliability was lower for examiners at Site B compared to Site A at the buccal sites for attachment loss (0.54 vs 0.82) and pocket depth (0.43 vs. 0.60).

It is unclear why examiner reliability was lower for the mid-facial sites given that interproximal sites can be more challenging to accurately measure. It may be that examiners slightly altered the placement of the probe to allow it to slip into a posterior furcation at an angle greater than required, which distorted the actual pocket depth measure. Interestingly, examiner reliability findings from NHANES 2009–10 have indicated a similar issue, with greater variability in examiner reliability at some mid-tooth sites [16]. The current study used the Hu Friedy PCP-12 periodontal probe, which is the same probe used on the current NHANES. The periodontal probes are marked in 2 mm increments, and examiners are trained to round down to the nearest whole millimeter. Common factors that often contribute to measurement bias include inconsistent angulation, probe pressure, and measurement rounding. Overall, reliability findings from our study indicate that periodontal measurement error was more likely to occur with pocket depth measurements and less with recession (and attachment loss) measurements.

Because our oral health data collection methodology used NHANES protocols, we can compare our examiner reliability findings to those published from NHANES. For this current study, the calculated examiner agreement for tooth retention and untreated dental caries was consistent with what has been observed from NHANES [14, 16]. In NHANES 2003–04, examiner Kappa statistics ranged from 0.65 to 0.73 for untreated caries (0.57–0.81 in our study) and 0.94–1.00 for tooth retention (1.00 in our study). When evaluating continuously-scaled measures of periodontal disease, the overall ICCs for pocket depth was 0.61 and 0.86 for NHANES 2003–04 and 0.61 and 0.72 for NHANES 2009–10. In the current study the calculated ICCs for pocket depth were 0.79 and 0.81. In NHANES 2003–04 mean attachment loss ICCs ranged from 0.86 to 0.93 and 0.82 to 0.87 in NHANES 2009–10, whereas ICCs for attachment loss in our study were 0.87 and 0.89. Given the similarity in examiner reliability statistics for general dental caries and periodontal assessments, examiner performance observed in the current study could be considered comparable to that reported on previous NHANES oral-health studies for selected oral health measures.

The oral health assessments conducted in our current study utilized standard assessments for dental caries and periodontal status that have been used on NHANES. Incorporating these assessments and employing similar quality assurance controls should facilitate comparison of our findings to that of the US national population. Although greater reliability is certainly desirable, it is worth noting that less-than-perfect reliability does not invalidate subsequent investigations; rather, one can conclude that less-than-perfect reliability has implications for the precision of corresponding evaluations, so analyses of measurements in the lower echelon of reliability categories should be interpreted with an understanding that reliability of measurement is a concern. By highlighting areas where reliability of measurement was not high, we hope to call attention to areas where additional attention to training and calibration protocols could be expected to have the greatest impact.


Overall, the quality assurance program confirmed the procedural adherence of the quality of the data collected on the distribution of dental caries and periodontal disease in MA-users. Examiner concordance was higher for dental caries but lower for specific periodontal assessments.