Background

Growth measurement during the first years of life is essential for optimizing child health [1, 2]. Primary care practitioners use these measurements to guide parents and caregivers when children fall outside healthy growth parameters. Weight, height and length measurements are used as growth indicators to calculate body mass index-for-age which has been recommended as the most inexpensive, efficient and precise measure in primary care practice to determine if a child has overweight or obesity [3]. However, imprecise measurements may lead to misclassification of weight status, unnecessary interventions, referrals and patient and parental concern [4]. The World Health Organization (WHO), and Dietitians of Canada recommend that measurement techniques be standardized; length for children less than 2 years of age measured in the recumbent position and standing height for children older than 2 years age. Further, equipment should be calibrated with those responsible for measurement trained for accuracy and reliability [1]. These procedures for growth monitoring have been previously used in research studies such as the Canadian Health Measures Survey [5] in Canada and the National Health and Nutrition Examination Survey (NHANES) in the United States [6].

All growth measurements are subject to error which can be random or systematic and may occur from human error or equipment error [7]. Reliability is the extent “to which within-subject variability is due to factors other than measurement error variance or physiological variation” [8,9,10]. The lower the variability between repeated measurements of the same subject the greater the precision [10]. There are two forms of reliability that will be evaluated in this study: intra-observer and inter-observer reliability. Intra-observer reliability is the ability for one observer (the term observer will be used to define the person measuring the subject) to repeat measurements on the same child with little to no variability. Inter-observer reliability is the ability for two independent observers to measure the same child with little or no variability. Determining both intra-and inter-observer reliability is important in evaluating the accuracy of measurement data. The accuracy of routinely collected anthropometric data is currently unknown relative to measurements taken by trained personnel. Imprecise measurements can weaken observed associations of exposure and health outcomes both clinically and for research. The purpose of this study was to assess the reliability of height, length and weight measurements collected during well-child visits in primary care practices. A secondary objective was to determine any systematic differences in intra-observer reliability between primary care team members and research assistants.

Methods

Study setting and population

Parents or guardians of healthy children 0 to 18 years attending a scheduled well-child visit were invited to participate and informed consent was obtained. Children were recruited at two pediatric practices and one family medicine primary care practice participating in the TARGet Kids! [11] research network in Toronto, Canada (www.targetkids.ca). Based on TARGet Kids! exclusion criteria at enrollment, children were ineligible to participate if they were diagnosed with associated health conditions affecting growth (such as failure to thrive or cystic fibrosis) or if their parent or guardian was not fluent in English. Primary care team members, including nurses and clinic staff at the 3 practices volunteered to perform the routine care measurements. This study was approved by the Research Ethics Boards of the Hospital for Sick Children and St. Michael’s Hospital.

Sampling and sample size calculation

A convenience sample was used to recruit participants until the required sample size was achieved in each age group: 0 to < 2 years, 2- < 5 years, and 5–18 years. Children younger than 2 years of age and 2 to 5 years were over sampled because length and height measurements have been shown to be particularly variable in these age groups [12, 13]. The sample sizes per age category was determined based on previous work by Walter and colleagues using 4 replicates (measurements per subject) [14]. In the age groups 0 to 2 years, with α = 0.05 and β = 0.2 (corresponding to 80% power) to rule out the possibility of a reliability < 0.7 (the minimally acceptable value), and to achieve an expected reliability coefficient (R) of at least 0.8 the number of subjects required was 68. In the age groups 2 to 5 years, to achieve the expected reliability coefficient (R) of 0.85 the number of subjects required was 26. In the age group 5 to 18 years to achieve the expected reliability coefficient (R) of at least 0.9, the number of subjects required was 12.

Data collection and measurement

Research measurements were performed by a research assistant; routine measurements were performed by a primary care team member, both on the same equipment. Research assistants were trained using the WHO Training Course Growth Assessment modules [15]. These growth monitoring guidelines were adopted by multiple professional health agencies such as the Canadian Pediatric Society, and Community Health Nurses of Canada and are the guidelines primary care team members would have received in their professional training [16]. Primary care team members did not receive any further training other than what they received during their professional degrees or at the practice where they worked. The standard procedure and equipment for measurement of length in children < 2 years was in the recumbent position using a length board (SECA model#2101821009 length board), and weight of infants was measured without clothing or diapers on a digital baby scale (Healthometer 553 KL pediatric scale). Measurement of height and weight in children ≥2 years was performed in light clothes without shoes on a digital scale with standing height attachment (Healthometer 500 KL adult scale). In total, each participating child had their age and sex recorded, and 4 sets of anthropometric measurements (4 weights and 4 lengths/heights), with each observer performing the measurements twice. The first observer measured weight and length/height, and recorded the measurements on a standardized data collection form. The child was then measured by the second observer and measurements recorded on a separate form. The process was repeated to obtain a second reading from both the first and second observer. In order to limit each observer’s recall of their previous measurements, the observers alternated and recorded measurements on separate forms. The order of the observer who performed the first measurement was random, so that the research assistant and the primary care team member could be either first or second observer. Each observer was blind to the other observer’s measurements.

Statistical analysis

Descriptive statistics were used to describe the children included in this study. Intra-observer reliability was assessed by comparing each observer’s first measurement to their own second measurement. Inter-observer reliability was assessed by comparing the initial observer’s first measurement to the second observer’s first measurement and by comparing second measurements therefore using all four measurements from each independent observer. Measures of central tendency (mean, median, and mode) were calculated for the absolute difference between measurements by age group. Bland-Altman plots were used to describe the differences between and within observers graphically. Infants 0 to < 2 years were calculated separately because they were measured using different equipment. Intra- and inter-observer reliability statistics were calculated for children 0 to 2 years, 2 to 5 years and > 5 to 18 years. Intra-observer reliability statistics for each observer and Bland Altman plots were calculated for children 2 to 18 years to maximize sample size. The technical error of measurement (TEM), the relative TEM (%TEM), and the coefficient of reliability (R) were the statistical tests used to assess intra- and inter-observer reliability. The TEM was defined as the standard deviation of differences between repeated measures in the unit of the measurement (e.g. TEM for height measured in centimeters is cm), using the following equation:

$$ TEM=\sqrt{\sum {D}^2/2N} $$

Where D is the difference between repeated measures and N is the number of individuals measured. TEMs with lower values indicate greater precision of the observer performing the measurement [10]. The relative TEM was calculated as the (TEM/mean × 100).

R, the coefficient of reliability is the estimated proportion of inter-subject variance that is not due to measurement error, defined by the equation:

$$ R=1-\left({TEM}^2/{SD}^2\right) $$

SD2 is the total inter-subject variance for the study population. Scores vary from 0 to 1, with 0 indicating that all between subject variations are due to measurement error and a value of 1 indicating that no measurement error is present. Higher R values are indicative of greater precision, with values above 0.95 considered acceptable, 0.8 considered sufficient and values lower than 0.7 considered minimally acceptable measurement error [10]. All three reliability measurements (TEM, %TEM, and R) were calculated to compare with published reliability statistics. Finally, an F-statistic was calculated to test differences between intra-observer reliabilities by squaring the technical error of measurement to create a variance and dividing one by another: F = TEM2(intra1)/TEM2(intra2) with degrees of freedom = N-1. A significance level of 0.05 was used to determine statistical significance [17].

Results

Sample characteristics

In total 125 children were recruited and measured 4 times, contributing 498 weight measurements and 500 length or height measurements. These measurements were performed by 6 trained research assistants (RA) from the TARGet Kids! research network and 16 primary care team members. Additional file 1: Table S1 shows the proportion of measurements each observer contributed to the study. Summary statistics of the subject characteristics are presented in Table 1. The median age was 19 months (IQR 9.0 to 53.0 months); there were 68 children (54.4%) < 2 years, 31 (24.8%) between 2 and 5 years, and 26 (20.8%) > 5 to 18 years. Boys and girls were almost equally represented, 50.4 and 49.6%, respectively. The majority of infants and children had a normal weight (between − 2 and ≤ 1 BMI z-score), 6.5% had a ‘risk of overweight’ (zBMI between 1 and ≤ 2) and 5.7% had ‘overweight’ status (zBMI ≥2). One subject < 2 years became agitated and only completed 1 set of weight measurements; therefore the sample to calculate inter- and intra-observer reliability in this age group was decreased by 2 and 1, respectively.

Table 1 Characteristics of sample population

Inter-observer reliability

Intra- and inter-observer reliability for each measurement type by age group is presented in Table 2. The absolute mean difference for weight ranged from 0.03–0.15 kg and 0.52–0.77 cm for length/height. Overall, all %TEM values for weight, length and height were in the acceptable range of < 2% [18] and coefficients of reliability (R) values were all > 99%, representing very good reliability between repeated measurements performed by two independent observers [10]. Inter-observer reliability of length, < 2 years, had the highest TEM (0.73 cm) and a %TEM of 1.03%. Relative TEM for weight slightly increased as child age increased from 0.64% in children < 2 years to 0.70% in children > 5 years. In contrast, the %TEM for length/height improved as child age increased from 1.03% to 0.36%. Figure 1 shows the inter- and intra-observer differences in weight, length/height by age group (< 2 and ≥ 2 years) using Bland-Altman plots. The majority of differences were within 2 standard deviations of the mean, considered acceptable levels of error. The largest differences were length measurements of children < 2 years, and the smallest differences were weight measurements of children < 2 years.

Table 2 Intra and inter-observer reliability statistics by measurement type and age group
Fig. 1
figure 1

Bland-Altman plots on the weight and length/height measurements for intra- and inter-observer reliability by age group. Top row of plots: a) inter-observer reliability for weight 0- < 2 years, b) intra-observer reliability weight 0- < 2, c) inter-observer reliability for weight 2–18 years, d) intra-observer reliability for weight 2–18 years, e) inter-observer reliability for length 0- < 2 years, f) intra-observer reliability for length 0- < 2 years, g) inter-observer reliability height 2–18 years, h) intra-observer reliability for height 2–18 years

Intra-observer reliability

The absolute mean difference for weight ranged from 0.03–0.16 kg and 0.35–0.46 cm for length/height. In general, intra-observer reliability was more precise compared to inter-observer reliability. Relative TEM for weight ranged from 0.61% to 0.70%, and 0.19% to 0.64% for length/height. R values were all > 99%, representing very good reliability of each observer between their own repeated measurements. In further analyses, intra-observer reliabilities were calculated separately for the RA observers and the primary care practitioners with the highest proportion of measurements and tested for statistically significant differences (Tables 3 and 4). TEMs for weight ranged from 0.02–0.20 cm, for length TEMs 0.35-0.63 cm, and for height TEMs were 0.38–0.47 cm. Statistical differences between RA and primary care practitioners were seen in the weight and length intra-observer reliabilities in children < 2 years in one of the pediatric practices; the RA had lower TEMs for length and the primary care practitioner had a slightly higher TEM for weight. In the family medicine practice, length measurements taken by the RA had a lower TEM than the primary care nurse. Further data on measurement differences and calculations are presented in Additional file 2: Table S2 and Additional file 3: Table S3.

Table 3 Intra-observer results for weight and length/height measurements by individual observers
Table 4 Differences in intra-observer reliability technical error of measurement (TEM) between TARGet Kids! research assistants and primary care nurses

Discussion

This study reports the intra- and inter-observer reliability of weight, length and height measurements performed in one family medicine and two pediatric practices using multiple reliability statistics such as the technical error of measurement (TEM) and the coefficient of reliability (R). All %TEM values were < 2% and R coefficients > 99% meaning both intra- and inter-observer reliability were acceptable and had high reliability. This supports the acceptability of using routinely collected weight and length/height measurements from these primary care practices.

Our findings are similar to previous studies that performed measurement reliability testing for quality assurance purposes involving multiple anthropometrists in large epidemiologic studies. For example, the World Health Organization Multicentre Growth Reference Standards (WHO-MGRS) [19], NHANES [17], Born in Bradford [8], and the Identification and prevention of Dietary- and lifestyle-induced health Effects In Children and Infants (IDEFICS) [9] study have published reliability data on the observers involved in measurement. In each of these studies all TEMs and R values for length/height and weight were in an acceptable range and indicative of good quality. In the WHO-MGRS expert anthropometrists were used as a gold standard and had intra-observer TEM of 0.29 cm for length and 0.23 cm for height (weight was not included). Although our TEM for length was higher (0.46 cm), our TEMs for height were comparable at 0.27 cm (2- < 5 years) and 0.24 cm (5+ years). In one review of reliability statistics mean intra-observer TEMs for length, height and weight were 0.35 cm, 0.38 cm, and 0.17 kg, respectively, which were similar to our study [10]. One main difference of these studies is test-retest measurements were all performed after the anthropometrists had undergone standardized training. In our study, we did not provide the primary care practitioners any further training other than what they received in their clinical training as our objective was to assess reliability in routine primary care.

Length had the highest %TEM for inter-observer reliability which is consistent with previous studies that have shown increased measurement error in length [13, 20]. Observing the Bland-Altman plots, differences in length measurements show greater spread. One possible explanation relates to the challenge in positioning of very young infants appropriately on the length board. In our study, the %TEMs for height were better for older children who may be easier to position on the stadiometer. Conversely, the results for weight were better for younger children, although the difference in %TEM between each age group was minimal (< 0.1%). This may be due to the consistent use of digital scales in all the practices. Moreover, very young infants tend to move less on the baby scale as opposed to preschool-aged children (2 to 5 years) who may have difficulty standing still or may not be in the exact center of the scale.

There were limitations to this study. Not as many children were recruited from the family medicine practice compared to the pediatric practices because of the higher volume of children seen for well-baby care in the latter. As well, the practices that volunteered to participate were already participating in research through TARGet Kids!, therefore may have had more standardized protocols related to measurement compared to other primary care practices in Ontario. We were unable to measure reliability at one of the pediatric clinics for children under 2 years. Both primary care team members and research assistants were aware of the purpose of this study therefore may have changed their measurement behaviour. Finally, there were multiple primary care practitioners contributing to the overall intra-observer reliability TEMs for each measure which may have inflated the estimates. However, since this was intended as a pragmatic examination of primary care anthropometry we included all data collected by all observers.

While this study was able to assess reliability of human measurements, we were not able to assess the potential measurement error resulting from equipment or measurement methods. For example assessing the use of a length board rather than the paper and pencil method where a marking is made on the examining table paper at the head and feet. This method has been shown to systematically overestimate length thereby increasing measurement error [21].

Conclusion

Monitoring weight, height and length measurements to calculate BMI-for-age has been recommended as the most inexpensive, efficient and precise measure in primary care to assess weight status. With the increased use of electronic medical records (EMR), this data is accessible for use outside of clinical care such as public health surveillance. In this study, we assessed multiple observers to calculate both intra- and inter-observer reliability and demonstrated all values were in the acceptable range. Although length measurement had the highest TEM, it was still acceptable according to standards based on both published reliability statistics [18] and comparable to the expert anthropometrists from the WHO-MGRS [19], meaning the magnitude of human measurement error was small. Determining the measurement reliability of length/height and weight in primary care contributes to understanding the feasibility of using routine clinical data for BMI surveillance in children. This study has identified that primary care practitioners in selected primary care practices who adhere to standardized equipment and procedures measure weight, length/height as well as research trained personnel.