Background

Repeatability of measurements refers to the variation in repeated measurements made on the same subject under identical conditions. Variability in measurements made on the same subject in a repeatability study can then be ascribed only to errors due to the measurement process itself [1]. By contrast, when the measurements are performed under changing conditions, i.e. over a period of time, reproducibility is assessed. Repeatability and reproducibility are essential for clinicians for a variety of purposes [2, 3], such as aiding diagnosis, predicting future patient outcomes and choosing a personalized therapy. Several statistical methods have been developed and recommended for assessing repeatability and reproducibility, i.e. Intraclass Correlation Coefficient (ICC) and Bland Altman plot, whereas others have been discouraged, for example Pearson’s correlation and Coefficient of Variation (CV) [1, 4, 5].

This paper was motivated by a study on Anterior Active Rhinomanometry (AAR) in healthy children and in ones with rhinitis. AAR is recommended as the gold standard tool for measuring nasal ventilation during a normal respiratory cycle and resistance at the nostrils in patients with upper airway obstruction symptoms [5, 6]. In clinical practice, AAR is the most widely used and readily applicable test for assessing the degree of nasal obstruction, as well as for monitoring clinical outcomes after surgical or medical procedures in order to improve nasal patency [7]. The test execution procedure is standardized according to the International Committee on Standardization of Rhinomanometry [6], with subjects sitting in upright positon and wearing a face mask, where breathe only with the nose and close their mouth.

To date, few studies investigating AAR repeatability have been performed in adults only, showing controversial results [8,9,10]. In particular, Carney et al. observed that single measurements had an unacceptably high CV (19–60%) in a cross-sectional study on seven adults [9], and Thulesius et al. reported rather poor long-term reproducibility (CV 27%) in a longitudinal study over 5 months on nine healthy adults [10]. Conversely, Silkoff et al. reported a high level of repeatability (coefficient of variation, CV 8.5 ± 2.8%) and Intraclass Correlation Coefficient (ICC) 0.96 in a small sample of healthy subjects [8].

The aim of the present study was to highlight the fact that using inappropriate tools may lead to misleading results, and this was done by comparing the ICC, the Bland Altman plot and the CV for data from both healthy children and ones with rhinitis and by a simulation study, as a possible reference for clinicians dealing with this type of study.

Methods

Statistical tools and underlying assumptions

This section is devoted to introducing the statistical tools used in the simulation and clinical data. The ICC can be defined as the ratio of the between-subject variance to the sum of the within-subject and between-subject variances, and can be derived from a two-level random effect model [11]:

$$ ICC=\frac{\sigma_B^2}{\sigma_B^2+{\sigma}_W^2} $$

The ICC ranges from 0 to 1 and the following benchmarks can be used for interpretation: ICC < 0.20 “poor agreement”, 0.21–0.40 “fair agreement”, 0.41–0.60 “moderate agreement”, 0.61–0.80 “substantial agreement”, and > 0.80 “excellent agreement” [12,13,14]. In order to detect at least “fair agreement”, a significance test [15] can be performed to assess the following hypotheses:

$$ \left\{\begin{array}{c}{H}_0: ICC\le 0.20\\ {}{H}_1: ICC>0.20\end{array}\right. $$

The ICC suffers from a variety of methodological issues including sensitivity to assumptions of normality and equal variance [16, 17], and its use under assumption violations leads to misleading and likely inflated estimates of interrater reliability [18].

The CV is defined as the ratio between the standard deviation and the mean:

$$ {CV}_i=\frac{\sigma_i}{\mu_i} $$

where σi and μi are, respectively, the standard deviation and the mean of the measurement for subject i. CV is subject to some restrictions; for example it is meaningful only for measurements with a real zero (i.e., “ratio scales”). In addition, the values of the measurement to compute the CV always have to be positive [19]. The levels of acceptability for the CV depend on the field of application [20, 21]; however, CV < 15% is widely used [9, 10].

The Bland-Altman plot is used to assess the agreement between two repeated measurements [22] and to visually check possible heteroscedasticity of the data. Heteroscedasticity means that the size of the difference between two measurements changes with the size of the mean of the two measurements. Logarithmic transformation is suggested in the case of heteroscedasticity [23]. A nonparametric approach is recommended when the paired differences are not normally distributed [24].

Simulation study

The simulation scenarios were inspired by our real data. We simulated data assuming two different generating mechanisms. In the first batch of simulations, we generated 1000 replicates from a normal distribution with a fixed CV, hypothesizing n = 10 subjects each with p = 5 repeated measurements. In particular, for each subject the p measurements were generated from Xi ∼ N(μi, σi), with μi ranging from 5 to 8 (10 equally spaced values), and σi = μi∗CV, with CV ranging from 0.01 to 0.99 (50 equally spaced values). At each replication, the ICC was estimated.

In the second batch of simulations, we generated 1000 replicates for n = 10 subjects each with p = 5 repeated measurements from a mixed model. In particular, for each subject the p measurements were generated from \( {\mathrm{X}}_i\sim {N}_p\left({\mu}_i,{\sigma}_B^2\right) \), with \( {\mu}_i\sim N\left({\gamma}_i,{\sigma}_B^2\right) \). Different configurations were considered by varying the overall mean γi = 1, 2…10, the between-subject variance \( {\sigma}_B^2=1,4,9 \), and within-variances \( {\sigma}_W^2 \) varied, for fixed \( {\sigma}_B^2 \), to simulate a true ICC sequence from 0.10 to 0.90 (9 equally spaced values). At each replication, the CV was estimated.

Clinical data

The data analysed in the present paper arise from a multicentre observational study carried out at the Pediatric Allergy and Immunology Service, Sapienza University (Rome, Italy), and at the Pulmonary and Allergy Pediatric Clinic of the CNR-IBIM (Palermo, Italy). The study was approved by the local Institutional Ethics Committee (Palermo, Italy, Approval Number: 7/2017), and informed consent was obtained from all parents before study entry. Once approved, the study was registered on ClinicalTrials.gov (ID: NCT03286049). This study was conducted in accordance with Good Clinical Practice and the Declaration of Helsinki.

The sample size was estimated according to the method illustrated by Zou [25] using the ICC.Sample.Size R package [26]. In order to test the null hypothesis of ICC ≤ 0.20, considering an expected ICC of 0.70 based on a previous study [8], five repeated measurements per subject and a 90% statistical power and a 5% significance level, a sample size of 10 subjects per group was required. Therefore, the study population comprised 50 children, i.e. 10 subjects for each of the following 5 groups:

  • Healthy Children (HC)

  • Children with non-allergic rhinitis (NAR), i.e. children with rhinitis symptoms but without allergic sensitization;

  • Children with perennial allergic rhinitis (PAR), i.e. children sensitized to perennial allergens;

  • Children with seasonal allergic rhinitis outside (SAR-O) and during (SAR-D) the pollen season, i.e. children sensitized to seasonal allergens;

All the children underwent a standardized questionnaire including demographic characteristics and the core questions on rhinitis of the International Study on Childhood Asthma and Allergy (ISAAC) [27]. The questions referred to problems with sneezing, or a runny, or blocked nose when the child did not have a cold or the ‘flu, “ever” and “in the past twelve months”.

The inclusion criteria were the following: (1) age 10–16 years; (2) Total Five Symptoms Score (T5SS) > 5 for children with AR and NAR; the T5SS included sneezing, rhinorrhea, nasal itching, nasal obstruction and itchy eyes (each symptom score ranging from 0 –absent- to 3 –severe-, so that the maximum possible score was 15); T5SS > 5 at inclusion was established to ensure that patients were symptomatic. The exclusion criteria were the following: medical diagnosis of nasal anatomic defects (i.e., deviated septum) or nasal polyp disease; craniofacial malformations; genetic diseases; medical diagnosis of asthma according to GINA guidelines (http://ginasthma.org); any acute illness in progress and in the month before the study; use of systemic steroids or antihistamines in the past 4 weeks; use of any nasal therapy in the past 4 weeks; active smoking. The study involved three visits: screening (visit 1, baseline), visit 2 (after 14 ± 3 days), and a final assessment (visit 3, after 28 ± 3 days). At visit 1, patients were assessed for eligibility and recruited if they met the inclusion criteria; then they underwent physical examination and five AAR measurements for each nostril. At visit 2 and 3, patients underwent one AAR measure for each nostril. The performance of AAR parameters in predicting patients’ current symptoms of rhinitis was assessed through a ROC analysis [28]. The estimation of the Area Under the Curve (AUC) was performed by nonparametric ROC analysis and significance was tested using the method described by DeLong et al. [29]. Moreover, to avoid overrating the test performance in ROC analysis, we performed a five-fold cross validation [30]. A p-value < 0.05 was considered to indicate a statistically significant effect. Statistical analyses were performed through R version 3.5.2; ICCs were computed using the R package irr [15], the ROC curves were computed using pROC [31].

Anterior active Rhinomanometry (AAR)

AAR was performed according to the ICSR guidelines, using a RINOPOCKET ED200 (EUROCLINIC®, ITALY) rhinomanometer. The rhinomanometer was calibrated according to standard requirements. Rhinomanometry was done in a temperature- and humidity-controlled room. A small plastic catheter was inserted through a pierced piece of tape and attached to flexible silicone tubing leading to the pressure port of the meter. The foam was placed across the contralateral nostril to measure the nasal pharyngeal pressure, taking care not to interfere with the nostril being tested. The tubing was brought out around the side of the transparent mask. To perform rhinomanometry patients were asked to wear a face mask, close their mouths and breathe. For each nostril a rhinogram was recorded which related inspiratory and expiratory nasal airflow to transnasal pressure. A retest was performed in all patients. Measurements were performed by the same operator using the same instrument and following the standard operation procedure according to Clement [32].

In reference to Ohm’s law (R = DeltaP / F), Rinopocket uses the following: 1) a differential pressure transducer − 25 to + 25 KPa (− 3.6 to + 3.6 psi) temperature compensated to get DeltaP {other features are: accuracy (0 to 85 °C) = ±5.0%VFSS; sensitivity (V/P) = Typ 90 mV/KPa; response time (t r) = Typ 1.0 ms; offset stability = Typ ±0.5%VFSS}; 2) an airflow sensor compensated and amplified (±300 SLPM) to get Flow; {other features are: repeatability and hysteresis = Typ ±0.035 Vdc; response time (t r) = Typ 10 ms; Null voltage shift (25 °C to 5 °C [77 °F to 41 °F] = Typ ±0.02 Vdc; 25 °C to 60 °C [77 °F to 140 °F]) = Typ ±0.02 Vdc; full scale output shift (25 °C to 5 °C [77 °F to 41 °F] = Typ ±2.5%reading; 25 °C to 60 °C [77 °F to 140 °F]) = Typ ±2.5%reading}; 3) CPU = STM32F373 32bit with internal A/D converter (3CH 16bit sigma-delta); 4) EDM software to calculate AAR resistances at 150, 100, and 75 Pa (R 150 Pa, R 100 Pa and R 75 Pa), total resistance and other parameters such as max press, max flux, flux at 150,100, and 75 Pa. According to Broms, the quotient pressure-flow at the standardized points were the curves cross the circle with radius 2 which defined resistance 2 (R2) [33]. For each nasal resistance, the AAR parameters considered were inspiratory (R, L and R + L), expiratory (R, L and R + L), total combined (total inspiratory + total expiratory).

Results

Simulation study

Figure 1 shows the mean of the ICCs estimated given the CVs. The first batch of simulations emphasizes that until the true CV was < 15%, ICC was greater than 0.50 even if data were generated under the CV model; overall, ICC decreased as CV increased.

Fig. 1
figure 1

Simulated mean of the ICCs estimated given the CVs

Table 1 reports the CVs estimated in the second batch of simulations. For fixed ICC (for fixed σW and σB), the estimated CVs decreased as the overall mean μ increased as expected; however, most of the CVs were ≥ 0.15 also for high ICC values. For fixed μ, the estimated CV decreased as σW decreased as expected; the only CVs < 0.15 were observed for quite large μ values.

Table 1 Simulated means of the CVs with n = 10 and p = 5, for different σB, σW and overall mean μ

Repeatability of AAR

At baseline, the characteristics of the children were similar in the five groups (Table 2). In Table 3 the AAR parameters given the five groups are shown. Significant differences were found for all AAR parameters among groups. Table 4 reports the within-day ICCs for each AAR parameter by group. Most of the ICCs were statistically significant in all groups and they were > 0.20, which is considered the cut-off value between poor and fair agreement. Table 5 reports the coefficient of variation by group for all AAR. Most of the CVs were ≥ 0.15, which would indicate unacceptable repeatability.

Table 2 Characteristics of children by group at the baseline visit
Table 3 Nasal resistances (R2, R 75 Pa, R 100 Pa, R 150 Pa) by group
Table 4 Within-day ICCs by group for all the measured nasal resistances (R2, R 75 Pa, R 100 Pa, R 150 Pa)
Table 5 Within-day CV by group for all the measured nasal resistances (R2, R 75 Pa, R 100 Pa, R 150 Pa)

Reproducibility of AAR

Figures 2, 3, 4 and 5 show the between-day reproducibility of total combined R2, R 75 Pa, R 100 Pa and R 150 Pa, for each group of children. Specifically, the first row reports the reproducibility after 14 days from baseline (visit 2), and the second row reports the reproducibility after 28 days from baseline (visit 3). For all groups no evidence of heteroscedasticity was found, and therefore the statistical analysis was continued without logarithmic transformation. Point distribution appeared to be random, except for SAR-D, for which a decreasing trend was observed, and SAR-O, for which most of the measurements were clustered at small values.

Fig. 2
figure 2

Bland-Altman plot: the difference between the Total R2 measurements of Day 1 and Day 14 (first row) and between Day 1 and Day 28 (second row) for each group. The broken lines represent 5 and 95% percentiles

Fig. 3
figure 3

Bland-Altman plot: the difference between the Total R 75 (Pa) measurements of Day 1 and Day 14 (first row) and between Day 1 and Day 28 (second row) for each group. The broken lines represent 5 and 95% percentiles

Fig. 4
figure 4

Bland-Altman plot: the difference between the Total R 100 (Pa) measurements of Day 1 and Day 14 (first row) and between Day 1 and Day 28 (second row) for each group. The broken lines represent 5 and 95% percentiles

Fig. 5
figure 5

Bland-Altman plot: the difference between the Total R 150 (Pa) measurements of Day 1 and Day 14 (first row) and between Day 1 and Day 28 (second row) for each group. The broken lines represent 5 and 95% percentiles

Table 6 reports the CV and ICC values of Day 1 and Day 14 and between Day 1 and Day 28 by group. An unacceptable reproducibility was found since all CVs were ≥ 0.15 and most of the ICCs were not significant.

Table 6 CV and ICC between Day 1 and Day 14 (first column) and between Day 1 and Day 28 (second column) by group

Symptom data

Table 7 reports AUC and 95% CI for AAR parameters in predicting current symptoms of rhinitis in the overall study population. Of interest, in all the children reporting current symptoms of rhinitis a significant association with two items of T5SS, such sneezing and nasal obstruction, were found (p = 0.024 and p = 0.021, respectively).

Table 7 AUC and 95%CI for predicting current symptoms of rhinitis

Discussion

In this paper, two common approaches used for assessing repeatability and reproducibility were compared; the focus was on the misleading results obtained when inappropriate tools are used. In fact, although the use of the CV has largely been discouraged, this warning appears to be still ignored among most clinicians.

A simulation study showed that ICC values estimated from data generated, assuming a given true CV, yielded moderate repeatability until CV was < 15%, while when data were generated from a mixed model, irrespective of the magnitude of the true ICC, CV reported conflicting results depending especially on the combination of mean and variance used for generating the data [34]. Indeed, when the mean value is close to zero, the coefficient of variation approaches infinity and is therefore sensitive to small changes in the mean. This is often the case if the values do not originate from a ratio scale. Repeatability and reproducibility should be assessed using a statistical test highlighting reliability of the measurement and not the differences between subjects.

The motivating dataset provided a good example of this; indeed, until now AAR repeatability has only been studied in adults [8,9,10]. Two studies reported repeatability in terms of CV, and only one reported both CV and ICC. CVs computed for our clinical data, are similar to other studies on healthy adults reporting unacceptable repeatability [9] and reproducibility [10]. However, when ICC is considered, our results suggest that AAR has good repeatability. Similarly, Silkoff et al. reported conflicting results depending on the statistical tool used: in particular good repeatability with ICC was observed (0.76, 0.70 and 0.96 for right, left and combined nasal resistance respectively), whereas, when CV was considered, unacceptable or poor repeatability was obtained for right and left nasal resistance (CV = 15.9% and CV = 12.9%) [8]. On the other hand, when ICC was used to assess reproducibility most of the ICCs were not significant. However, in order to test the null hypothesis of ICC ≤ 0.20, considering an expected ICC of at least 0.70 and two repeated measurements for subject with a 90% statistical power and a 5% significance level, a sample size of 21 subjects per group was needed [35]. Therefore, the Bland and Altman plot is preferred, given the powerful visual representation of the degree of agreement and the easy identification of bias, outliers, and any relationship between the variance in measures with the size of the mean [4]. Bland and Altman plots constructed for our clinical data showed no evidence of heteroscedasticity and point distribution appeared to be random, except for SAR-D and SAR-O. The difference in reproducibility between groups is unexplained; however, the required sample size to estimate reproducibility using the Bland-Altman plot setting an expected mean of differences 0.20, an expected standard deviation of differences of 0.10 and a maximum allowed difference between methods of 0.50, was of 26 subjects [22]. Therefore, since the AAR repeatability in children with upper airway obstructive symptoms has not been investigated before, larger numbers of cases and more repeated measurement in prospective are needed to better determine reproducibility.

The present paper might suggest that, due to the use of inappropriate statistical tools, AAR repeatability and reproducibility may have been underestimated in previous assessments. Overall, our results highlight the clinical reliability of AAR both in healthy children and in ones with rhinitis. Furthermore, we showed good performance of AAR parameters in predicting current symptoms of rhinitis in the overall study population. This suggests that a more accurate reproducible measurement well correlates with patient’s symptoms, highlighting the additional value of AAR performance in clinical practice.

Conclusions

Physicians dealing with clinical data should carefully choose the most suitable statistical tools for assessing repeatability and reproducibility. The results of the present study support the clinical reliability of AAR parameters that showed good repeatability both in healthy and in rhinitis children.