Background

Administrative health data are widely used for monitoring trends in chronic disease incidence and prevalence for entire populations. Algorithms (i.e., case definitions) to ascertain disease cases may be applied to administrative health data without considering potential changes in the data over time. Specifically, changes in clinical guidelines, diagnosis coding practices, and healthcare processes may impact how administrative health data are coded [1, 2]. Therefore, changes in observed disease trends may reflect changes in data coding rather than true changes in population health status [3,4,5,6,7]. Methods that attempt to disentangle true change from coding-related effects will benefit users of administrative health data for disease surveillance.

Originally developed to monitor industrial processes, control charts are used to graph observed data in sequential order, with a centre line representing the average or expected value [8]. Control limits set around the centre line are used to denote the range where sources of process variation can be attributed to random error. Observations outside the control limits are deemed ‘out-of-control’, suggesting a non-random source of variation influenced the process of interest [8]. Different kinds of control charts can be applied to process data, including Shewhart charts, U′ charts, cumulative sum charts (CUMSU), and observed-expected charts. Chart selection depends on the data characteristics and chart purpose.

There has been a steady uptake of control charts in population health and healthcare research since the 1990s, with a marked increase in recent years [9]. Applications include monitoring: mortality rates using observed-expected [10], CUMSU [11], and p charts [10]; hospital length-of-stay using exponentially weighted moving average [12], CUMSU [12], and Shewhart charts [13]; surgical infection rates using Q [14] and p charts [15]; and delivery outcomes for maternity wards using observed-expected charts [16]. In health surveillance settings, U′ charts have been used to monitor injury rates of military personnel [17] and Shewhart and CUMSU charts have been used to detect changes in child blood lead levels [18]. In addition, open source software has already been developed to apply control charts to infectious disease surveillance using REDCap, R, and the R Shiny package [19].

Risk-adjusted control charts are of particular interest for health surveillance as they can adjust for different risk strata in the population [20, 21]. Risk-adjusted CUMSU and observed-expected control charts, which are closely related and sometimes used interchangeably [11, 21, 22], are risk-adjusted charts commonly used in health research due to their ease of interpretability and versatility with different data types (e.g., binary, count, continuous data) [9, 11, 22]. Risk-adjusted CUMSU charts incorporate observed values from previous time points into control limit calculations [22], whereas observed-expected charts may not.

Control charts could also be used to monitor chronic disease surveillance estimates obtained from administrative health data, similar to their applications to mortality data. Out-of-control diseases estimates may indicate where changes in trends are due to changes in coding practices or other factors affecting the data, rather than true changes in population health. Moreover, comparing control charts across multiple algorithms that use different sources of data (i.e., hospital versus physician records) may help to reveal potential sources of non-random process variation and indicate whether some algorithms are more affected by data variations (i.e., less stable) than other algorithms.

Given this background, the purpose of this study was to apply observed-expected control charts to incidence and prevalence trends in a case study of one disease. The objectives were to a) visualize the stability of disease trends over time; and b) compare the stability of incidence and prevalence trends produced using different algorithms applied to administrative data.

Methods

Selection of algorithms

PubMed, Google Scholar, and Embase were searched up to October 2020 for juvenile diabetes algorithms for administrative health data. Juvenile diabetes was selected as the focus of this study because administrative health data have frequently been used for surveillance of this disease and multiple validated algorithms have been developed [23, 24]. Search terms included diabetes, children, juvenile, administrative health data, case definition, claims data, incidence, and prevalence. Only articles published in the English language were reviewed.

Algorithms were selected for this study if they used hospital and/or physician records, if the number of records and observation window (i.e., number of years for a diagnosis to occur within the records) for the algorithm was clearly stated, and if validation measures (e.g., sensitivity, specificity) were reported. Algorithms were excluded if they included gestational diabetes or used data other than hospital or physician records, such as prescription medications. We adopted the latter exclusion, because our primary interest was in data coded using International Classification of Disease (ICD) codes. Table 1 summarizes the 18 algorithms we identified from the literature to include in this study [23,24,25,26,27,28,29]. Six algorithms were validated in Manitoba, Canada; three were validated in British Columbia, Canada; 13 were validated in Ontario, Canada; 16 were validated in Quebec, Canada; and one was validated in Nova Scotia, Canada. Figure 1 provides a flowchart that describes algorithm selection.

Table 1 Validated algorithms used to identify juvenile diabetes cases in administrative health data
Fig. 1
figure 1

Flowchart of juvenile diabetes algorithm selection from published literature

Data source

Algorithms were applied to data from the Manitoba Population Data Repository housed at the Manitoba Centre for Health Policy (MCHP). The study period was Jan 1, 1972 to Dec 31, 2018. Manitoba has a universal healthcare system and a population of 1.3 million residents. The Manitoba Health Insurance Registry, Hospital Discharge Abstracts, and Medical Claims/Medical Services databases were used. The Manitoba Health Insurance Registry contains health insurance coverage dates, birth date, and sex. The Hospital Discharge Abstracts and Medical Claims/Medical Services databases contain ICD codes and dates for hospital and physician visits, respectively. Three ICD versions are used to code diagnoses within these two databases: ICD Adapted (A)-8, ICD-9-Clinical Modifications (CM), and ICD-10-Canadian version (CA). For hospital visits captured in Hospital Discharge Abstracts, records between January 1, 1972 and March 31, 1979 are coded using 4-digit ICDA-8 codes; records between April 1, 1979 and March 31, 2004 are coded using 5-digit ICD-9-CM codes; and records from April 1, 2004 onwards are coded using 5-digit ICD-10-CA codes. For physician visits captured in Medical Claims/Medical Services, records between January 1, 1972 and March 31, 1979 are coded using ICDA-8 and records from April 1, 1979 onwards are coded using ICD-9-CM. Diagnosis codes for physician visits are recorded at the 3-digit level until March 31, 2015 (both ICDA-8 and ICD-9-CM); 5-digit codes are used from April 1, 2015 onward (ICD-9-CM). For both databases, data from 1972 to 1974 were originally collected using using the 7th revision of ICD codes and later converted to ICDA-8 by the data provider. These years were not included in the study analysis because we had no information about the conversion method used by the data provider. However, data from these years were used to establish the lookback period for defining incident cases (see Study cohort and study periods).

Incident and prevalent disease counts per year were aggregated by sex and age group (0-9 years; 10-17 years). Cell sizes less than six were suppressed, as per provincial health privacy regulations.

Study cohort and study periods

Separate cohorts were created for each algorithm. To be included in a study cohort, individuals required continuous health insurance coverage during the observation window (1 to 3 years, depending on the algorithm). Individuals in each cohort were classified as cases if they met the criteria of the respective algorithm. A 3 year look back period was used for incidence [23], meaning only individuals with no diabetes claims in the prior 3 years were identified as incident cases.

Study ICD periods were defined based on the ICD version that was used at the beginning of each year. There were three ICD periods: ICDA-8 (1975 to 1979), ICD-9 (1980 to 2004), and ICD-9/10 (2005 to 2018). ICD implementation periods were defined as the 2 years before, after, and including the year a new ICD version was implemented. There were two ICD implementation periods: ICDA-8 to -9 (1977 to 1981) and ICD-9 to -9/10 (2002 to 2006).

Statistical analysis

The estimated annual crude rate per 100,000 population was calculated; this was the number of cases per year divided by the number of individuals with continuous healthcare coverage per year, multiplied by 100,000. An average rate was calculated for each ICD period; this was the average value of the annual crude rates in that time period. The average annual rate of change in each ICD period was calculated as the total change in crude rate (annual crude rate in the last year of the ICD period minus the annual crude rate in the first year of the ICD period) divided by the number of years in the ICD period.

For each algorithm, incident case counts, where observations for successive years are independent (i.e., not correlated), were modelled using negative binomial regression models. Prevalent case counts, where observations for successive years are correlated, were modelled using generalized estimating equation (GEE) models that assume a Poisson distribution; this GEE produces correct estimates of the population average model parameters (i.e., prevalence) and their standard errors in the presence of dependence between repeated observations. The GEE model adopted a first order autoregressive correlation structure because the data modelled were time series data. For all models, age group, sex, and year were included as covariates. The natural logarithm of the cohort size was defined as the model offset. To account for potential non-linear effects of year, the shape of the year effect was tested using a restricted cubic spline [30]. Four models were applied to the data: one with year as a linear effect, and three with year as a restricted cubic spline with three, four, and five knots, respectively. Knots were placed at quintiles as recommended by Harrell [30]. The model with year as a restricted cubic spline with the lowest Akaike Information Criterion (AIC) [31] or Quasi Information Criterion (QIC) value [32] was selected as the best fitting model and compared to the model with year as a linear term using a likelihood ratio test (incidence) or Wald test (prevalence). If the test indicated the model with the restricted cubic spline did not fit the data significantly better than the model with year as a linear term (i.e., p < .05), the linear model was adopted.

Model fit for the best fitting model was assessed by calculating the residual deviance to degrees of freedom ratio (negative binomial models) or the marginal R2 values based on Zheng [33] (GEE models). If the number of suppressed cells for an algorithm was greater than 10%, the data were not modelled. If no more than 10% of the cells were suppressed, suppressed cells were randomly imputed to have a value between one and five. Three algorithms had more than 10% of cells suppressed for incidence and one algorithm had more than 10% of cells suppressed for prevalence. Therefore, incidence counts were modelled for 15 of the identified algorithms and prevalence counts were modelled for 17 of the identified algorithms.

Observed-expected control charts were applied by graphing model-predicted counts from the best fitting model against the observed case counts [21, 34]. Predicted values for each year, age group, and sex combination were calculated, along with their respective standard deviations. To obtain a single estimate and standard deviation (SD) for each year during the study period, predicted values were summed across groups and SDs were pooled. Control limits were calculated based on Cohen’s effect size [35] as the model-predicted value ±0.8*pooled SD. This cut-off was chosen as it provided a meaningful understanding of results (i.e., detect large differences between model predicted and observed counts) and did not incorporate a grand mean into the calculation. More information on the calculation of control limits, expected values, and SDs can be found in Additional file 1.

To compare trend stability across algorithms, annual case counts were classified as ‘in-control’ or ‘out-of-control’ for the years 1975 to 2016 based on the calculated control limits. Data after 2016 were truncated, because algorithms with three-year observation windows did not have case counts beyond 2016. Data before 1975 were used to establish the lookback period for defining incident cases. The proportion of out-of-control years was calculated as the total number of out-of-control years for an algorithm divided by the number of study years (i.e., 1975-2016; 42 years).

McNemar’s test [36] was used to test for differences in the frequency of out-of-control observations between algorithms. McNemar’s test was chosen because all algorithms were applied to the same population (i.e., repeated measurements). The algorithm of one or more hospital or physician visits in a two-year period (2: 1 + H or 1 + P) was selected as the reference algorithm, as the literature review identified it as having the highest validation measures (validated using chart abstraction) and was the most common algorithm identified through the literature search. Out-of-control observations for the remaining algorithms were then compared to the reference algorithm to determine differences in trend stability. Trend stability was compared across the entire study period, the three ICD periods, and the two ICD implementation periods. To control the overall probability of a Type I error for each family of tests (i.e., entire study period, each ICD period, and each ICD implementation period), a Holm-Bonferroni adjustment [37] was used. This adjustment controls the Type I error rate, but is more powerful than the traditional Bonferroni adjustment to detect a difference [37, 38].

To identify years that were frequently flagged as out-of-control, an agreement-by-year measure was calculated for each year of the study observation period. This was the total number of algorithms that classified a particular year as out-of-control, divided by the total number of algorithms modelled (i.e., 15 for incidence; 17 for prevalence).

In a sensitivity analysis the control limits were set as the model-predicted value ±2*pooled SD. All data analyses were performed using R version 4.1.0. The MASS package [39] was used to fit the negative binomial models and the geepack package [40] was used to fit the GEE models. All research was performed in accordance with the relevant guidelines and regulations.

Results

Average crude incidence and prevalence rates and average annual rates of change for each ICD period are reported in Table 2. For both incidence and prevalence, the average crude rate increased from the ICDA-8 period to the ICD-9/10 period, with the exception of the 1: 4 + P algorithm, where prevalence decreased from 39.86 cases per 100,000 population in the ICDA-8 period to 38.30 per 100,000 population in the ICD-9 period. As expected, the average crude rate was lower for algorithms that required more diagnosis codes to identify cases and was higher for algorithms with longer observation windows. All algorithms had a positive average annual crude rate of change for both incidence and prevalence during the ICD-9 and ICD-9/10 periods, except for the algorithm 3: 1 + H or 1 + P, which had a negative average annual crude rate of change for incidence during the ICD-9/10 period. The direction of the average annual crude rate of change was variable across algorithms for the ICDA-8 period.

Table 2 Average crude incidence and prevalence rates and average annual rate of change per 100,000 population across ICD periods

Models

Goodness-of-fit measures for the best fitting negative binomial and GEE models for each algorithm are reported in Table 3. Residual deviance to residual degrees of freedom ratio ranged from 1.04 to 1.23 for the negative binomial regression models; marginal R2 ranged from 0.83 to 0.98 for the GEE models. All algorithms for juvenile diabetes indicated a non-linear effect of year for incidence and prevalence (i.e., the model with year as a restricted cubic spline was selected as the best fitting model).

Table 3 Goodness of fit statistics for negative binomial regression and generalized estimating equation models applied to cases ascertained by juvenile diabetes algorithms

Control charts

Figure 2 shows the observed-expected control charts for incidence and prevalence trends obtained for the reference algorithm (2: 1 + H or 1 + P). Both incidence and prevalence increased over time; the rate of increase was variable over time. The variance (i.e., range) of observed values around expected values was greater for incidence than for prevalence; the control limits for incidence were wider than the control limits for prevalence. Control charts for all algorithms are found in Additional file 2: Figs. S1 and S2.

Fig. 2
figure 2

Observed-expected control charts for juvenile diabetes algorithm ‘one or more hospital or physician visits in two years’. Panel a shows results for incidence; panel b shows results for prevalence. Vertical lines indicate years where a change in ICD version was implemented

Table 4 contains information about the proportion of out-of-control years for each algorithm, for both incidence and prevalence. The proportion of out-of-control years ranged from 0.57 to 0.76 for incidence and 0.45 to 0.83 for prevalence. For incidence, the algorithm 2: 5 + P had the greatest proportion of out-of-control years; 2: 3 + P had the lowest proportion of out-of-control years. For prevalence, the algorithm 1: 3 + P had the greatest proportion of out-of-control years and 2: 3 + P had the lowest proportion of out-of-control years.

Table 4 Comparisons of incidence and prevalence trend stability across juvenile diabetes algorithmsa,b

McNemar’s test with the Holm-Bonferroni correction found no significant differences in the stability of trends for the reference algorithm compared to other algorithms. The same finding was observed for analyses stratified by ICD period and ICD implementation period.

Figure 3 reports agreement-by-year. For incidence, the years 1980, 2000, and 2004 were flagged as out-of-control for all algorithms. In contrast, 1986 and 2006 were flagged as out-of-control for only four of 15 algorithms. For prevalence, the year 1997 was flagged as out-of-control for all algorithms. The years 1981, 1988, 1990, 1993, and 2001 were flagged as out-of-control for 15 of 17 algorithms. In contrast, 1987 was flagged as out-of-control for only five algorithms.

Fig. 3
figure 3

Algorithm agreement-by-year for out-of-control juvenile diabetes estimates. Panel a shows results for incidence; panel b shows results for prevalence

Sensitivity analysis

Sensitivity analysis with control limits set at model-predicted value ±2*pooled SD flagged fewer years as out-of-control (Table 5). Control charts for all algorithms can be found in Additional file 2: Figs. S3 and S4. The proportion of out-of-control years ranged from 0.19 to 0.33 for incidence and 0.07 to 0.52 for prevalence. For incidence, the algorithm 2: 3 + P had the lowest proportion of out-of-control years. Two algorithms had the highest proportion of out-of-control years for incidence: 2: 2 + P and 1: 1 + H or 3 + P. For prevalence, the algorithm 2: 2 + P had the lowest proportion and the algorithm 1: 1 + H or 4 + P had the highest proportion of out-of-control years.

Table 5 Sensitivity analysis: comparisons of incidence and prevalence trend stability across juvenile diabetes algorithmsa,b

For incidence, McNemar’s test revealed no significant differences in trend stability across algorithms (Table 5). For prevalence, differences in trend stability were revealed between the reference algorithm and 2: 1 + H or 2 + P (p = 0.010), 1: 2 + P (p = 0.049), 2: 2 + P (p = 0.008), 2: 3 + P (p = 0.049), and 2: 4 + P (p = 0.049), where these algorithms had a lower frequency of out-of-control observations. When stratified by ICD period, there was a difference between the reference algorithm and 2: 2 + P (p = 0.041) for the ICD-9 period, with 2: 2 + P having a lower frequency of out-of-control observations.

Algorithm agreement-by-year for the sensitivity analysis are reported in Additional file 2: Fig. S5. Incidence counts for the year 1980 was flagged as out-of-control for 14 out of 15 algorithms; prevalence counts for the year 2001 were flagged as out-of-control for 13 out of 17 algorithms.

Discussion

Observed-expected control charts applied to juvenile diabetes algorithms for administrative health data were used to investigate the stability of trends in incidence and prevalence over a 42-year period in which three ICD versions were used for diagnosis codes. The proportion of out-of-control years detected using control limits of 0.8*SD ranged from 0.57 to 0.76 for incidence and 0.45 to 0.83 for prevalence. As expected, these proportions were reduced to 0.19 to 0.33 for incidence and 0.07 to 0.52 for prevalence when control limits of 2*SD were used in a sensitivity analysis. No differences in trend stability across algorithms were observed in the main analysis. Sensitivity analyses identified five algorithms that produced a more stable prevalence trend compared to the reference algorithm.

Control limits in this analysis were set to have practical meaning and detect a large difference between the observed and expected values, relative to the distribution of the observed data. Applying control limits based on meaningful cut offs has been done before [11, 34]. Wider control limits used in the sensitivity analysis found few changes to the overall study outcome. Previous research that used a similar observed-expected control chart on hospital mortality data indicated poor specificity for control limits larger than 2*SD [34]. Our sensitivity analysis allows users to compare study control limits, while maintaining reasonable specificity for defining an out-of-control observations. Control limits of 2*SD have been used previously when applying control charts to health data [10, 34, 41]. Other potential approaches to set control limits include using a clinical database as the in-control reference or applying a validated algorithm and correcting for potential misclassification rates [42]. The former method requires a population-based clinical database to use as the reference, which may not always be available or accessible.

Our tests of statistical significance did not detect any differences between algorithm agreement of out-of-control years for the main analysis, indicating there was no difference in stability of trends ascertained by different algorithms when compared to a reference algorithm. In contrast, the sensitivity analysis indicated there were five algorithms with a more stable prevalence trend, compared to the reference algorithm. Observation window and data source (i.e., hospital versus physician visits) did not appear to influence differences in observed trend stability. Results from the sensitivity analysis suggest some algorithms are more stable to changes in the coding process when estimating prevalence trends, but only when the specificity for detecting out-of-control years is lower.

Agreement-by-year indicated several years (e.g., 1980, 1981, and 2001) where all, or the majority of algorithms produced an out-of-control estimate in both the main and sensitivity analysis. Previous research examining incidence disease trends over time has called for more studies to examine factors that influence disease trends [43]. The years identified here could provide a starting point to identify those factors. For example, 1980 and 1981 being out-of-control for the majority of algorithms is likely indicative of changes in coding patterns due to the switch from ICDA-8 to ICD-9-CM in 1979, rather than true changes in population health.

This analysis applied control charts to assess the stability of trends over time. While data quality was not directly assessed, trend stability has been used to assess data quality [44]. This is of particular interest, as administrative health data were not originally collected for research and surveillance, potentially impacting the data’s ‘fitness-for-use’. Previous research has used administrative health data in control charts; however, the primary interest was the quality of the healthcare process, not the data itself. Control charts have been used to monitor the quality of cancer registry data [45]; thus, there is a precedent for using control charts as a first step to investigating potential sources of systematic error in administrative data.

Strengths and limitations

Strengths of this study include the use of observed-expected control charts to assess trend stability. With this method, underlying risk strata were accounted for and calculation of control limits were appropriate to surveillance data (i.e., no grand mean incorporated and therefore appropriate for data trending over time; does not rely on previous case counts). In addition, using restricted cubic splines to model change over time relaxed the assumption of a linear effect without overfitting and reducing the control chart’s ability to detect out-of-control observations. Good model fit was confirmed by the goodness-of-fit measures for the best fitting models.

There are some limitations to this study. While control limits were set to have practical meaning, the accuracy for detecting true out-of-control estimates based on these limits was not tested. To account for this, multiple control limits were used, with the limits for the sensitivity analysis being based on previous literature that used simulations to maximized out-of-control detection accuracy [10, 34].

Clinical data were not used to produce a known ‘in-control’ (i.e., not influenced by error in the data coding process) trend. Rather, a reference algorithm validated using chart abstraction was the comparator for the remaining algorithms. This provided an indication of how trend stability for remaining algorithms compared to a proxy in-control process; however, results may differ when using a clinical database as the standard as defining an in-control reference process.

Conclusions and future research

Control charts can be used to visualize the stability of chronic disease surveillance trends captured using administrative health data and indicate where potential systematic sources of error may affect surveillance estimates. Differences in trend stability across algorithms were observed for prevalence, but only at wider control limits. Potential areas of future research include identifying optimal control limits for trends ascertained with administrative health data. Future research should also apply control charts to other chronic disease surveillance estimates. Adaptation of control charts as a visual tool to inform policy and decision makers is also a potential area for future research.