Background

Chronic disease is a global public health issue. In 2014, more than two thirds of deaths worldwide (38 million) were attributed to chronic diseases [1], of which the majority (82%) were attributable to four chronic diseases: cancer, cardiovascular disease, chronic respiratory disease, and diabetes. This burden is mirrored in Canada where 60% of Canadians aged 20 years and older have at least one chronic disease [2]. Chronic disease is responsible for the decreased quality of life and over 200,000 Canadian deaths a year, 27% of which are premature (i.e., deaths under 70 years of age) [1,2,3].

While the increasing prevalence of chronic disease over time can be attributed to improved chronic disease management [4, 5], one consequence is high costs borne by the health care system due to improved care. In Canada, approximately 42% of direct medical costs and 65% of indirect costs are attributable to chronic disease [6], and individuals with at least one chronic disease are more likely to use the health care system and have higher health expenditures than individuals without chronic disease [7]. The costs are expected to increase as the prevalence of multimorbidity—the presence of two or more chronic conditions—has been increasing in recent years [8]. Multimorbid patients have higher health care expenditures than other patients due to their complexity of care [9,10,11].

The best strategy to address the chronic disease burden long-term is through prevention [12]. Well-established evidence shows that four modifiable lifestyle risk factors—alcohol consumption, cigarette smoking, unhealthy diet, and physical inactivity—are major risk factors of cancer, cardiovascular disease, chronic respiratory disease, and diabetes and that two thirds of incident cases of these chronic diseases are caused by these risk factors [12]. Despite this knowledge, designing and implementing prevention strategies to reduce chronic disease is difficult. Part of the difficulty is that there is no straightforward way to predict how a prevention strategy targeted towards unhealthy lifestyle behaviors will reduce the incidence of multiple chronic diseases simultaneously. Currently, there are population-based prediction tools that predict the incidence of individual chronic diseases based on lifestyle behaviors [13,14,15,16], but to the best of our knowledge, there is no population-based regression prediction model that predicts the incidence of multiple chronic diseases concurrently.

To address the need for a population-based regression prediction model for chronic disease incidence, we propose the development and validation of the Chronic Disease Population Risk Tool (CDPoRT). CDPoRT will use self-reported, modifiable lifestyle risk factor information for a population-based prediction of the incidence of six chronic diseases over a 15-year period: congestive heart failure (CHF), chronic obstructive pulmonary disease (COPD), diabetes, lung cancer, myocardial infarction (MI), and stroke including transient ischemic attack (TIA). A study protocol outlining the analytical details of the development and validation of a prediction model is essential to providing transparency of the model building process and improving the quality of prognostic research [17]. This study protocol outlines the analytical approach that will be used to develop and validate CDPoRT.

Methods

Data sources

CDPoRT will use population-based survey data from the Canadian Community Health Survey (CCHS) linked to health administrative data from Ontario and Manitoba, two Canadian provinces. The CCHS is a cross-sectional survey originating in 2000 that collects personal health status, health care utilization, and health determinant data [18]. The CCHS features a multistage, stratified cluster survey design and represents over 98% of the Canadian population 12 years and older. The CCHS will provide the predictors (e.g., modifiable lifestyle risk factors) for CDPoRT.

CCHS respondents will be individually linked to health administrative data in Ontario and Manitoba housed at the Institute for Clinical Evaluative Sciences and Manitoba Centre for Health Policy, respectively. Health administrative data will be used to determine chronic disease outcomes. The data holdings from the Institute for Clinical Evaluative Sciences are hospital discharge data from the Discharge Abstract Database, physician billing claims data from the Ontario Health Insurance Plan Claims Database, cancer registry data from the Ontario Cancer Registry, demographic data from the Registered Persons Database, and cause of death from the vital statistics data, the Office of the Registrar General Deaths Database. From the Manitoba Centre for Health Policy, the data holdings to be used are hospital discharge data from the Discharge Abstract Database, physician claims data from the Medical Services Database, cancer registry data from the Manitoba Cancer Registry, demographic data from the Manitoba Health Insurance Registry, and vital statistics data.

Study design

Two sex-specific CDPoRT models will be developed and validated, one for females and another for males. There will be one derivation cohort and two validation cohorts. The Ontario derivation cohort consists of 70% of the CCHS respondents from Ontario randomly selected from the first six cycles of the CCHS—cycles 1.1 (2000), 2.1 (2003), 3.1 (2005), 4.1 (2007), 2009/2010, and 2011/2012—that permitted the linkage to Ontario administrative data. The first validation cohort, the Ontario validation cohort, will consist of the other 30% Ontario CCHS respondents. This validation cohort will allow for split-sample validation. The second validation cohort, the Manitoba validation cohort, will be all CCHS respondents from Manitoba who partook in the first six CCHS cycles and permitted the linkage to the Manitoba administrative data. The second validation cohort will enable external, geographic validation. For all cohorts, survey respondents will be excluded if they were under the age of 20 years as of the CCHS interview data or had a self-reported history of the six chronic diseases of interest based on self-report from the CCHS or algorithmic diagnosis from the health administrative data. For individuals who had multiple survey responses, the earliest record after the age of 20 years was used.

Main predictors—modifiable lifestyle risk factors

The main predictors in CDPoRT are four modifiable lifestyle risk factors: alcohol consumption, cigarette smoking, diet, and physical activity. Information regarding the lifestyle factors was consistently collected across all six CCHS cycles. Alcohol consumption was measured in terms of whether alcohol was consumed in the year prior, the frequency of drinking alcohol, the days on which alcohol was consumed in the last week, and the number of drinks per day. Cigarette smoking was measured in terms of current and former smoking habits including the number of cigarettes smoked daily, the frequency of smoking, and time since quitting smoking. Diet was captured through daily fruit and vegetable consumption based on the daily frequency consumption habits for fruits, fruit juice, starchy vegetables (e.g., potatoes), salad, carrots, and other vegetables. Physical activity was measured by calculating the average metabolic equivalents of daily leisure physical activity for a variety of leisure activities, such as walking, swimming, running, and sports. The modifiable lifestyle risk factor information measured in the CCHS will be used to create a summary predictor for each lifestyle factor (described further in the “Data cleaning and coding of predictors” section). Further details about the data collected by the CCHS for each lifestyle factors can be found in Additional file 1.

Outcome—chronic disease

Through the linkage to administrative data, individuals will be followed up longitudinally for the incidence of their first chronic disease, defined as CHF, COPD, diabetes, lung cancer, MI, and stroke including TIA. These six chronic diseases were chosen based on a combination of factors, such as their prevalence, previous causal associations with the modifiable lifestyle risk factors (e.g., smoking and lung cancer), impact on morbidity and mortality, and interest from the knowledge users. All selected chronic diseases are readily identifiable from the administrative data. Except for lung cancer, all chronic diseases will be identified from the health administrative data using chronic disease algorithms based on physician diagnosis (Table 1). All chronic disease algorithms have been previously validated in Ontario (sensitivity from 60.2–95.0%, specificity from 76.5–99.2%) [19,20,21,22,23,24]. For both jurisdictions, the same algorithms will identify chronic diseases based on the International Classification of Diseases (ICD) coding scheme using hospital discharge data (ICD-9 and ICD-10) and physician claims data (ICD-9). Lung cancer will be ascertained from each province’s respective cancer registries [22, 25]. Additionally, vital statistics data will be used to supplement chronic disease incidence by attributing causes of death due to a chronic disease of interest as a diagnosis. This is essential because some chronic diseases would be underreported if only the health services databases were used. For example, some individuals suffer and die from stroke outside of the health care system, so stroke incidence would be underreported if vital statistics were not used. Respondents will be followed up from the date of the CCHS interview up to a maximum of 15 years (January 2000 to December 2014) until one of the following events: incidence of chronic disease, death free of chronic disease (a competing risk), or end of follow-up (December 31, 2014).

Table 1 Chronic disease ascertainment algorithms and their predictive accuracy

Sample size

The Ontario derivation and validation cohorts have already been created but not used for prediction modeling. The female and male derivation cohorts consist of 47,960 and 38,267 respondents with 365,771 and 291,638 years of follow-up, respectively. The minimum sample size required for survival analysis-based prediction is based on the events where the cohort should have at least ten events per degrees of freedom (df) used in the model [26]. There will be 7035 and 6220 chronic disease events for females and males in the derivation cohort, which allows for a maximum of 703 and 622 dfs in the models, respectively. We do not anticipate exceeding 622 dfs during model development.

For validation studies of prognostic models, it has been recommended that the validation cohort has a minimum of 100 events, but ideally 200 (or more) events [27]. The Ontario validation cohort will consist of 20,325 females and 16,627 males, of which 2972 and 2658 had a chronic disease event, respectively. At the time of publication, the Manitoba validation cohort has not been linked to the administrative data, but using the number of CCHS respondents from Manitoba and making assumptions about the proportion of individuals who consented to the linkage and did not meet the exclusion criteria based on the Ontario cohorts, the Manitoba validation cohort is expected to consist of 11,900 females and 9600 males. Using the sex-specific incidence of chronic disease in the Ontario cohorts, approximately 1800 females and 1600 males are expected to have chronic disease events in the Manitoba validation cohort. Both validation cohorts exceed the recommended number of events for validation.

Analysis plan

The analysis plan was developed with consideration of the guidelines provided by Harrell and Steyerberg [28, 29]. To protect against type 1 error from a data-driven variable selection of model specification, the predictor-outcome associations in any cohort will not be explored until the protocol is published. All steps in the analytical plan for the development and validation of CDPoRT are guided by the principle of creating a prediction model with high predictive accuracy (e.g., calibration, discrimination). The analytical plan also takes into consideration that CDPoRT will be used by knowledge users (e.g., public health units, health policymakers, and other decision-makers) that may not have extensive statistics training, so the practical application of CDPoRT is important. Features of CDPoRT that will make it more user-friendly include easy input of the predictors, straightforward and meaningful interpretations of the predictions overall and by subgroups, and consistent model application across time and geography. As a result, the practical considerations will impact how CDPoRT is built and will affect the aspects of model development, such as predictor selection and operationalization, missing data management, model specification, model estimation, validation, and model presentation. All data will be cleaned and manipulated using SAS. Modeling will be conducted using Harrell’s Hmisc package of function in R and the stpm2 command in Stata. The TRIPOD statement for multivariable predictive models also helped guide this study protocol and will be used to help report the estimates from CDPoRT [30].

Identification of predictors

The modifiable lifestyle risk factors were identified based on well-established evidence of their associations with the chronic diseases of interest [1, 31,32,33]. Other candidate predictors were identified based on a combination of subject matter expertise, input from the knowledge users, the group’s previous experiences with population-based prediction models [13,14,15,16, 34,35,36], and predictor availability across all CCHS cycles. During this phase, some predictors were excluded due to narrow distributions or insufficient variation, while others were excluded based on redundancy. There were sixteen candidate predictors identified: four modifiable lifestyle risk factors (alcohol consumption, cigarette smoking, daily fruit and vegetable consumption, physical activity), six sociodemographic characteristics (age, ethnicity, immigration status, household income, education, marital status), and six other health-related factors (asthma, body mass index (BMI), high blood pressure, household secondhand smoke, self-rated health, life stress). Self-reported BMI is known to be biased, so a validated correction equation will be used [37]. Because the biological underpinnings of chronic disease vary by sex, two CDPoRT models will be created.

Data cleaning and coding of predictors

Continuous predictors will be examined using descriptive statistics, histograms, and box plots to examine their distributions. Incorrect values will either be corrected or set to missing. Continuous predictors with highly skewed distributions will be truncated to the 99.5th percentile. Continuous predictors that can be meaningfully categorized will be grouped; for example, BMI will be classified according to internationally recognized groups (i.e., underweight, normal weight, overweight, and classes I to III obesity). The frequency distribution of categorical predictors will be examined, and categories which are too small (i.e., < 5%) will be grouped with another category. Based on previous experiences, we will be deriving some predictors based on a combination of survey responses. For example, alcohol consumption status will be defined based on whether the person reported drinking in the past year, the number of times drank in the past week, and the total number of drinks consumed while considering sex-specific differences (i.e., non-drinker, light drinker, moderate drinker, heavy drinker, binge drinker). Predictors will also be created with consideration for how they were defined in similar population-based prediction models [13,14,15,16, 34,35,36]. The definitions for all predictors have been pre-specified to minimize overfitting (Table 2). However, we recognize that there is some subjectivity in how these classifications are made, so sensitivity analyses of the predictive performance of other definitions of the predictors will be performed to examine the robustness of our definitions during model building and validation.

Table 2 Pre-specified definitions of CDPoRT predictors

Missing data

Variables with missing data will be categorized as a separate category. This will allow users of the tool to include all respondents during CDPoRT application. While defining a missing category may introduce some classification bias, excluding incomplete cases will add bias by making the cohort less representative of the population. We expect minimal bias from creating a missing category because most predictors have less than 1% of values missing. Income is expected to be an important predictor for chronic disease incidence, so imputed values of income will be used in the cycles in which Statistics Canada provided computed income values.

Model estimation

The initial models will be estimated using the Royston-Parmar model [38]. The distinguishing feature of this model is that the baseline cumulative hazard function is modeled as a restricted cubic spline which permits estimation of absolute measures of effect (e.g., cumulative incidence) at all time points. While the Royston-Parmar model is generally robust to the number and placement of knots [38,39,40], it is recommended to test the robustness of the baseline function by varying the number and placement of knots [41]. To do this, different baseline functions with a varying number of knots (between two and six knots) evenly distributed over the survival times will be compared via information criterion (i.e., Akaike information criterion (AIC), Bayesian information criterion (BIC)). If there is a baseline function with a clear best model fit, that function will be used. If the model fit is similar between multiple baseline functions, the function with the fewest number of knots will be selected to reduce overfitting that may occur during validation. The knot placement of the selected function will then be varied randomly to further test the robustness. The baseline cumulative hazard function will be modeled on the proportional hazards (PH) scale as hazards ratios are a common estimate for survival models. However, the Royston-Parmar model can also be modeled on the proportional odds or probit scale, and if the model fit is greatly improved on either of these scales, that scale will be selected instead.

The Royston-Parmar model estimates the cause-specific hazards ratio, but it can also account for competing risks by transforming the cause-specific hazards to a cumulative incidence function [42]. CDPoRT will consider death free of chronic disease as a competing risk. PH models assume that the relative hazard of a predictor is constant across time. However, this might not be true, and so violations of the PH model will be examined by plotting raw and smoothed scaled Schoenfeld residuals versus time versus predictors that are expected to vary with time (e.g., age). Any violations of the PH assumption that alter the model fit or predictive performance will be accounted for by including an interaction with time. The degree of overfitting will be estimated using the heuristic shrinkage estimator, which is based on the log-likelihood ratio χ2 statistic of the full model [43]. If the shrinkage is less than 0.9, the model will be adjusted for overfitting accordingly. If the shrinkage value is greater than 0.9, and the model does not perform well, data reduction techniques will be explored. All results will incorporate survey weights so that CDPoRT is representative of the underlying population. Variance estimates will be calculated using the bootstrap survey weights via balanced repeated replication [44, 45].

Model specification

Sex-specific models will be fitted initially with the pre-specified forms of the predictors (Table 2). Predictors that were originally continuous will also be modeled in a flexible fashion using restricted cubic splines with the knots placed evenly through the distribution. If the fit is greatly improved with the continuous form of the predictor versus the categorical form based on information criterion (e.g., AIC, BIC), and measures of predictive performance (i.e., overall fit, discrimination, calibration), then the continuous, centered form will be used. While we have pre-specified definitions of the categorical predictors to use, other definitions will be examined during model building. If any of these definitions greatly improve the model fit or predictive performance of the model, the alternate specification will be used instead. Ordinal predictors will be initially modeled as categorical, but if the fit improves when specified as a linear function, the ordinal variable will be modeled continuously. Collinearity between predictors will be assessed using the Varclus function in R. Two-way interactions between variables will be examined. The model that uses all the pre-specified forms of the predictors has 56 dfs (Table 2).

There will be a stepwise selection process for predictors. The initial model will consist of the modifiable lifestyle risk factors. The overall fit of the initial model will be analyzed in terms of model fit statistics (e.g., log-likelihood, AIC, BIC). The predictive performance of this initial model will be assessed using overall measures of predictive accuracy, calibration, and discrimination (described further in the “Assessment of predictive performance” section). Two sets of predictors (sociodemographic characteristics, other health-related factors) will be added to the model, one set at a time. Individual predictors within each set that do not improve the overall model fit (e.g., AIC, BIC) and/or predictive performance (e.g., discrimination, calibration) will be removed. When a set of predictors is added to the model, the predictive performance of the existing predictors will be examined, and if their predictive performance is minimal, the predictor may be removed. Variables excluded in prior rounds will be added back to the model to confirm whether their exclusion was appropriate. The overall fit and predictive performance after each set of predictors are added will be compared to the original model. The final model will consist of the four modifiable lifestyle risk factors and a combination of sociodemographic characteristics and other health-related factors.

Assessment of predictive performance

Predictive performance in the derivation and validation cohorts will be assessed and reported using overall measures of predictive accuracy, discrimination, and calibration. Accuracy will be measured with Nagelkerke’s R2 and Brier score. Discrimination will be assessed with Harrell’s concordance statistic. Calibration is important for prediction models because risk prediction in future settings is of primary interest [29]. Calibration will be assessed by comparing the observed and predicted risk of a chronic disease over deciles of risk using calibration plots at specific time points (e.g., 1, 5, 10, and 15 years). Calibration slopes will be generated by regressing the outcome in the validation cohort on the predicted chronic disease risk. Deviation from perfect calibration (i.e., slope of 1) will be tested with Wald or likelihood ratio tests. Calibration-in-the-small will also be assessed for subgroups of interest (e.g., cigarette smoking groups). We consider any relative difference of less than 20% between observed and predicted risk within subgroups that have at least a 5% chronic disease event rate as adequately calibrated.

Model validation

Once the model has been developed in the Ontario derivation cohort, the performance of the model will be validated within the Ontario context in terms of overall measures of predictive accuracy, discrimination, and calibration. This will provide an idea of the model’s optimism when validated in the Manitoba context. We will perform bootstrap validation within the Ontario derivation cohort to get an idea of how the model will perform in the Ontario validation cohort. We will also validate the model using the Ontario validation cohort via split-sample validation. The major drawback of split-sample validation is not expected to affect the validation as the sample sizes for the Ontario derivation and validation cohorts are large. While the Ontario cohort is being divided into parts for the split-sample validation, the final regression coefficients using the full Ontario dataset will be used to maximize the sample size and follow-up duration. The final combined model will maintain the same predictors and form as the derivation model. In the off chance that the derivation and validation cohort differ significantly, a cohort-specific intercept and/or interaction term will be included in the Manitoba model.

CDPoRT will undergo external validation using the Manitoba validation cohort. This will assess geographic validation. The performance of the model in the Manitoba validation cohort will be understood using predictive accuracy, discrimination, and calibration. As well, novel approaches to assess geographic validity will be used [46, 47]. Ideally, the CDPoRT model developed and validated in Ontario will have high predictive performance in Manitoba. However, the case-mix of individuals and their outcomes will be different between the jurisdictions, which may result in decreased model performance. In this scenario, CDPoRT in the Manitoba setting will be modified using updating methods (e.g., re-calibration, model revision, or model extension).

Model presentation

The final regression model of the Ontario CDPoRT consisting of the derivation and validation sample, as well as the Manitoba CDPoRT, will be presented using estimated hazards ratios and 95% confidence intervals. Absolute measures of effect such as baseline risk and cumulative incidence will also be presented, which is a feature of the Royston-Parmar model versus the Cox PH model because the baseline cumulative hazard function is modeled. Estimation of the baseline risk allows and permits the calculation of the attributable risk of the predictors based on the hazard ratios. Interactive, visual tools will also be created to help describe the model, which can improve the literacy of the model for non-technical audiences. The regression formula will also be published and used as the underpinning for web-based implementation.

Analyses beyond initial model development

Sensitivity analyses will be conducted to see how the model performs under different settings. One sensitivity analysis will be performed to see the algorithm’s performance in a population in which individuals were excluded based on self-reported chronic disease only and not based on an algorithmic diagnosis from the health administrative data. This is important to understand how the model will perform in real-world environments where users will not have access to health administrative data. A second sensitivity analysis will be performed where each chronic disease will be modeled separately to see if modeling each chronic disease as separate outcomes and combining results to get the overall chronic disease incidence improves predictive performance. This is a possibility as the predictor coefficients can be thought of as an averaged effect that reflects the frequency of each chronic disease. In a population with a different case-mix of chronic disease, the model may have worse predictive performance. As other chronic diseases (/conditions) remain of interest to the knowledge user [8], a third sensitivity analysis will see how the predictiveness of CDPoRT changes when other chronic conditions are included as outcomes.

Discussion

Chronic diseases are a public health priority and a great expenditure burden on the Canadian health care system. Accurate prediction of chronic disease incidence in the population based on modifiable lifestyle risk factors will help with the prevention strategies and health care delivery planning. However, public health officials and health policymakers do not have a straightforward way to estimate chronic disease incidence accurately for their jurisdiction. The development and validation of CDPoRT hopes to address these needs.

Limitations

One limitation of CDPoRT is that while the tool will be representative of most of the population (98%), some groups will not be covered, notably on-reserve Aboriginals. This is an important consideration because Aboriginals have been reported to be at greater risk for developing chronic disease [48], and a sizable proportion of the Manitoba population is Aboriginal (15%) [49]. The Aboriginal population is relatively smaller in Ontario (2%), so it should not impact the model development. However, the different case-mix is expected to impact the validation in the Manitoba context. To help with this matter, we will compare the case-mix between Ontario and Manitoba and use the information to understand how it will affect the calibration and discrimination [29]. Novel approaches to geographic validation will also be explored [46, 47]. If necessary, CDPoRT will be updated to the Manitoba setting.

A second limitation is that the measurement quality of the modifiable lifestyle risk factors varies. Cigarette smoking is the most complete measure as it considers current and past smoking history. However, pack-years smoked cannot be measured because not enough details about the smoking habits of former smokers were collected. Alcohol consumption measures drinking habits up to 1 year from the interview date but does not measure habits before this period. Diet is only captured by fruit and vegetable consumption, and dietary habits of other food groups (e.g., meats, dairy) or constituents (e.g., sodium, fat) are not captured. Specific cycles of the CCHS capture dietary habits in more details (e.g., CCHS 2.2), and if the data can be obtained, reclassification methods can be used to measure the incremental changes in the prediction if these predictors were added to the model [50]. Physical activity only measures leisure physical activity, and other forms of physical activity (e.g., transportation, work) are not measured.

Implications

CDPoRT will be developed to ensure the tool meets the needs of the knowledge user. Partnerships exist with knowledge users at the municipal and provincial level including prominent health system decision-makers (e.g., Medical Officers of Health; program managers, directors and executives in two provincial governments). This will enable CDPoRT to be applied in various regions and settings across Canada through the support of knowledge brokers and permit knowledge users to predict the incidence of multiple chronic diseases simultaneously in their population. The effectiveness of CDPoRT in these settings will be further evaluated to understand the utility of the tool in real-world settings for supporting decision-making and planning.

Conclusions

To the extent of our knowledge, CDPoRT will be the first population-based regression prediction model that will predict the incidence of multiple chronic diseases simultaneously at the population level. The tool will be used by public health and health policymakers to support planning and decision-making.