The role of cigarette smoking in the etiology of lung cancer is strong and has been so recognized since at least the 1960s [1]. In the intervening years, a great deal of evidence has accumulated confirming the strong impact of smoking on cancer and extending it in various directions, such as the impact of smoking on women’s risks, the nature of dose-response relationships, the impact of quitting, the particular relationships between smoking and each of the main histologic types of lung cancer and many more [2,3,4]. One might imagine that all of this evidence can be put together to derive very precise estimates of the quantitative impact of smoking on lung cancer risk. But our recent review showed us that the amount of published evidence that can be assembled for meta-analyses regarding exposure-response is rather limited [5]. While there have been quite a few publications showing exposure-response results for smoking and lung cancer, they use different metrics (duration, intensity, pack-years, categorical, continuous, etc.) and different strategies for control of confounding, so that the number of results that can fairly be juxtaposed or meta-analyzed, using a common exposure metric and type of model, is limited. Providing high quality statistical evidence on smoking–lung cancer associations, including by histologic subtypes, remains an important objective because such information will be useful for public health purposes, to build lung cancer risk prediction models that can be used to advise healthy patients about their risks related to their past and possible future smoking behaviors, to understand mechanisms of carcinogenesis, and to provide information that may be useful in a legal context, for compensation or litigation purposes.

In carrying out a large case-control study of lung cancer in the Montreal area to explore the possible etiologic role of scores of occupational and environmental factors in lung cancer [6], we of course collected a detailed smoking history from each subject. We use this dataset to address two objectives: a) provide evidence of exposure-response relationships between smoking history and lung cancer, using a variety of smoking metrics and formats that might be used in future meta-analyses; b) explore the relative impact of different dimensions of smoking history and how these interact in predicting risk [7,8,9,10].


Cases and controls

The population-based case–control study included all lung cancer cases, males and females aged 35–75 years residing in Montreal and its surrounding suburbs and who were Canadian citizens. Histologically confirmed incident cases of lung cancer diagnosed between January 1996 and December 1997 were ascertained through active monitoring of pathology reports in the 18 participating hospitals in the metropolitan Montreal region, providing almost complete (≈98%) coverage of lung cancer diagnosis in the area. Histology of lung cancer was coded according to World Health Organization/International Agency for Research on Cancer technical report 31 [11].

Controls were randomly sampled from population-based electoral lists, frequency-matched to cases by age group (±5 years), gender and residential area. Further details about the study can be found elsewhere [12]. Ethics approval was obtained from all collaborating institutions, and written informed consent was obtained for all participants.

Data collection

A face-to-face interview was conducted by one of our bilingual interviewers (English and French). If the subject was deceased or too ill to respond, we attempted to conduct the interview with a close next of kin proxy, usually the surviving spouse. The questionnaire was designed to collect information on sociodemographic and lifestyle characteristics, including smoking history, and a detailed semi-structured history of all jobs ever held.

Smoking history

Detailed self-reported information was collected about cigarette smoking habits including smoking status, ages at initiation and cessation, periods of interruption and average number of cigarettes smoked per day over the subject’s lifetime. Smokers were defined as those who smoked regularly (at least one cigarette per week) during at least 6 months, and at least 100 cigarettes in their lifetime, the others being considered “never smokers”. Since early symptoms of lung cancer can lead to changes in smoking behavior, in order to avoid reverse causality bias, we discounted the two years before index date in computing each of the smoking variables. This cutpoint of two years was recommended by Leffondré et al. [7], based on their fitting of models with different cutpoints. Thus “current smokers” were defined as subjects who still smoked at interview or had quit smoking less than 2 years before the reference date (i.e. date of diagnosis for cases and date of interview for controls), and former smokers were those who quit at least 2 years before this reference date. Smoking duration was defined as the difference between age at index date for current smokers, or age of cessation for former smokers, and age at initiation, and then subtracting total duration of any temporary cessation periods. Cumulative smoking exposure was represented by two alternative constructed variables: pack-years and cumulative smoking index (CSI). The pack-years variable was computed by multiplying the average number of cigarettes smoked per day by duration of smoking in years, and dividing by 20 (cigarettes per pack). The CSI is an index comprising all the smoking dimensions collected from study subjects in a function that is biologically motivated and that optimizes predictive power [13, 14]. Leffondre et al. proposed a modified version of CSI, adapted specifically for lung cancer, and demonstrated that the resulting aggregate exposure measure improved the fit of data, compared with conventional modeling of separate effects of different smoking components [14].

The equation is: CSI = (1–0.5dur*/τ) (0.5tsc*/τ) ln(int + 1), where

$$ {\displaystyle \begin{array}{l}\mathrm{tsc}=\mathrm{time}\ \mathrm{since}\ \mathrm{cessation},\\ {}{\mathrm{tsc}}^{\ast }=\max \left(\mathrm{tsc}-\updelta, 0\right),\\ {}\mathrm{dur}=\mathrm{duration},\\ {}{\mathrm{dur}}^{\ast }=\max \left(\mathrm{dur}+\mathrm{tsc}-\updelta, 0\right)-{\mathrm{tsc}}^{\ast },\\ {}\operatorname{int}=\mathrm{average}\ \mathrm{daily}\ \mathrm{amount}\ \mathrm{smoked}\ \mathrm{in}\ \mathrm{cigarettes},\\ {}\updelta =\mathrm{lag}\ {\mathrm{between}}^{`}{\mathrm{causalaction}}^{\prime}\mathrm{and}\ \mathrm{disease}\ \mathrm{detection},\\ {}\uptau =\mathrm{biological}\ \mathrm{half}-\mathrm{life}\ \mathrm{tabacco}\ \mathrm{of}\ \mathrm{carcinogens}\end{array}} $$

The latter two parameters, δ and τ, are estimated by trial-and-error so as to optimize the fit to data [13, 14].

Other covariates

Detailed information was collected on sociodemographic characteristics, including ethnicity, education and family income.

In addition, from the detailed employment history and description of each job, a team of chemists and industrial hygienists examined each completed questionnaire and translated each job into a list of potential exposures using a checklist of 294 agents that included many IARC-recognized Group 1 Lung Carcinogens [15, 16].

Statistical analysis

Analyses were performed separately for men and women, and either with all histologies combined or by histologic type. When simply using the term “lung cancer”, we mean all histologies combined.

All associations were estimated using multivariable unconditional logistic regression. When several variables were tested simultaneously, Wald statistics were used to compare the contribution of each variable in a model while Akaike information criterion (AIC) was used to compare the goodness of fit between the different models.

We assessed the relations between lung cancer and various smoking metrics, including duration, daily intensity, time since cessation, pack-years, and CSI. For CSI, its parameters were a priori set to values established by Leffondre et al.: half-life = 26 years and lag = 1 year (males) or 0.7 year (females) [14]. Initially the smoking metrics were analyzed one-at-a-time. Subsequently we conducted analyses with selected multiple smoking metrics in the same models. Analyses involving the time since cessation variable were performed among smokers only. For models involving all subjects, with nonsmokers being the reference group, an indicator of ever smoking was used and continuous smoking variables were centered by subtracting the mean value of the smoking variable from the original value for all smokers, while keeping 0 for never smokers [7]. For each model, the smoking variables under study and the non-smoking covariates were forced into the model.

Some analyses were conducted with continuous smoking variables transformed into categorical variables, while others were conducted on the continuous variables. For the latter, different functions were used to model the relations between continuous smoking metrics and the logit of the lung cancer risk, including (i) linear and (ii) logarithmic functions models as well as fractional polynomials (FP) [17]. In FP analyses, for each continuous variable X, one or two terms of the form Xp were fitted with powers p chosen from (− 2, − 1, − 0.5, 0, 0.5, 1, 2, and 3) to optimize goodness of fit, i.e. minimize the model’s deviance [17].

The following covariates were included in all models: age (continuous), respondent status (self, proxy), ethnic origin (dummy variables: French / British Isles / Italian / other Europeans / other), educational level (elementary, secondary, post-secondary), socioeconomic status (SES) as measured by median household income of the residential neighborhood, derived from census information (continuous) and exposure to those IARC Group 1 occupational lung carcinogens that had at least 1% lifetime prevalence in our study population. The following occupational exposures (lifetime prevalence as indicated) satisfied these criteria: diesel engine emissions (23.8%), crystalline silica (15.9%), benzo[a]pyrene (15.3%), chrysotile asbestos (10.9%), nickel and its compounds (6.2%), chromium VI and its compounds (4.5%) and cadmium and its compounds (2.2%). These were included in the models as qualitative ordinal variables for men: no exposure, ‘non-substantial’ exposure and ‘substantial’ exposure, where the two exposure subsets were distinguished by duration of exposure, concentration of exposure and number of hours per week of exposure. Among women, due to much lower prevalence of occupational exposures, binary variables (ever vs never exposed) were preferred.

Some sensitivity analyses were carried out with study subjects restricted to those who answered for themselves, i.e. excluding proxy responses.

The population attributable fraction (PAF) was estimated as PAF = pexp\( \left(\frac{OR-1}{OR}\right) \) where pexp represents the ratio of the number of exposed cases to the total number of cases [18]. A 95% confidence interval (CI) for the PAF was derived by replacing the point estimate of the relevant OR by, respectively, the lower and upper boundaries of the corresponding 95% CI.

All statistical analyses were performed using SAS® 9.4 software. The %MFP8 macro was used for determining the transformation of continuous variables using fractional polynomials [19].


Selected characteristics of the study population

A total of 1434 eligible lung cancer cases were invited to participate in the study and, of those, 1203 (84%) agreed to participate. Of the 2182 population controls approached, 1513 (69%) agreed to participate. Of the 2716 participating subjects, 11 were excluded due to missing smoking information. Table 1 presents the main characteristics of the 2705 subjects included in this analysis. Briefly, 60% of the study subjects were male, a great majority were French Canadian and the mean age of our population was 63.4 years [SD = 8.5]. For both genders, cases were more likely than controls to be of French ancestry (p < 0.001), to have a lower educational level (p < 0.001) and a lower family reported income (p ≤ 0.001), and to have had a proxy respond on their behalf (p < 0.001). Adenocarcinoma was the predominant histologic type among women, whereas squamous cell carcinomas were more prevalent among men. Information about lifetime prevalence of occupational exposure to IARC Group 1 lung carcinogens of the study subjects is shown in (Additional file 1: Table S1).

Table 1 Selected demographic characteristics of subjects in the lung cancer study, by sex, Montreal, 1996-2000

Characteristics of smoking histories of lung cancer cases and controls

Smoking characteristics of the study population are presented in Table 2. Nearly all cases (97.6% of male cases and 93.1% of female cases) had regularly smoked cigarettes, compared to about two-thirds of controls (82.3% of men and 49.3% of women). Very few smoking cases or smoking controls had smoked for fewer than 20 years, or had averaged fewer than 20 cigarettes per day, or had accumulated fewer than 20 pack-years of smoking. As expected, compared with smoking controls, smoking cases had longer durations of smoking (p < 0.001), higher daily intensities (p < 0.001), younger ages at starting smoking (p < 0.001) and shorter periods since quitting smoking (p < 0.001).

Table 2 Characteristics of smoking histories of lung cancer cases and controls

Lung cancer risk in relation to smoking

As expected and shown in Table 3, for both sexes, subjects who had been regular smokers were at higher risk of developing lung cancer than nonsmokers. Table 3 also shows the OR of lung cancer as a function of categories of exposure for each of several smoking metrics. Risk of lung cancer increases with duration of smoking, and with intensity of smoking, as well as with the cumulative exposure measures, Pack-years and CSI. All of the trend tests across the categories of these variables were highly statistically significant (p < 0.0001). For both sexes, our results highlighted a positive and quite linear association between smoking duration and the logit of the lung cancer risk. Smoking cessation at least 2 years before the reference date was associated with a reduction of the risk of lung cancer for both sexes (p < .0001 - data not shown). For the models exploring the role of the duration, intensity, cumulative index (i.e. pack-years) or time since cessation of smoking as categorical variables, the introduction of the second smoking variable (respectively intensity, duration, smoking status ± time since cessation, or pack-years of smoking) had no noticeable impact on the risk estimates of the primary exposure metric, though as expected, the inclusion of a second source of information on smoking in the model improved the model fit (data not shown). In both sexes, duration of smoking is a stronger predictor of risk than daily intensity of smoking, as indicated by Wald test statistics.

Table 3 OR estimates between various smoking-related variables (categorical) and lung cancer risk

Models based on continuous smoking variables are presented in Table 4. For both sexes and among all the models tested, the one built with the CSI index as a linear variable provided the best fitting model, as indicated by the lowest AIC. Another model that included two smoking dimensions (i.e. pack-years and time since cessation) comes close in terms of AIC index. If for duration of smoking, a linear association with the logit of the lung cancer risk is consistently observed, the use of logarithms of the values of intensity or pack-years provided superior fit to the model that used un-transformed value. More complex fractional polynomials models, shown in footnotes of Table 4, did not substantially improve the fit over the linear or log transformation.

Table 4 Odds ratios between various smoking-related variables (continuous) and lung cancer risk

Histological types

Table 5 shows ORs for selected smoking metrics with men and women combined to optimize power, for all lung cancers combined and for each histologic type. All histologic types of lung cancer were strongly associated with smoking. The ORs between ever regular smoking and lung cancer were close to 10, considering the statistical variability, for each of squamous cell, large cell and adenocarcinoma, and it was undefined for small cell lung cancer because there were no non-smokers among the 205 cases.

Table 5 OR estimates between various smoking-related variables (categorical) and lung cancer risk by histology, males and females combined

Sensitivity analysis limited to self-respondents

Since cases were more likely than controls to have had a proxy respond on their behalf, we adjusted for proxy status in the main analyses. But, as a further insight into the possible impact of using proxies, we re-ran some analyses among self-respondents only, and compared these with the results of combining self and proxy respondents (see Additional file 2: Table S2). Among men, the adjusted OR for a binary indicator of ever smoking and lung cancer was 7.82 (95% CI [4.59–13.30]) in analyses that included all respondents, and 6.96 (95% CI [3.82–12.68]) in analyses restricted to self-respondents. Among women the OR was 11.76 (95% CI [7.50–18.42]) in analyses including all respondents, and 12.17 (95% CI [7.45–19.85]) in analyses restricted to self-respondents. When considering the CSI index, the OR corresponding to a one unit increase in CSI was slightly lower among self-respondents than among all respondents. But in all of these contrasts, the confidence limits between self-respondent and all respondent results overlapped considerably, and none of the substantive inferences would have changed.

Population attributable fraction

Given the overall OR estimate, its confidence limits and the observed prevalence of smoking among cases (96%), we estimate that the Population attributable fraction was 0.858 (0.819–0.887). The estimates were almost identical in men and women.


Although tobacco smoking is the main cause of lung cancer in humans, there is no widely accepted estimate of the exposure-response relationship between smoking and lung cancer. This is partly due to a false impression that the smoking-lung cancer association is so well established that there is little to be learned from additional attention to the issue. Because the analysis and description of dose-response relationships involves such idiosyncratic methodologies and parametrizations, there are really not a tremendous number of published results that can be usefully assembled for meta-analyses or other attempts at synthesizing knowledge. Finally, there remains a great deal to learn about the mechanisms of carcinogenesis by studying valid and generalizable dose-response relationships. The addition of new evidence from studies such as ours will hopefully increase the likelihood that reasonably representative estimates can be derived from the world body of evidence.

As well as describing the smoking history of a North American study population at the end of the twentieth century, this paper provides new estimates of the relationship between cigarette smoking and lung cancer risk for both sexes.

There were significant changes in design of cigarettes (filters, “light”, etc.) in the 1960s and 1970s, and in contrast with many previous studies that had been conducted from the 1950s to the 1980s and that evaluated risks of smoking the types of cigarettes marketed in the middle of the twentieth century, our study covers a population that was largely exposed to the types of cigarettes that came into widespread usage in the last part of the twentieth century. Cigarette smoking was a prevalent habit in this population, certainly among men, but even among women. Among men, 97.5% of cases and 82.3% of controls had ever smoked regularly, while among women, it was 93.1 and 49.4%. These numbers are in the same order of magnitude as those observed in a previous case-control study conducted in Montreal in the 1980’s [20] and in others in the United States [21]. Further, there were very few short duration or low daily intensity smokers.

To provide useful and relevant information for various purposes, different parametrizations and different statistical risk models were created. There are many more published results with categorical parametrization of smoking variables, namely duration, intensity and pack-years, than there are with any particular continuous parametrizations of smoking variables. There are more ways to model the shapes of continuous variables, while categorizations have more limited options for summarization. While there are advantages to the flexibility of continuous variable modelling in a given study, when a meta-analysis of many studies is called for, using the more common categorical variables is a useful strategy. Models were also built either with each smoking variable studied separately or combined with each other.

For the association between binary smoker/nonsmoker variable and lung cancer, we observed an OR of 7.82 [4.59–13.30] for men, a risk slightly lower than the one we derived from a meta-analysis covering studies conducted in North America and Europe [5]. Our estimate is slightly higher than the one (OR = 6.18 95% CI [5.49–6.95]) observed in a recent meta-analysis based on a larger number of studies [3], but those included studies in other parts of the world which were not comparable to ours in terms of history of smoking and ethnic profile of the population. For women, the corresponding OR is 11.76 [7.50–18.42], which is higher than the one (OR = 4.43 95% CI [3.84–5.10]) computed in the meta-analysis cited above [3]. Gender susceptibility to cigarette smoking-attributable lung cancer, a topic under debate in recent years [22,23,24,25,26], will be addressed in a separate paper.

There is more published evidence concerning smoking duration and intensity as distinct predictors of risk than any other metrics. Of these, duration of regular smoking seems to be the more important predictor of lung cancer risk [4]. For both sexes, our results highlighted a positive and quite linear association between smoking duration and the logit of the lung cancer risk. In our analyses, particular attention was paid to the reconstruction of the smoking duration variable which excluded any temporary or permanent periods of cessation. It is conceivable that the lower predictive value of smoking intensity is due to the fact that duration of smoking may be recalled and reported with greater accuracy than average daily intensity over the lifetime smoking history. Obtaining a good estimate of smoking intensity is very challenging due to the potential variability in the true number of cigarettes smoked per day over different periods within a long smoking “career”, and the difficulty of recalling such variability over a long time span. The estimate used in our study was based on the reported average number of cigarettes smoked per day throughout the smoking history of the subject, an estimate often used in epidemiological studies [27]. Log-transformation for intensity of smoking yielded best fit of the data, suggesting that the impact of increasing daily intensity by a fixed number of cigarettes per day is more harmful for light smokers than for heavy smokers, as suggested elsewhere [10, 28].

“Pack-years” smoked, calculated as the average number of packs of cigarettes smoked per day, multiplied by the cumulative number of years during which a person smoked, is the simplest and most commonly reported cumulative smoking exposure metric. Use of the “pack-year” index has been criticized [8, 29, 30] on the grounds that it gives equivalent weight to intensity and duration as contributors to the risk of lung cancer, and it does not explicitly account for time since quitting. Nevertheless, the pack-years variable has the virtue of simplicity, it partially accounts for time since quitting since length of time since quitting implies shorter duration and is a strong predictor of the risk of various smoking-related diseases [31,32,33]. For some purposes, this simple but powerful metric is perfectly adequate, while for other purposes, more sophisticated treatment of the components of a smoking history might be indicated. The CSI [34] is one such possible metric and as shown by Leffondre et al [14] and here, it is a very effective smoking metric, and indeed it performed very well compared with other models in predicting lung cancer risk.

In models using continuous smoking variables, all metrics had strong effects on OR and mutual adjustment among smoking metrics did not noticeably attenuate the OR estimates, indicating that each metric carries some independent risk-related information.

In any case, each of these metrics is an imperfect measure of inhaled dose because of intra and inter-individual variation in: (i) depth of inhalation, (ii) number of puffs taken per cigarette and (iii) retention time in the lung [4].

Finally, in our study population, we estimated that, irrespective of gender, about 86% (95% CI [81–89%]) of the lung cancer cases were attributable to cigarette smoking. That proportion is consistent with previous findings [35, 36].


Besides presenting a portrait of the smoking-lung cancer association in North America at the end of the twentieth century, this study provides a panorama of estimates of this association derived from several modeling strategies and several parameterizations of the smoking variables. This provides new material for future meta-analyses using any of the smoking metrics presented here. Among the notable substantive findings are: the high risk estimates of all types of lung cancer due to smoking despite the changes over time in composition and tar output of cigarettes; the high risk estimates among women, and consequent high attributable fractions; the clear message of increased risk with even the simplest metrics of exposure; and the apparently stronger influence of duration of smoking than daily amount smoked on risk of lung cancer.