Background

Lung cancer is a leading cause of global cancer incidence and cancer-related mortality with close to 2 million cases in 2020 [1]. In the UK, lung cancer is the third most common cancer and the most common cause of cancer mortality [2]. Computerised Tomography (CT) for lung cancer screening has been shown to reduce the risk of lung cancer mortality [3]; notably 20% reduction in the National Lung Screening Trial (NLST) trial [4] and 24% reduction in the Dutch–Belgian Lung Cancer Screening Trial (NELSON), respectively [5].

Lung cancer screening trials using CT scans have mainly used two risk factors, age and smoking history, in their inclusion criteria for the identification of high-risk populations [4,5,6,7,8]. The United States Preventive Services Task Force (USPTF) uses a ‘pack-year criteria’ and recommends lung cancer screening between age 50–80 years for those who were smokers within the past 15 years and who have a smoking history of 20 pack years or more [9]. Despite this, only 5–18% of eligible patients are being screened [10]. In recent years, various lung cancer risk prediction models have been developed and found to have high discriminatory power in identifying those with a high risk of lung cancer [3, 11]. Commonly included risk factors in these models are comprehensive information on smoking history (duration, quit time and intensity), family history of lung cancer and lifetime asbestos exposure [3, 11]. Implementation of the existing lung cancer risk prediction models into routine clinical practice would necessitate patient contact as electronic medical records (EMR) are unlikely to have the required granularity of information on component risk factors [12,13,14].

A general practice (GP) electronic health records-based risk score is important in the UK healthcare setting context as all existing cancer screening programmes (cervical, breast and bowel) in the UK are based on general practice registrations (https://www.cancerresearchuk.org/about-cancer/cancer-symptoms/spot-cancer-early/screening/what-is-cancer-screening). In the UK, lung cancer screening ‘pilots’-Targeted Lung Health Check (TLHC) projects for high-risk individuals have been implemented in certain areas by NHS England [15, 16]. High-risk participants are identified by a two-step process and those eligible on assessment are offered low-dose CT screening [15]. The first step involves identifying participants who have ever-smoked, those registered with a GP practice and those aged between 55 and 75 years, using the general practice EMR. The second step involves a comprehensive assessment, which includes a spirometry test and a discussion to assess participants’ individual lung cancer risk. The Lung Health Check programme uses two of the existing lung cancer risk prediction models for step two: Prostate Lung Colorectal and Ovarian (PLCO)M2012 and Liverpool Lung Project version 2 (LLPv2) [17]. Participants found to have a risk threshold of ≥1.51% risk of lung cancer over 6 years as the minimum threshold for PLCOM2012; and ≥2.5% risk of lung cancer over 5 years for LLPv2 in step two are eligible for a low-dose CT screening [3, 11, 15]. EMRs have been used to predict lung cancer risk in symptomatic patients [18], and other studies in the general population are underway [19]. In the UK context, cancer screening is based on general practice registrations and in pilot studies for lung cancer screening, only every-smoked criteria has been extracted from GP records for identifying those at high risk.

We hypothesize that general practice EMRs can improve the identification of individuals at high risk of lung cancer compared to the ever-smoked criteria by extracting sociodemographic, lifestyle factors and long-term conditions prevalence. The study objectives are to calculate a lung cancer risk prediction model from a routinely collected general practice data source; validate the new model in another data source; assess discrimination and calibration of the new model; and compare the new model accuracy against ever-smoked criteria and other commonly used lung cancer risk prediction models (PLCOM2012 and LLPv2).

Methods

Data sources and study population

The development cohort was identified from Secure Anonymised Information Linkage (SAIL Databank). SAIL databank has data from primary care EMRs from Wales and covers ~70% of the population of Wales, representative of the wider population in terms of age, sex, and socioeconomic deprivation [20, 21]. We identified all participants that were currently registered with a participating practice on January 1, 2011 and who had been registered for a full year prior to this date, as electronically coded data was most complete from this point [21]. These data were regarded as the start of the follow-up period for SAIL analysis. The validation cohort was UK Biobank, a population-based cohort study which includes 502,640 participants enrolled from 22 different centres across England, Scotland, and Wales between 2006 and 2010 (5% response rate). At present, linked primary care data are only available for a subset of participants. The availability of primary care data depended on the electronic medical record system used by the practice (data only currently available from certain systems) rather than any participant-level factors. This subset is representative of the UK Biobank cohort [22]. The date of recruitment for a participant to UK Biobank was regarded as the start of the follow-up period. The age range for SAIL was 55 to 75 years while in UK Biobank it was 55–73 years as that was the maximum age of participants recruited to UK Biobank. Participants with a previous history of lung cancer (based on lung cancer registry records) were excluded in both cohorts.

Predictor and outcome variables

Age at baseline was used as a continuous variable in both datasets. Sex was used as a categorical variable. The Welsh Index for Multiple Deprivation (WIMD) score [23] in SAIL and Townsend score [24] in UK Biobank, respectively, were used to measure socioeconomic status and the scores were divided into quintiles. WIMD is the Welsh Government’s official measure of relative deprivation for small areas in Wales. It identifies areas with the highest concentrations of several different types of deprivation. WIMD ranks all small areas in Wales from most deprived to least deprived [23]. The Townsend Deprivation Index is a measure of material deprivation first introduced by Peter Townsend in 1987. A Townsend score can be calculated using a combination of four census variables for any geographical area (provided census data is available for that area): households without a car, overcrowded households, households not owner-occupied, persons unemployed [24]. A previous study has found association between Townsend score and lung cancer risk among symptomatic patients attending GP practices in the UK [18]. In both datasets, smoking status was defined using general practice Read codes and divided into three categories: non-smokers, previous smokers, and current smokers, using Read codes from previously published studies [12] (please see Supplementary File Table S1 for read codes used). There was a considerable heterogeneity in the duration between recording of the smoking status and the start of the follow-up period across the two cohorts. Supplementary Table S2 shows the median duration (with interquartile range) across different smoking categories in the two cohorts.

We used a previously published list of 40 long-term conditions (LTCs), defined using Read codes in SAIL as predictors [21, 22] (please see Supplementary File Table S2 for Read2 and CTV3 codes). In addition, we also considered previously identified risk factors for lung cancer by PLCOM2012 and LLPv2 as candidate predictor variables. Painful conditions as a LTC included a broad group of conditions which included back pain, joint pain, headaches (not migraine), sciatica, plantar fasciitis, carpal tunnel syndrome, fibromyalgia, arthritis, shingles, disc problem, prolapsed disc/slipped disc, spine arthritis/spondylitis, ankylosing spondylitis, back problem, osteoarthritis, gout, cervical spondylosis, trigeminal neuralgia, and disc degeneration, using previously validated definitions [25,26,27].

In UK Biobank, participants also self-reported/underwent an examination for information on family history of lung cancer, detailed history on smoking status and body mass index, while in SAIL all candidate predictor variables were extracted from general practice records. The outcome of interest was lung cancer incidence at 6 years of follow-up, which was derived by searching for the presence and date of ICD-10 code “C34” within linked cancer registry records in both cohorts.

Statistical analysis

All statistical analysis was performed using R version 4.0.3 in R studio. For the development of lung cancer risk prediction model in the SAIL cohort, fast backward variable selection model (using P value < 0.01 as the significance level for selecting variables in the model) from rms package was used [28]. Age, sex, socioeconomic status, smoking status, and presence/absence of N = 40 LTCs were entered as predictor variables. In addition, we also considered previously identified risk factors for lung cancer by PLCOM2012 and LLPv2 as candidate predictor variables. The association of selected variables (by backward selection) with 6-year lung cancer risk was examined using logistic regression models in both development (SAIL) and validation cohorts (UK Biobank). Interaction between selected variables and smoking status was tested and if a significant statistical interaction was found, this was added to the final risk score. The discriminating ability of the newly developed EMR-based lung cancer risk score in the prediction of lung cancer was compared to ever-smoked criteria from the TLHC programme [15] in the SAIL and the UK Biobank cohorts using receiver operating characteristics (ROC) area under curve (AUC) [29]. Similarly, AUC was used to compare the EMR-based risk score against the PLCO2012 [30] and LLPv2 [17] in both cohorts. The calibration performance of the new EMR-based lung cancer risk score was evaluated using cut-off values at each 10% increment and compared against the ever-smoked criteria. This was done in both development and validation cohorts. The number of false positive, true positive, true negative and false negative cases were reported for each threshold. In addition, sensitivity, specificity, positive and negative predictive values, and balanced accuracy (formula= sensitivity + specificity/2) were also reported for each threshold [31]. The number of ever-smokers excluded, the number needed to detect one lung cancer and the number correctly excluded for every lung cancer missed were also reported for each 10% increments in both cohorts. The number needed to screen to detect one lung cancer was calculated using the formula: (false positive + true positive)/true positive cases. The number correctly excluded for every lung cancer case missed using the formula: (true negative+ false negative)/false negative. The calibration performance was compared against ever-smoked criteria (yes/no) and PLCO2012 ≥ 1.51% threshold [15].

Sensitivity analysis

The AUC analyses were repeated for the general practice EMR score and other comparator scores described above after including data from hospital admissions (in addition to general practice health records) for calculating LTC prevalence using a previously validated algorithm [32].

Sub-group analysis

The SAIL cohort was split equally into ten sub-groups for internal validation and the AUC was calculated independently to assess the discriminating power of the new score in predicting lung cancer incidence at 6 years. In addition, a sub-group analysis was conducted in SAIL cohort based on age group, sex and smoking status.

Results

Study populations

In the development cohort (SAIL), ~2.3 million patients were registered with a GP practice at the start of the follow-up period (January 1, 2011). The study population was N = 574,196, aged between 55 and 75 years, after excluding those with previous history of lung cancer. The incidence of lung cancer was 1.11% (6430 cases) at 6 years follow-up. In the validation cohort (UK Biobank), N = 137,918 met the eligibility criteria and had primary care records available, with a lung cancer incidence of 0.48% (656 cases) at 6 years (see Fig. 1 for details).

Fig. 1: Study population for the development cohort (SAIL) and validation cohort (UK Biobank).
figure 1

SAIL secure anonymised information linkage.

Table 1 describes demographic factors and smoking habits in both cohorts. SAIL cohort had a proportionately higher number of current smokers (N = 156,463; 27.25%) compared to current smokers recorded in the UK Biobank cohort (N = 19,247; 13.95%). While the UK Biobank cohort had a higher proportion of ex-smokers (N = 44,722; 32.42%) compared to ex-smokers recorded in the SAIL cohort (N = 123,274; 21.47%).

Table 1 Study participant demographics, smoking status and 6-year lung cancer incidence.

Compared to those who did not develop lung cancer, people developing lung cancer over 6 years were older in both cohorts. In both cohorts, males, current and ex-smokers and participants from socioeconomically deprived areas were observed to have a higher lung cancer incidence at 6 years. Smoking status captured by primary care records and that captured by self-report in UK Biobank had moderate correlation with Spearman correlation coefficient value of 0.405 (95% CI 0.401–0.409).

Development and validation of lung cancer risk score

Variable selection, using backward selection method, retained 17/56 predictor variables in the SAIL cohort (age, sex socioeconomic status, smoking status, family history of lung cancer, body mass index (BMI), BMI: smoking status, and presence of the following LTCs: alcohol misuse, chronic obstructive pulmonary disease (COPD), coronary heart disease, dementia, hypertension, painful condition, stroke/transient ischaemic attack (TIA), peripheral vascular disease, and history of previous cancer and previous pneumonia). The newly developed lung cancer risk score was called ALIGNED (generAL practIce lunG caNcEr moDel-aligned). The full results of variable selection are presented in Supplementary Table S4 and the results for interaction testing between selected variables and smoking status are presented in Supplementary Table S5. The relationship of these predictor variables with the 6-year risk of lung cancer incidence was assessed using multivariate logistic regression models in the development and validation cohorts, respectively, and presented in Table 2. The presence of ten LTCs was associated with a significantly higher risk of lung cancer at 6 years in SAIL cohort and five LTCs (COPD, painful condition, hypertension, peripheral vascular disease and history of previous cancer) had a significant association in both cohorts.

Table 2 Variables selected in the final model and association with 6-year lung cancer risk: multivariate logistic regression analysis.

Discrimination and calibration of ALIGNED score

The discriminating power of ALIGNED in 55–75-year-olds to detect 6-year risk of lung cancer was good, with AUC value of 80.4% in the development (SAIL) cohort (95% CI 79.9–80.9%). The discriminating power (AUC value) of ever-smoked status (yes/no) and PLCO2012 in the SAIL cohort for 55–75-year-olds was A 69.8% (95% CI 69.4–70.2%) and 80.1% (95% CI 79.6–80.6%), respectively, see Fig. 2. Table 3 reports calibration performance of the ALIGNED score using cut-off scores at each 10% decile value in SAIL. In SAIL, 70% threshold cut-off had the best value for balanced accuracy at 73.3%, while 50%, 60 and 80% thresholds, respectively, also performed better than the ever-smoked criteria (balanced accuracy=69.8%). The number needed ‘to screen’ one lung cancer case was 35, and for every 262 participants correctly excluded, one true lung cancer would be missed, with the 70% threshold of ALIGNED in SAIL (the most accurate threshold). In comparison, the number needed to screen one lung cancer case was 50 for ever-smoked criteria and 77 for PLCO2012 ≥ 1.51% in the SAIL cohort.

Fig. 2: Area under curve (AUC) of the new EMR-based lung cancer ALIGNED score for 6-year lung cancer incidence prediction.
figure 2

ALIGNED score generAL practIce lunG caNcEr model, PLCO Prostate Lung Colorectal and Ovarian PLCOM2012, LLP V2 Liverpool Lung Project version 2 (LLPv2). Left panel = findings in the development cohort (SAIL secure anonymised information linkage); right panel = findings in the validation (UK Biobank) cohort.

Table 3 Model discrimination and calibration for different cut-off values across each decile of the ALIGNED risk score and comparison against ever-smoker status in SAIL.

In the validation cohort (UK Biobank), the ALIGNED score (AUC 74.4%; 95% CI 72.3–76.6%) outperformed the ever-smoked criteria (AUC 60.2%; 95% CI 58.3–62.2%) in 55–73 years old participants (please see Fig. 2). However, the ALIGNED score underperformed PLCOM2012 (AUC 82.6%; 95% CI 80.8–84.4%) and LLPv2 (AUC 81.1%; 95% CI 79.1–83.1%) in the UK Biobank cohort. Of note, the PLCOM2012 and LLPv2 in UK Biobank was calculated using smoking information which was self-reported by the participants at the time of the recruitment. In the calibration analysis, six thresholds of the ALIGNED score from 30% onwards outperformed the ever-smoked criteria with better-balanced accuracy (see Table 4). In this cohort as well, 70% threshold of the ALIGNED score offered the best-balanced accuracy of 68% and it significantly outperformed the ever-smoked criteria (balanced accuracy=60.2%). The number needed ‘to screen’ one lung cancer case was 80, and for every 363 participants correctly excluded, one true lung cancer would be missed, with the 70% threshold of ALIGNED in UK Biobank. In both datasets, the ALIGNED score had high positive predictive values but low negative predictive values. The number of patients needed to be screened to detect one patient with lung cancer was comparatively smaller for SAIL across all the decile cut-off scores, which is expected as SAIL is the more representative cohort of the general population with higher lung cancer incidence.

Table 4 Model discrimination and calibration for different cut-off values across each decile of the ALIGNED risk score and comparison against ever-smoker status in UK Biobank.

Sensitivity analysis

We checked the accuracy of ALIGNED score after excluding patients with dementia. The AUC for both SAIL and UKB cohorts were relatively unchanged (80.3% and 74.4%, respectively) in the sensitivity analysis. The lung cancer risk scores were recalculated using hospital admission data (in addition to GP data) which improved the AUC values for ALIGNED, PLCOM2012 and LLPv2 scores in both SAIL and UK Biobank cohorts (please see Supplementary Material Table S6). In addition, the AUC for risk scores were recalculated for lung cancer incidence at 5 years in SAIL and the result trends remain unchanged (please see Supplementary Table S7).

Sub-group analysis

SAIL cohort data was split into ten equal parts, the AUC values for ALIGNED score ranged from 79.1 to 81.6% across the ten sub-groups (please see Supplementary Material Table S8). In sub-group analysis based on demographic characteristics and smoking status, ALIGNED score had the highest AUC value of 81.1% among females and lowest among current smokers at 67.1% (please see Supplementary Table S9).

Discussion

This study presents the findings of developing and validating a GP EMR-based lung cancer risk score—ALIGNED—from two large community cohorts. The new score was based on using demographic information (age, sex, and socioeconomic status), smoking status (non-smoker, ex-smoker and current smokers), body mass index, family history of lung cancer and presence of following LTCs: alcohol misuse, COPD, coronary heart disease, dementia, hypertension, painful condition, stroke/TIA, peripheral vascular disease, and history of previous cancer and previous pneumonia. The new score outperformed the ever-smoking status from TLHC in discrimination and calibration, however underperformed against the questionnaire-derived PLCOM2012 and LLPv2 risk scores. The new ALIGNED score may have a potential role in implementation of lung cancer screening at population level, crucially without requiring patient contact, while the other commonly used scores like PLCOM2012 and LLPv2 rely on information which require participant response.

A proactive approach based on individual risk prediction is being described as the future of early cancer detection, across all cancers [33]. A risk prediction model-based approach in lung cancer screening has been found to be associated with greater reduction in lung cancer mortality in clinical trials [34]. Use of EMR for lung cancer prediction has been previously examined in a lung cancer screening trial [8] and routine implementation of lung cancer screening programmes [35, 36], however, these studies have not used the EMR to generate a lung cancer risk score. A personalised lung cancer risk score generated from the EMR could potentially help with reducing perceived cancer risk—a recognised barrier in lung cancer screening participation [37, 38]—and enable those at higher risk to be provided with a personalised risk estimation. Identifying high-risk individuals at an early stage via EMR can pave the way for more personalised targeting of high-risk individuals for the next stage of risk assessment (including comprehensive smoking and family history, spirometry, +/− biomarkers/imaging) [39]. Importantly, EMR-based lung cancer risk score can be calculated without any patient contact which means that everyone can be assessed, not just those who attend for a complete assessment, which has the potential to reduce health inequalities.

We compared cumulative lung cancer incidence and AUC values in our cohorts with those from previous studies. The observed 6-year lung cancer cumulative incidence among ever-smokers was 5257 (2%) in SAIL (age 55–75 years) and 375 (0.8%) in UK Biobank (age 55–73 years), respectively. In comparison, cumulative lung cancer incidence was 0.85% (46/4380) at 5.5 years in the NELSON trial cohort [40], 0.80% (599/75,958) at 5 years in UKLS trial cohort [41], 1.84% (1463/79,209) in PLCO trial cohort and 3.73% (1925/51,527) in NLST trial cohort respectively at 6 years [11]. In our study, the new EMR-based lung cancer risk score had an AUC value of 80% (SAIL) and 74% (UK Biobank). Previous studies have reported AUC values of 67%, 73%, 73%, 74 and 81% for LLP V2 risk score in different cohorts, respectively [11, 41, 42]. Similarly, reported AUC values for PCOM2012 risk score have been 68%, 74%, 76 and 78%, respectively, in different cohorts [11, 42]. The ALIGNED score includes LTCs which have not been previously included in any previous lung cancer risk scoring system, however, previous research has found similar associations. A higher risk of lung cancer has been observed with heavy alcohol consumption: in particular, beer drinking was associated with higher risk of squamous cell carcinoma [43, 44]. A meta-analysis of 16,849 patients found a higher prevalence of lung cancer among those with peripheral vascular disease [45]. A cohort study in Germany involving 18,668 found a higher risk of intrathoracic cancers among stroke survivors, in both men and women [46].

Using a large primary care representative cohort for development (SAIL) and another large community cohort (UK Biobank) for validation is one of the key strengths of this study. However, we acknowledge some limitations, particularly relating to the UK Biobank cohort. UK Biobank participants have been found to be less socioeconomically deprived, have fewer lifestyle risk factors and lower prevalence of LTCs than the UK population [47]. However, despite lower absolute risk of lung cancer and other diseases, hazard ratios derived from UK Biobank should still be applicable to the wider UK population [48]. There was significant heterogeneity in the duration between recording of smoking status in GP EMR and the start of the study follow-up across different smoking categories in the two cohorts which would have likely influenced the accuracy of smoking status considered in both cohorts. In this study, we could not use free text data available in GP EMR for extracting more detailed information on predictor variables, including smoking behaviour as free text data was not available for research purposes in both these cohorts. Use of free text data and natural language processing methods have the potential to improve the prediction power and accuracy of lung cancer risk. The correlation between GP EMR-based classification and self-reported classification of smoking status had moderate correlation so there is a possibility of misclassification and measurement error when using GP records for capturing smoking status.

A GP EMR-based lung cancer risk score-LUCGPEHR was validated using two large UK community cohorts and comprised of demographic information (age and socioeconomic status), smoking status, body mass index, family history of lung cancer, and presence/absence of ten LTCs. The crucial benefit of the ALIGNED score is not requiring patient contact for score calculation. The ALIGNED score outperformed the ever-smoked criteria used in the TLHC programme. The ALIGNED score may have a potential role as a “first step” in the implementation of lung cancer screening by facilitating more informed decision-making with personalised information available for participants. It can also be potentially used for improving screening precision by more focused targeting of high-risk individuals for lung cancer screening, however further research is urgently needed in this area.