Background

Cardiovascular, renal, and metabolic (CRM) and mental health (MH) conditions (listed in Box 1) are amongst the most common causes of death and disability globally, [1,2,3,4,5] with MH conditions alone accounting for almost a third of the global burden of years lived with disability [1]. Primary care electronic health records (EHR) databases are routinely used in observational studies of the epidemiology of these long-term health conditions [6]. Clinical Practice Research Datalink (CPRD) Aurum is a relatively new primary care EHR database, with a number of strengths stemming from the richness of the nationally representative routinely collected data, which captures patient demographics, diagnoses, test results, and prescriptions for over 19 million patients [7]. However, there are recognised limitations to EHR data and there are inevitably disparities between self-reported health status and conditions reported in EHRs with variation in case-detection rate according to age, sex and other demographic characteristics. [8,9,10]. A recent US study found varying agreement between self-reported survey answers and EHR diagnoses data, with 81% positive agreement for type 2 diabetes and 59% positive agreement for depression [9].

Objective clinical investigations are typically used to diagnose CRM conditions (e.g. glycosylated haemoglobin (HbA1c) for diabetes or computed tomography (CT) for strokes), although there is still considerable potential for both under and over diagnosis of these conditions [11]. On the other hand, MH diagnoses are based on clusters of symptoms with an element of subjectivity on the part of the diagnosing clinician, especially in milder cases, and there are also a number of recognised barriers to seeking help for MH conditions including societal stigma and difficulties in asking for and accessing support, which may lead to underdiagnosis [12]. The extent of these barriers is likely to vary according to ethnicity, sex and socio-economic status [13]. Furthermore, conditions that are primarily diagnosed in secondary care, might not be as well captured in primary care records where there is inefficient information transfer between hospitals and GP practices. Studies comparing primary care EHR to hospital episode statistics have shown that only around 60% of hospital admissions for stroke were recorded in primary care EHRs [14]. These factors may lead to disparities between the prevalence of diagnoses in EHRs and screen-detected prevalence estimates for MH conditions across both socio-demographic characteristics and when compared with CRM conditions. However, there is a paucity of research that has examined the extent of these disparities in primary care records for this range of conditions, particularly in CPRD and other UK EHR databases.

It is also valuable to compare the prevalence of health conditions in CPRD Aurum with those from other sources (e.g., national health surveys, screening studies) to understand the strengths and limitations of current and future epidemiological research using CPRD Aurum and other similar EHR databases. Therefore, the primary objective of this study was to describe the prevalence of selected CRM and MH conditions within this database and assess variation in the prevalence of reported conditions by categories of age, sex, ethnicity, and socio-economic deprivation. Secondly, we aimed to compare the prevalence of the conditions in this database against the prevalence within the UK (or similar countries) general population in three other sources in the literature: (1) other primary care EHR databases; (2) self-reports of doctor-diagnosed conditions in nationally representative surveys; and (3) screening studies.

Methods

Study design, data source and population

This was a cross-sectional analysis of the CPRD Aurum database, which contains routinely-collected primary care EHRs from 1,444 general practices across England using EMIS Web® patient records software [7]. Clinical observations, diagnoses and treatments are recorded as Read Version 2, SNOMED-CT, and EMIS Web® clinical codes. The full data resource profile has been described elsewhere [7]. A cross-sectional dataset was extracted for analyses using the Data Extraction for Epidemiological Research (DExtER) tool [15]. Data for these analyses included all patients who were alive and permanently registered with a participating practice on 1st January 2020 (this date was chosen so that results would not be influenced by the impact of the SARS-CoV-2 pandemic on primary care activity and data recording). Patients were only included if there were at least 12 months of acceptable data recording prior to the index date (1st January 2020). Acceptable data was determined using the “acceptable patient flag” data quality measure provided by CPRD: (consistent recording of events including date of birth, practice registration date and transfer out date, and valid age and gender) [7, 16]. The dataset includes patients’ year of birth, sex, ethnicity, and their socio-economic status (index of multiple deprivation (IMD) quintile). Results of this cross-sectional analysis were compared against population prevalence of these same conditions determined from a literature review.

Selection of cardio-renal-metabolic and mental health conditions

A recent Delphi study has identified key conditions that are important to patient and research stakeholders for inclusion in research into patients with multiple long term conditions [17]. From the results of this study, and after discussions within our clinical team and patient advisory group, eight MH and ten CRM conditions were selected for inclusion in our analyses (see Box 1). We included all recommended cardiovascular conditions from the Delphi study except for venous thromboembolic disease as we have focused on chronic rather than acute conditions. We included all recommended “mental health” conditions except for autism and dementia as these are neurodevelopmental and neurodegenerative conditions, respectively. We also added diabetes and chronic kidney disease (CKD) as these are highly prevalent chronic conditions which are closely related to cardiovascular disease.

Box 1: Included conditions

Outcome measures in CPRD Aurum

Prevalent cases for all conditions were identified using disease-specific clinical codelists. Codelists were developed through collaboration by a team of clinicians in the Universities of Birmingham and Cambridge using a rigorous, systematic process via the DExtER codebuilder tool, with search strategies recorded using a consistent coding checklist. We began by reviewing all existing Quality and Outcomes Framework (QOF) codelists, [18] and published codelists for UK primary care EHR analyses, including HDRUK Phenotype library, [19] OpenCodelists, [20] and CPRD @ Cambridge Codelists [21]. Lists were adapted or, where they did not exist, created anew for CPRD Aurum using the hierarchical Read code system, the NHS Digital SNOMED CT term browser, [22] and the DExtER codebuilder tool to search for relevant text words for symptoms, diagnoses, clinical findings, and interventions that indicated a diagnosis of condition. Finally, codelists, conventions and queries were reviewed and agreed among the team at regular clinical coding meetings. Codelists can be found at https://github.com/THINKINGGroup/phenotypes.

For hypertension and CKD, prescriptions and clinical biomarkers were also used as a secondary method of determining prevalence estimates. Hypertension was defined (according to the same methods as the Health Survey for England [23] to enable comparison) as prescription of an antihypertensive medication in the six months prior to 1st January 2020, or most recent blood pressure within the past three years > 140/90mmHg. CKD was defined as the most recent estimated glomerular filtration rate (eGFR) < 60ml/min/1.73 m² within the past three years prior to 1st January 2020.

Outcome measures in comparator sources

A literature review was undertaken to identify, for each condition, three estimates for UK population prevalence:

UK primary care electronic records prevalence

  • Previous analyses of UK primary care electronic records databases using clinical codes to detect prevalent cases. Where available, QOF data were the ideal comparator as the Quality and Outcomes Framework programme uses data collected from 96% of general practices in England [18]. Practices are financially incentivised via QOF to keep accurate disease registers of patients with specific conditions according to nationally agreed standards. For conditions not included in QOF we used cross-sectional, or cohort studies analysing data from other UK EHRs.

Self-reported doctor-diagnosed prevalence

  • Prevalence estimates identified from studies using methods other than primary care EHRs for detection of cases that have been diagnosed by a healthcare professional. These estimates primarily came from two large cross-sectional studies: the Health Survey for England (HSE) and the Adult Psychiatric Morbidity Survey (APMS) where a nationally representative sample of the UK population were surveyed face-to-face and asked about their health conditions [14, 23].

Screen-detected prevalence

  • Prevalence estimates identified from studies that involved screening of a representative sample of the population using a reference standard diagnostic technique. For example, the Health Survey for England (HSE) used HbA1c blood tests from the representative sample to estimate population prevalence of diabetes [23].

Search strategy

A pragmatic approach was used to identify relevant sources for each condition; where available, prevalence statistics reported within Public Health England fingertips resources, [24] NHS Digital resources, [25] and NICE Clinical Knowledge Summaries were used [26]. Further details of the search strategy can found in Additional file 1. Where QOF, HSE, or AMPS data were not available, PubMed, and Google scholar databases were systematically searched for cross-sectional and longitudinal studies using a Boolean search strategy; “condition name” AND “prevalence” OR “epidemiology”. For EHR prevalence, we added: AND abbreviated and unabbreviated names of these established UK-based primary care EHR databases (e.g., “THIN” and “The Health Improvement Network”). For screen-detected prevalence we added: AND “screening”.

Study selection criteria

Studies were included if they reported the most recent available prevalence of any of the conditions using cross-sectional or cohort study data (or a meta-analysis of these), representative of the general population prior to 1st January 2020. They were excluded if they contained fewer than 500 patients or were based on a subpopulation within a specific disease. The most recent study within a large and comparable population was selected. This was ideally a UK population study, but if this was not available then studies within European or other high-income countries were used. Further details and methods of data collection for all comparator studies were summarised in Additional file 1, and in Additional Table 1, Additional Table 2, and Additional Table 3 within that additional file.

Statistical methods

CPRD aurum prevalence analysis

Frequencies, percentages, and cross-tabulations were used to describe the prevalence of each condition across the entire population and by sociodemographic characteristics with age, sex, ethnicity groups, and deprivation quintiles all treated as categorical variables. Age at entry was categorised into the following age groups: 0–16, 17–30, 31–40, 41–50, 51–60, 61–70, and ≥ 70 years. Ethnicity was categorised into five groups based on those used in the UK Census: white, Asian, black, mixed, and other ethnicity (which includes Chinese, Middle Eastern and Pacific). Socioeconomic categories were based on the English Index of Multiple Deprivation (IMD) quintiles for the geographic area where the patient lives. Patients with missing data on ethnicity were assigned to a separate “missing” category and included in the regression analysis.

For each point estimate of prevalence, 95% confidence intervals (CI) for proportion were calculated using the Clopper-Pearson exact method [27]. Logistic regression was used to calculate the odds ratios of each condition by sociodemographic characteristics (with mutual adjustment). All statistical analyses were performed using Stata statistical software, V.16 (StataCorp, College Station, Texas, USA). Stata codes used for the analysis are publicly available here: https://github.com/CPRDAurumPrevalenceAnalysis/.

Comparator data prevalence analysis

Numerators (number of cases) and denominators (number of people sampled) and details of the data collection methods were extracted from each source identified in the literature review. Population prevalence and 95% confidence intervals for proportions were calculated for each condition in the same way as for the CPRD Aurum analysis.

Comparisons between prevalence estimates

For each comparison with the prevalence reported in the literature, a sample was created within CPRD Aurum containing all patients who matched the age profile of that population. For aortic aneurysms the only available comparator was from a screening programme that reported incidence within men in their 65th year, therefore a comparison was made with prevalence of aortic aneurysm in men aged 66 (to allow time for the diagnosis to be recorded in their records). For anxiety, the most appropriate comparator population prevalence estimates only measured prevalence of generalised anxiety disorder. Therefore, a new codelist for generalised anxiety disorder was created within CPRD Aurum for comparison. The prevalence estimates were compared using scatter graphs of observed vs. comparator prevalence using Microsoft Excel.

Results

Cross-sectional analysis of primary care EHR

Almost 12.4 million patients within the CPRD Aurum database were eligible for inclusion in this analysis. The median length of follow up in this study was 10.2 years (IQR 4.4–20.9) Males and females were equally represented; 18% of the patients were under 16, 69% were between 16 and 70, and 13% were over 70 years old. Ethnicity was recorded for 80% of patients in the database, and of these 81% were White, 10% were Asian, 5% were Black, 2% were of other ethnicities, and 2% were of mixed ethnicity. Deprivation quintiles were equally distributed (~ 20%). Hypertension, affecting 15% of the study population and depression, affecting 16%, were the most common CRM and MH conditions respectively. The prevalence of those with each CRM and MH condition in the general population and by socio-demographic characteristics are shown in Tables 1 and 2.

Table 1 Prevalence of cardio-renal-metabolic conditions overall and by socio-demographic in the CPRD Aurum database, 2020
Table 2 Prevalence of mental health conditions overall and by socio-demographics within CPRD Aurum database, 2020

For analysis of prevalence by socio-demographic variables the adjusted odds ratios for prevalence of each condition by sex, deprivation quintile, age categories and ethnicity were calculated. These are presented in forest plots in Additional File 2.

Sex

Cardio-renal-metabolic conditions were more prevalent in men, except for CKD which was more prevalent in women. There was no difference in the prevalence of PTSD between men and women. Affective (depression, anxiety and bipolar) and eating disorders were more prevalent in women, whilst there were higher odds of substance and alcohol misuse and schizophrenia in men. [See Additional File 2; Supplementary Fig. 1]

Socio-economic status

There was a clear trend of increasing prevalence of almost all conditions with increasing socio-economic deprivation, with ORs in the order of 1.4 (aortic aneurysm) to 3.9 (substance misuse) greater in those from the most compared to the least deprived. Associations were weaker between deprivation and AF, heart valve disorders, and T1DM, and prevalence decreased with increasing deprivation for eating disorders. [See Additional File 2; Supplementary Fig. 2]

Age categories

There was a general trend of increasing lifetime prevalence for all cardio-renal-metabolic conditions (except for T1DM) with increasing age. There was a marked increase in prevalence of all mental health conditions after the age of 16. There was typically a gradual increase in lifetime prevalence of each mental health condition up until the age of 40–60 followed by a gradual decrease in recorded prevalence in the oldest age categories. Lower lifetime prevalence of a MH condition in those over 60 years old was most pronounced for substance abuse and PTSD. [See Additional File 2; Supplementary Fig. 3]

Ethnicity

There was considerable variation in the prevalence of CRM and MH conditions by ethnicity. Among those of black and Asian ethnicities diabetes, hypertension, and CKD, were more prevalent than in those of white ethnicity, whilst aortic aneurysms, AF, PVD, heart valve disorders and T1DM were less prevalent in black or Asian people.

In CPRD data, mental health conditions were typically around twice as prevalent in those of white ethnicity as in those of black or Asian ethnicity, except for PTSD and schizophrenia, which were 33% more prevalent and twice as prevalent in those of black ethnicity.

[See Additional File 2; Supplementary Fig. 4]

Comparison of prevalence of health conditions in CPRD against literature

Figures 1, 2 and 3 compare the prevalence estimates from the literature within other UK primary care EHRs (Fig. 1), surveys of self-reports of doctor-diagnosed conditions (Fig. 2) and screening studies (Fig. 3), against the prevalence of each condition in an age matched population within CPRD Aurum. Prevalence estimates from the literature, with the data sources and methods of data collection are reported in Additional File 1; Tables 1, 2 and 3.

Prevalence in UK primary care EHRs

Figure 1 shows that for 5/10 CRM conditions, the prevalence in CPRD Aurum was similar to (< 20% difference relative to) available prevalence estimates for age-matched populations in QOF and other UK primary care EHRs [18]. However, the prevalence of heart valve disorders in CPRD Aurum in 65–95-year-olds (5.2% (95%CI 5.2–5.3%)) was more than double the prevalence reported in age-matched patients in THIN data (1.6% (95%CI 1.6–1.7%) [28]. The prevalences of IHD, T1DM, stroke and HF were between 20 and 55% higher in CPRD Aurum than in other UK primary care EHRs [18, 29,30,31].

Fig. 1
figure a

Comparison of condition prevalences in CPRD Aurum with prevalence estimates from other electronic health records. CKD = chronic kidney disease, IHD = ischaemic heart disease, AF = atrial fibrillation, HF = heart failure, PVD = peripheral vascular disease, BPAD = bipolar affective disorder

The prevalence of bipolar disorder in CPRD Aurum (0.4% (95%CI 0.4–0.4%)), was similar to (< 20% higher than) the prevalence estimate in the IQVIA Medical Research Database (IMRD) in 2018 (0.4% (95%CI 0.4–0.4%)) [32]. The prevalence of eating disorders and schizophrenia in CPRD Aurum were 20% and 34% higher respectively than prevalence estimates in CPRD Gold [33,34,35]. For depression and anxiety the age-matched prevalence in CPRD Aurum was around twice as high as in QOF and THIN data [18, 36].

Self-reported doctor-diagnosed prevalence

Figure 2 shows the prevalence of stroke, diabetes, and IHD in CPRD Aurum were similar to (< 20% difference relative to) self-reported doctor-diagnosed prevalence estimates in HSE [23, 37]. However, prevalence of CKD in over 16 year olds was more than twice as high in CPRD Aurum (4.4% (95%CI 4.4–4.4%) than in HSE data (2.0% (95%CI 1.6–2.4%)) [38]. Prevalence of T1DM and hypertension in CPRD Aurum were 23% and 34% higher than were reported by the National Diabetes Audit and HSE respectively [23, 30]. Prevalence of PVD in CPRD Aurum was 43% lower compared with the prevalence reported in UK Biobank [39].

Fig. 2
figure b

Comparison of condition prevalences in CPRD Aurum with self-reported doctor-diagnosed prevalence estimates from the literature. CKD = chronic kidney disease, IHD = ischaemic heart disease, HF = heart failure, PVD = peripheral vascular disease, BPAD = bipolar affective disorder, T1 diabetes = type 1 diabetes, PTSD = post-traumatic stress disorder

The prevalence of depression, schizophrenia and bipolar disorder in CPRD Aurum in over 16 year olds closely matched (< 20% relative difference to) those reported in HSE and APMS [14, 37]. The prevalence of eating disorders in CPRD Aurum was 41% lower reported in the HSE, [34]whilst for generalised anxiety disorder prevalence was 69% higher in CPRD Aurum than in the HSE [37]. Prevalence of alcohol misuse was three times higher in CPRD Aurum (5.4% (95%CI 5.4–5.4%)) than in HSE (1.2% (95%CI 1.0-1.5%)) [37]. However, prevalence of PTSD in CPRD Aurum (0.6% (95%CI 0.6–0.7%) was three times lower than that reported by HSE (1.9% (95%CI 1.5–2.2%)) [37].

Screen-detected prevalence

Figure 3 shows that for aortic aneurysms, CKD, IHD, AF and PVD, the prevalence estimates reported in CPRD Aurum matched (< 20% difference relative to) estimates of screen-detected prevalence in the same age groups in the literature [29, 38, 40,41,42]. For diabetes, hypertension, heart failure, and heart valve disorder the prevalence estimates in CPRD Aurum were around a third lower than in screening studies [23, 43, 44].

Fig. 3
figure c

Comparison of condition prevalences in CPRD Aurum with screening study prevalence estimates from the literature. BP = blood pressure, CKD = chronic kidney disease, IHD = ischaemic heart disease, AF = Atrial fibrillation, HF = heart failure, PVD = peripheral vascular disease, BPAD = bipolar affective disorder, PTSD = post-traumatic stress disorder

For substance misuse disorder, depression, and schizophrenia the prevalence estimates in CPRD Aurum were around 30% lower than in the APMS (2014) [14]. For generalised anxiety disorder and alcohol misuse disorder, the prevalence in CPRD Aurum were around 80% higher than in the European Study of the Epidemiology of Mental Disorders and APMS respectively [14, 45]. However, for eating disorders, bipolar disorder and PTSD, prevalence reported in the APMS and HSE were 4–6 times higher than in CPRD Aurum [14, 33].

Biomarkers for hypertension and CKD

When defined by use of antihypertensive medication or most recent blood pressure reading > 140/90mmHg (to match methodology in HSE) the prevalence of hypertension in CPRD Aurum in over 16-year-olds was 31.6% (95%CI 31.6–31.6%)), which was almost twice as high as the prevalence when defined by using clinical codes (19.1% (95%CI 19.0- 19.1%)). However, as shown in Fig. 3 it was similar to (< 20% difference relative to) the HSE screen-detected prevalence estimate (27.9% (95%CI 26.5–29.2%)) [23].

Prevalence of CKD in CPRD in over 16-year-olds was similar (< 20% relative difference) when measured using the most recent eGFR < 60ml/min/1.73 m² (to match methodology in HSE) (5.0% (95%CI 5.0–5.0%)) to both the prevalence in CPRD estimated using clinical codes (4.4% (95%CI 4.4–4.4%)) and the screen-detected prevalence in HSE (5.1% (95%CI 4.4–5.8%)) [38].

Discussion

Main findings

This was a comprehensive analysis of the prevalence of cardio-renal-metabolic (CRM) and mental health (MH) conditions in 12 million patients in a primary care electronic health records (EHRs) database. There was a high burden of depression, anxiety, and hypertension across the population. As expected, most conditions reported in EHRs were increasingly prevalent with increasing deprivation and age, although mental health conditions were potentially under-represented in children. Most CRM conditions, schizophrenia and substance misuse were more prevalent in men, whilst anxiety, depression, bipolar and eating disorders were more common in women. Hypertension and diabetes were twice as prevalent in black patients compared with white patients and diabetes was three times as common in Asian patients. However, black and Asian patients generally had lower recorded prevalences of cardiovascular disease (aortic aneurysms, AF, PVD, HF, heart valve disorder, IHD, stroke) than white patients. Mental health conditions were reported twice as frequently in those of white ethnicity as in those of black or Asian ethnicity in EHRs, except for PTSD and schizophrenia, which were 33% more prevalent and twice as prevalent in those of black ethnicity respectively.

Estimates for prevalence of most clinically detected CRM conditions, as well as depression, anxiety, bipolar disorder, and schizophrenia in the EHR database were broadly similar or greater than the self-reported doctor-diagnosed prevalence reported in the Health Survey for England (HSE) and Adult Psychiatric Morbidity Survey (APMS). This suggests these conditions are well represented in EHRs. However, there were sizable differences in the prevalence of hypertension, diabetes, and depression in the EHR compared to other prevalence estimates from studies screening for these conditions. Screen-detected prevalence estimates for PTSD, bipolar disorder and eating disorders were 4–6 times higher than prevalence of these conditions in primary care EHR records, potentially reflecting a significant burden of underdiagnosed or less well documented MH morbidity.

Comparisons with other literature

In EHR the risk factors for cardiovascular disease (i.e., hypertension and diabetes) were more prevalent in black and Asian people than white people, but paradoxically this was not typically matched by higher prevalence of cardiovascular disease itself (i.e., PVD, aortic aneurysms, stroke, IHD). This has also been reported in other cohort studies analysing variation in prevalence of aortic aneurysms and peripheral artery disease by ethnicity [46, 47]. We found that AF was recorded twice as frequently in white patients compared with black and Asian patients. A previous cross-sectional analysis also found lower prevalence of AF recorded in African American patients’ records compared with white American patients, but no difference in prevalence with systematic unbiased testing [48]. Potential explanations have included differential uptake of screening in the case of aortic aneurysms, and under-diagnosis due to language barriers or lower-health literacy in Asian people regarding PVD symptoms [20].

These disparities may also reflect the higher premature death rate from IHD in Asian people compared to white people, thus susceptible Asian people do not survive long enough to.

develop symptoms of PVD [20]. Additionally, although these analyses of variation in prevalence by ethnicity are adjusted for age, given the strength of the association between age and cardiovascular diseases, the lower prevalence of cardiovascular disease in those of black and Asian ethnicity could reflect that in this database these populations were on average significantly younger than those of white ethnicity.

In mental health conditions there was typically significant reduction in prevalence in the over 70-year-olds compared with those aged 40–50, which may reflect earlier mortality for those diagnosed with these conditions at younger ages [49]. Reduced prevalence in the oldest adults is especially notable in eating disorders (see Additional File 2; Supplementary Fig. 3), which is the MH condition with highest mortality rate [50]. In the analyses of prevalence conditions by socio-demographic factors; it is important to note that those who have died before the index date were excluded from the sample so those with non-fatal disease may be over-represented in the survivors.

Prevalence of MH conditions recorded in the primary care EHR was comparatively very low in children. Depression was recorded 40 times more frequently in 17–30 year-olds compared with under-16 year-olds. The latest Mental Health of Children and Young People in England survey found that one in six people aged 6–16 years had a “probable” MH condition [51]. However, this reflects a wide range of mental health symptoms from mood and anxiety to attention and hyperactivity, rather than specific diagnoses. Nevertheless, there is likely to be considerable under-representation of the true prevalence of MH conditions in children in EHRs. Qualitative research suggests that whilst parents and children do not always report mental health symptoms to GPs, [52] in turn GPs report feeling ill-equipped to diagnose MH conditions in children, and there are considerable challenges in accessing child and adolescent mental health specialists [52,53,54].

The gap between screen-detected prevalence and primary care EHR prevalence was more apparent for MH conditions than for CRM conditions, notably for depression, bipolar disorder, eating disorders and PTSD. Financial incentives for accurate coding of certain conditions may have impacted the accuracy of recording diagnoses in EHR. All but one (aortic aneurysms) of the CRM conditions are included in QOF, which financially incentivises practices to have accurate disease coding, whilst eating disorders and PTSD are not included in QOF [18]. Longitudinal analysis of atrial fibrillation coding in UK EHRs suggests that the introduction of QOF did lead to practices refining the diagnostic coding for this condition [55].

PTSD had the most notable discrepancies between both screen-detected prevalence and self-reported doctor-diagnosed prevalence compared with prevalence in the EHR, which suggests that this condition may be especially under-recognised and under-diagnosed. PTSD is typically diagnosed in secondary care and case-detection within primary care EHR is also dependent on accurate transfer of information between primary and secondary care. Studies exploring the accuracy of stroke and cancer diagnoses in primary care EHR, have shown that between 10 and 40% of these diagnoses in hospital records are missing in primary care EHR [14]. Many people with symptoms of common MH conditions do not present to primary or secondary care [13]. However, self-reported screening questionnaires also consistently overestimate the prevalence of MH conditions in epidemiological studies, [56] thus CPRD Aurum and other EHR databases may be more reliable for case-detection of these conditions. Results from the SAIL EHR databank, showed that ten-year prevalence of depression and/or anxiety was 16.2% and of anxiety/depression symptom codes was 21.4% which is similar to our estimates (16.0% had depression (95%CI 16.0–16.0%) [57].

Women had double the rates of reported depression and anxiety compared with men in the primary care EHR. However, in the AMPS survey screening for symptoms of depression and anxiety prevalence of these conditions is only around 25–50% higher in women [13]. In the EHR, depression and anxiety were three times as common in those of white ethnicity compared with those of black or Asian ethnicity. However, in the AMPS survey, symptoms of depression and anxiety were more common in people of black and Asian ethnicity [13]. Like previous studies, we found that black people were twice as likely to be diagnosed with schizophrenia as other ethnicities [13]. Research in this area is limited by small sample sizes. However, it is recognised that there are considerable barriers to accessing mental healthcare for people from black and minority ethnic communities, which may lead to under-diagnosis in primary care [58]. These disparities between screening prevalence and prevalence of mental health conditions in EHR likely reflect patterns of help-seeking behaviour and barriers to access, which are influenced by both gender and ethnicity [58, 59].

There was also an overall gap between screen-detected prevalence in HSE and CPRD Aurum prevalence for diabetes and hypertension, whilst doctor-diagnosed prevalence estimates were similar [23]. However, it is important to note that the methods used for screening in HSE are not diagnostic, for example, a single raised HbA1c measurement was used to estimate the prevalence of diabetes, whereas clinical guidelines state that two raised HbA1c measurements are required to confirm the diagnosis.

Replicating the screening methods used in HSE with clinical biomarkers such as blood creatinine and blood pressure produced a similar prevalence rate of hypertension and CKD [23]. These biomarkers may be useful for some studies looking at short term outcomes. A previous study in CPRD Gold found that clinical codes underestimate the prevalence of CKD and concluded that a combination of codes and test results is most appropriate to detect CKD [60]. However, for studies investigating multimorbidity and detection of disease accumulation over several years, clinical codes are more likely to be more specific and most reflective of long-term conditions.

The prevalence of all CRM and MH conditions in CPRD Aurum typically ranged from 5 to 50% higher than prevalence rates reported in other UK primary care EHR databases (predominantly QOF data). Our codelists were more comprehensive than QOF codelists; for example, the codelists for heart failure and depression included more codes related to interventions, abnormal test results, disease monitoring, and referral to secondary care services. In both these conditions the prevalence estimates in CPRD Aurum were similar to the self-reported doctor-diagnosed prevalence estimates. Therefore, our codelists may be more sensitive but less specific than QOF codelists.

A diagnosis of anxiety was more prevalent in CPRD Aurum data (15.8% (95%CI 15.8–15.8%)) in comparison with a previous analysis of THIN data (7.2% (95%CI 7.1–7.2%)) [36]. However, the THIN analysis reported prevalence of anxiety codes entered between 2002 and 2004 only, whereas we included any case prior to 2020. Doctor-diagnosed prevalence of generalised anxiety disorder was also higher in CPRD Aurum (9.4% 95%CI 9.4–9.4%)) compared with self-reported doctor-diagnosed generalised anxiety in HSE (5.5% (95%CI 4.9–6.1%)) [37]. The most frequently used code within our anxiety codelist by some margin was “Anxiety with depression”, reflecting the established overlap between these two conditions.

As in previous studies, the prevalence of all conditions increased with increasing socio-economic deprivation (with the exception of eating disorders) [61]. A recent systematic review showed no consistent pattern of association between socio-economic status and eating disorders, but that historically those in more affluent groups were more likely to access diagnosis and treatment, which may explain the inverse association between social deprivation and eating disorders [62].

The prevalence of alcohol misuse in CPRD Aurum in over 16-year-olds (5.4% (95%CI 5.4–5.4%)) was considerably higher than HSE reports of both self-reported doctor-diagnosed alcohol misuse (1.2% (95%CI 1.0-1.5%)) and the screen-detected prevalence of alcohol misuse in the same age group (3.1% (95%CI 2.7–3.5%)). Participants may potentially under-report their true drinking practices in surveys, whilst GPs may be entering clinical codes for alcohol misuse but not conveying the extent of their concerns to patients [63]. On the other hand, substance misuse appears to be under-diagnosed in CPRD Aurum compared with self-reported substance misuse. The prevalence reported in CPRD Aurum 2.1% (95%CI 2.1–2.1%) was lower than the screen-detected prevalence of drug dependence in APMS analysis 3.1% (95%CI 2.7–3.5%), which is in keeping with findings from other studies [64].

Strengths and limitations

This CPRD Aurum database contains EHR from over 12 million patients reflecting a nationally representative sample of the UK population in terms of geographic spread, deprivation, age, and gender [7]. For half of the 18 conditions (almost all the CRM conditions) primary care clinicians are financially incentivised via the QOF system since 2004 to accurately record diagnosis codes in EHRs.

Our codelists for identifying conditions within CPRD Aurum were created using a rigorous and systematic process by a team of experienced clinicians, building on a strong foundation of previous research using clinical codes in EHRs. Our findings demonstrate that these codelists appear to have high sensitivity to detect the majority of CRM and MH conditions within EHRs.

The literature review was more pragmatic than a systematic review methodology as it would not have been feasible to do a systematic review for each of the 18 conditions. However, the majority of the comparisons are from the latest official UK government commissioned studies or audits of disease prevalence (e.g., QOF, HSE, APMS, National Diabetes Audit, etc.) [30]. Comparisons with studies reliant on self-reported health status (e.g., HSE) are subject to response bias which may have influenced their findings.

An important limitation is that the prevalences we report are lifetime prevalences, thus conditions that have resolved will still be captured in our results. This was done for comparison with the analysis periods of the comparator data sources. Many of the included conditions are likely to be lifelong conditions (e.g., type 1 diabetes or heart failure). However, others such as depression or anxiety may later resolve or follow a relapsing-remitting course, rather than having persisting symptoms. Therefore, the duration of the data collection period significantly affects the reported prevalence of these types of conditions.

For pragmatic reasons, only age (and sex in the case of aortic aneurysms) was used to stratify CPRD Aurum data to make comparisons with prevalence estimates from the literature. Where disease prevalence has changed over time, especially given the ageing population, there can be far less certainty in the comparisons with prevalence estimates from less recent studies in the literature. Caution should be taken in analysis of prevalence of conditions by ethnicity, given that these categories aggregate together very diverse communities and ranges of cultural practices and countries of ethnic origin. Where researchers wish to examine specific conditions or sub-populations in more depth or wish to understand prevalence within a specific sub-population these factors may need to be explored in greater detail.

Implications for policy and practice

Primary care EHR data are a reliable source for clinically diagnosed cases of most cardio-renal-metabolic (CRM) conditions and for depression, bipolar disorder, and schizophrenia. Caution should be taken in interpreting analyses of anxiety disorders using primary care EHR data as the prevalence may be over-reported, whereas cases of PTSD and eating disorders may be under-reported. Policymakers should explore whether incentivising accurate coding for more MH conditions, for example through QOF (in the UK), could improve reporting quality [18]. Policymakers may also wish to consider how both public awareness and primary care and mental health services can be configured to improve case-detection of these more neglected MH conditions, especially in men and children and those of black or Asian ethnicity. Healthcare providers should be encouraged to adopt culturally sensitive practices to ensure that minority populations receive adequate mental health care and support.

We found almost 40% of patients on anti-hypertensive medications or whose latest recorded blood pressure was greater than 140/90mmHg did not have a clinical code for hypertension in their EHR. This is in keeping with other studies that have demonstrated a significant burden of hypertension that is not well documented or acted on in primary care despite financial incentivisation [65]. Practices should consider implementing more robust follow-up systems once hypertension is initially detected. Policymakers should be aware that longer consultation times and higher GP to patient ratios are associated with better hypertension case-detection and management, especially in more deprived areas [65, 66].

Implications for future research

The variation in prevalence of conditions by sociodemographic characteristics, especially sex and ethnicity, warrants further exploration to understand the relative contribution of genetics, and lifestyle, socio-cultural and healthcare-related factors in these disparities. This requires both longitudinal analyses, stratified by these demographic subgroups, to understand how these factors mediate risk of CRM and MH conditions, and qualitative research exploring barriers to accurate case-detection at both a patient and practice level (e.g., staffing ratios, funding, and continuity of care).

For future research using EHRs, additional algorithms may be used to adjust sensitivity or specificity of a codelist for case-detection of these conditions, depending on the purpose of the analysis. These might include use of prescription data, or codes for symptoms or referrals (instead of diagnoses), and use of clinical biomarkers such as blood test results. Future research could also explore the prevalence and demographic variation of other common chronic conditions in this database, such as cancer, respiratory conditions, and autoimmune diseases.

Conclusion

Most clinically diagnosed conditions appeared to be well represented in primary care records. However, we found important variations in prevalence by demographic characteristics, which may reflect true variation in prevalence or systematic differences in likelihood of both presentation to healthcare professionals and of being diagnosed with these conditions. Primary care data may underrepresent the prevalence of undiagnosed conditions particularly in mental health.