The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and its resulting disease (COVID-19) are spreading rapidly worldwide [1]. A better understanding of the predictors of developing infection is essential for health service planning (e.g. ensuring adequate facilities for those most at risk), targeting prevention efforts (e.g. targeted shielding or surveillance) and informing future modelling efforts. Age, male sex and pre-existing medical conditions are established predictors of adverse COVID-19 outcomes, as is excess adiposity [2], but the role of social determinants is poorly understood [3, 4].

Ethnicity and socioeconomic position strongly influence health outcomes for both infectious and non-communicable diseases. Previous pandemics have often disproportionately impacted ethnic minorities and socioeconomically disadvantaged populations [5, 6]. Early evidence suggests that the same may be occurring in the current SARS-CoV-2 pandemic but empirical research remains highly limited [7]. It is highly plausible that infection risk will vary across these social groups. For example, socioeconomic disadvantage is linked to living in overcrowded housing. Similarly, Bangladeshi, Indian and Chinese households are more likely to live in intergenerational households (e.g. with children, parents and grandparents) [8], which has been hypothesised to increase transmission [9].

Establishing the risk of developing infection across different social groups is challenging. A major issue is that information about ethnicity and socioeconomic position are often not well collected within routine health data. Furthermore, the size of the different social groups in the general population is also often not accurately known [10]. The ideal approach to estimating infection risk across different social groups is to analyse data from a cohort study, but most existing cohort studies which include detailed information about ethnicity and socioeconomic position are subject to long delays in data being available for analysis and are too small to provide useful estimates of infection risk.

The UK Biobank study has carried out data linkage between its study participants and SARS-CoV-2 test results held by Public Health England. We therefore aimed to investigate the relationship between ethnicity, socioeconomic position and the risk of having confirmed SARS-CoV-2 infection in the population-based UK Biobank study.


Study design and participants

Data were obtained from UK Biobank (, with the methods described in detail previously [11]. In brief, over 502,000 community-dwelling individuals largely aged 40 to 70 years were recruited to the study during 2006 to 2010. Participants attended one of 22 assessment centres across England, Scotland and Wales. Data were collected on a range of topics including social and demographic factors, health and behavioural risk factors, using standardised questionnaires administered by trained interviewers and self-completion by computer.

Results of SARS-CoV-2 tests for UK Biobank participants, including confirmed cases, were provided by the Public Health England (PHE) microbiology database Second Generation Surveillance System and linked to UK Biobank baseline data [12]. Data provided by PHE included the specimen date, specimen type (e.g. upper respiratory tract), laboratory, origin (whether there was evidence from microbiological record that the participant was an inpatient or not) and result (positive or negative). Data were available for the period 16 March 2020 to 3 May 2020.

Since data on test results were only available for England, we restricted the study population to people who attended UK Biobank baseline assessment centres in England. Participants who were identified as having died prior to 31 January 2018 from the linked mortality records provided by the NHS Information Centre (N = 17,632) and those who requested to withdraw from the study prior to February 2020 (N = 30) were also excluded from the analysis. In addition to the analyses of the overall population, we also investigated positive test results amongst those who had been tested only. This allowed us to investigate the potential for bias due to differential testing between ethnic and socioeconomic groups. UK Biobank received ethical approval from the NHS National Research Ethics Service North West (11/NW/0382; 16/NW/0274).

Assessment of ethnicity and socioeconomic position

All exposures were derived from the baseline assessment centre data collection. Ethnicity was self-reported and categorised into white British, white Irish, other white background, south Asian, black (Caribbean or African), Chinese, mixed or others. As more data became available, we also used more refined groupings, separating south Asian into Indian, Pakistani or other south Asians (including Bangladeshi) and black into Carribean, African or other black. Due to small numbers, analyses of the Chinese, mixed and other black groups were limited. In line with previous research, we also do not report results for the other group due to problems with interpretation of this highly heterogenous group [13].

Socioeconomic position was assessed using two different measures recorded at the baseline visit. Area-level socioeconomic deprivation was assessed by the Townsend index (including measures of unemployment, non-car ownership, non-home ownership and household overcrowding), corresponding to the output area in which the respondent’s home postcode was recorded [14]. Quartiles were derived from the index, where the lowest quartile represents the most advantaged and the highest the least advantaged. Highest education level is a proxy measure for socioeconomic position and usually remains stable throughout the adult life course. It was assessed as (1) university or college degree; (2) A levels or equivalent; (3) O levels, General Certificate of Secondary Education (GCSE), vocational Certificate of Secondary Education (CSE) or equivalent; (4) others (e.g. National Vocational Qualifications or other professional qualifications); or (5) none of the above [15].

Ascertainment of SARS-CoV-2 outcomes

We defined our primary outcome as having a positive test within the Public Health England database available through linkage [12]. This reflects confirmed infection but does not include symptomatic individuals who have not presented to the health service or not been tested, or asymptomatic cases. Some systemic differences exist in testing threshold. For example, healthcare workers may be more likely to be tested and therefore observed differences may reflect differences in testing practices. To investigate whether differential ascertainment was biasing our results, we studied three further outcomes. We identified positive cases that had their test taken while attending hospital (i.e. either emergency departments or as inpatients—hereafter referred to as hospital cases). This group is likely to reflect more severe illness and therefore is less likely to be subject to ascertainment bias. In addition, we investigated outcomes related to testing practice by assessing the risk of being tested in the overall population and testing positive amongst only those who had been tested. Higher levels of confirmed SARS-CoV-2 infection could arise from higher rates of testing amongst some population subgroups [12]. However, if this were to occur, the likelihood of having a positive test would be lower amongst groups experiencing high rates of testing.

Potential confounders and mediators

Age group (5-year age bands), sex and assessment centre were included as potential confounding variables in all statistical models. Country of birth (UK and Ireland) versus elsewhere was also included, given its influence on cultural practices [16]. We also included several variables which could reflect potential confounding or mediation.

Baseline health status was assessed using self-reported longstanding illness, disability or infirmity (yes or no), self-reported health status (excellent, good, fair, poor) and the number of chronic health conditions self-reported from a pre-defined list of 43 conditions and top-coded at 4 or more, based on a previously published approach [17]. Behavioural factors included smoking (never, previous, current), body mass index (BMI) (weight/height2 derived from physical measurements and classified into underweight, normal weight, overweight, obese) and alcohol consumption (categorised into daily or almost daily, 3–4 times a week, once or twice a week, 1–3 times per month, special occasions, former drinker or never).

Other social variables were also considered. Employment status distinguished those in paid employment or self-employment, retired, looking after home and/or family, unable to work because of sickness or disability, unemployment or others. For those in work, manual versus non-manual occupation was assessed by asking participants to report whether their job involved heavy manual or physical work (never/rarely/sometimes versus usually/always). Participants were asked about the title of their current or most recent job at baseline and these were converted to the Standard Occupational Classification (SOC 2000 [18]) by UK Biobank. Healthcare (and related) workers were identified from the SOC 2000 codes 22 (Health Professionals), 32 (Health and Social Welfare Associate Professionals), 118 (Health and Social Services Managers), 611 (Healthcare and Related Personal Services), 9221 (Hospital porters) and 4211 (Medical Secretaries). Housing tenure was categorised into owner-occupier or renter/other (including those who lived in accommodation rent free, in a care home or sheltered accommodation). Urban/rural status was derived from data on the home area population density; UK Biobank combined each participant’s home postcode with data generated from the 2001 census from the Office of National Statistics. The number of people within a household was categorised into four groups: single person, two people, three people or four or more people (which included those living in institutions, such as care homes).

Statistical analyses

The association between the exposures (ethnicity and socioeconomic position) and the outcomes of interest (confirmed infection, hospital case, being tested and having a positive test amongst those tested) was explored using Poisson regression. Poisson regression was preferred over logistic regression to allow relative risks to be presented, rather than odds ratios which are often misinterpreted [19]. Robust standard errors were used to ensure accurate estimation of 95% confidence intervals and p values. Missing data were excluded from the analysis via listwise deletion. Statistical analysis was conducted using Stata/MP 15.1.

To investigate ethnic differences, we initially adjusted for age, sex and assessment centre (model 1) and then added country of birth (model 2). Subsequent models additionally adjusted for variables which we hypothesised were likely to be at least partially mediating rather than confounding variables. Model 3 adjusted for model 2 variables and for being a healthcare worker. Model 4 additionally adjusted for social variables (namely urbanicity, number of people per household, highest education level, socioeconomic deprivation, tenure status, employment status, manual work); model 5 was adjusted for model 2 plus health status variables (self-rated health, number of chronic conditions and longstanding illness or disability); model 6 was adjusted for model 2 plus behavioural risk factors (smoking, alcohol consumption and BMI); and model 7 was adjusted for all aforementioned covariates. In post hoc analyses, we also repeated the above with the more defined ethnic groups.

We followed a similar approach to explore the role of socioeconomic deprivation and education level. Model 1 was adjusted for age, sex and assessment centre; model 2 added ethnicity and country of birth; model 3 also adjusted for the social variables (as above); model 4 adjusted for model 2 plus health status variables; model 5 was adjusted for model 2 plus behavioural risk factors; and model 6 was adjusted for all previous covariates.


A total of 392,116 participants were included in the study (after excluding 36,109 (8.4%) people with missing data, Additional file Figure S1 for flowchart and Table S1 for patterns of missing data by ethnicity and socioeconomic position). Most of the baseline UK Biobank sample in England was white British, with the next largest groups being other white, white Irish and then south Asian and black (Table 1 and Additional file Table S2). Approximately one-third (32.9%) of the sample had a degree and 16.2% had no formal qualifications. In our sample, 2658 people had been tested for SARS-CoV-2 and 948 had at least one positive test (726 received a positive test in a hospital setting suggesting more severe illness) (see Additional file Table S3 for outcomes by ethnicity, socioeconomic deprivation and education level). The geometric mean number of tests performed per participant tested was 1.53 (95% CI 1.50–1.56).

Table 1 Description of the study population

In comparison to the white British majority ethnic group, several ethnic minority groups had a higher risk of testing positive for SARS-CoV-2 infection and also testing positive while attending hospital (Fig. 1 and Additional file: Tables S4 and S5). Black participants had the highest risk (RR 3.35 (95% CI 2.48–4.53)), with adjustment for the country of birth resulting in little attenuation (RR 3.13 (95% CI 2.18–4.48)); adjustment for a history of being a healthcare worker (RR 2.66 (95% CI 1.83–3.84)) and for social factors (including measures of socioeconomic position) did additionally attenuate the risk (RR 2.05 (95% CI 1.39–3.03)). South Asians also had an elevated risk of testing positive (RR 2.42 (95% CI 1.75–3.36) in model 1), with a broadly similar pattern of attenuation as for the black ethnic group. The white Irish group also had a marginally elevated risk of having a positive test (RR 1.42 (95% CI 1.00–2.03)) which attenuated with adjustment for social variables (RR 1.23 (95% CI 0.86–1.75). The Chinese group had imprecisely estimated risk ratios due to smaller numbers. The pattern of findings for hospital cases was similar (Additional file S5), suggesting that the higher testing rates amongst certain ethnic groups in the community were not skewing the results. Similarly, analyses of the likelihood of testing positive amongst those who had been tested were often higher or the same in these ethnic groups (Table 2 and Additional file S16), whereas a lower risk would have suggested differentially high testing.

Fig. 1
figure 1

Risk ratios for associations between broad ethnicity groups (white British as the reference category) and SARS-CoV-2. Model 1: age, sex and assessment centre. Model 2: model 1 + country of birth. Model 3: model 2 + healthcare worker. Model 4: model 3 + social variables (urbanicity, number of people per household, highest education level, deprivation, tenure status, employment status, manual work). Model 5: model 4 + health status variables (self-rated health, number of chronic conditions and longstanding illness) + behavioural risk factors (smoking, alcohol consumption and BMI). Coefficients for the Chinese and other groups are not shown

Table 2 Risk ratios for testing positive for SARS-CoV-2 amongst those tested (N = 2658) in UK Biobank

When using a more detailed ethnicity classification within the south Asian and black groups, we observed important heterogeneity in the pattern of findings between the Indian group and other south Asian groups (Fig. 2 and Additional file Tables S7-S9). Compared to white British, risks were largest in the Pakistani group (RR 3.24 (95% CI 1.73–6.07)), followed by other south Asians (RR 3.00 (95% CI 1.64–5.49)) and were more modestly increased in the Indian group (RR 1.98 (95% CI 1.26–3.09)). There were less clear differences in the estimates for black Caribbeans and black Africans: RR 3.51 (95% CI 2.39–5.15) and RR 3.11 (95% CI 1.97–4.91) in initial models and RR 2.18 (95% CI 1.43–3.32) and RR 1.53 (95% CI 0.87–2.69) in fully adjusted models respectively.

Fig. 2
figure 2

Risk ratios for associations between narrow ethnicity groups (white British as the reference category) and SARS-CoV-2. Model 1: age, sex and assessment centre. Model 2: model 1 + country of birth. Model 3: model 2 + healthcare worker. Model 4: model 3 + social variables (urbanicity, number of people per household, highest education level, deprivation, tenure status, employment status, manual work). Model 5: model 4 + health status variables (self-rated health, number of chronic conditions and longstanding illness) + behavioural risk factors (smoking, alcohol consumption and BMI). Coefficients for the white Irish, white other, mixed, Chinese, black other and other groups are not shown

In comparison to the most socioeconomically advantaged quartile, living in a disadvantaged area (according to the Townsend deprivation score) was associated with a higher risk of confirmed infection, particularly for the most disadvantaged quartile (RR 2.19 (95% CI 1.80–2.66)) (Fig. 3 and Additional file: Table S10). Differences in ethnicity and country of birth, social factors, baseline health and behavioural risk factors all moderately attenuated the association in the most disadvantaged quartile. Socioeconomic deprivation was also associated with hospital cases (Additonal file: Table S11). While testing was again more likely, the risk of being diagnosed positive amongst those tested also tended to be higher, rather than lower (Table 2 and Additional file: Table S17).

Fig. 3
figure 3

Risk ratios for associations between Townsend deprivation score quartile (most advantaged as reference category) and SARS-CoV-2. Model 1: age, sex and assessment centre. Model 2: model 1 + ethnicity + country of birth. Model 3: model 2 + social variables (healthcare worker, urbanicity, number of people per household, highest education level, tenure status, employment status, manual work). Model 4: model 3 + health status variables (self-rated health, number of chronic conditions and longstanding illness) + behavioural risk factors (smoking, alcohol consumption and BMI)

Analyses by education level also showed a higher risk of confirmed SARS-CoV-2 infection with the lowest level of education (RR 2.00 (95% CI 1.66–2.42) for no qualifications compared to degree level educated) (Fig. 4 and Additional file: Table S13). While adjustment for ethnicity and country of birth made little difference to the association, adjustment for social factors, baseline health and behavioural risk factors all attenuated the association somewhat (RR 1.46 (95% CI 1.19–1.79) in fully adjusted model). We again observed a similar pattern in hospital cases and found little evidence of increased testing amongst the less educated groups (Fig. 4 and Additional file Tables S14 and S18).

Fig. 4
figure 4

Risk ratios for associations between highest educational level (degree educated as reference category) and SARS-CoV-2. Model 1: age, sex and assessment centre. Model 2: model 1 + ethnicity + country of birth. Model 3: model 2 + social variables (healthcare worker, urbanicity, number of people per household, deprivation, tenure status, employment status, manual work). Model 4: model 3 + health status variables (self-rated health, number of chronic conditions and longstanding illness) + behavioural risk factors (smoking, alcohol consumption and BMI). Coefficient for the other groups are not shown


Several ethnic minority groups had a higher risk of both being diagnosed and testing positive in a hospital setting with laboratory-confirmed SARS-CoV-2 infection in the UK Biobank study. The black and south Asian groups were found to be at greatest risk, with Pakistani ethnicity at greatest risk within the south Asian group. Similarly, measures of socioeconomic disadvantage (area-based deprivation and lower education) were also associated with an increased risk of having confirmed infection and being a hospital case. For both ethnicity and socioeconomic position, we did not find evidence that these patterns were likely to be due to differential ascertainment, since although the likelihood of testing was increased, the likelihood of a positive test was, if anything, higher amongst ethnic minorities who had been tested. Ethnic differences in infection risk did not appear to be fully accounted for by differences in pre-existing health, behavioural risk factors or country of birth measured at baseline. Furthermore, socioeconomic differences appeared to make a moderate contribution to these ethnic differences.

Our study has several important strengths. First, by using a well-characterised cohort study, we can identify a clearly defined population at risk of experiencing SARS-CoV-2 infection. By combining data linkage with a large sample size, this has allowed us to provide empirical data from this pandemic in a timely fashion. Ethnicity was collected using self-report which is widely considered to be a gold standard approach [20], and the availability of a large dataset has allowed us to provide empirical data on this crucial policy priority in a timely fashion, including a more nuanced appreciation of the risks of infection within different members of the white majority population, as well as drilling down into more specific minority ethnic groups [21]. Our investigation of socioeconomic position has similarly benefited from being able to study different measures and assess the pattern of findings across these. The detailed data collected in this cohort has also allowed us to investigate the extent to which observed inequalities are potentially mediated by a wide range of factors, including behavioural risk factors, pre-existing health status and other social variables.

However, several potential limitations should be noted. Ascertainment bias is potentially problematic and could arise in several ways, including differential healthcare seeking, differential testing and differential prognosis. Even so, we have been unable to find any evidence to suggest that differential healthcare seeking or testing would explain the observed pattern of findings. Increased ascertainment amongst ethnic minorities would be expected to result in a lower proportion of confirmed cases amongst those tested whereas we observed the opposite. One possibility that remains is that some ethnic and socioeconomic groups have a poorer prognosis and are therefore more likely to be admitted to hospital and therefore to be tested [7]. However, if this were the case, the issue of more adverse outcomes amongst these groups remains concerning. Other limitations include the non-representativeness of the UK Biobank study population, potentially exacerbated by missing data, with those who were more advantaged being more likely to participate and ethnic minorities less well represented. There is therefore the potential that the findings in our study may not reflect the broader UK population [22, 23]. However, empirical research has found that this may not result in substantial bias in measures of association in the UK Biobank study [24]. Furthermore, estimates from other sources of inequalities in COVID-19 mortality show similar patterns of associations to our results [25, 26]. We have also been unable to fully exclude all deaths that occurred prior to the pandemic, due to lack of up-to-date linkage to mortality records at present. Our exposure data were collected some years ago, and it is therefore likely that pre-existing health, risk factors and some social variables have changed, although generally most risk factors track throughout life [27]. However, it is possible that management for chronic health conditions could have been differential across ethnic and socioeconomic groups [28] between baseline data collection and the pandemic period. Being a healthcare worker was also ascertained at baseline, although many who stopped employment in this area have now returned to work [29]. Lastly, due to sparse data, we have not explored the role of specific health conditions such as asthma, diabetes and high blood pressure, which have been shown to be associated with a higher risk of severe outcomes [3, 30] and are more prevalent amongst socioeconomically disadvantaged groups and some ethnic minority groups [31, 32]. However, these are likely to operate as mediators rather than confounders.

Administrative data from health services has recently suggested an increased risk of severe COVID-19 disease within ethnic minority groups. The UK’s Intensive Care National Audit & Research Centre (ICNARC) analysed data on 5578 patients admitted to critical care up to 16th April 2020 and found black and Asian people comprised a high proportion of total patients (11.2% and 14.9% respectively), although it was unclear whether these higher percentages were biased by most cases being initially seen in areas with high proportions of ethnic minority groups [33]. Similarly, data from the US Centers for Disease Control and Prevention also suggest a higher risk amongst black or African American people, but information on race was missing for approximately two-thirds of those diagnosed [34]. Analyses of administrative UK data have also suggested increased COVID-19 mortality in black and south Asian ethnic groups [26], which was only partly accounted for by socioeconomic differences [25]. However, the role of prior health and risk factors was not accounted for. Academic research on this topic has been limited to date. An ecological study of US counties has suggested that more socially vulnerable areas (which included greater numbers of people with socioeconomic disadvantage and ethnic minorities) were associated with higher COVID-19 case fatality rates [35]. Our study adds substantially to the evidence by finding that ethnicity appears to be an important predictor of laboratory-confirmed SARS-CoV-2 infection that is only partly attenuated by a large range of potential mediators (such as socioeconomic position), as well as addressing concerns about numerator-denominator bias.

Our results suggest there is an urgent need for further research on how SARS-CoV-2 infection affects different ethnic and socioeconomic groups. Our findings warrant replication in other datasets, ideally including representative samples and across different countries. As the pandemic evolves, there is a need to monitor infection and disease outcomes by ethnicity and socioeconomic position. However, data to allow this disaggregation is often not available—record linkage could potentially help address this gap, particularly in settings where administrative register data are available. Given the differences in health risks across occupational groups [36], understanding the risks that the full range of key workers experience is also required. Lastly, other social groups, such as homeless people, prisoners and undocumented migrants, experience severe disadvantage and research is necessary to study these highly vulnerable populations too [37, 38].


The limited evidence available suggests that some ethnic minority groups, particularly black and south Asian people, are particularly vulnerable to the adverse consequences of COVID-19. Socioeconomic disadvantage and poorer pre-existing health do not explain all of this elevated risk. There is therefore a need to determine why this increased risk occurs. An immediate policy response is required to ensure the health system is responsive to the needs of ethnic minority groups. This should include ensuring that health and care workforces, which often rely on workers from minority ethnic populations, have access to the necessary personal protective equipment (PPE) to ensure they can work safely. Timely communication of guidelines to reduce the risk of being exposed to the virus is also required in a range of languages [39]. Previous evidence suggests ethnic minorities in the UK tend to receive reasonably equitable care in many, but not all, areas [40]. However, this is not the case in many other countries (such as the USA) where the adverse consequences of SARS-CoV-2 infection may be even worse. SARS-CoV-2 therefore has the potential to substantially exacerbate ethnic and socioeconomic inequalities in health [41], unless steps are taken to mitigate these inequalities. The data from this study may be helpful to inform allocation of more aggressive therapies in people with severe disease, or targeting preventative vaccination to at-risk groups, once evidence for such approaches becomes available.