Ethnicity is becoming more and more important in understanding the emerging health needs in different communities and to this end the Thames Cancer Registry (TCR) and the London Health Observatory (LHO) are working together looking at ethnicity coding in cancer.

Ethnicity information has been collected in Britain at the last two censuses, 1991 and 2001. In 1991 a simple classification with nine entities was used to show ethnicity: White, Black-Caribbean, Black-African, Black-Other, Indian, Pakistani, Bangladeshi, Chinese, and any other ethnic group. For the 2001 census the classification was changed so that White was expanded to White-British, White-Irish and White-Other, while Asian-Other and four Mixed groups were added to create a total of 16 categories. Cancer registries have included ethnicity as an optional data item since 1993. Ethnicity coding was introduced to the NHS in 1995 as part of the Hospital Episode Statistics (HES) data. Unfortunately the availability of ethnicity data in both datasets has not been improving as rapidly as hoped for, due to a multitude of factors e.g. people are reluctant to collect data they do not feel is being utilised.

Accurate coding and collecting of ethnicity is important for epidemiological research and for planning services. Following the Race Relations (Amendment) Act in 2000, the NHS developed guidance to ensure different ethnic groups have equal access to services. Certain cancers can be associated with certain ethnic groups, for example, breast cancer in Ashkenazi women [1] and prostate cancer in black males [2]. Head and neck cancers are particularly associated with Asians from the Indian sub-continent and nasopharyngeal cancers with the Chinese [3]. Such associations can be due to genetic influences or to lifestyle and environment.

There has not been much research on ethnicity in this country due to the poor availability of ethnicity information. Studies have generally been focused in particular areas and heavily reliant on name algorithms to determine ethnicity [47]. Country of birth can also be used as a proxy for ethnicity, but this does not work for older people born in India when it was still part of the British Empire, and it also means that the research is only looking at migrant populations and not the total ethnic group. Migrant populations tend to have an incidence somewhere between the 'home' and the 'host' nation which often approaches the 'host' nation after one or two generations, for example stomach cancers in Japanese people who move to a Western country [2]. Examining these populations is more suitable where immigration is a new phenomenon, and the 'host' population is fairly homogeneous.

The aim of this joint project was to determine the completeness of ethnicity coding on the TCR database and in the HES data as held at the LHO, and investigate factors associated with the availability of ethnicity data.


The TCR dataset for the calendar year 2002 and the HES dataset for the financial year April 2002 to March 2003 were examined. Data on whether ethnicity was recorded, sex, age, cancer network of residence, deprivation, proportion of non-whites in the local authority population, and site of cancer were available. The TCR dataset records individual tumours whilst the HES dataset records inpatient episodes. As the datasets covered different time periods and had different definitions of cancer records, no attempt was made to match the datasets to validate the ethnic code information. Ethnicity was regarded as recorded if it was a valid, non-missing code.

The deprivation data was taken from the Index of Multiple Deprivation 2000 income domain [8]. Quintiles were computed for the London area and assigned to records based on postcode of residence.

The proportion of non-whites in the populations of the local authorities in London was calculated from the ONS Labour Force Survey [9] where a quarter of the labour force was surveyed in the summer of 2001 for their ethnicity. This data source was used as Census data were not available at the start of the study. The proportion of the population in each local authority which was non-white was calculated, and the local authorities were grouped into quintiles.

The proportion of patients who had ethnicity recorded was calculated for each variable collected. For the TCR data, logistic regression was then used to fit a fully adjusted model, including sex, age, cancer network of residence, deprivation, proportion of non-whites in the population and site of cancer. Results were then transformed to obtain an adjusted proportion of patients with ethnicity data provided. Tests for trend were done by fitting categorical variables as continuous, and chi2 tests were used to test for heterogeneity.


Table 1 shows the number and proportion with ethnicity recorded in each dataset. On the HES database, there were 111821 hospital admissions of London residents with a malignant cancer as the primary diagnosis. Ethnicity was recorded for 81.1% (90661) of these admissions; this figure was fairly uniform over sex and deprivation quintile. Ethnicity coding by age-group ranged from 71.8% in the 20–24 year olds to 88.9% in the 5–9 year olds, with the majority of the age groups achieving around 80%. The quintile with the largest proportion of non-whites had the lowest proportion of records with ethnicity coded (76.1%). The highest proportion, 85.3%, was recorded in the middle group. The valid ethnic coding varied between the five networks of residence from 68.2% to 93.0%. Coding over the cancer sites varied from 75.9% in pancreas cancer to 85.2% in bladder cancer.

Table 1 Number and proportion of cancer hospital admissions (HES) and cancer patients (TCR) with ethnicity recorded.

There were 25581 London residents registered on the TCR database with a malignant cancer (ICD10 C00-C97 excluding basal cell carcinomas of skin) diagnosed in 2002. A total of 22.7% (5796) had a valid ethnicity code (Table 1). The majority of patients with a valid ethnicity had an ethnic code of white, (4652/5796, 80.3%), data not shown. Men were slightly more likely to have a valid ethnic code than women, 23.6% vs. 21.7%. Ethnicity coding varied by age-group between 9.5% in the under 1 year olds to 31.0% in the 5–9 year olds, with the majority of the age-groups achieving around 22%. The patients resident in the most deprived areas were most likely to have an ethnicity code, 33.2% as opposed to 17.0% in the least deprived areas. The availability of ethnicity data ranged from 16.7% to 33.9% with the proportion of non-whites in the population of the local authority (least to most). The availability of ethnicity coding varied between the five cancer networks of residence from 15.6% to 29.9%. Coding over the cancer sites varied from 18.7% in ovarian cancer to 25.7% in head and neck cancers.

The difference in ethnicity coding between the sexes in the TCR data was no longer significant when adjusted for all other variables; this was mostly due to the effect of cancer site as predominantly male cancers (e.g. lung, prostate and bladder) had high proportions of patients with ethnicity recorded (Table 2). As age increased, patients were less likely to have ethnicity recorded, both before and after adjustment. Patients were more likely to have ethnicity recorded as the proportion of non-whites in the population increased. After adjustment this trend was entirely driven by the group with most non-whites having a very high proportion of patients with ethnicity coded, without this group there was a significant negative trend. Adjustment did not affect the associations between cancer networks of residence or deprivation and ethnicity being recorded. In the unadjusted analysis patients with cancer of the head and neck (25.7%) and melanoma of skin (25.6%) had the highest proportion of patients with ethnicity recorded. After adjustment the groups with highest proportion of ethnicity coded were melanoma of skin (27.8%) and bladder cancer (27.5%)

Table 2 Proportion of cancer patients (TCR) with ethnicity recorded, unadjusted and adjusted for all terms in table.


There were large differences between the availability of ethnicity data in the TCR and HES datasets. A number of factors were associated with the likelihood of having ethnicity recorded. Patients resident in areas with a higher proportion of non-white residents and the most deprived population were more likely to have an ethnic code on the TCR database, though this pattern was not seen with the HES data.

Ethnicity information has generally not been well recorded in the UK. However some health data sources have a high level of completeness. The Survey of Prevalent HIV Infections Diagnosed (SOPHID) is a cross-sectional survey of all individuals who have been diagnosed with an HIV infection and attended for HIV related care at an NHS site within a calendar year. Only 5% of these patients seen in London in 2002 did not have ethnicity recorded. [10].

Recording of ethnicity data has been particularly difficult in the TCR database. The proportion of valid ethnic coding on the TCR database for London residents varied by area of residence, deprivation quintile and the proportion of the population that is non-white. The largest variation in ethnic coding in the HES database was between the cancer networks of residence.

The difference in proportions of ethnicity data available in the HES and TCR data may have occurred for a number of reasons. The HES data are downloaded from the hospital patient administration system (PAS) which should include ethnicity. The TCR data come from a number of different data sources, most of which do not have ethnicity as part of the dataset. The primary source of TCR data is a mixture of PAS and pathology data with other data added from radiotherapy or chemotherapy clinic notes. However, some fields, such as ethnicity, may not be available to data collection staff, or staff may not be collecting the data item, viewing it as less important than other variables. This needs to be reviewed internally. Data are also obtained directly from death certificates, GP notes, and outpatient notes. Of all the sources of TCR data, the PAS data is the only source likely to contain ethnicity data which is accessible to the TCR data collection staff.

The variation in ethnicity coding between cancer networks of residence in the HES data is likely to be due to differences in trust of admission. A study of access to revascularisation in London examined the completeness of ethnicity coding for related episodes and found wide variation between hospital trusts. [11].

As the HES dataset records in-patient episodes, rather than individual patients, some patients will be recorded more than once. If these patients were more likely to have their ethnicity recorded, the results would be affected by selection bias, with HES completeness figures artificially inflated. Although the populations being examined are different, this is unlikely to explain the large differences in results found.

In the TCR data the areas with the highest proportion of non-whites in the population were most likely to record ethnicity. Areas with diverse populations may be more aware of the importance of collecting ethnicity information than areas with a large ethnic majority population.


Improved recording of ethnicity in sources of data that TCR have access to will improve completeness, as will highlighting the importance of collecting ethnicity to data collection staff. Alternatively information from other sources e.g. electronic data feeds or HES data itself supplied directly to the TCR should increase completeness. Efforts to collect ethnicity data should be encouraged in all healthcare settings. Future research should explore where the difficulties collecting ethnicity information lie, whether with patients, healthcare professionals or the recording procedure, and how such problems can be overcome.