Introduction

The conventional definition of geodemographics is “the analysis of people by where they live” (Leventhal 2016 p6), usually by the classification of small areas sharing common socioeconomic characteristics. Geodemographic classification is a powerful and internationally applicable (Singleton and Spielman 2014) data reduction tool which summarises complex multivariate data and is used to assist business location decisions and the delivery of public services. In this paper we develop – and implement on a national scale for the UK – a substantive enhancement to the conventional geodemographics concept by building a classification based on the characteristics of workers and their places of work.

The majority of openly available geodemographic classifications have for decades relied on small area data from censuses of population, which enumerate people primarily referenced to their places of residence. Various commercial organisations have combined census and non-census data to produce classifications targeted at different markets and sectors of the population, but their methods and data products are rarely published openly. More recently, new forms of data from business records and administrative sources have come to be incorporated into open classifications (Singleton and Spielman 2014), but until very recently these have similarly focused on characteristics of people based on where they live, such as household and dwelling characteristics. These residence-based classifications effectively create neighbourhood types reflecting the situation when residents are at home – mostly, but not exclusively, at night. During working hours (which encompass both day and night, but predominantly day), the working population is extensively redistributed over geographical space, meaning that the characteristics of these areas can change greatly. These very different distributions also mean that residentially-based census reporting zones and associated data tend to be most detailed in areas where few people work, and least detailed in areas such as city centres and business districts with high workplace populations (Martin et al. 2013).

Many sectors require an understanding of the distribution and characteristics of the population at small area level, but for some important purposes it is the distribution and characteristics of population during the working day, rather than the (predominantly night-time) residential pattern, which is most relevant. For example, analysts and planners in the emergency, healthcare, transport and retail sectors require information about the spatial distribution of workers and workplaces in order to assess time-specific hazard scenarios, market opportunities and service delivery. Geodemographic analysis based on place of residence can provide, at best, only partial insights for these purposes.

In this paper, we propose the novel concept of a workplace-based geodemographic classification, based on the characteristics of the working population at their places of work. We describe a programme of research leading to the creation of a UK-wide Classification of Workplace Zones (COWZ-UK), building on a recently created statistical geography of Workplace Zones (WZs), and an interim classification for England and Wales (COWZ-EW). COWZ-UK was published in 2018 by the Office for National Statistics (ONS) and is available as an open data product from https://www.ons.gov.uk/methodology/geography/geographicalproducts/areaclassifications/2011workplacebasedareaclassification. UK-wide implementation has required generation of entirely new geographical units and datasets for Scotland and Northern Ireland. The classification is, nevertheless, methodologically consistent with previous UK classifications and is internationally applicable.

The remainder of this paper is structured as follows. In the next section, we review geodemographic classification, focusing on developments and data most relevant to classification of workers and workplaces. The third section describes the data and methods developed for the creation of COWZ-UK. The fourth section presents a summary of our new classification, including examples of the insights made possible. We then present a validation exercise based on spatial intersection of the new classification with a large map database indicative of different types of workplace locations. Finally, we present our conclusions and reflect on future opportunities.

Review

Geodemographic methods (Harris et al. 2005; Singleton and Spielman 2014; Leventhal 2016) comprise a variety of approaches to the identification of geographical patterns in multivariate socioeconomic data, usually by data reduction and classification. There is variation in the terms used to describe these classification methods in the academic and business literatures. For the sake of consistency, we here follow the convention adopted by the UK national statistical agencies and the most relevant academic commentators (ONS 2015; Gale et al. 2016; Leventhal 2016). Thus, geodemographic data products are here termed “classifications” and individual groupings of area types “clusters”, regardless of the specific algorithm employed in their construction. In geodemographic classification, a wide range of input variables describing demographic and social characteristics are combined through methods such as principal components analysis (PCA), k-means clustering or hierarchical agglomerative clustering (Harris et al. 2005; Leventhal 2016). The outputs of these methods are typically small numbers of distinctive clusters, with each area being allocated to the cluster to which it is closest in multidimensional variable space. These classifications often comprise two or three levels of nested clusters. Area cluster membership may be added as a contextual indicator to records relating to individual addresses, households and people. Textual and statistical descriptors are usually provided for each cluster, to provide users with summaries of its distinctive characteristics.

There is a wide range of both general-purpose and market-specific geodemographic classification systems worldwide. Leventhal (2016) provides a useful summary of these, illustrating the diversity of geographical units, numbers of clusters and information sources employed, together with application examples spanning the marketing, retail, financial services, media and market research industries as well as the public and academic sectors. Diverse examples include airport marketing in Australia (Leung et al. 2016) and assessment of exposure to natural hazards in Italy (Willis et al. 2014). Many classifications have been developed and made available as commercial products (see for example Experian™‘s Mosaic (https://www.experian.co.uk/business/marketing/segmentation-targeting/mosaic/) or TransUnion’s CAMEO (https://www.transunion.co.uk/products-and-services/consumer-marketing-data/segmentation-analysis)), but these are not openly available. By contrast, Vickers and Rees (2007) and Gale et al. (2016), respectively, describe the development of the 2001 and 2011 Census-based Output Area Classifications (OAC) of the UK, both of which were entirely conceived of, and produced as, open data products.

Various current research trends are discernible in geodemographic classification, of which we note increasing variety of input data, creation of open classifications (Singleton and Spielman 2014; Singleton et al. 2016) and an increasing interest in aspects of temporality. Longley and Adnan (2016), for example, explore the creation of a geodemographic classification based on Twitter usage, making the important observation that these data are no longer tied only to residential locations. However, due to the importance of residential census data to many classifications (Burns et al. 2018), it has still been usual for residential census geographical units to form the spatial building blocks for such classifications.

Singleton et al. (2016) demonstrate one aspect of the temporal specificity of geodemographic classifications by investigating classification stability through time, presenting a new classification based on data from two consecutive censuses. This is an important innovation, but does not address the much shorter cyclical timescale with which we are interested here, namely the daily redistribution of population between residential and workplace locations. Singleton et al. (2016) classification is based entirely on residential Output Areas (OAs, mean population 309) and limited to England, thereby avoiding the challenge of reconciling censuses held in different countries. They provide an approach to validation by exploiting ancillary secondary data, which provide alternative characterisations of residential areas.

Rather than building a classification from area characteristics, Burns et al. (2018) develop a classification of individual (person-level) characteristics based on census microdata and then geographically distribute these using spatial microsimulation to match observed area characteristics. This approach addresses the important issue that small areas are never in reality comprised of individuals and households who all share the characteristics of the area-based geodemographic cluster to which they have been assigned and more often encompass very heterogeneous populations. In theory, the classified microdata employed in Burns et al. (2018) could be allocated to appropriate workplace locations, although this aspect is not explored.

Clearly, the patterns revealed by residential geodemographics do not exploit the full richness of census data, which also include questions about place of work. Workplaces represent both a different geographical distribution and a different temporal domain for the 45.2% of the UK population recorded by the 2011 Census as engaged in employment (source: 2011 Census tables QS104UK and QS601UK, https://www.nomisweb.co.uk/), and yet, to date, there has been no geodemographic classification of UK workers and workplaces within bespoke small area units, primarily due to the lack of suitable data and boundaries.

Very few published studies have attempted to produce classifications of small areas based solely on workplace characteristics; where this has been attempted it has been in support of specific research objectives or without workplace-based small area data or boundaries. Hincks et al. (2018) produce a classification of commuting flows using census travel to work data, characterising the types of commuter travelling from origin and into destination areas. However, their analysis is based on Middle Layer Super Output Areas (MSOAs, mean population 7806) and thus does not make use of the highest spatial resolution residence or workplace data available. Debenham et al. (2003) present a novel analysis, which integrates census and non-census data about the supply- and demand-side of residential and labour markets into a single classification. They include administrative data about employment in different industries but the data integration necessitates aggregation to postcode sectors (mean population in their study area 33,691) which are not specific to either residential or workplace distributions. Manaugh et al. (2010) build separate residential and workplace classifications for analysis of commuting in Montreal, to characterise the origin and destination neighbourhoods in a travel survey. The two classifications share the same spatial grid and pool of input variables, albeit refined by a factor analysis, but there are relatively few employment-related variables and no attempt to extend this to a larger region or more general application.

The closest existing attempt to a country-wide classification of workers and workplaces based on place of work at the small area level is a UK commercial product called Workforce Acorn (CACI 2015). Workforce Acorn reweights CACI’s established classification of residential areas onto areas of work, based on travel to work flows. It is therefore not a direct classification of workers and workplaces in areas of work and is unfortunately not openly available. Other commercial products which integrate census and non-census data to produce partial classifications of workers, workplaces or areas of work, include AFD Software’s Censation (which provides a simplistic classification of commercial areas in the UK based on census and non-census data: https://www.afd.co.uk/data-sets/censation/), and Scan/US’s Daytime demographics classification (classifies data on business establishments, employees and land-use at a range of geographical scales but not for specific workplace-based areas: http://www.scanus.com/daytime-demographics-subscription.htm).

The four countries of the UK are the setting for the empirical work described in this paper. Three separate censuses are conducted by ONS for England and Wales, National Records of Scotland (NRS) for Scotland and the Northern Ireland Statistics and Research Agency (NISRA) for Northern Ireland. Although comparable, there are important differences such as in questions asked, output categories, confidentiality thresholds and OA design. Census OAs (known as Small Areas in Northern Ireland, and with slightly different design criteria in each of the UK censuses) are designed to represent approximately equal numbers of residents and households and thus contain widely varying workplace populations, which in the 2001 Census ranged from 0 to 80,145. This immense variation and associated disclosure control concerns meant that much of the data about workers and workplaces collected by the 2001 Censuses could not be published.

A major innovation of the 2011 Census in England and Wales was the creation of an entirely new set of geographical units (WZs) designed specifically for the release of census workplace data (ONS 2014a). Using automated zone design methods developed by Martin et al. (2013) and previously applied to OAs (Cockings et al. 2011), ONS generated 53,578 WZs (mean worker count 493) by splitting, merging or retaining the 2011 OAs. Design of WZs respected confidentiality thresholds of 200 workers and 3 workplace postcodes, permitting 21 tables of 2011 Census aggregate outputs about the workplace population and workplaces to be published, based on geographical aggregation of each person in employment to the WZ of their place of work. WZs are quite different in scale and purpose from the various types of functional labour market, travel to work or transportation zones used in the UK and internationally (Adams et al. 1999; Coombes and Bond 2008; FHA 2010). The value of these new workplace-based data and bespoke geographical units to a range of sectors has been demonstrated by various analysts. Waddington et al. (2017, 2019) highlight their value to retailers, particularly in the convenience sector, who are able to take account of the characteristics of the population working, as well as living, in the vicinity of existing or potential new stores, and to develop more time-specific demand and revenue prediction models. Woods (2017) uses the WZ demographic data as the framework for an analysis of workplace accidents. In local government, Hampshire County Council (2014) demonstrates the new insights to be gained into the characteristics of the workforce in local employment centres, while Hackney London Borough Council (2015) employs WZs and associated data to inform their Transport Strategy for 2015–2025.

An interim classification of WZs for England and Wales (EW), called COWZ-EW, has been developed by Cockings et al. (2015; http://cowz.geodata.soton.ac.uk/cowz-ew/). This has already been utilised by Berry et al. (2016) in retail, by Greater London Authority (2016) in local government, and by Martin et al. (2018) in travel-to-work analysis. COWZ-EW has also since been refined by Singleton and Longley (2019) to produce a bespoke classification for London (UK). At the time of development of COWZ-EW, while there was strong user interest in development of a UK-wide classification, this was not possible due to the lack of WZs and corresponding census outputs for Scotland and Northern Ireland. The rest of this paper describes the development of a new, UK-wide open geodemographic classification of workers and workplaces, including creation of the necessary geographical units and aggregated census data for Scotland and Northern Ireland.

Creating a Classification of Workplace Zones for the UK (COWZ-UK)

Data

To enable a UK-wide version of COWZ, WZ boundaries for Scotland and Northern Ireland were created by ONS on behalf of NRS and NISRA using the same software and, as far as possible, the same methods as for England and Wales (Martin et al. 2013; ONS/NRS/NISRA 2016). Capitalising on knowledge gained from the creation of COWZ-EW, and addressing specific differences between the UK censuses, the authors specified custom extracts of Scottish and Northern Irish 2011 census workplace microdata. These microdata, supplied by NRS and NISRA to ONS and accessed by the authors under secure conditions, were then aggregated to the newly created WZs for Scotland and Northern Ireland. Published aggregate WZ-level data for England and Wales were downloaded from NOMIS (https://www.nomisweb.co.uk/census/2011/workplace_population), thus providing the basis for a UK-wide geodemographic classification of 60,709 WZs.

As a by-product of this process, NRS and NISRA subsequently released openly available WZ boundaries for Scotland and Northern Ireland (available separately from NRS: https://www.nrscotland.gov.uk/statistics-and-data/geography/our-products/census-datasets/2011-census/2011-boundaries, or NISRA: https://www.nisra.gov.uk/support/geography/northern-ireland-workplace-zones, or as a complete UK set from the ONS Open Geography portal: http://geoportal.statistics.gov.uk/). NRS also released limited WZ-level data for Scotland (https://www.scotlandscensus.gov.uk/ods-web/data-warehouse.html#additionaltab), but NISRA were unable to publish WZ-level data due to disclosure concerns related to their previous release of data at the small area level.

The population base for COWZ-UK is the 2011 Census workplace population, defined as ‘All usual residents aged 16 to 74 in employment in the area the week before the census’, which includes any paid work (including casual or temporary work) as employees, self-employed (with or without employees), people on a government sponsored training scheme, people working for their own or family’s business, people on sick leave, maternity leave, holiday or temporarily laid off, and full-time students who are working. Those usually resident in one country (e.g. England) but working in another (e.g. Scotland), having workplaces outside the UK (including offshore installations), not usually resident in the UK and full-time students who are not working are excluded from the workplace population, in part due to the separate processing of the censuses by the different national agencies. Census respondents answer questions related to their main job (most hours), which means that secondary employment is not included. Those who work mainly at or from home, or with no fixed place of work are georeferenced to their area of usual residence, while workers who report to a depot are georeferenced to the depot. In the UK, a census ‘workplace’ is thus defined as the place of work recorded by a census respondent and may not correspond to a workplace found in other lists of businesses, enterprises or companies. The location of a workplace was determined by matching its postcode to the georeferenced postcode directory held by the relevant statistical agency. Workplaces recorded in the individual census records for Scotland and Northern Ireland were georeferenced by postcode using the May 2012 version of the ONS Postcode Directory (ONSPD). However, some postcodes spanning local authority boundaries had previously been split by NRS and the fragments assigned a modified postcode: for the purposes of COWZ-UK, these postcode parts were georeferenced using a lookup file provided by NRS and then allocated to a WZ using a point-in-polygon operation, with the microdata being aggregated to WZs.

Methods

The generic concept of a workplace-based geodemographic classification can be implemented using a variety of methods and datasets. The rest of this section describes its implementation in COWZ-UK. The aim is to construct a new classification which is based entirely on 2011 Census outputs, is consistent with the design of the 2001 and 2011 residence-based OACs, addresses the interests of research users and statistical agencies, and is publishable as an open data product. Greater detail than can be reproduced here is documented in ONS (2018a), particularly the definitions of census variables.

Identification of Relevant Domains

As with any geodemographic classification, careful consideration of real-world phenomena and detailed exploration of candidate variables is required in order to define domains and select variables for inclusion. Both 2001 and 2011 versions of the residence-based OAC include census variables organised into five domains, representing key characteristics of residential areas: demographic structure, household composition, housing, socio-economic group and employment. These are clearly not all appropriate for the characterisation of workers and workplaces, but it is possible to identify four comparable domains: composition of the workplace population, composition of the built environment, socio-economic characteristics of the workplace population and employment characteristics of the workplace population. There is no direct workplace equivalent of household composition because, unlike households, individual workplaces are not units of observation in the UK censuses so there is no data about workforce size or industry sector for individual workplaces, nor is there a direct count of workplaces. The composition of the built environment domain in COWZ-UK is approximately equivalent to the housing domain in OAC, but is intended specifically to capture the relative workplace/residential mix of a WZ. As in OAC, distance travelled and mode of travel to work are included within the socio-economic characteristics of workplace population domain, based on the behaviour of individual workers.

Selection and Preparation of Variables for Classification

When producing the interim COWZ-EW classification, an initial long list of 501 candidate variables, from 13 of the 21 published WZ-level tables, was identified, covering the four domains (see Table 1 in Cockings et al. 2015). Three additional bespoke variables (the number of workplace postcodes (obtained directly from ONS), density of workplace postcodes (per hectare), and ratio of the number of OAs to WZs) were also explored as proxies for the composition of the built environment as this was not otherwise reflected in the published data. The remaining eight WZ-level tables were excluded because they did not fall within the scope of the classification, or because they were highly correlated with variables in other tables. Following in-depth analysis, this long list was reduced first to 63 and then to 48 variables, which formed the final inputs to the COWZ-EW cluster analysis.

Table 1 Final set of 48 variables input to COWZ-UK classification, grouped by domain

In producing COWZ-UK, the full long list of 504 variables is not re-evaluated, as the previous COWZ-EW analysis had already shown that many of these contribute little useful information to a workplace classification and there is no reason to expect this to be different for the whole of the UK. Analysis for COWZ-UK instead focuses on the 63 variables which comprised the intermediate set of candidate variables for COWZ-EW. These were constructed by the authors for Scotland and Northern Ireland from the microdata extracts supplied by NRS and NISRA.

The definitions of the vast majority of variables are consistent across the UK, or can readily be mapped to equivalents. However, special treatment is required for four variables whose differences originate from variations in the census questions and/or coding of responses. Variable names refer to those shown in Table 1. The categorisation of White British (WP201_WhiteBrit) ethnicity is slightly different in the Scotland and Northern Ireland censuses compared to England and Wales and it is not possible to fully reconcile the Northern Ireland definition with the others. Similarly, there are slight differences in the classification of highest level of qualification (WP501_GE_L4, WP501_L3, WP501_NoQual), for which the Scottish definition cannot be completely reconciled with the others. Full-time students (WP601_FT_Stud) are identified slightly differently in all three censuses, although a comparable count for each country can be obtained from: the employment status variable for England and Wales; a separate student status variable for Northern Ireland; and socio-economic classification for Scotland. The England and Wales census asked only about travel to place of work (WP702 and WP703), but in Scotland and Northern Ireland the equivalent question asked about travel to place of work or study. For full-time students also in employment, it is not possible to determine whether the response relates to place of work or study and this cannot be inferred from either the routing of the questions, or from post-enumeration processing. In order to reduce the impact of this discrepancy on COWZ-UK, full-time students were excluded from just the calculation of travel to work variables for Scotland and Northern Ireland. Full details of the differences and associated solutions are provided in ONS (2018a). All counts are expressed as percentages of the workplace population (all usual residents aged 16 to 74 in employment in the week before the census), except for Approximated social grade (WP613) for which the denominator is all usual residents aged 16 to 64 in employment in the week before the census.

Despite the essentially data-driven nature of geodemographic classification, the analyst must make a number of key decisions concerning the selection and refinement of variables. For COWZ-UK, comparable methodological steps to those set out by Vickers and Rees (2007) in their development of the original 2001 OAC are followed. The statistical and geographical distributions and inter-relationships between the 63 intermediate variables are explored using a combination of simple summary statistics (mean, median, range, standard deviation), histograms, Quantile-Quantile (Q-Q) (normal probability) plots, maps at various geographical scales, and a correlation matrix.

Decisions concerning whether to normalise variables prior to cluster analysis, and which variables to include/exclude are non-trivial (Spielman and Singleton 2015). Geographically-aggregated census data typically exhibits many zero counts and outliers for specific variables. Such skewed distributions can be problematic for cluster analysis. Both 2001 and 2011 OACs were normalised prior to clustering. Following Gale et al. (2016), the log, Box-Cox and inverse hyperbolic sine (IHS) methods of transformation are evaluated for COWZ-UK, with the Box-Cox method producing the best overall results and therefore being implemented for all variables. Pair-wise correlation between the 63 normalised variables is then assessed using Pearson’s Product Moment Correlation coefficient (see Online Resource 1). Rather than employing rigid correlation coefficient or significance thresholds, final inclusion/exclusion decisions are based on statistical and mapped analysis and assessment of the variable’s overall contribution to the classification.

One such example includes two highly correlated groups of variables: Approximated Social Grade (ASG) (variables WP613_AB to WP613_DE) and National Statistics Socio-economic Classification (NS-SeC) (WP607_1 to WP607_7). Both ASG and NS-SeC are derived variables, but NS-SeC is a National Statistics product and incorporates information on supervisory relationships and size of employer. The NS-SeC variables are therefore retained (with the exception of WP607_4 (Small employers and own account workers) which is very strongly correlated with other variables such as Self-employed with employees, Self-employed without employees, No fixed place of work and Work from home), while the ASG variables are excluded.

A further example includes composite and component variables: Employed (WP601_Emp) and Self-employed (WP601_SEmp) are strongly negatively correlated. Employed exhibits distinct spatial patterning, particularly concentrated in urban areas and positively correlated with intermediate occupations and travelling greater than 20 km. The two component variables of Self-employed: Self-employed with employees (WP601_SEmp_w_E) and Self-employed without employees (WP601_SEmp_wo_E) demonstrate important relationships with other variables: the former positively correlated with longer and shorter working hours, Work from home, No fixed place of work, Construction and Education; the latter with industries such as Arts and entertainment. Employed, Self-employed with employees and Self-employed without employees are therefore retained whereas the composite Self-employed variable are excluded. Following these detailed evaluations, the 63 intermediate variables are refined to a final input set of 48, shown in Table 1 (grouped by domain).

Various researchers note the importance of standardising variables prior to cluster analysis to avoid those with a much greater range in their values carrying a disproportionate weight in the classification. As per Vickers and Rees (2007), Spielman and Singleton (2015) and Gale et al. (2016), the final 48 variables for COWZ-UK are range standardised to produce values in the range 0 to 1 prior to clustering.

Cluster Analysis

There is a range of algorithms for creating geodemographic classifications, including top-down and bottom-up approaches (Spielman and Singleton 2015). To maintain consistency with previous open residence-based classifications, COWZ-UK employs a top-down k-means clustering method, with repeated application to create a nested hierarchy i.e. the top level of the hierarchy is created first and each cluster is then further sub-divided to create the next level down. This is implemented in R using the kmeans function and the default Hartigan-Wong algorithm, with 10,000 random starts. The squared Euclidean distance is used to evaluate the degree of similarity within and between clusters.

For the top level of the COWZ-UK hierarchy (termed Supergoups in line with the 2001 and 2011 OACs), solutions containing two to twelve clusters are evaluated. Each of these solutions is then further subdivided into two to six clusters (termed Groups). Evaluation of potential cluster analysis solutions typically involves application of both objective and subjective criteria. Here, solutions are evaluated using: prior expectations based on the literature and the authors’ knowledge of specific locations; statistical and graphical diagnostics (such as compactness of cluster solutions, homogeneity of cluster sizes, and stability of clusters); and mapping using geographical information systems (GIS) and visual imagery. Analysis takes place at various geographical scales (local, regional and national) to ensure a thorough understanding of the spatial and statistical patterns.

Scree plots showing the mean distance to the cluster centre relative to the number of clusters prove generally unhelpful here in determining the most suitable number of clusters, as the gradients of the curves are relatively smooth throughout. Similarly, while it can be useful to aim for homogeneity of cluster size throughout the hierarchy (expressed as the range in the number of members (WZs) per cluster, relative to the number of clusters), some clusters will inevitably have slightly higher/lower membership due to their distinctiveness and it is often desirable to retain this specificity. The most insightful diagnostics include: clustergrams (see for example Fig. 1), radial plots (Fig. 2), and sense-checking using tools such as Google Maps and Street View to corroborate the authors’ expectations of the classification of specific areas.

Fig. 1
figure 1

Clustergram showing potential clusters in one- to twelve-cluster solutions for the Supergroup level of the COWZ-UK hierarchy. Rows represent solutions. Circles represent clusters. Area of circle proportional to number of WZs in the cluster. Thickness of arrows proportional to percentage of WZs contributed from relevant cluster in solution above. Labels for seven-cluster solution indicate named Supergroups in final COWZ-UK classification: A Retail; B City and business parks; C metro suburbs; D suburban services; E manufacturing and distribution; F rural; G servants of society

Fig. 2
figure 2

Illustrative radial plot: Supergroup E - manufacturing and distribution. Values of cluster centres for final 48 variables plotted on radial axes and joined to form (blue) irregular polygon. UK mean values shown as (red) circle. Points falling outside the circle indicate higher than average values for that variable and vice versa. Codes for variables labelled as per Table 1, but reordered as per final ONS published products

Figure 1 is a bespoke clustergram (Schonlau 2004) of the potential Supergroup solutions, which permits evaluation of the stability of potential clusters and solutions. Each row represents a solution, with each circle representing a cluster within that solution i.e. row 1 shows all 60,709 WZs in one cluster, row 2 shows the two-cluster solution, row 3 the three-cluster solution, and so on. For each cluster, the area of the circle is proportional to the number of members (WZs) in that cluster. The thickness of the arrows between clusters shows the percentage of WZs in the cluster in row n contributing to the relevant cluster in row n + 1. For example, of the 60,709 WZs in the one-cluster solution, 54.3% contributed to the left-hand cluster in the two-cluster solution, and 45.7% to the right-hand cluster. To aid clarity, only contributions greater than 5% are drawn.

Using Fig. 1, it is possible to identify the levels at which clusters emerge and remain stable. For example, the two-cluster solution effectively splits the WZs into “Rural” (right) and “Urban” (left) clusters. The “Rural” cluster remains relatively distinct throughout the two- to seven-cluster solutions, whereas the “Urban” cluster splits off into various combinations of groupings, which gradually coalesce into stable solutions at around the six- to eight-cluster solutions. By evaluating clustergrams for both the Supergroup and Group levels of the hierarchy, it is also possible to identify clusters from the Supergroup solutions which become clusters in their own right at the Group level of the hierarchy. For example, in the eight-cluster solution in Fig. 1, the “Rural” cluster splits off into two distinct new clusters: in the final COWZ-UK classification, these both become Groups. Fundamentally, beyond the seven-cluster solution, any sub-divisions of the Supergroups become Groups in the next level of the hierarchy. The seven-cluster solution is therefore selected for COWZ-UK, and Fig. 1 shows the clusters at this level labelled according to the codes used in the published classification (and as per Table 2).

Table 2 COWZ-UK Supergroups and groups: Codes, names and descriptive statistics

Figure 2 is an illustrative radial plot. Each radial axis of the plot relates to one of the 48 variables used in COWZ-UK, labelled according to the codes in Table 1 but ordered as per the final ONS published products. The value of the cluster centre for each variable is plotted on these axes and these points are then joined to form an irregular polygon (blue) which reveals the distinctive characteristics of the cluster, relative to the UK mean values for the variables (indicated by the (red) circle). If the point lies outside the circle then that variable has a higher than average value, and vice versa. Thus, Fig. 2 (for Supergroup E: Manufacturing and distribution, taken from the final COWZ-UK classification) shows that this cluster is characterised by higher than average percentages in Manufacturing (WP605_C), Transport and storage (WP605_H), Employeee (WP601_Emp), Lower supervisory and Routine occupations (WP607_5 and WP607_7) and Travelling over 20 km to work (WP702_GE20), together with lower than average percentages of Females (WP1101_F_16_74), High qualifications (WP501_GE_L4) and Work from home (WP702_WkFromHome).

Once the number of levels in the hierarchy and clusters per level has been determined, a pen portrait for each cluster is produced and the clusters are named. This naming process is challenging and often contentious (Vickers and Rees 2007). Names are proposed by the authors and then jointly reviewed by themselves and ONS, using experience gained from the production and naming of other official area classifications. The names are intended to be: accurate in terms of the input variables; neutral rather than complimentary or derogatory; consistent throughout the hierarchy; consistent with, but not duplicative of, their use in other classifications; and independent of placenames to avoid undue specificity.

Results: A Classification of Workplace Zones for the UK

COWZ-UK is a two-tiered geodemographic classification of WZs, comprising 7 Supergroups and 29 Groups. Table 2 shows the hierarchy of Supergroups and Groups, together with the codes and names assigned to each cluster and various descriptive statistics. The full open datasets, including cluster memberships, squared Euclidean distance from centroids and pen portraits for the Supergroups and Groups are available from https://www.ons.gov.uk/methodology/geography/geographicalproducts/areaclassifications/2011workplacebasedareaclassification. Interactive maps facilitating exploration of COWZ-UK are provided by Consumer Data Research Centre (CDRC) Maps: https://maps.cdrc.ac.uk/#/geodemographics/cowzuk11/default/BTTTFFT/10/-0.1500/51.5200/.

As expected, Table 2 shows that the rural Groups cover the greatest land areas, with Rural with Mining and Quarrying at 42.55% and Traditional Countryside at 28.39%, although in combination they account for only 12.8% of WZs and 8.6% of workers. The Suburban Services Supergroup contains the two Groups with the largest percentage of workers, at 14.79% in combination, in just 3.62% of the land area. The highest spatial concentration of workers is achieved in the Global Business group, with 3.77% of workers in just 0.01% of land area (a density of 599 workers per hectare).

Figure 3 maps the national distribution of the Supergroups. As highlighted by Table 2, it is clear that the land area is dominated by the Rural Supergroup, with all others being concentrated in urban areas. In Fig. 4, the spatial distributions of Groups for three coastal cities in different parts of the UK are illustrated, masked with a buildings layer to aid interpretation. Belfast and Aberdeen are characterised by extensive industrial and docks-related areas categorised as Groups of the Manufacturing and Distribution Supergroup. Both have city centres reflecting a mix of Groups primarily within the Retail, City and Business Parks and Servants of Society Supergroups, but in Belfast this is much more closely ringed by the Suburban Services Groups, while Aberdeen demonstrates a more complex patchwork of Groups across the whole urban area. The Brighton map contrasts greatly, having only a small area of port-related activity but a centre of intense Retail and City and Business Park Groups, surrounded by a wide band of Metro Suburbs, which tend to have higher status occupations than those of the extensive Suburban Services seen in Belfast. The complex internal structure of each city is reflected in the classification, while the substantial differences between them are also captured, reflecting their unique industrial, social and settlement histories.

Fig. 3
figure 3

COWZ-UK Supergroups for the UK

Fig. 4
figure 4

COWZ-UK groups for three coastal cities in the UK (a) Belfast, (b) Aberdeen, (c) Brighton, masked by a buildings layer to aid interpretation

Validation of the New Classification

Spielman and Singleton (2015) and Openshaw et al. (1980) note that the accuracy of classifications cannot really be measured; rather, they can only be evaluated relative to the purpose for which they were designed. Vickers and Rees (2011) describe ‘ground-truthing’ a geodemographic classification by recruiting a panel of 61 peer reviewers able to comment on results in areas with which they are familiar. As a preliminary validation step, the COWZ-UK categories of approximately 100 specific places (such as large shopping centres, industrial areas, educational and health sites, suburbs, high streets, different types of rural areas) are sense-checked against the authors’ prior expectations using maps and visual imagery. This proves generally reassuring in terms of anticipated classifications but is clearly limited in scope, coverage and objectivity.

Leventhal (2016, p55) suggests that, as well as mapping and ‘sense checking’ the locations of the clusters, validation of geodemographic classifications should include comparison against other datasets. There are many datasets which could potentially be intersected with COWZ-UK, but none which offers a definitive way to assess ‘accuracy’ per se, given that there is no ‘true’ classification against which to compare the results. One approach is to compare the COWZ-UK outputs with independent location data for identifiable workplace types such as hospitals, business parks, farms etc. While COWZ-UK is a classification of small areas rather than individual workplaces, large workplaces or groups of similar workplaces will strongly influence the classification: comparing patterns of association between the location of such features and the COWZ-UK categories therefore provides an independent source of verification. Open source datasets such as OpenStreetMap cover the entire UK, but lack consistency of feature coverage and definitions. Ordnance Survey GB publishes a Points of Interest® (POI) database (Ordnance Survey 2014) containing 4,258,081 geocoded features (March 2018 release), using a detailed 3-level classification. This database permits identification of relevant features for comparison with the COWZ-UK classification. Northern Ireland is covered by a separate national mapping agency with no directly comparable dataset available: the following analysis is therefore necessarily restricted to England, Wales and Scotland. POI features are date-stamped with the date of data supply but it is not possible to extract the feature set current at a historical date (such as 2011 for the census), so any comparison is inevitably impacted by changes in POI features over time.

The POI database is general purpose and therefore contains many natural and built physical features, such as ponds and wind turbines, which are not associated with a workplace type and will rarely be places of usual employment: these are excluded from the analysis. Similarly, many POI features which probably do represent places of work but which are likely too small (in terms of number of workers) to influence the classification of a WZ are not included. For example, the high level Education and Health code 05 includes a range of facilities ranging from code 05280368 ‘Dental surgeries’ to 05280370 ‘Hospitals’: while an individual hospital’s workforce will often be large enough to influence the classification of a WZ, that of a dental surgery will not.

Table 3 identifies 17 POI feature codes, or combinations of codes, likely to be associated with distinctive, large, workplaces. These account for 1,054,697 POIs, 24.9% of the total. These point features are extracted and intersected with WZ boundaries to identify the COWZ-UK Group to which they are allocated. Less than 0.01% of POIs cannot be matched, mostly because they are offshore and fall outside the WZ geography (e.g. POI code 07390512 ‘Fish and shellfish’, within 0739 ‘Farming’, comprises mostly offshore mussel beds). This results in 1,053,645 POI records with a matched COWZ-UK Group code (shown broken down by POI code group in the final column of Table 3).

Table 3 Ordnance Survey GB points of interest (POI) code groups selected for COWZ-UK validation

The 17 POI code groups are then cross-tabulated with the 29 COWZ-UK Groups. In order to account for the very different number of features in each comparison group, expected values ((row total*column total)/grand total) are calculated for each intersection, assuming a totally uniform distribution of POI code groups across the COWZ-UK Groups. The number of features observed in each intersection is then compared to the expected value, with differences expressed as a ratio [(observed-expected)/(observed+expected)], thus allowing groups with much greater (or lower) than expected prevalence to be identified.

The results are summarised as a heatmap in Fig. 5, which provides a clearer visualisation of the relationships than a simple numerical table. Higher than expected values are shown in red; lower than expected in blue. The POI code groups are re-ordered from Table 3 to aid interpretation. An exact relationship between POI code groups and COWZ-UK Groups is not to be expected, but there are clear structural relationships, with a broad diagonal of strongest association through retail and business POI code groups in the relevant COWZ-UK Groups, through industrial and rural activities to education health and public administration. The extractive industry and farming POI code groups are strongly present in the Rural COWZ-UK Groups and strongly absent from all other types. There are also isolated hotspots such as the strong presence of government and hospitals in the COWZ-UK Administrative centres Group. Clothing and accessories retail is under-represented in almost all COWZ-UK groups except those associated with retail and business (and highest in ‘Shop until you drop’). As might be expected, suburban areas offer some of the least clear patterns, but show some overrepresentation of food and multi-item retail, sport, entertainment and education, reflecting local community services and businesses. Taken as a whole, this analysis provides further confidence in the COWZ-UK classification.

Fig. 5
figure 5

Heatmap of points of interest (POI) code groups cross-tabulated against COWZ-UK groups

Discussion and Conclusions

This paper has introduced the novel concept of a workplace-based geodemographic classification and demonstrated its implementation via COWZ-UK, a national small area level classification of WZs for the UK. While validation of such classifications is difficult, the analysis presented here suggests that COWZ-UK performs as per a priori expectations and that users can have confidence in its outputs. The concept of a workplace-based geodemographic classification is generic and could readily be implemented by census or statistical agencies in other countries where the characteristics of workers and their workplaces are recorded at the individual level. Refinements to the domains and variables are likely to be required, depending on context- and country-specific phenomena. The classification could be produced for any geographical units to which the individual level data can be aggregated. A link between place of work and place of residence also permits the inclusion of variables related to travel to work.

COWZ-UK is an openly available data product, consistent with other existing census-based classifications. As described in Data Section, not all of the source data employed in the creation of COWZ-UK are publicly available, primarily due to its retrospective creation following the processing and release of 2011 Census data in the UK. While the methods are transparent, users are therefore unable to fully replicate its production.

COWZ-UK shares some limitations with all census-derived datasets. Despite the overall high quality of census data, questions relating to topics such as place of work and industry of employer are among the least accurately completed (ONS 2014b). Inaccuracies are also introduced both by the assignment of workers with no fixed place of work to their residential OA, and incorrect allocations of some workers to business headquarters rather than actual places of work. A specific direct consequence of the separate production of 2011 workplace census statistics by ONS, NRS and NISRA is that cross-border flows are entirely missing from the source datasets and therefore from COWZ-UK. Only people living and working within a specific country are included – people living in England and working in Scotland, for example, are not included. Nevertheless, the analysis presented here suggests that COWZ-UK provides an original and meaningful characterisation of workers and workplaces at the small area level.

COWZ-UK is a classification of both workers and workplaces. 31 out of the 48 input variables relate to workers, with variables which are more geographically concentrated notably driving the classification. For example, ethnic group/country of birth and the presence of public transport systems result in clear differentiation between metropolitan and non-metropolitan areas. London is so distinct in a range of variables compared to the rest of the country that, relatively, it lacks internal differentiation. Singleton and Longley’s (2019) recent London-specific classification emphasizes this local variation compared to the national COWZ-UK version.

While 17 of the input variables relate to workplaces, 15 of these are a typology of industry. The influence of industry type on COWZ-UK is therefore quite profound. A broader range of workplace-related variables would be useful but, as noted previously, the UK census is not a register of businesses and only indirectly records information about employers and businesses via workers’ reporting of them. Industry type also arguably exerts a strong influence on the validation of COWZ-UK presented in this paper. Much subjective a priori expectation of the classification of small areas is based on the type of industry present, probably because it is the most easily observed visually, compared to the less obvious characteristics of the workforce in an area. Similarly, the POI feature-type categorisations are largely based on industry type rather than any workforce characteristics. Independent observation, ground-truthing and measurement of the characteristics of the workforce are therefore more challenging in this respect.

Current UK statistical agency plans are to update OAs and WZs for the 2021 Censuses, thus maintaining stable geographies (HM Government 2018). This presents a real opportunity to harmonise definitions and processing across the three UK statistical agencies, and to address some of the limitations of the current WZ and COWZ outputs such as definitional and geocoding differences and the treatment of cross-border flows. Unlike in 2011, WZs and associated data should be able to be released with the rest of the census outputs, increasing user awareness and uptake. It is reasonable to expect that new versions of COWZ will be produced following the 2021 Censuses, thus providing users with a valuable time series of workplace-based classifications upon which to base analyses. WZs have also recently been produced by the Central Statistical Office for Ireland (CSO 2018), therefore it would now also be possible to extend the COWZ suite of products to Ireland.

As with any census-derived data product, the accuracy of COWZ-UK will degrade over time. The rate of change in the distribution and characteristics of workers and workplaces is likely to be more rapid than that of residents and households. It is highly likely that such changes will be geographically concentrated, with certain COWZ-UK categories being affected more than others, due to their varying relative concentrations in different sectors of the economy. The maintenance required for WZs may be higher than that predicted for OAs and will therefore need to be factored into Census 2021 plans.

Looking further ahead, beyond the 2021 Census, it can be anticipated that workplace, business and employment data will form an important aspect of the broader transformation of traditional census products to make much greater use of administrative and survey data (ONS 2018b), which in turn has the potential to address some of the challenges identified here and offer more frequent updating of derived classifications such as COWZ-UK.