Introduction

Geodemographics is widely defined as the analysis of people based on where they live. It is concerned with segmenting the population into homogeneous groups based on a range of characteristics to enable the profiling of neighbourhood areas for commercial and public service planning applications (Longley 2017). In the UK, the concept can be dated back to the 1970s and evolved from coarse scale census-based classifications to systems nowadays that make use of a myriad of individual and lifestyle data (Harris et al. 2005). However, the notion of area classification is much older and has been around since the late nineteenth century with the work of Charles Booth (1889).

Commercial systems in the UK typically operate at postcode level (Petersen et al. 2011) with an average population of forty residents across each of the 1.75 million postcodes in the country (ONS 2014). Other countries use differing levels of granularity but many remain restricted to areal units. However, the use of areal units creates spatial aggregation problems. Geodemographic classifications assume that each class is largely homogeneous based on the premise that "birds of a feather flock together" (Harris et al. 2005, p.16); hence people with similar traits tend to gravitate to similar locations. Whilst application-specific geodemographic systems such as those in health (Abbas et al. 2009), crime (Ashby and Longley 2005) and education (Singleton and Longley 2009) use carefully selected inputs, many of the commercial general-purpose systems include huge arrays of inputs designed to describe a range of lifestyle, behavioural and socioeconomic conditions. As more variables are added – both census/socioeconomic (e.g. age and housing type) and lifestyle/behaviour (e.g. expenditure and leisure activities), there is an increase in the prospective ambiguity as demonstrated in Fig. 1.

Fig. 1
figure 1

Problems in aggregate level classification caused by increasing variables and people traits

Two hypothetical areas exist: Area A and Area B (Fig. 1). If Area A was statistically assessed independently and clustered into a crisp ‘best fit’ grouping based only on the shading of the individuals, this area would be assigned to a ‘red shaded’ cluster. Even using only this one variable, the overall ambiguity is apparent given that all members of this area do not fit this typology. There are in fact four people-types resident in this area. When a second variable is introduced, that of person height, the collective ambiguity increases yet further as evidenced in Area B. If this area were to be assigned to a single ‘best fit’ cluster, it may fall into a ‘tall blue’ grouping (or similar). In this instance, the uncertainty is greater and the ability to classify with minimum divergence becomes more difficult. Furthermore, important variations in the data are hidden and in some applications it is these exceptions that are of greatest interest. This demonstrates the need to keep variable numbers to a minimum when adopting a cluster-led approach to data analysis, which is also noted by Openshaw and Wymer (1995). However, limiting the number of variables used in a classification may be detriment to a rich and holistic depiction of neighbourhood conditions so needs to be balanced correctly. Further details on clustering and in particular the commonly used k-means approach can be found in Vickers and Rees (2006) and Burns (2017).

There has been much written on the projected path of geodemographics, including the work of Adnan et al. (2010) which explores more innovative geodemographic visualisation techniques and real-time segmentation systems. More recently, research by Singleton and Spielman (2014) discusses how ‘open data’ and alternative data sources to the census will underpin such systems in the future and this is particularly important in the UK when a traditional census will be undertaken for the final time in 2021 and followed by an analysis of existing and administrative data (Cadman 2014). However, there is still limited coverage in the literature on the development of geodemographic methodologies based around the individual. The ability to classify individuals based solely on personal characteristics may result in more homogeneous clusters than area-based geodemographic systems. There has been work on this commercial sector, which has led to various household classifications, Acxiom’s PersonicX and Experian’s Mosaic system are two such examples (Acxiom 2016; Experian 2015) but the details of how these proprietary systems are built is not published. In the UK, the smallest level of geography at which data are released is the output area with an average population of 297 persons in 2001 (ONS 2012). In particular, these data, when classified, are subject to the effects of ecological fallacy and generalisation. Despite Farr and Webber’s (2001) work, which describes the benefits to be gained from moving from areal unit classification to systems capable of working at the level of the individual as being “intuitively obvious” (p.58), no work in the academic sector has previously been undertaken to test this. There is appreciation, however, of the potential loss of neighbourhood-level effects with such an approach; examples include voting behaviour or newspaper readership, something area-based geodemographics are able to capture.

This paper aims to address this particular gap in the geodemographic literature by first demonstrating the need for an individual-based classification and then providing a framework for the development of such a system that uses only census data. The framework is then applied to the creation of an individual-based classification for the city of Leeds using data from the 2001 census and further validated using individual and household survey data from the British Household Panel Survey.

Demonstrating the Need for an Individual-based Classification

To demonstrate why individual-based geodemographic classifications could be useful, an examination was undertaken on the freely available 2001 Output Area Classification (OAC) (Vickers and Rees 2007). This involved contrasting the OAC segmentation results with 2001 Census data for Leeds, a northern UK city. A summation of the total number of people who possess each characteristic per output area is contrasted with the OAC. The 2001 OAC comprises seven key groups (tier 1), known as ‘Supergroups’. One of these is named “Multicultural” (Supergroup #7). This cluster is disaggregated into two smaller standard groups (tier 2); “Asian Communities” (Group 7a) and “Afro-Caribbean Communities” (Group 7b) with both of these groups further split into three and two subgroups, respectively. However, given that these final-tier groups possess no explicit label, this assessment will only consider tier 2 of the OAC.

Given the names of these groups, i.e. “Asian Communities” or “Afro-Caribbean Communities”, one may expect any output area selected from the 2438 that comprise the district of Leeds to contain a relatively high concentration of these ethnic groups. Furthermore, one may expect any “Asian Communities” area classified within the “Multicultural” supergroup to contain a higher percentage of persons of Asian ethnicity than those of Afro-Caribbean, and vice versa. In Leeds there are 273 output areas classified as “Multicultural”. Of these, 217 are in the “Asian Communities” subgroup where 30 (13.82%) actually contain higher concentrations of Afro-Caribbean residents. Similarly, of the 55 areas categorised within the “Afro-Caribbean Communities” subgroup, 22 (40%) include a higher percentage of Asian inhabitants.

The patterns observed above are not dissimilar to those seen when analysing other area-based geodemographic systems. For example, in the early ‘SuperProfiles Lifestyle’ classification, Birkin (1995) points to clusters labelled “Young Married Suburbia” and “Metro Singles” and emphasises how these names not always overly representative of cluster composition. For the former cluster, this grouping accounts for over one quarter of the population whose age is 45 plus. Meanwhile, for the latter named cluster, this category encompasses only 21% of single workers – unrepresentative when considering the cluster label contains the words “metro” and “single”. Although the above focused on the explicit naming conventions assigned to clusters, the complete 2001 OAC pen portraits do not diverge too far away from this short hand descriptor.

Both of these examples show that clusters are not as homogeneous as might be expected. From a commercial point of view, this means that the wrong type of consumer may be targeted at times, or more importantly, this may result in the misallocation of resources when used by public sector bodies for decision making.

Individual-level classifications should go some way to reducing the impact of such generalisation and hence may result in clusters with greater levels of homogeneity which are easier to label. The importance of geography in a geodemographic system should not be lost, however, as such systems are more than simple sociological analyses of people. In many ways, geography remains a useful means of presenting the output and often drives individual-level characteristics. In the next section, a framework for the development of an individual-based classification is provided, which is applied to the city of Leeds using data from the 2001 census.

A Framework for the Development of an Individual-based Classification

The framework presented in the subsequent sections follows the conventional phases of geodemographic system development, such as those proposed by Gibson and See (2006), but it has been adapted to reflect the handling, processing and presentation of individual-level data (see Fig. 2). Note that the framework was developed using data from the 2001 census but any census data can be used provided there are accompanying microdata as outlined below.

Fig. 2
figure 2

Complete individual-level geodemographic system framework (following process detailed by Gibson and See 2006)

Defining the Purpose

The overall purpose is to produce a generic, individual-level geodemographic classification that can be applied across a broad set of applications. For this reason a selection of variables from the 2001 UK census is used. To test the functionality of the classification, we have developed it for two contrasting areas: Leeds and Richmondshire, both in the Yorkshire region, UK. However, for this paper, we present only the results for the city of Leeds. Leeds was selected as it is close to the national average for a range of socio-economic variables (CASWEB 2001; Leeds City Council 2002).

Selecting the Input Data

All of the variables used to create the classification were obtained from the freely available Small Area Microdata (SAM) file in the UK census. SAM is an individual-level sample of anonymised records, which were extracted from the 2001 Census (ONS 2008). The SAM is similar to the SAR (Sample of Anonymised Records) with regards to variable inclusion, however, broader banding/categorising is adopted to preserve individual confidentiality given the personal nature of this multivariate dataset. The range of variables are also more restricted when compared to the national census dataset thus creating a trade-off between rich individual-level data and a diverse pool of variables to select from. Furthermore, the SAM provides a finer level of geography, i.e. the local authority (government) level, as opposed to the then larger government office regions (GORs) in the standard SAR, the latter discontinued in 2011. Local authorities are smaller, with population sizes ranging from ~1 million to 27,000 (Local Government Boundary Commission 2014). The SAM sample accounts for 5% of the population and contains circa 2.9 million records from people in the UK and ~35,000 for the Leeds metropolitan district (of Leeds’ 715,402 2001 population) (Leeds = SAM code 67). A broad range of census topics are covered, including; employment, personal demographics and residential arrangements (ONS 2008).

The data held in this file varies by type where each variable is categorised into one distinct category of (1) individual, (2) household, or (3) family. The file contains a total of seventy-four census variables, one unique identifier, plus thirteen ONS / Department of Food and Rural Affairs (DEFRA) variables and additional imputed variables. This research considers only the census variables. Examples of such variables include: number of cars / vans owned or available for use (household category), presence / number of dependent children in family (family category) and fundamental demographic variables such as age / sex / ethnicity / social-economic classification of respondents (individual category). Table 1 contains the range of variables selected for inclusion in the classification together with the associated theme (individual, household or family) and data types.

Table 1 Selection of variables for SAM individual-level classification

Transforming and Re-scaling Variables

A classifying or clustering algorithm is the process of partitioning objects into clusters (or groups) such that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity (Kaufman and Rousseeuw 1990). To be able to do this effectively, the data first needs to be pre-processed to (i) convert the dichotomous and categorical-nominal variables to continuous data (ii) undertake any variable recoding and (iii) ensure polarity (data direction).

The variables in the SAM are of three kinds: dichotomous, categorical – nominal and categorical – ordinal, and do not lend themselves to typical clustering algorithms, which are generally used for segmenting continuous and/or ordinal variables only. The clustering algorithm used here is K-Means, which is an iterative relocation algorithm based upon an error sum of squares measure (Jain and Dubes 1988), and also requires these data types. Thus, the first data pre-processing step was to convert the dichotomous and categorical-nominal variables to continuous data. Several adaptations to the algorithm have been proposed for handling mixed variable types (Ahmed and Day 2007; San et al. 2004) but a different approach to aligning all variables was adopted here. The conversion was undertaken using variables that reflect income and/or wealth. With no such variables present in the census (excluding proxy measures), the British Household Panel Survey (BHPS) was selected (now part of Understanding Society, see: Understanding Society 2015) instead. The ‘Monthly Gross Income’ variable [in British Pound Sterling] (BHPS variable reference: RPAYG) was the most complete with regards to individual responses and therefore selected.

The equivalent SAM variable in the BHPS was extracted and the average gross monthly income for persons falling into each variable sub-category (e.g. marital status: single, widowed, divorced etc) was calculated. For this transformation to work, each of the variables selected for use in the classification had to be present in the BHPS. Table 2 shows the original SAM data and Table 3 shows the same data transformed into a monetary continuous scale for a small subset of the Leeds records. As examples, Central Heating: 1 = yes, 2 = no. Sex: 1 = male, 2 = female.

Table 2 Original SAM data prior to conversion to gross monthly income
Table 3 Newly created gross monthly income values for SAM categories based on BHPS, in British Pound Sterling

One should also note that the gross income figures listed above are averaged across the entire population (including those out of work, e.g. under 16’s, retired and unemployed) and are therefore lower than official average salary estimates noted elsewhere, such as in the Annual Survey of Hours and Earning (ONS 2013). Without incorporating the entire population, the process would be misleading. However, for the purpose of effective clustering, it is the magnitude and level of difference between variables and variable sub-categories which is of most importance – more so than producing accurate salary estimates.

In order to achieve this process fully, re-coding of the data was necessary due to the structure / aggregation of the data between the SAM and BHPS. For example, some variables in the BHPS are continuous (e.g. age, hours worked per week, number of care hours provided) and therefore needed to be aggregated up to match the categories as put forward by the SAM. This was a simplistic summation process. However, other BHPS variables are also categorical and if the categorical variables in the BHPS failed to match the categorical variables in the SAM, e.g. BHPS contained a far greater number of groupings for Marital Status than the SAM, then some data matching was necessary. Using this example, the SAM assigns all individuals into one of three [legal] categories; Single (never married); Married, re-married; and Separated (but still legally married), Divorced or widowed. The BHPS separates individuals into nine groupings – including several extra categories not covered by the SAM, for example, Living as a couple; Have a dissolved civil partnership; Separated from a civil partnership; and Surviving partner of a civil partnership. Matching these individuals into the categories as put forward by the SAM required some decisions to be made prior to transformation to a continuous, monetary value. Figure 3 illustrates the process through which Marital Status within the BHPS was matched to that in the SAM. A similar matching process was also necessary for several other variables, including: Relationship to household reference person (HRP) (30 B.P. vs. 6 SAM categories) and NS-SEC (33 B.P. vs. 8 SAM categories).

Fig. 3
figure 3

Example of matching data between the BHPS and SAM categories

Determining Suitability of Re-scaled Variables

A conversion to monetary values resulted in certain variables losing their desired meaning; age is the most noteworthy example. Once converted to a continuous scale, persons in the lower age groupings (0–4, 5–9, 10–15) were captured in ways identical to those in older categories and following retirement (generally 70+), as neither age group earns any form of salary (benefits and pensions excluded). For this reason, and given the impact this would have when clustering, variables deemed to be ordinal in their structure and of a format suitable for clustering in their original form were not transformed. These variables included: (1) Age, (2) Number of Hours Worked Per Week, (3) Number of Care Hours Provided, (4) Number of Residents in Household and (5) Number of Cars/Vans Available for Use. These variables are recorded as interval data in the SAM. By modifying these intervals to include only the first value within the interval, this ensured that the data were transformed to an ordinal format. For example, somebody residing in the 0–4 age group is recorded as having an age of 0 and somebody in the 5–9 age group is recorded as having an age of 5. With regards to the remaining variables, such as ‘Hours Worked per Week’ and ‘Number of Care Hours Provided’, the top value in the range is utilised for classification. Therefore, in the example working 1–15 h per week, fifteen is put forward for clustering.

The National Socio-Economic Classification (NS-SEC, 8 Classes) variable also underwent some refinements. This variable was considered suitable for classifying under its original data structure (despite being nominal by definition) in the same way as the five variables stated above (e.g. 1 - Large employers & higher managerial occupations, 2 - Higher professional occupations, 3 - Lower managerial and professional occupations, etc). This variable was left in its original form due to the fact that the final two categories where wholly incomplete resulting in zero average earnings per month. Although this variable is not designed to be hierarchical, it was decided to keep this in a continuous monetary format and estimate the two missing categories based on a combined average of the two nearest categories. Similar to the importance of gauging how (dis)similar individuals in the marital status category are, the same process was applied here.

Finally, the data were normalised and polarity was ensured, i.e. high values in all variables were positive and low values were negative (excluding any variables than may be regarded as neutral).

Classifying the Data

As referred to in Transforming and Re-scaling Variables section, K-Means was employed as the clustering algorithm. The algorithm functions iteratively, moving a case from one cluster to another to see if the move would enhance the sum of squared deviations within each cluster (Aldenderfer and Blashfield 1984). The case will then be allocated (or re-allocated) to the cluster to which it brings the maximum improvement. The next iteration takes place when all the cases have been processed. A stable classification is therefore achieved when no moves occur during a full iteration of the data. After clustering is complete, it is then possible to inspect the means of each cluster (i.e. the cluster centres) to gauge the distinctiveness of the clusters (Everitt et al. 2011). K-Means was applied to the data and the number of clusters was experimented with based on suggestions made by Milligan (1996) and Gibson and See (2006) whereby a process of classification iteration took place to deduce the change in the slope of the scree. A classification with five clusters was chosen to evidence the functionality of the framework. This decision was largely down to the data loss that is generally experienced when extending beyond a higher number of groupings, something illustrated by the percentage of Within cluster Sums of Squares (%WSS), and to ensure ease of comparability between districts.

Of the sixteen variables chosen, the only variable not deemed to add value was the Country variable since the system was only developed for Leeds (and Richmondshire) at this point, which are both in England. Should this framework be used to develop a UK-wide classification or to contrast two areas in different constituent countries of the UK (England, Wales, Scotland or Northern Island) then such a variable is worthy of use – hence its inclusion as part of the wider framework. The resulting classification is therefore based on the remaining fifteen individual-level variables from Table 1.

Adding the Geographical Component

The final phase in constructing the classification involved the addition of geography. It is important to emphasise the value of geography in order to avoid producing a purely sociological classification of individuals and ensure any neighbourhood effects present in traditional areal systems are captured.

The classification was linked to an individual population for Leeds that was generated using spatial microsimulation. Microsimulation creates a synthetic population drawn from an anonymous sample of individual-level data, that ‘realistically’ matches the observed population (see Fig. 4 for an illustration). Spatial microsimulation allows neighbourhood effects to be captured.

Fig. 4
figure 4

Schematic outlining the basic process of creating a synthetic population using microsimulation (from Crooks et al. 2018)

The process follows the steps detailed below:

  1. 1

    A population of individuals, termed the sample, is obtained (normally from the UK Census). This sample represents a higher spatial level, such as a country or one of its statistical areas.

  2. 2

    To create a population for a smaller geographical area (such as Leeds’ output areas used in this study) weights are applied to each member of the sample. For example, if the small area is multi-ethnic, we may wish to apply high weights to members of the sample whose country of birth is outside the UK.

  3. 3

    For each small area, a series of constraint tables that count the distribution of characteristics in the population for a range of attributes are used.

The overall objective of the microsimulation is to generate a set of weights so that when the sample population is aggregated, the goodness of fit between the model distributions and the equivalent constraints is maximised. There are several algorithms that can be used to achieve this, including deterministic reweighting (Ballas et al. 2005), conditional probabilities (Birkin and Clarke 1988) and combinatorial optimisation (Voas and Williamson 2000). Following the recommendation of Harland et al. (2012), the combinatorial optimisation approach was used. This algorithm uses the simulated annealing approach to optimise the number of matches in the synthetic population. The synthetic population was generated using the Flexible Modelling Framework (Harland 2013).

The 715,402 individuals (2001) were synthesised at output area level (2438 areas) using constraint data acquired from the Census of Population via CASWEB (2001) and survey data courtesy of the British Household Panel Survey. These synthetic individuals, which are geographically referenced via output area, were then linked to the SAM-based classification using common variables in order to assign a cluster code from the classification to each individual.

The purpose of this link is to attribute each member of the complete population a cluster code based on the classification generated on the modified SAM data. This then ensures all members of the population have a cluster code (indicative of their behaviour) and an output area reference enabling the capture of the aforementioned notion of neighboured. Should this be based purely on the SAM data, any analysis would be restricted to the local authority level and hence the influence of neighbourhood would be lost.

The eight variables common between the classification and microsimulated dataset were first converted into SAM-identical format (i.e. monetary income values or ordinal equivalents). Then, the Euclidean distance between each individual and the SAM cluster centres were calculated (the final distance was divided by 10,000 in all cases to reduce the magnitude of the values and allow for ease of interpretation). The cluster with the minimum distance was then assigned to each microsimulated individual. Figure 5 illustrates this process. Although the chosen method adopts crisp clustering, it does provide a fuzzy-like representation of individuals in each output area, thereby reducing the ecological fallacy associated with a purely area-based crisp geodemographic classification.

Fig. 5
figure 5

Visual illustration of SAM classification to microsimulated dataset linking process

Visualising and Understanding the Output

Given the supplementing with geography process undertaken in phase 6, visualisation was carried out based on the modal cluster membership of each individual per Leeds output area. Although this involved the aggregation of individual-level data and hence a move back towards area-based geodemographics, it was executed here entirely for the purpose of understanding the predominant geographical distribution of individuals.

Validation and Enrichment

This final phase represents an opportunity to both validate and enrich the results with supplementary information. The ability to link the final individual-based classification to external non-census datasets provides a means of profiling far deeper against more behavioural and lifestyle information. Not only does this add value to a classification through enrichment, but it can also be used for validation purposes. For example, one might expect any cluster categorised as being predominantly young, city-living types to being technologically advanced or physically active. Given that a system built entirely on census data cannot benchmark against such variables, the ability to link the classification to survey datasets like the BHPS, which contains variables of this nature, adds real value. It also gives users of these external datasets an alternative method through which to view their data.

Through a process of statistical matching (identical to phase 6), the cluster codes from the classification were appended onto other datasets. It was possible to match the cluster codes onto the BHPS dataset and profile the results against other variables (principally behavioural) such as an individual’s propensity to dine out of an evening or take flights abroad during the course of a twelve-month period. Such outcomes not only add value and enrich the classification but also offer an opportunity to corroborate the clustering process.

Results

This section shows the results of the individual-based classification, i.e. phases 7 and 8 of the framework (Fig. 2), for Leeds.

Results from Phase 7: Analysis of Clusters

Cluster centres are an important way of analysing cluster composition. The results from the final cluster centres provide broad indications as to the typical population characteristics within each cluster and are listed in Table 4 for Leeds. Table 5 interprets the output from Table 4 and illustrates this in a more understandable format.

Table 4 Final cluster centres for Leeds SAM classification
Table 5 Interpreted cluster centres for Leeds SAM classification

From the output in Tables 4 and 5 and through an assessment of variables relative to a global UK average, it is possible to develop pen portraits of clusters and devise naming conventions in a way similar to that adopted in conventional area-based geodemographics. The profiles for Leeds are detailed below:

  • Cluster 1: Affluent Managers

This cluster is a middle-aged cluster with an average age of thirty-seven years. Typically households are quite affluent as reflected by access to two cars, being largely employed in managerial capacities. Members of this cluster provide some weekly care for relatives and work typical hours. Members tend to be married and live in households with circa four people, likely to include children. Individuals in this cluster are typically of White British ethnicity and well educated.

  • Cluster 2: Young People living with Family

This cluster contains a youthful and healthy demographic with an average age of twelve years. These individuals live with their parents who are married, have good general health and are of White British ethnicity. The household has access to one car, is heated and on average houses around four people. They are the son/daughter of the head of household.

  • Cluster 3: Co-habiting Couples

This cluster is categorised by young individuals with start-up families. Members of this cluster tend to be single by legal definition but may be cohabiting. Individuals are in their mid/late twenties, have access to a car and work predominantly in semi-routine occupations with employment taking up to circa forty hours per week. Members have some education and are typically in good to fair health.

  • Cluster 4: Average Resident

This cluster is categorised by individuals in their mid-thirties who are married with children. Health is recorded as fair and members have some education, and work typical length weeks. Households typically contain three individuals with care provided for family on a weekly basis. Education levels are fair and access to a car is common.

  • Cluster 5: Nearing Retirement

This cluster contains an elderly demographic with a typical age of sixty-two. Members tend to be married without children at home and in fair health. Of those still working, most work in lower managerial occupations and have some education. The average sized household is two persons with most married and of a White British Ethnicity.

A sample of ten output areas are shown in Table 6 to provide an overview of how the cluster allocation process was carried out.

Table 6 First ten Leeds output areas (sorted A-Z) and associated cluster codes

The linking process does differentiate between individuals by linking them with different clusters. As this matching process makes use of circa half of the SAM classification variables present in the microsimulated dataset, there is clearly scope for improvement.

Cluster 1, categorised by individuals in higher managerial occupations and in 2+ car households, tends to be distributed in the more affluent areas of the city, in particular to the north and with some presence in the east. To the contrary, Clusters 3 and 4, which may be regarded as the less affluent cluster-types given the semi-routine occupations (probably leading to longer working weeks as also identified in the classification), fair health and, in the case of cluster 3, persons sharing houses who are unrelated, show different patterns. These clusters are focused more around the inner city (in the case of cluster 4) and to a lesser extent cluster 3, the latter also being more sporadic it its spatial patterns. Cluster 5, typified by elderly populations, arguably does not follow the conventional spatial patterns one would expect in a UK city; however, it is principally occupied by people in menial employment, which may hinder mobility.

Figure 6 presents an alternative way of visualising the true demographic composition of an area. The proportion of each cluster is highlighted within a given output area, which introduces a level of fuzziness to the presentation. The ten output areas listed in Table 6 are shown in Fig. 6 which demonstrates the variability of individual types across space but can also be used as a tool for the exploration of patterns.

Fig. 6
figure 6

Ten Leeds 2001 output areas based on cluster membership. Includes Census Area Statistic ward boundaries for partial context. Contains National Statistics data © Crown copyright and database right 2012. Contains Ordnance Survey data © Crown copyright and database right 2012

Results from Phase 8: Validating and Adding Value

Through adopting a process akin to the statistical matching method discussed when linking the SAM cluster codes to the simulated datasets and small-area geography, it is possible to link the classification cluster codes to external datasets. The only requirement is the presence of common variables between the two datasets to enable the codes to be matched. To illustrate this process, a link to the BHPS (wave 18) was established. This link enabled each of the individual records in the BHPS to be assigned to one of five SAM clusters.

Variables present in the BHPS are designed to describe socio-economic conditions at both individual and household level (ISER 2011). Variable categories include; household organisation, employment, accommodation, tenancy, income and wealth, housing, health, socio-economic values, residential mobility, marital and relationship history, social support, and individual and household demographics (ISER 2011) and hence add value over and above the variables present in the original SAM file. Furthermore, the extensive choice of variables within the BHPS made it easy to ensure a robust link between this data file and the SAM classification. Of the fifteen variables used to create the SAM classification, fourteen were present in the BHPS. The results are shown in Table 7.

Table 7 Contrasting SAM Classification with BHPS Individuals and extracting new information

As can be seen from Table 7, if taking the full cluster descriptors into consideration, the results appear to corroborate the cluster characteristics to some degree. However, there are a series of anomalies that may be explained by the methods adopted.

Cluster 2 is categorised by predominantly young individuals (circa aged 12) living with family. One should therefore not be surprised that this cluster is one of the least likely to dine out on a monthly basis and is one of the more active clusters when it comes to sport and physical activity but one of the least active when it comes to attending costly sporting events. Furthermore, the categorisation of individuals in this cluster, being of non-voting age, is supported by the BHPS statistic which denotes that 18.6% of individuals in this cluster are ineligible to vote in elections. The results referred to here suggest some degree of success with regards to this matching process as far as validation goes. However, one must also consider the wider impact of the household. Dining out is likely to be a function of family decisions and finance rather than decisions made by the individuals specifically assigned to this cluster. Hence this links back to more coarse spatial units influencing the individual (household, neighbourhood, environment etc).

A second example can be seen from assessing Cluster 1 (Affluent Managers). As many of the variables presented in Table 7 can be linked to availability of disposable income, it is unsurprising that members of this cluster have a high tendency to dine out once per month (greater than other clusters) and attend sporting venues. The high proportion of members willing to vote (the only cluster where ‘No Vote’ is not highly ranked) in addition to an alignment towards the Conservative Party are also statistics that corroborate the cluster output. The low percentage partaking in sport is rather surprising.

A key observation to be highlighted is the use of Leeds’ final cluster centres when classifying the complete BHPS (wave 18). Naturally, different parts of the UK look rather different in terms of their demographic profiles and the use of Leeds’ cluster centres may have impacted on the results of this BHPS linkage process – particularly given that the BHPS is UK-wide. Furthermore, adopting the complete BHPS file as opposed to a more regionalised subset may also have had some bearing on the results presented in Table 7.

Discussion and Conclusions

The framework presented and discussed in this paper is one of the first fully open and transparent methodologies geared towards individual-level classification that uses only data from the UK census. It achieves the goal of producing a geodemographic classification at the person unit. Inevitably, however, the framework is not the finished product nor is it without its problems but it does provide reason to maintain the pursuit of individual-level classification as the ultimate in geodemographic analysis. Although it was demonstrated using data from the 2001 census, it can be applied to any census where there are small area microdata.

The proposed framework combines added discrimination with reduced ecological fallacy impact through operating at the level of the individual. If a system is deemed to discriminate better than alternatives, then it will, as a consequence, reduce the level of ecological fallacy as the clusters are likely to be more homogeneous. As highlighted previously, as the quantity of variables increases in an aggregate-data classification, the scope for misrepresentation also increases as fewer people are likely to fit the described cluster demographic. At the level of the person, although this is also the case, it is easier to maintain a greater level of homogeneity as one individual can easily be re-classified should he/she not fit a given cluster definition. It is only through supplementing with geography that the issue of ecological fallacy really arises.

In terms of weaknesses, within this framework certain variables do not appear to differentiate between individuals particularly well. Such variables include ethnic group and gender. Even though these variables may be termed fundamental census characteristics, later versions of this framework may be required to make more detailed decisions on the variables included. Furthermore, a means of handling high valued continuous monetary variables and low valued ordinal variables is important should this framework successfully evolve. Nonetheless, this research acts as the first piece of academic work to attempt to classify individual person-level data in this way and incorporate small-area geography and linkages to external datasets for both validation and deeper profiling.

The inter-disciplinary opportunities that profiling at this level generates, in particular with an ability to profile against external datasets, offers a broad appeal to further research using this framework. This work has demonstrated an ability to link to the BHPS and explore behavioural datasets over and above pure census characteristics held directly within the classification. Opportunities therefore exist to profile against datasets such as the Health Survey for England, the Crime Survey for England and Wales and the National Travel Survey. When one considers the refinement of the framework in addition to such diverse profiling opportunities, scope for research extension is clear and policy implications brought about from more accurate and finer-level classifications offer incentives to pursue this research direction.