As we pointed out in Chap. 7, there is frequent debate in the literature over the relative contributions of composition and context in the statistical explanation of individual-level outcomes , such as self-reported health and the incidence and prevalence of disease or mortality. This tutorial provides an application of the insights from Chap. 7. In this tutorial we will be looking at the patterning of the prevalence of cardiovascular diseases in Scotland. In particular, we consider whether the prevalence of disease is related to an individual social determinant (occupational social class), an individual biological determinant (current smoking status) or an area-based social determinant. As an area-based social determinant we used area deprivation measured by the Carstairs score, a Census-based variable derived from the social class of the heads of households, male unemployment, lack of car ownership and overcrowding (Carstairs 1995; Carstairs and Morris 1990). As with the previous two chapters, the software used in this chapter is MLwiN. Further details on multilevel modelling and MLwiN are available from the Centre for Multilevel Modelling The materials have been written for MLwiN v3.01. The teaching version of the software is available from

The Data

The data are contained in the worksheet ‘CVD-data.wsz’ and are taken from the 1998 Scottish Health Survey, and the analysis is related to a published paper (Leyland 2005). The data refer to 8804 respondents aged between 18 and 64. The outcome considered is a self-report of a doctor-diagnosed cardiovascular disease (CVD) condition (angina, diabetes, hypertension, acute myocardial infarction, etc.). This is a binary response, whether (1) or not (0) respondents have CVD condition.

figure a

The independent variables at individual level on which we focus in the tutorial are social class and smoking status. Occupational social class is used in three categories: high social class (1 and 2: professional and managerial), intermediate (3: skilled workers), and low (4 and 5 and missing: semiskilled and unskilled manual workers and those for whom social class was missing). Smoking has been categorised as never smoked, light smokers (<10 cigarettes per day), moderate (10–19) and heavy (20+) smokers as well as former smokers. Age and sex are used as control variables in all analyses. At the area level the Carstairs index is used as a continuous variable.

The survey was cluster-sampled, with respondents clustered within 312 small areas (postcode sector, average population about 5500).

Structure of the Analysis

As a first exploratory step in the analysis, examine the mean Carstairs score by social class and current smoking, and also smoking patterns by social class, to see the dependency between the variables.

After that, we are going to examine a series of models with a view to determining the relationship between the prevalence of CVD diseases and individual social class, current smoking and area deprivation. We will conduct these analyses with a table in mind, filling in the table as we progress (see Table 13.1).

Table 13.1 Outline of a table to report the analysis to untangle context and composition

Estimating the Null Model

The first model to fit is a null model. We will adjust all of the models we fit for age and sex, but we are not going to report the estimates associated with these factors; these are ‘nuisance variables’ and we are going to control for differences between areas in their age and sex composition.

We then set up a two-level model with the response variable CVDDEF and with levels defined by AREA and ID. This is a binomial response with a logit link function and with the denominator given by the constant CONS. We will add CONS to the fixed part of the model and allow for random intercepts across areas by letting the coefficient of CONS vary at random at level 2 (i.e. across areas). It is important that we have a well-fitting model at individual level, otherwise unmeasured individual effects might appear as contextual effects. We have used fractional polynomials in age (Royston et al. 1999) together with interactions with sex to find a parsimonious model that adequately controls for age and sex; these are already included in the model that can be found in the Equations window. We can start off by fitting this model using the first order MQL approximation but then move on to the second order PQL approximation. This is then the null model on which we base subsequent analyses.

figure b

We can estimate the ICC from this model using the approximation that the individual-level variance is given by π2/3 (= 3.290). So a level 2 variance of 0.043 gives an ICC of 0.013; just over 1% of the variation in the prevalence of CVD diseases is attributable to differences between areas.

A useful diagnostic measure is the R-squared which indicates how much of the total variation has been explained by the fixed part of the model. For multilevel logistic regression, we approximate the explained variation by the variance of the linear predictor (that is, the variance of the fixed part of the model which is on a log odds scale) and get the total variance by adding the variance of the linear predictor to the variance at the higher levels plus our estimate of the variance at the individual level. In other words,

$$ {R}^2=\mathrm{VLP}/\left(\mathrm{VLP}+{\sigma}_{u0}^2+{\pi}^2/3\right) $$

where VLP is the variance of the linear predictor. We can calculate the linear predictor using the Predictions window and including all variables in the fixed part (but not the random part).

figure c

We can use the Averages and correlations window to estimate the standard deviation of this prediction as 0.921. The variance is the square of the standard deviation; this gives VLP = 0.848 and so R-squared  = 20.3%.

The values of the ICC, VLP and R-squared can be obtained for any two-level multilevel logistic regression model by running the macro ‘modeldiag.txt’. (To run the macro make sure that the output window of the Command interface is open, then open the macro using the File menu and click Execute.)

Fixed Effects

The first model that we want to fit is the model containing individual social class (variable SC). There are three categories of social class; we will fit two dummy variables keeping social class 1 and 2 as the reference category.

figure d

The parameter estimate for social class 3 is a log odds ratio; we can convert this to an odds ratio by exponentiating: exp{0.100} = 1.105, so the odds of CVD diseases are 10.5% higher in social class 3 than in social classes 1 and 2. Similarly we can obtain 95% confidence intervals as exp{0.100 ± 1.96 × 0.064} = (0.975, 1.253). Since the 95% confidence interval for this odds ratio includes 1, it suggests that the odds ratio for social class 3 is not significantly different from that for social classes 1 and 2.

Odds ratios and 95% confidence intervals can be obtained for all parameter estimates from any logistic regression model by running the macro ‘or.txt’.

Although the odds ratio for social class 3 is not significantly different from that for social classes 1 and 2, that for social classes 4 and 5 is significant (the 95% confidence intervals do not include 1). Since we would expect the social class effect to increase across social class categories—CVD prevalence is likely to be higher in social class 3 than in social classes 1 and 2, and higher still among social classes 4 and 5 than in social class 3—we test for a linear trend in the social class variable. We do this by removing the categorical social class variable from the model, fitting social class using a continuous variable created for this purpose (i.e. with values 1, 2 and 3) and testing for the significance of this single variable. This can be done using the Intervals and tests window from the Model menu.

We can now continue by fitting models containing just smoking and just deprivation (again including age and sex as these were contained in the null model). (Click on a variable in the Equations window and choose Delete term to remove it from the current model.)

figure e

Compared to the reference group of never smokers, the prevalence of CVD diseases is no higher in any of the smoking categories but is significantly higher among the ex-smokers. As a prevalence study this may reflect an increased likelihood of giving up smoking once a respondent has been told by a doctor that they have a cardiovascular disease. The categories of smoking are not ordered and so testing the significance of this variable involves testing the significance of differences between categories rather than a test for trend.

figure f

Area deprivation is coded with positive values indicating areas of higher deprivation and negative values indicating areas of lower deprivation. The effect of deprivation is clearly significant; we can consider whether the effects of social class and smoking are significant after controlling for area deprivation. At the same time we will see whether the effect of area deprivation remains significant once individual factors are taken into account. The significant effect of individual social class is attenuated and becomes non-significant when area deprivation is taken into account whilst area deprivation remains significantly related to the prevalence of CVD diseases. The effect of individual smoking status remains insignificant following adjustment for area deprivation.

Basically, with these models we can complete Table 13.1 such that it becomes Table 13.2. This presents a neat summary of the fixed and random parts of the models that we have fitted. The strong influence of the context can be seen through the persistent significance of the area deprivation score even after adjustment for individual factors.

Table 13.2 Estimates from model

Additional Models

There are a variety of other models that we may wish to fit. One of the reasons for the closer relationship between the Carstairs score and the prevalence of CVD diseases may be because the Carstairs score is a continuous variable—indicating a broad range of deprivation—whilst our measure of occupational social class is categorical with just three categories. To satisfy our curiosity that this is not just a measurement issue, we can categorise the deprivation measure into three approximately equal groups and fit some of these models again.

As we discussed in Chap. 3, contextual variables may be direct observations made on areas detailing, for example, the provision of services. They may be derived from alternative data sources (as in this case: the Carstairs score is based on Census variables). Another possibility is to create contextual variables through the aggregation of individual variables collected in the study. Think about creating a contextual variable describing the social class of the neighbourhood. A simple example would be the proportion of the survey respondents in each area who were in social classes 4 and 5; an alternative might be the difference between the proportion in social classes 4 and 5 and the proportion in social classes 1 and 2. Such variables can be created using the Multilevel data manipulations window found under the Data manipulation menu. These variables permit further examination of the relative importance of composition versus context, given that both descriptors are derived from the same source, but also illustrate how an important contextual descriptor can be created within the data set in the absence of an externally validated measure such as the Carstairs score.

The aggregation of an individual variable to an area level can change its interpretation. We can construct an area-based smoking score to illustrate this. If an individual is given a score of 3 for a heavy smoker, 2 for a moderate smoker, 1 for a light smoker and 0 for an ex-smoker or a non-smoker, then the average of this score at an area level provides information about current smoking behaviour in an area in terms both of smoking prevalence and dose. The relationship of such a variable to the prevalence of CVD diseases is different to the relationship between individual smoking behaviour and CVD disease prevalence; the area smoking score—just like the area social class score—acts as a marker of area deprivation.