Where Is the Geography? A Study of the Predictors of Obesity Using UK Biobank Data and Machine Learning

Zhou, Yunqi; Harris, Richard; Tranos, Emmanouil

doi:10.1007/s41651-023-00142-4

Where Is the Geography? A Study of the Predictors of Obesity Using UK Biobank Data and Machine Learning

Research Paper
Open access
Published: 12 June 2023

Volume 7, article number 17, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Geovisualization and Spatial Analysis Aims and scope Submit manuscript

Where Is the Geography? A Study of the Predictors of Obesity Using UK Biobank Data and Machine Learning

Download PDF

2436 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

In this study, we adopted individual baseline data from the UK Biobank cohort of participants aged 40–69 across the UK to explore whether there is evidence of the geography related to health disparities in obesity. First, we used multilevel models to decompose the variation in body mass index (BMI) values to examine the presence of spatial clustering patterns of individual BMI values at various geographic scales. Next, we looked at whether key predictors of obesity, such as physical activities and dietary habits, differ across 6 cities in England by using a machine learning approach. To do this, we trained random forest models in one city, and we used them to predict BMI values in other cities to see if the models were spatially transferable. Subsequently, we turned to explore socio-economic status, which is one of the direct interests in the literature with obesity and used those in combination with multilevel models to check for the existence of spatially varying effects. The results of the multilevel null models indicate that most of the variance of BMI is due to individual variation, suggesting little evidence of geographical clustering at any geographical scales. The machine learning prediction results show that the effects of the main identified risk factors for obesity are stable (spatially stationary) across cities, based on approximately the same predictive power and broadly constant effect sizes of main factors. Multilevel models taking socio-economic status into account further support that individual and neighbourhood deprivation levels display limited geographical variation in their effects on obesity across the study areas. Contrary to our expectations, the models together suggest the limited association of geographical context with obesity, among the UK Biobank participants.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The rapidly increasing prevalence of obesity and interventions to prevent it have become a major health challenge worldwide (Reilly et al. 2018; Sassi 2010; OECD 2017, 2018). The ways that health outcomes, including the risk of being obese, are physically and socially shaped by places are usually described as “neighbourhood or contextual effects” (Bambra et al. 2019; Elliott 2018). Evidence of these effects is used to establish that context matters. If health outcomes are, at least in some regard, location dependent then “one-size-fits-all” prevention strategies should be avoided in favour of tailoring them to the individual and their context. According to the second Law of Geography (Goodchild 2004), “geographic variables exhibit uncontrolled variance”, which is the principle of spatial heterogeneity. The associations may vary over space, particularly over large or sophisticated geographical areas. Given that context is place-based then the expectation is that the predictors of obesity may be different or impact differentially in their effects from one place to another, suggesting the need for geographically tailored interventions.

However, these geographical variations are not always evident. Schuurman et al. (2009) found no evidence of significant spatial clustering of obesity at either the city or neighbourhood level of their analysis across eight suburban neighbourhoods in Metro Vancouver. Moreover, finding geographical variation in any health outcome does not necessitate that the predictors of that relationship also vary from place to place. The use of geographically weighted regression, which allows the functional relationship between dependent and independent variables to change with location, failed to improve the global model significantly in Schuurman et al.’s study, suggesting the relationship is spatially stationary (the same model could be used with the same broad predictive accuracy anywhere within the study region).

Many studies focusing on the neighbourhood effect influencing obesity risk were limited by potential ecological fallacies as they only consider the aggregated, neighbourhood-level variables. Sun et al. (2020b) adopted spatial regression models to explore to what extent spatial inequalities in childhood obesity are attributable to spatial inequalities in socioeconomic characteristics in England using aggregated data. Their analysis indicates positive spatial dependence for childhood obesity prevalence as well as significant associations with aggregated socioeconomic variables across England (Sun et al. 2020b). Similarly, Neelon et al. (2017) harnessed the national prevalence of childhood obesity rather than individual-level health outcomes by adopting a multinomial logistic regression and a geographically weighted logistic regression to examine the association between food insecurity and obesity by area-level deprivation in children across England. The area deprivation was associated with the perceptions of obesity and food insecurity. Importantly, the influence of area deprivation on obesity and food insecurity varied over space in England (Jia et al. 2017).

In contrast to the literature that emphasizes the essential influence of geographical context on obesity (Chi et al. 2013; Ha and Xu 2022; Huang 2021; Jia et al. 2017; Lee et al. 2019; Oshan et al. 2020; Qin et al. 2019; Shahid and Bertazzon 2015), this paper reaches the opposite conclusion: there is evidence neither of any substantial geographical patterning to where more and less obese participants in the UK Biobank data live nor in the predictors of obesity at a more regional scale. We employ data from individuals aged 40–69 years between 2006 and 2010 who participated in a large-scale health survey in Great Britain (UK Biobank 2022). This conclusion is contrary to our expectations and is despite using methods that we assumed would reveal geographical variation. Most studies examining the neighbourhood effect on obesity harness various types of regression methods, including Poisson regression (Lovasi et al. 2013; Nguyen et al. 2021), logistic regression (Alexander et al. 2013; Dadvand et al. 2016; Rossi et al. 2019) and multilevel models (Gilliland et al. 2012; Robinson et al. 2021).

We use multilevel models in the first stage of the analysis to decompose the variation in body mass index (BMI) values and quantify how much of it is linked to the various geographical scales of the model, which was neglected in another study that also looks for geographical heterogeneity using data from the UK Biobank (Mason et al. 2021). It is essential to explore the potential geographical clustering of obesity before further exploring the potential spatially varying effects of obesity predictors, which this study does. The identification of spatial clustering patterns aims to understand spatial autocorrelation of obesity firstly. Both spatial dependence and spatial heterogeneity require attention when modelling dynamic spatial data (Anselin 1988). The comprehension on patterns of spatial variations helps to incorporate related characteristics variability into model construction (Jacquez 2008), which is beneficial in the further investigations on spatial heterogeneity.

At the second stage, the paper turns to whether the main predictors of obesity, such as dietary habits and physical activities, vary geographically for those measured at the different assessment centres at which the Biobank data are collected. Other studies that focus on the possibility of spatial dependencies and/or variations have adopted geographically weighted regression (Fraser et al. 2012; Oshan et al. 2020) or spatial regression models (Leonard et al. 2014; Bonnet et al. 2022). Here we adopt a different approach that is rooted in the growth of geographic data science and its interest in machine learning (Andrienko et al. 2017; Calafiore et al. 2021; Gahegan 2000; Singleton and Arribas-Bel 2021). Compared with traditional regression approaches, machine learning approaches are efficient and powerful in handling big high-dimensional datasets as a data-driven approach to recognize potential patterns within data themselves (Cracknell and Reading 2014; Feizizadeh et al. 2023). It is essential in spatial analysis to develop data-adaptive, non-linear and multivariate models based on high-dimensional spatial datasets (Kanevski et al. 2008, 2009), which fits well with the scope of machine learning approaches (Du et al. 2020). Machine-based approaches usually involve training a model with a random subset of the complete data to then assess how it performs for the remaining data. The potential problem with such an approach is that it is not geographical if the training strategy does not allow for geographical variation in what it is being trained for.

Additionally, traditional machine learning methods rely on statistical probability principles, assuming data are independently and identically distributed (L’heureux et al. 2017); however, they are not applicable in the case of datasets where there is spatial heterogeneity. We, therefore, adopt a more directly spatial training and prediction strategy: random forest models are trained in one city to predict individual BMI values in other cities to examine the possibility of spatially varying predictors of obesity. In detail, the validations on the degree of heterogeneity of the predictors are achieved through the generalized evaluations, including R-squared, variable importance and coefficient estimations on models’ predictive performance from city to city. Previous studies which also employed the UK Biobank data mainly used regression methods to explore the associations between risk factors and obesity. However, this paper is, to our knowledge, the first study to focus on using machine learning methods to examine predictors of obesity and whether they vary regionally across England. Finally, we used multilevel models, including random intercepts and random slope models as a further check on whether the effect of socio-economic status, both individual and neighbourhood deprivation level vary across England. The adoption of multilevel models allows for neighbourhood measures of deprivation to be taken into considerations.

Data and Method

Study Area and Dataset

The UK Biobank survey is a large prospective cross-sectional, observational cohort study (UK Biobank 2022). It recruited 502,656 volunteer participators, mainly aged 40–69 years old, between 2006 and 2010 who visited the 22 assessment centres throughout the UK and has collected a wide range of individual-level data about those participants, including demographic characteristics, lifestyle habits, socio-economic status, mental health status and neighbourhood environments. These data appear in the baseline assessment, which are used in this study. Those registered with the National Health Service and living within 25 miles of the 22 UK assessment centres were invited to participate.

To explore the geography of participation, we used kernel density estimation (KDE) to evaluate the participation density across Great Britain (Fig. 1). The exact coordinates of each participant’s home addresses are not known to us as these have been degraded to 1 km by 1 km geography to protect participators’ privacy to avoid individual data disclosure (UK Biobank 2022). It should, however, be noted that although KDE reveals “the density” of where participants live, it does not, in the simple form, show control for the underlying number of the population. In other words, it reveals incidences, not rates. Residents living in the buffer overlap areas have more than one chance of receiving the invitation to participate. Generally, high densities of respondents are concentrated in major cities and towns — that is, closer to the assessment centres.

Total six cities, including Bristol, Birmingham, London, Newcastle, Nottingham and Manchester in England, were chosen as the study areas. Because each selected city is located in a different region of England, and they have relatively various contexts, such as population demographic, urban planning and transportation and population behavioural habits. Conducting the spatial analysis, including investigating whether predictive models of obesity are spatially transferable and whether the associations between risk factors and obesity spatially vary across these cities, is feasible and meaningful. The high density of UK Biobank participation from these cities (Fig. 1) helps to increase the reliability of our findings.

A range of data were considered when applied the machine learning models and these data can be grouped into four domains of predictors: local environments exposures, interpersonal lifestyle habits, socio-economic status, demographical characteristics and continuous log BMI values as health outcomes. Multilevel models turned attention to the spatially varying effect of socio-economic status and incorporated with demographical characteristics to predict the risks of obesity.

Local Environment Exposures

UK Biobank offers neighbourhood environments about greenspace exposures, distance to the coast and traffic scores. Linked to the UK Biobank is the UK Biobank Urban Morphometric Platform (UKBUMP), providing a high-resolution spatial database based on the morphometrics analysis of the built environment (Sarkar et al. 2015). The proximity to various health-relevant destinations (GP practices, hospitals and fast-food outlets) was adopted in the study. Another complementary dataset, the index of Access to Health Assets and Hazards (AHAH), is a multi-dimensional index developed to measure how “healthy” neighbourhoods are (Green et al. 2018). Three variables in AHAH, including accessibility to fast food outlets, accessibility to pubs and accessibility to tobacco stores, were used to link the risks of obesity.

Interpersonal Lifestyle Habits

Interpersonal lifestyle habits involve predictors of diet habits, physical activities, drinking habits, smoking status, sleeping patterns and habits of watching TV. Predictors in the dietary habit domains were collected by questionnaires and interviews about the intake and frequency of eating various types of food as well as their dietary patterns. The usual walking paces have been answered by the short-form international physical activity questionnaire representing physical activities. Sleeping patterns include whether participants snore and the duration of sleep. Drinking habits were quantified to drinking units firstly, further classified into five groups based on their drinking units according to previous literature (Perreault et al. 2017), including (1) never drinking; (2) previous drinking; (3) within the UK low-risk drinking guidelines (< 14 (women); < 21 (men)); (4) hazardous drinking (14–35 (women); 21–49 (men)) and (5) harmful drinking (> 35 (women) > 49 (men)). Smoking status was coded as non-smoker, previous smoker and current smoker. The time spending in watching TV per day represented the habits of sedentariness.

Socio-economic Status

Deprivation levels were divided into individual deprivation levels and neighbourhood deprivation levels. The individual deprivation levels were directly represented by the household income, which were answered by touchscreen questionnaires. Neighbourhood deprivation levels included the Index of Multiple Deprivation (IMD) and the Townsend deprivation index with the closest release version to the year of their recruitment (UK biobank 2022). The higher the IMD values, the more deprived the areas. As for the Townsend deprivation index, positive values indicate areas with high material deprivation, while negative values indicate relative affluence. A score of 0 represents an area with an overall average deprivation value.

Demographic Characteristics

Age, gender, ethnicity and family status provide the broad characteristics of these participants, which were obtained through touchscreen questionnaires in the UK Biobank Survey. UK Biobank gathered information about family status, including cohabitation status, marital divorce status and a spouse’s death. Participants who answered “Husband, wife or partner” in response to the question “How are the other people who live with you related to you?” were classified as “living with a spouse or partner”. Participants also answered questions about whether they had experienced the marital separation/divorce and death of a spouse/partner in the last 2 years.

Outcomes: Body Mass Index (BMI , kg/m.² )

Body mass index (BMI, kg/m²) values were directly provided by UK Biobank baseline data, which were calculated from height and weight measured during the baseline assessment centre visit. Log BMI values rather than absolute BMI values are treated as the continuous health outcomes to offset the observed positive skewness of BMI values.

Descriptive Sample Characteristics

This is a demographically specific subgroup analysis study of 117,108 UK Biobank participants aged 45 to 72 years living in or near one of the six cities (Table 1). Their mean age is 56 years, of which 51% are male, and 87.5% are of white ethnic background. The average BMI is 27.16 kg/m². NHS follows the World Health Organization’s guidance and criteria on obesity diagnostic as high BMI values over or equal to 30 (NHS 2017; WHO 2018). Adult obesity prevalence in England ranged from approximately 24 to 26% between 2006 and 2010, the baseline data collection period of UK Biobank. The obesity prevalence across cities ranges from 13.1 to 29.9%, with a mean of 22.93% in this study.

Table 1 Summary of sample characteristics (N = 117,108)

Full size table

Data Analysis

This study follows a multi-stage analysis strategy with the detailed methodological flow chart in Fig. 2. There are mainly three stages in this study. The first stage was to examine the existence of spatial clustering patterns of BMI values by multilevel null models. Secondly, random forest models were trained in one city to predict other cities to investigate whether models are spatially transferable and whether key predictors of obesity have spatially varying effect. Finally, we turned to focus on the spatially varying effect of socio-economic status by constructing multilevel models across six cities.

At the first stage, multilevel null models are used to examine the existence of spatial patterns of log BMI values at various geographical scales. With the UK Biobank, participants are nested into coordinates, nested into Middle Layer Super Output Areas (MSOA^{Footnote 1}), nested into assessment centres (cities). All cities apart from London (three assessment centres), there is a one-on-one match between cities and assessment centres. This study considered a total of six cities (eight assessment centres), including Bristol, Birmingham, London, Newcastle, Nottingham and Manchester. Different multilevel null models on two-level, three-level and four-levels were constructed (some combination of city-MSOA-coordinate-id). The calculation of the variance partition coefficient (VPC) in null models without any predictors variables (just the response variable, Log BMI values) was used to explore to what extent the variations in individual BMI values cluster geographically. Higher VPC values reveal a more significant clustering of log BMI values at the geographical level concerned. For a two-level model of individuals nested into geographical coordinates, the standard formula for calculating VPC is formula (1) below.

$${\mathrm{VPC}}_{\mathrm{u}}=\frac{{\sigma }_{\mathrm{u}}^{2}}{{\sigma }_{\mathrm{e}}^{2}{+\sigma }_{\mathrm{u}}^{2}}$$

(1)

where ${\sigma }_{\mathrm{u}}^{2}$ is the level 2 (coordinate level) variance, ${\sigma }_{\mathrm{e}}^{2}$ is the level 1 (individual level) variance and ${\mathrm{VPC}}_{\mathrm{u}}$ is the coordinate level VPC (Merlo et al. 2005)

For a three-level model considering the hierarchical geographical structure that individuals nested into coordinates, then nested into MSOA, the standard formula for calculation VPC at the MSOA level is formula (2) below.

$${\mathrm{VPC}}_{\mathrm{MSOA}}=\frac{{\sigma }_{\mathrm{MSOA}}^{2}}{{{\sigma }_{\mathrm{e}}^{2}+\sigma }_{\mathrm{u}}^{2}+{\sigma }_{\mathrm{MSOA}}^{2}}$$

(2)

where ${\sigma }_{\mathrm{MSOA}}^{2}$ is the level 3 (MSOA level) variance, ${\sigma }_{\mathrm{u}}^{2}$ is the level 2 (coordinate level) variance, ${\sigma }_{\mathrm{e}}^{2}$ is the level 1 (individual level) variance and ${\mathrm{VPC}}_{\mathrm{MSOA}}$ is the MSOA-level VPC (Merlo et al. 2005)

For a four-level model of individuals nested into coordinates, MSOA and cities, the standard formula for calculation VPC at the city level is formula (3) below.

$${\mathrm{VPC}}_{\mathrm{city}}=\frac{{\sigma }_{\mathrm{city}}^{2}}{{{\sigma }_{\mathrm{e}}^{2}+\sigma }_{\mathrm{u}}^{2}+{\sigma }_{\mathrm{MSOA}}^{2}+{\sigma }_{\mathrm{city}}^{2}}$$

(3)

where ${\sigma }_{\mathrm{city}}^{2}$ is the level 4 (city level) variance, ${\sigma }_{\mathrm{MSOA}}^{2}$ is the level 3 (MSOA level) variance, ${\sigma }_{\mathrm{u}}^{2}$ is the level 2 (coordinate level) variance, ${\sigma }_{\mathrm{e}}^{2}$ is the level 1 (individual level) variance and ${\mathrm{VPC}}_{\mathrm{city}}$ is the city-level VPC (Merlo et al. 2005)

Having used the multilevel null models to look for geographical patterning to the log BMI values, we next turn to the machine learning approach to explore any spatially varying influence of predictors and whether trained models are spatially transferable. The flow chart containing the entire procedures of constructing random forest models in all six cities is exhibited in the Appendix as Fig. S2. Due to the procedures being quite similar and repeated across cities, the procedures of training unweighted random forest models in Birmingham and Nottingham as examples are exhibited in Fig. 3. Firstly, a random forest model was trained in one city and then the model was adopted to predict continuous log BMI values in another city. For instance, we trained our model using the London data and then we used this model to make predictions for Newcastle, Manchester, Birmingham, Bristol and Nottingham. The logic is that if the predictors and their effects vary according to city context then the models should not be spatially transferable. Our interest is in whether they are. Subsequently, models were trained in other cities in sequence to predict another city’s log BMI values.

The prediction performance of random forest models was evaluated by four error measures, including mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE) and R-squared. Their calculations are shown in the following formula 4 to formula 7. The power of random forest predictions was compared by the values of these four statistical error measures. Low values of MAE, MSE and RMSE represent high prediction accuracy of models. The higher R-squared, the better prediction accuracy.

$$\mathrm{MAE}=\frac{1}{n}{\sum }_{\mathrm{i}}|{\mathrm{y}}_{\mathrm{i}}-{\widehat{y}}_{i}|$$

(4)

$$\mathrm{MSE}=\frac{1}{n}{\sum }_{\mathrm{i}}{({\mathrm{y}}_{\mathrm{i}}-{\widehat{y}}_{i})}^{2}$$

(5)

$$\mathrm{RMSE}=\sqrt{\frac{1}{n}{\sum }_{\mathrm{i}}{({\mathrm{y}}_{\mathrm{i}}-{\widehat{y}}_{i})}^{2}}$$

(6)

$$R-\mathrm{Squared}=1-\frac{{\sum }_{\mathrm{i}}{({\mathrm{y}}_{\mathrm{i}}-{\widehat{y}}_{i})}^{2}}{{\sum }_{\mathrm{i}}{({\mathrm{y}}_{\mathrm{i}}-\overline{\mathrm{y} })}^{2}}$$

(7)

where ${y}_{i}$ are the continuous log BMI values, ${\widehat{y}}_{i}$ are the predicted log BMI values by random forest models and $\overline{y }$ are the mean of predicted log BMI values (Rustam et al. 2020).

Furthermore, variable importance plots were built to examine whether and how the significance of predictors varies across space for predicting log BMI values. The reason for using variable importance to exhibit spatial variation effects is that machine learning methods do not provide direct comparisons between coefficient estimates. A permutation feature importance method was adopted to assess the variable importance, which is the most advanced variable importance measure method in random forest models (Strobl et al. 2007). The mechanism is to randomly permute variables and then examine the prediction accuracy before and after permuting the variable (Strobl et al. 2007). The larger values the permutation importance of a variable, the more predictive the variable (Breiman 2001; Chen and Ishwaran 2012).

UK Biobank is not representative of the UK population with a selection bias arising due to volunteering to participate. Van Alten et al. (2022) constructed weight for each participant to offset the volunteer bias based on the reference of the UK Census (2011), and they verified that adopting these weights reduces 78% of the volunteer bias on average. We also adopted these weights (Van Alten et al. 2022) in machine learning models but not multilevel models to investigate whether the significance of geographical context would change due to volunteer bias. Both weighted and unweighted random forest models were compared regarding their prediction ability and the importance of variables. Multilevel models adopt weights at all levels to unbiased estimate parameters (West et al. 2015). However, only weights at the individual level were available by Van Alten et al. (2022). Thus, multilevel models in this study did not harness weights.

The decision-making processes of machine learning methods are somewhat black box, referring that users typically know inputs and outputs but the procedures inside are not clear (Savage 2022; Watson et al. 2019). In machine learning models, spatial comparisons of the effect sizes of variables are not directly available. Thus, we turned to variables that we know are of direct interest in the wider literature with obesity and combine them with multilevel models to examine their geographical variations from city to city. In the UK, the prevalence of obesity in low-income neighbourhoods is over twice compared with wealthy neighbourhoods (NHS 2018). Scholars argued that poor people are limited by economic resources and threatened by poor health conditions (Averett and Smith 2014; Black et al. 2010; Fan et al. 2020; Macintyre et al. 1993). Therefore, socio-economic status, including both individual and neighbourhood deprivation levels, was chosen to investigate whether their associations with obesity vary across cities to ultimately examine the spatial heterogeneity.

Previous machine learning models are a mixture of variables, some of which may indeed contribute to obesity, but some of which are symptoms of obesity. For example, discussions of the causal relationship between snoring and obesity are controversial as to which is the cause (Fraire et al. 2021; Taylor et al. 2021). Furthermore, the influence of socio-economic status on obesity could be masked by the influence of physical activities, diet habits and neighbourhood environments. The substitution property of food is affected by income elasticity (Cawley et al. 2010; Monteiro et al. 2004; Moreno-Franco et al. 2018). Additionally, poor populations may live in low-income neighbours with more barriers to exercise due to resource and facilities restrictions and limitation (Romero 2005; Robert and Reither 2004; Ruel et al. 2010; Salmasi and Celidoni 2017). Thus, in order to clearly and directly compare the effect size of socio-economic status from city to city, demographic characteristics were adjusted for, while other predictors which may mask associations between poverty and obesity to a certain degree were not considered.

As participants in the same neighbourhood shared similar levels of neighbourhood deprivation, we adopted multilevel models that considered the nested hierarchy of individuals sharing coordinates and then nested in cities. For each city, we initially built two-level (coordinate-id) models with individual deprivation level, which is household income and demographic characteristic (age, gender and ethnicity) (Model Income), followed by consideration of neighbourhood deprivation level, which is Townsend deprivation index and demographic characteristic (age, gender and ethnicity) (Model Townsend Deprivation), subsequently accounted for both individual and neighbourhood deprivation level (household income and Townsend deprivation index) and demographic characteristic (age, gender and ethnicity) as the Model Income & Townsend Deprivation, finally added other individual socio-economic status, including the number of vehicles, highest educational attainment and employment status based on the Model Income & Townsend Deprivation as the Model Socio-Economic Status. The effect sizes of these socio-economic status were compared to investigate their influence on obesity was constant or varied across cities. After constructing separate random intercept models in each city, the final step was to build the three-level (city-coordinate-id) random slope models to check whether the slopes of socio-economic status change among cities to further investigate whether there were spatially varying influences of deprivation levels.

Results

Null Multilevel Results (Evaluations on VPC Values)

Having fitted the multilevel null models to explore the variation in the log BMI models at the various levels, we found little evidence of any geographical variations at any scale above coordinates, as indicated by the extremely low VPC values for them in the selected six cities (Null Multilevel Model 1 (Model NM1) to Null Multilevel Model 5 (Model NM5)). The calculated VPC values for the null multilevel models (Model UK1 to Model UK5) involving all 22 assessment centres are listed in the Appendix (Table S1). Regardless of the coverage of assessment centres considered, the VPC values at any geographical level are very low, suggesting that there is no substantial clustering pattern in the log BMI values. It is clear that participants who shared coordinates had different BMI values because over 95% of the overall variation from the null multilevel models is found within coordinate level. Compared to the variance between the city level or the coordinate level, the MSOA-level variance was marginally larger. The calculated VPC values for BMI values in selected six cities (Model B1 to Model B5) are also listed in the Appendix (Table S2). Similarly, no matter how many or which geographical scales in the multiple multilevel models were taken into consideration, the variances between all geographic scales were, however, much more limited than the variance within the coordinate level (Table 2 and Table S2).

Table 2 VPC of Null multilevel models in the selected six cities (N = 117,108) for predicting log BMI values

Full size table

Prediction Abilities of Random Forest Models Across Cities (Comparisons of MAE, MSE, RMSE and R-squared)

We firstly fitted random forest models in one city by set of predictors, and then adopted the trained models to predict continuous log BMI values in other cities. The whole prediction procedures were conducted from city to city in sequence. The VIF diagnosis with the threshold 5 has been done to avoid the existence of collinearity issues among predictors (Akinwande et al. 2015; Salmerón et al. 2018; Vatcheva et al. 2016). MAE, MSE, RMSE and R-squared were harnessed to compare the prediction ability across cities, and it was assumed that similar MAE, MSE, RMSE and R-squared across cities based on the same trained model indicated models across spatial domains do not transfer in terms of the similar predictive abilities.

Additionally, the usage of IP-weights did not bring substantial changes in prediction ability evaluated by MAE, MSE, RMSE and R-squared for modelling log BMI values across cities. Thus, only comparisons of R-squared derived from unweighted models (Model RF1 to Model RF6) are listed in Table 3. The weighted random forest models’ prediction abilities (Model W1 to Model W6) are exhibited in the Appendix (Table S3). The R-squared across cities by different random forest models is quite similar, between 18 and 22% (Table 3). MAE, MSE and RMSE values are also fairly close across cities, at around 0.110, 0.020 and 0.143 respectively. The differences among MSE values are the minimal compared to the other error indicators. MAE measure the average residuals, while MSE measure the variance of the residuals. The MSE values for predicting London and Manchester are relatively high compared with other cities. It is worth noting that MSE is sensitive to outliers and the few large errors between predicted and actual log BMI values may result in high MSE values. It is therefore useful to consider both MSE and R-squared to evaluate the prediction accuracy, as R-squared is less sensitive to outliers. It was observed that the R-squared for predicting log BMI values in London remained relatively low. However, R-squared for predicting log BMI values in Manchester was relatively high.

Table 3 MAE, MSE, RMSE and R-squared for predicting log BMI values in unweighted random forest models

Full size table

Regardless of city selection to training the prediction models, the R-squared for predicting log BMI values for Birmingham and Manchester is approximately 20%. There are slight differences in predicting log BMI values in Newcastle and Nottingham based on different models. The model based on London has the highest prediction performance in predicting Bristol and vice versa; the model based on Bristol has the highest prediction ability in predicting London. The model based on Manchester invariably has a relatively low interpretability ability compared with other models. One possible explanation is that the sample size of Manchester is the smallest.

The model transferability is the application of parameters’ estimations in previous models to a new context (Karasmaa 2001). Spatial transferability has been applied in habitat suitability model and travel demand model to explore the stability and performance of previous models when applied into new geographical areas (Lauria et al. 2015; Yasmin et al. 2015). However, spatial transferability has been few explored in health disparities field, which is achieved by our analysis through comparing MAE, MSE, RMSE and R-squared for predicting log BMI values from city to city. We are interested in whether the predictive power varies according to the city context to examine whether the models are spatially transferable. Overall, the prediction abilities of all models do not vary a lot, suggesting that the models can be considered spatially transferable across cities. In general, the spatial transferability in our analysis suggests the stability in predicting log BMI values from one context to another. The trained models across different cities could be regarded as spatially transferable based on similar predictive performance, regardless of the error measure considered.

Variable Importance Plots of Random Forest Models Across Cities

Note that the MAE, MSE, RMSE and R-squared values are similar when using models for different cities does not actually mean that the models are the same in terms of which variables are important to them. It is, therefore, important to examine the possibility of spatially varying effects. The explorations of the variable importance of predictors across cities help to examine the existence of spatially varying effects of predictors. Even though the usage of weight affects the variable importance of neighbourhood environments more than other types of predictors, the influence of neighbourhood environments is still limited compared to lifestyle habits and socio-economic status. In summary, the overall power of weight is limited across cities. Thus, only the variable importance plots from unweighted models (Model RF1 to Model RF6) are displayed in Fig. 4, and the variable importance plots from weighted models are displayed in the Appendix (Fig. S1). Variables with the symbol “_” are one category of the categorical variables. For instance, Usual walking pace_Brisk pace represents the category “Brisk pace” in the Usual Walking Pace variables and the reference category is Usual walking pace_Slow pace. The reference category selection for categorical variables depends upon their attributes and contents. The first type of categorical variables was classified by the frequency of occurrence (none, once a week, twice a week etc.), and the intensity or magnitude (low, medium, high). The second kind was classified by their properties (British White, Black, and other ethnic groups). The lowest intensity, magnitude or frequency was used as the reference level for the first kind. Otherwise, the category with the largest population was picked as the reference category; for example, British white is the reference category for ethnicity. Variables without the symbol “_” are continuous variables.

For all six cities, the usual walking pace, snoring or not, gender, and time spent in watching TV are the most important predictors for continuous Log BMI values. Except for London, the education deprivation index has relatively high variable importance. This may be because neighbourhoods in London have lower educational deprivation compared with other cities. Individual food intake is not so essential compared with the frequency of overall dietary changes. The influence of the neighbourhood effect varies across cities, but their impact is limited compared to lifestyle habits. For example, accessibility to the nearest fast-food outlets has a relatively high variable importance (ranked in the top 10 most important variables) while it has marginal effect in London (Fig. 4). Possibly because fast-food outlets are nearly everywhere with high distribution density in London. The variable importance of drinking and smoking habits is limited compared with physical activities and sleeping patterns, but they are more significant than some food intake variables.

The overall variable importance across different cities has more similarities than differences. Even though neighbourhood environments perform differently among cities, their influence is relatively limited compared with lifestyle habits. The most significant predictors are lifestyle habits related to physical activities and sleep patterns across six cities. The absolute variable importance values may change for these important variables in different cities; however, the rankings of the variable importance keep unchanged across cities. In summary, important variables constantly rank high in predicting log BMI values, while relatively unimportant variables have varying rankings in different cities. The importance of these significant variables in random forest models is similar across cities, which is mutually corroborative with the observed transferability based on similar prediction abilities in the previous section.

Results from the Multilevel Models Containing Socio-economic Status and Demographic Characteristics

In the wider literature, there is interest in the relationship between deprivation and obesity; thus, a spatially varying effect between deprivation and obesity is being explored. Participants who live in deprived areas have higher BMI values regardless of individual socio-economic status across those six cities. The effect sizes of the Townsend deprivation index are nearly the same for Birmingham, Manchester, Newcastle and Nottingham with or without the consideration of individual deprivation (household income), suggesting a fairly constant effect of neighbourhood deprivation on obesity (Table 4, Fig. 5). In the Model Townsend Deprivation, an increase of one unit in the Townsend deprivation index is associated with a 0.006-kg/${\mathrm{m}}^{2}$ increase in log BMI values across these four cities. In the Model Income & Townsend Deprivation and Model Socio-Economic Status, which consider individual socio-economic status, there are slight differences in coefficient estimations for the Townsend deprivation index. However, they are still between 0.005 and 0.006 kg/${\mathrm{m}}^{2}$ for Birmingham, Manchester, Newcastle and Nottingham, which is approaching the estimations in the Model Townsend Deprivation. The spatial differences towards effect sizes could still be regarded as stationary for these four cities.

Table 4 Associations between Townsend deprivation index with log BMI values (estimated using two-level (coordinate-id) regression models) in Birmingham (n = 13,195), Bristol (n = 25,035), London (n = 31,536), Manchester (n = 7331), Newcastle (n = 20,821) and Nottingham (n = 19,190)

Full size table

For Bristol, the Townsend deprivation index coefficient estimations are between 0.002 and 0.003 kg/${\mathrm{m}}^{2}$ in the three models, which is approximately half the effect for the other cities (Table 4, Fig. 4). For London, the effect of neighbourhood deprivation on obesity is an order of magnitude greater than for the other cities. The effect size of the Townsend deprivation level is slightly larger in the model that only contains neighbourhood deprivation level compared to the models with both neighbourhood and individual socio-economic status. In the Model Townsend Deprivation, a unit growth of Townsend deprivation level is associated with 0.073 kg/${\mathrm{m}}^{2}$ rise in log BMI values. In the Model Income & Townsend Deprivation and Model Socio-Economic Status, the coefficient estimations towards Townsend deprivation level are 0.055 and 0.063 kg/${\mathrm{m}}^{2}$.

Three-level (city-coordinate-id) random slope models allowed the slopes to vary randomly across cities and were fitted to explore whether the effect sizes of the Townsend deprivation index were constant in these cities. The results of the random slope models are shown in the Appendix (Table S4). It is found that there is no change in the slope for individual cities with the estimated variance smaller than 0.001 (standard errors < 0.001) when considering or not considering individual socio-economic status, suggesting no differential effect for Townsend deprivation level across space. Additionally, compared with the single slope models, the corresponding random slope models did not bring statistically significant changes in Deviance, indicating that the varying slopes did not provide better fit for predicting log BMI values.

With the increase in household income, the regression coefficients decreased monotonically for all three models in all six cities, suggesting that lower household income is associated with higher BMI values in these six cities (Table 5). As for Birmingham, in the Model Income, compared with the lowest household income < £18,000 (reference category), the log BMI values were 0.039 kg/${\mathrm{m}}^{2}$ lower for those with the highest household income > £100,000. In the Model Income & Townsend Deprivation, with the consideration of neighbourhood-level deprivation level (Townsend deprivation index), those with the highest household income have lower log BMI values (β = − 0.026 kg/${\mathrm{m}}^{2}$, P < 0.001) compared with those with the lowest household income. Taking other individual socio-economic statuses into consideration, those with the highest household income have 0.012 kg/${\mathrm{m}}^{2}$ lower log BMI values compared with the lowest household income. The effect of household income becomes slightly reduced as other socio-economic conditions are accounted for. The other individual socio-economic status has more evident influence compared to the neighbourhood deprivation level.

Table 5 Associations between household income with log BMI values (estimated using two-level (coordinate-id) regression models) in Birmingham (n = 13,195), Bristol (25,035), London (31,536), Manchester (7331), Newcastle (20,821) and Nottingham (19,190)

Full size table

As for London in the Model Income, those with the highest household incomes (> £100,000) had − 1.349 kg/${\mathrm{m}}^{2}$ lighter log BMI values (95% CI: − 1.56264, − 1.13536; p < 0.001) than those with the lowest incomes (< £18,000). While for Newcastle, Nottingham and Manchester, the highest household income had approximately lower 0.04 (0.047, 0.041, 0.046) kg/${\mathrm{m}}^{2}$ log BMI values compared with the lowest incomes without consideration of other socio-economic statuses. The circumstances are similar for Model Income & Townsend Deprivation and Model Socio-Economic Status in that the log BMI differences among different categorical incomes in London were the most evident when compared with the other five cities, which is consistent with the highest effect sizes of the Townsend deprivation index in London. With the adoption of three-level (city-coordinate-id) random slope models with the allowance of household income to have a different effect for each city, both the estimated variance and standard error for the slopes of household income levels were both smaller than 0.001 and close to 0, suggesting the effect of household income is broadly stationary, whether taking other socio-economic statuses into considerations or not. Furthermore, the Deviance differences between single slope models and matched random slope models were not statistically significant, suggesting there were no differences between cities in the relationships between the household income levels and the log BMI values.

Discussion

Summary of Findings

Despite plenty of studies linking place and health to explain health disparities across space, our results failed to find evidence of geographic context or substantial spatially varying effects of predictors on obesity for middle- and old-aged adults across six cities. We found no geographical clustering of log BMI values at any geographic level due to extremely limited VPC at any geographical scale. Furthermore, we found no significant differences in the predictive power of machine learning models across cities in predicting BMI values, suggesting models were spatially transferable from one context to another in England. Compared with lifestyle habits, including dietary habits and physical activities, the spatially varying neighbourhood effect on obesity is negligible and marginal. We also found that even when severe volunteer bias was reduced by using IP-weight as proposed by van Alten et al. (2022), models were still transferable from one city context to another, the effects of risk factors remained constant and neighbourhood effects remained marginal. Furthermore, multilevel models incorporating socio-economic status and demographic characteristics further indicated that associations between deprivation level, both neighbourhood and individual deprivation level and obesity were consistent across space.

Interpretations and Discussion of Results

The extremely low VPC values from multilevel null models illustrate that the majority of variance of individual BMI values comes from within the coordinate level rather than between the coordinate level. In other words, the BMI values of participants who share the same coordinate vary greatly. Mason et al. (2021) also proposed that it is not surprising that most of the variations in BMI would be expected to be explained by individual-level factors. Scholars have debated more about the significance of geography on obesity prevalence at the aggregation level rather than predicting individual BMI values in this study. In this study, due to the 1-km coordinate rounded measurements, over 300 participants may share the same coordinates, and their individual BMI values may vary considerably because being obese is the consequence of a complex and multifactorial social gradient in the interactions of context factors and interpersonal behaviours (Barton and Grant 2006; Black et al. 2010; Colberg et al. 2016; Hamasaki 2016; Healy et al. 2015; Kriska et al. 2003; Nguyen et al. 2017; Pesta et al. 2018; Reilly et al. 2018). It is reasonable to assume that the prediction of individual BMI values increases the difficulty of detecting the significance of geographical context compared with the regional level prevalence of obesity prediction.

Given the evidence of no apparent clustering of individual BMI values at any geographical level, the similar prediction abilities of random forest models in modelling individual BMI values across different cities suggest models are spatially transferable from one city context to another. Furthermore, there are more similarities than differences towards variable importance of predictors across cities, indicating constant associations among predictors and risk of obesity over space. It is noticeable that the R-squared values of models are not very high, but they are acceptable for social science research when most predictors are statistically significant (Ozili 2023). Furthermore, the prediction difficulty has increased for predicting continuous BMI values compared with binary outcomes (obese or not). The overall analysis was at the individual level with a large number of participations. Furthermore, R-squared values get smaller when the sample size increases (Reisinger 1997; Ozili 2023). High R-squared values are more common in predicting aggregated obesity outcomes, such as the prevalence of obesity (Shrestha et al. 2013; Sun et al. 2020a). Additionally, the major objective of the work is to examine whether the geographical context has influence on obesity based on model prediction performance and variable importance across cities rather than predicting BMI values accurately.

Mason et al. (2021), who also used the UK Biobank, proposed that relationships between the availability of neighbourhood physical activity facilities and BMI, and fast-food proximity and BMI, varied from place to place across urban England (Mason et al. 2021). Nevertheless, Mason et al. (2021) did not attempt to account for the influence of lifestyle habits, such as physical activities and diet habits (Mason et al. 2021), which were considered in our study. In this study, the effective magnitude of neighbourhood variables varies across cities, but their impact is limited compared with lifestyle habits, such as the usual walking pace. Overall, our results indicated that the spatial varying effect of neighbourhood factors cannot fully interpret individual health disparities of obesity compared with lifestyle habits. Rossi et al. (2019) also implied that the influence of neighbourhood effect on obesity may be marginal compared with personal habits such as physical activities and diets.

The spatially varying performance of key factors would support geographic heterogeneity; however, it is not found in this study. The separate two-level random intercept models further illustrated the constant influence of both individual and neighbourhood deprivation levels in Birmingham, Manchester, Newcastle and Nottingham based on the constant effect sizes. We acknowledge that the coefficient estimations for socio-economic status reveal a much more pronounced effect on obesity for London than for the other cities. One possible explanation is that it is more racially diverse with higher educated and income participants. If geographic context did cause differences in the performance of variables, the usage of the random slope model would improve the predictive power with the evident varying slopes among cities. However, the three-level random slope models indicated that if all six cities were regarded as one entirety, the effect sizes for deprivation remained constant as both the variance and standard errors were minimal and close to 0 and Deviance differences between the single slope models and random slope models were not statistically significant. It is not to conclude that geography cannot make any differences in socio-economic status, but it is primarily affected by the study region and samples. In summary, the effect of geography on deprivation levels is limited for the selected cities from the UK Biobank data.

Strengths and Limitations

Compared with other studies that stressed the significance of spatial heterogeneity and geographical patterns, such as clustering (hot spots and cold spots) of health outcomes, this paper proposes that there is no substantial geographical patterning between areas with more and less obese participants because most of the variation the models find is at the individual level. Accordingly, this paper questions the significance of geographical context and suggests stationary associations between main risk factors and obesity. Furthermore, this paper also suggests that random forest models are spatially transferable based on similar prediction abilities across cities. This study has used the machine learning methods, random forest models to examine whether there are spatially varying effects based on prediction results among different cities.

Multilevel models were also used to offset the “black-box” disadvantages of machine learning methods and to look for spatial variations in socio-economic status. Previous UK-based studies on contextual factors are either ecological analyses (Neelon et al. 2017; Sun et al. 2020a) or only focused on small areas with limited observations (Fraser et al. 2010; Fraser et al. 2012). This paper employs individual-level data from the UK Biobank (UK Biobank 2022) and linked UKBUMP, to look at both micro- and meso-level analysis in the UK to explore if there are any geographical patterns in obesity and the factors explaining obesity.

Although scholars have questioned the representativeness and sample bias of the UK Biobank, few studies have examined whether and how volunteer bias affects the observed findings. This paper compared the weighted and unweighted associations to explore whether volunteer bias influences geographical context using the weights proposed by Van Alten et al. (2022). We found that weights did not bring evident changes in prediction ability and variable importance suggested by the random forest models.

There are limitations to this study. The obtained results should, therefore, be viewed as a case study on the importance or otherwise of geographical context in the studied areas rather than suggesting that the limited effects of geographical context would necessarily be true in other circumstances. The analysis is cross-sectional rather than a longitudinal exploration. UK Biobank did not update the latest data for each participant after the baseline assessment visits during 2006 to 2010. Thus, the overall analysis is restricted from 2006 to 2010. Other scholars who also adopted the UK Biobank work indicated that UK Biobank datasets are cross-sectional and the usage of the dataset is limited by its collection method and time (Burgoine et al. 2018; Cassidy et al. 2017; Dadvand et al. 2014; Healy et al. 2015; Mason et al. 2018; Mason et al. 2020; Rauber et al. 2021). The study does not consider how macro-level processes impact on obesity over time. Additional variables such as local taxes (on, for example, high sugar drinks) and local policies reflecting social and economic context may be beneficial in comprehending spatial variations on obesity in, for example, cross-national studies (Black 2014).

Another potential limitation is the imprecise georeferencing of participants. The round coordinates for home addresses may lead to inaccurate classification for participants living near the Lower Super Output Area (LSOA) boundary, thus matching to inaccurate neighbourhood variables measured at LSOA level, such as AHAH variables. In addition, the potential edge effects of the neighbouring effect are also ignored in this paper, which is the residents may be affected by living neighbours and their neighbouring environments (Van Meter et al. 2010). Furthermore, interpretability ability of selected variables is limited, thereby requiring more variables to explain the spatial variations on obesity.

Despite the aforementioned limitations, the strengths of this study include the usage of both the machine learning approach and multilevel models help to understand that there are no substantial geographical patterns to where more and less obese participants in the UK Biobank nor predictors of obesity vary across space. The insignificance of the geographical context on obesity remains stable even after accounting for IP-weight that offsets particular sampling bias in UK Biobank data. Future research is needed to extend the study coverage to the UK rather than England areas, consider edge effect and adopt more macro variables, such as local policies, local taxes and regional poverty.

Conclusion

Most studies on neighbourhood effects implicitly assume that such effects are critical in producing health disparities. However, using the UK Biobank Survey, we found no substantial evidence for the importance of geographical context in understanding difference in BMI values or their predictors. The possible exception is neighbourhood environments, including accessibilities to fast food outlets, health services, pubs and tobacco stores, for which their relationships with BMI values vary from place to place in England. However, their importance in predicting individual BMI values is limited when compared with interpersonal lifestyle habits, such as physical activities. Furthermore, individual and neighbourhood socio-economic status almost constantly influence obesity in Birmingham, Manchester, Newcastle and Nottingham while they are more pronounced in London. However, when six cities are considered as one entirety, the associations between socio-economic status and obesity remain constant and limited compared with interpersonal behaviours.

Additionally, there are no apparent clustering patterns of individual BMI values at a number of geographical scales, specifically city, MSOA and coordinates, suggesting that the variations of individual BMI differences mainly lie within geographical scales, at the level of the individual, rather than between any geographical scales. It is not to conclude that our results support a “one-size-fit-all” policy in relation to health interventions because our study areas contain relatively developed urban cities in England and deprived cities and towns are not considered in our analysis due to data scarcity issues. Furthermore, we find that ethnic minority populations and deprivation areas each is associated with higher obesity risks, thereby requiring attention from policy makers. This paper contributes to wide debate on the importance of geography on health disparities, with, in this study, evidence for neighbourhood or contextual effects being somewhat elusive and suggests a deeper understanding of the magnitude and scope of application of geography and neighbourhood effects. Future work could extend to cover a wide range of deprived areas of the UK, taking into account edge effects and using more macro and time-varying variables to examine the temporal association of obesity risk across a wider area.

Data Availability

The datasets generated during and analysed during the current study are not publicly available due to UK Biobank requiring the formal application procedures to access the datasets. Interested researchers could apply the access to UK Biobank datasets through https://www.ukbiobank.ac.uk/register-apply/ and http://www.ukbiobank.ac.uk/using-the-resource/.

Notes

A Middle Layer Super Output Area is a geographical area with minimum population of 5000 to improve understanding of small areas in England and Wales (Middle Layer Super Output Area 2022).

References

Akinwande MO, Dikko HG, Samson A (2015) Variance inflation factor: as a condition for the inclusion of suppressor variable(s) in regression analysis. Open J Stat 5(07):754
Google Scholar
Alexander DS, Huber LRB, Piper CR, Tanner AE (2013) The association between recreational parks, facilities and childhood obesity: a cross-sectional study of the 2007 National Survey of Children’s Health. J Epidemiol Community Health 67(5):427–431
Google Scholar
Averett SL, Smith JK (2014) Financial hardship and obesity. Econ Hum Biol 15:201–212
Google Scholar
van Alten S, Domingue BW, Galama TJ, Marees AT (2022) Reweighting the UK Biobank to reflect its underlying sampling population substantially reduces pervasive selection bias due to volunteering. Preprint at medRxiv. https://doi.org/10.1101/2022.05.16.22275048
Andrienko G, Andrienko N, Weibel R (2017) Geographic data science. IEEE Comput Graphics Appl 37(5):15–17
Google Scholar
Anselin L (1988) Spatial econometrics: methods and models (Vol. 4). Springer Science & Business Media
Google Scholar
Bambra C, Smith KE, Pearce J (2019) Scaling up: the politics of health and place. Soc Sci Med 232:36–42
Google Scholar
Barton H, Grant M (2006) A health map for the local human habitat. J Royal Soc Prom Health 126(6):252–252
Google Scholar
Black NC (2014) An ecological approach to understanding adult obesity prevalence in the United States: a county-level analysis using geographically weighted regression. Appl Spat Anal Policy 7(3):283–299
Google Scholar
Black JL, Macinko J, Dixon LB, Fryer GE Jr (2010) Neighborhoods and obesity in New York City. Health Place 16(3):489–499
Google Scholar
Bonnet C, Détang-Dessendre C, Orozco V, Rouvière E (2022) Spatial spillovers, living environment and obesity in France: evidence from a spatial econometric framework. Soc Sci Med 305:114999. https://doi.org/10.1016/j.socscimed.2022.114999
Breiman L (2001) Random Forests Machine Learning 45:5–32
Google Scholar
Burgoine T, Sarkar C, Webster CJ, Monsivais P (2018) Examining the interaction of fast-food outlet exposure and income on diet and obesity: evidence from 51,361 UK Biobank participants. Int J Behav Nutr Phys Act 15(1):1–12. https://doi.org/10.1186/s12966-018-0699-8
Google Scholar
Calafiore A, Palmer G, Comber S, Arribas-Bel D, Singleton A (2021) A geographic data science framework for the functional and contextual analysis of human dynamics within global cities. Comput Environ Urban Syst 85:101539
Google Scholar
Cassidy S, Chau JY, Catt M, Bauman A, Trenell MI (2017) Low physical activity, high television viewing and poor sleep duration cluster in overweight and obese adults; a cross-sectional study of 398,984 participants from the UK Biobank. Int J Behav Nutr Phys Act 14(1):1–10
Google Scholar
Cawley J, Moran J, Simon K (2010) The impact of income on the weight of elderly Americans. Health Econ 19(8):979–993
Google Scholar
Chen X, Ishwaran H (2012) Random forests for genomic data analysis. Genomics 99(6):323–329
Google Scholar
Chi SH, Grigsby-Toussaint DS, Bradford N, Choi J (2013) Can geographically weighted regression improve our contextual understanding of obesity in the US? Findings from the USDA Food Atlas. Appl Geogr 44:134–142
Google Scholar
Colberg SR, Sigal RJ, Yardley JE, Riddell MC, Dunstan DW, Dempsey PC, Horton ES, Castorino K, Tate DF (2016) Physical activity/exercise and diabetes: a position statement of the American Diabetes Association. Diabetes Care 39(11):2065–2079
Google Scholar
Cracknell MJ, Reading AM (2014) Geological mapping using remote sensing data: a comparison of five machine learning algorithms, their response to variations in the spatial distribution of training data and the use of explicit spatial information. Comput Geosci 63:22–33
Google Scholar
Dadvand P, Villanueva CM, Font-Ribera L, Martinez D, Basagaña X, Belmonte J, Vrijheid M, Gražulevičienė R, Kogevinas M, Nieuwenhuijsen MJ (2014) Risks and benefits of green spaces for children: a cross-sectional study of associations with sedentary behavior, obesity, asthma, and allergy. Environ Health Perspect 122(12):1329–1335
Google Scholar
Du P, Bai X, Tan K, Xue Z, Samat A, Xia J, Li E, Su H, Liu W (2020) Advances of four machine learning methods for spatial data handling: a review. J Geovisualization Spat Anal 4(1):1–25
Google Scholar
Elliott SJ (2018) 50 years of medical health geography(ies) of health and wellbeing. Soc Sci Med 196:206–208. https://doi.org/10.1016/j.socscimed.2017.11.013
Google Scholar
Fan JX, Wen M, Li K (2020) Associations between obesity and neighborhood socioeconomic status: variations by gender and family income status. SSM-Popul Health 10:100529. https://doi.org/10.1016/j.ssmph.2019.100529
Google Scholar
Feizizadeh B, Omarzadeh D, KazemiGarajeh M, Lakes T, Blaschke T (2023) Machine learning data-driven approaches for land use/cover mapping and trend analysis using Google Earth Engine. J Environ Plan Manage 66(3):665–697
Google Scholar
Fraire JA, Deltetto NM, Catalani F, Orden AB, Mayer M (2021) Prevalence of sleep-disordered breathing among adolescents and its association with the presence of obesity and hypertension. Arch Argent Pediatr 119(4):245–250
Google Scholar
Fraser LK, Clarke GP, Cade JE, Edwards KL (2012) Fast food and obesity: a spatial analysis in a large United Kingdom population of children aged 13–15. Am J Prev Med 42(5):e77–e85
Google Scholar
Gahegan M (2000) On the application of inductive machine learning tools to geographical analysis. Geogr Anal 32(2):113–139
Google Scholar
Gilliland JA, Rangel CY, Healy MA, Tucker P, Loebach JE, Hess PM, He M, Irwin JD, Wilk P (2012) Linking childhood obesity to the built environment: a multi-level analysis of home and school neighbourhood factors associated with body mass index. Can J Public Health 103(3):S15–S21
Google Scholar
Goodchild MF (2004) The validity and usefulness of laws in geographic information science and geography. Ann Assoc Amer Geogr 94(2):300–303
Google Scholar
Green MA, Daras K, Davies A, Barr B, Singleton A (2018) Developing an openly accessible multi-dimensional small area index of ‘Access to Healthy Assets and Hazards’ for Great Britain, 2016. Health Place 54:11–19
Google Scholar
Ha H, Xu Y (2022) An ecological study on the spatially varying association between adult obesity rates and altitude in the United States: using geographically weighted regression. Int J Environ Health Res 32(5):1030–1042
Google Scholar
Hamasaki H (2016) Daily physical activity and type 2 diabetes: a review. World J Diabetes 7(12):243
Google Scholar
Healy GN, Winkler EA, Brakenridge CL, Reeves MM, Eakin EG (2015) Accelerometer-derived sedentary and physical activity time in overweight/obese adults with type 2 diabetes: cross-sectional associations with cardiometabolic biomarkers. PLoS ONE 10(3):e0119140
Google Scholar
Huang H (2021) A spatial analysis of obesity: interaction of urban food environments and racial segregation in Chicago. J Urban Health 98(5):676–686
Google Scholar
Jacquez GM (2008) Spatial cluster analysis. In: Fotheringham S, Wilson J (eds) The Handbook of Geographic Information Science. Blackwell, Oxford, pp 395–416
Jia P, Cheng X, Xue H, Wang Y (2017) Applications of geographic information systems (GIS) data and methods in obesity-related research. Obes Rev 18(4):400–411
Google Scholar
Kanevski M, Pozdnukhov A, Timonin V (2008) Machine learning algorithms for geospatial data. Applications and software tools. In: 4th International Congress on Environmental Modelling and Software. IEMS, Barcelona, Spain
Kanevski M, Pozdnoukhov A, Timonin V (2009) Machine learning for spatial environmental data: theory, applications, and software. EPFL press.
Karasmaa N (2001) The spatial transferability of the Helsinki metropolitan area mode choice models. In: Selected Proceedings of the 9th World Conference on Transport Research World Conference on Transport Research Society. Worldcat, Seoul, Korea
Kriska AM, Saremi A, Hanson RL, Bennett PH, Kobes S, Williams DE, Knowler WC (2003) Physical activity, obesity, and the incidence of type 2 diabetes in a high-risk population. Am J Epidemiol 158(7):669–675
Google Scholar
L’heureux A, Grolinger K, Elyamany HF, Capretz MA (2017) Machine learning with big data: challenges and approaches. IEEE Access 5:7776–7797
Google Scholar
Lauria V, Power AM, Lordan C, Weetman A, Johnson MP (2015) Spatial transferability of habitat suitability models of Nephrops norvegicus among fished areas in the Northeast Atlantic: sufficiently stable for marine resource conservation? PLoS One 10(2). https://doi.org/10.1371/journal.pone.0117006
Lee KH, Heo J, Jayaraman R, Dawson S (2019) Proximity to parks and natural areas as an environmental determinant to spatial disparities in obesity prevalence. Appl Geogr 112:102074. https://doi.org/10.1016/j.apgeog.2019.102074
Google Scholar
Leonard T, McKillop C, Carson JA, Shuval K (2014) Neighborhood effects on food consumption. J Behav Exp Econ 51:99–113
Google Scholar
Lovasi GS, Schwartz-Soicher O, Quinn JW, Berger DK, Neckerman KM, Jaslow R, Lee KK, Rundle A (2013) Neighborhood safety and green space as predictors of obesity among preschool children from low-income families in New York City. Prev Med 57(3):189–193
Google Scholar
Macintyre S, Maciver S, Sooman A (1993) Area, class and health: should we be focusing on places or people? J Soc Policy 22(2):213–234
Google Scholar
Mason KE, Pearce N, Cummins S (2018) Associations between fast food and physical activity environments and adiposity in mid-life: cross-sectional, observational evidence from UK Biobank. The Lancet Publ Health 3(1):e24–e33
Google Scholar
Mason KE, Pearce N, Cummins S (2020) Do neighbourhood characteristics act together to influence BMI? A cross-sectional study of urban parks and takeaway/fast-food stores as modifiers of the effect of physical activity facilities. Soc Sci Med 261:113242
Google Scholar
Mason KE, Pearce N, Cummins S (2021) Geographical heterogeneity across England in associations between the neighbourhood built environment and body mass index. Health Place 71:102645
Google Scholar
Merlo J, Yang M, Chaix B, Lynch J, Råstam L (2005) A brief conceptual tutorial on multilevel analysis in social epidemiology: investigating contextual phenomena in different groups of people. J Epidemiol Community Health 59(9):729–736
Google Scholar
Middle Layer Super Output Area (2022) NHS choices. NHS. Available at: https://www.datadictionary.nhs.uk/nhs_business_definitions/middle_layer_super_output_area.html. Accessed 3 Jan 2022
Monteiro CA, Moura EC, Conde WL, Popkin BM (2004) Socioeconomic status and obesity in adult populations of developing countries: a review. Bull World Health Organ 82(12):940–946
Google Scholar
Moreno-Franco B, Pérez-Tasigchana RF, Lopez-Garcia E, Laclaustra M, Gutierrez-Fisac JL, Rodríguez-Artalejo F, Guallar-Castillón P (2018) Socioeconomic determinants of sarcopenic obesity and frail obesity in community-dwelling older adults: the Seniors-ENRICA Study. Sci Rep 8(1):10760
Google Scholar
Neelon SEB, Burgoine T, Gallis JA, Monsivais P (2017) Spatial analysis of food insecurity and obesity by area-level deprivation in children in early years settings in England. Spat Spatio-Temporal Epidemiol 23:1–9
Google Scholar
Nguyen QC, Brunisholz KD, Yu W, McCullough M, Hanson HA, Litchman ML, Li F, Wan Y, VanDerslice JA, Wen M, Smith KR (2017) Twitter-derived neighborhood characteristics associated with obesity and diabetes. Sci Rep 7(1):16425
Google Scholar
Nguyen TH, Götz S, Kreffter K, Lisak-Wahl S, Dragano N, Weyers S (2021) Neighbourhood deprivation and obesity among 5656 pre-school children—findings from mandatory school enrollment examinations. Eur J Pediatr 180(6):1947–1954
Google Scholar
NHS (2017) www.nhs.uk. Available at: https://www.nhs.uk/conditions/obesity/. Accessed 26 May 2021
NHS (2018) NHS Summary report. Available at: https://digital.nhs.uk/data-and-information/publications/statistical/health-survey-for-england/2018/summary#overweight-and-obesity. Accessed 8 Mar 2022
OECD (2017) Obesity update. OECD Publishing, Paris, http://www.oecd.org/els/health-systems/Obesity-Update-2017.pdf. Accessed 27 June 2021
OECD (2018) Obesity among adults. In: Health at a glance: Europe 2018: State of Health in the EU Cycle. OECD Publishing, Paris/European Union, Brussels
Oshan TM, Smith JP, Fotheringham AS (2020) Targeting the spatial context of obesity determinants via multiscale geographically weighted regression. Int J Health Geogr 19(1):1–17
Google Scholar
Ozili PK (2023) The acceptable R-square in empirical modelling for social science research. Soc Res Methodol Publ Results 134–143. https://doi.org/10.4018/978-1-6684-6859-3.ch009
Perreault K, Bauman A, Johnson N, Britton A, Rangul V, Stamatakis E (2017) Does physical activity moderate the association between alcohol drinking and all-cause, cancer and cardiovascular diseases mortality? A pooled analysis of eight British population cohorts. Br J Sports Med 51(8):651–657
Google Scholar
Pesta D, Bobrov PD, Zaharia OP, Bódis K, Karusheva Y, Markgraf DF, Burkart V, Müssig K, Strassburger K, Szendroedi J, Roden M (2018) Relationship of physical activity behavior with insulin sensitivity over five years after diagnosis of type 2 diabetes. Diabetes 67(Supplement_1). https://doi.org/10.2337/db18-748-P
Qin W, Wang L, Xu L, Sun L, Li J, Zhang J, Shao H (2019) An exploratory spatial analysis of overweight and obesity among children and adolescents in Shandong. China BMJ Open 9(8):e028152
Google Scholar
Rauber F, Chang K, Vamos EP, da Costa Louzada ML, Monteiro CA, Millett C, Levy RB (2021) Ultra-processed food consumption and risk of obesity: a prospective cohort study of UK Biobank. Eur J Nutr 60(4):2169–2180. https://doi.org/10.1007/s00394-020-02367-1
Google Scholar
Reilly JJ, El-Hamdouchi A, Diouf A, Monyeki A, Somda SA (2018) Determining the worldwide prevalence of obesity. Lancet 391(10132):1773–1774
Google Scholar
Reisinger H (1997) The impact of research designs on R2 in linear regression models: an exploratory meta-analysis. J Emp General Market Sci 2(1):1–12
Google Scholar
Robert SA, Reither EN (2004) A multilevel analysis of race, community disadvantage, and body mass index among adults in the US. Soc Sci Med 59(12):2421–2434. https://doi.org/10.1016/j.socscimed.2004.03.034
Google Scholar
Robinson TN, Matheson D, Wilson DM, Weintraub DL, Banda JA, McClain A, Sanders LM, Haskell WL, Haydel KF, Kapphahn KI, Pratt C (2021) A community-based, multi-level, multi-setting, multi-component intervention to reduce weight gain among low socioeconomic status Latinx children with overweight or obesity: the Stanford GOALS randomised controlled trial. Lancet Diabetes Endocrinol 9(6):336–349
Google Scholar
Romero AJ (2005) Low-income neighborhood barriers and resources for adolescents’ physical activity. J Adoles Health 36(3):253–259
Google Scholar
Rossi CE, de PatríciaFragas H, Corrêa EN, Neves das J, de Vasconcelos FD (2019) Association between food, physical activity, and social assistance environments and the body mass index of schoolchildren from different socioeconomic strata. J Publ Health 41(1):e25–e34
Google Scholar
Ruel E, Reither EN, Robert SA, Lantz PM (2010) Neighborhood effects on BMI trends: examining BMI trajectories for Black and White women. Health Place 16(2):191–198
Google Scholar
Rustam F, Reshi AA, Mehmood A, Ullah S, On BW, Aslam W, Choi GS (2020) COVID-19 future forecasting using supervised machine learning models. IEEE Access 8:101489–101499
Google Scholar
Salmasi L, Celidoni M (2017) Investigating the poverty-obesity paradox in Europe. Econ Hum Biol 26:70–85
Google Scholar
Salmerón R, García CB, García J (2018) Variance inflation factor and condition number in multiple linear regression. J Stat Comput Simul 88(12):2365–2384
Google Scholar
Sarkar C, Webster C, Gallacher J (2015) UK Biobank Urban Morphometric Platform (UKBUMP)–a nationwide resource for evidence-based healthy city planning and public health interventions. Ann GIS 21(2):135–148
Google Scholar
Sassi F (2010) Obesity and the economics of prevention: fit not fat, OECD Publishing, Paris, https://doi.org/10.1787/9789264084865-en.
Savage N (2022) Breaking into the black box of artificial intelligence. Nature
Schuurman N, Peters PA, Oliver LN (2009) Are obesity and physical activity clustered? A spatial analysis linked to residential density. Obesity 17(12):2202–2209
Google Scholar
Shahid R, Bertazzon S (2015) Local spatial analysis and dynamic simulation of childhood obesity and neighbourhood walkability in a major Canadian city. AIMS Public Health 2(4):616
Google Scholar
Shrestha R, Mahabir R, Di L (2013) Healthy food accessibility and obesity: case study of Pennsylvania, USA. In: 2013 Second International Conference on Agro-Geoinformatics (Agro-Geoinformatics). IEEE, Fairfax, VA, pp 329–333
Singleton A, Arribas-Bel D (2021) Geographic data science. Geogr Anal 53(1):61–75
Google Scholar
Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics 8(1):1–21
Google Scholar
Sun Y, Wang S, Sun X (2020a) Estimating neighbourhood-level prevalence of adult obesity by socio-economic, behavioural and built environment factors in New York City. Public Health 186:57–62
Google Scholar
Sun Y, Hu X, Huang Y, On Chan T (2020b) Spatial patterns of childhood obesity prevalence in relation to socioeconomic factors across England. ISPRS Int J Geo-Inf 9(10):599
Google Scholar
Taylor C, Kline CE, Rice TB, Duan C, Newman AB, Barinas-Mitchell E (2021) Snoring severity is associated with carotid vascular remodeling in young adults with overweight and obesity. Sleep health 7(2):161–167
Google Scholar
Taylor C, Kline CE, Rice TB, Duan C, Newman AB, Barinas-Mitchell E (2021) Snoring severity is associated with carotid vascular remodeling in young adults with overweight and obesity. Sleep Health 7(2):161–167
Google Scholar
UK Biobank (2022) UK Biobank - UK Biobank. Available at: https://www.ukbiobank.ac.uk/. Accessed 3 Jan 2022
UK Census (2011) Census data - Office for National Statistics. Available at: https://www.ons.gov.uk/census/2011census/2011censusdata. Accessed 24 Aug 2022
Van Meter EM, Lawson AB, Colabianchi N, Nichols M, Hibbert J, Porter DE, Liese AD (2010) An evaluation of edge effects in nutritional accessibility and availability measures: a simulation study. Int J Health Geogr 9:1–12
Google Scholar
Vatcheva KP, Lee M, McCormick JB, Rahbar MH (2016) Multicollinearity in regression analyses conducted in epidemiologic studies. Epidemiology 6(2). https://doi.org/10.4172/2161-1165.1000227
Walker RE, Block J, Kawachi I (2014) The spatial accessibility of fast food restaurants and convenience stores in relation to neighborhood schools. Appl Spat Anal Policy 7(2):169–182
Google Scholar
Walker BB, Shashank A, Gasevic D, Schuurman N, Poirier P, Teo K, Rangarajan S, Yusuf S, Lear SA (2020) The local food environment and obesity: evidence from three cities. Obesity 28(1):40–45
Google Scholar
Watson DS, Krutzinna J, Bruce IN, Griffiths CE, McInnes IB, Barnes MR, Floridi L (2019) Clinical applications of machine learning algorithms: beyond the black box. BMJ 364:l886. https://doi.org/10.1136/bmj.l886
Weier J, Herring D (2000) Measuring vegetation (NDVI & EVI). NASA Earth Observatory, Washington DC
Google Scholar
Wen TH, Chen DR, Tsai MJ (2010) Identifying geographical variations in poverty-obesity relationships: empirical evidence from Taiwan. Geospatial health 4(2):257–265
Google Scholar
West BT, Beer L, Gremel GW, Weiser J, Johnson CH, Garg S, Skarbinski J (2015) Weighted multilevel models: a case study. Am J Public Health 105(11):2214–2215
Google Scholar
WHO (2018) World Health Organization. Available at: https://www.who.int/health-topics/obesity#tab=tab_1. Accessed 26 May 2021
Yasmin F, Morency C, Roorda MJ (2015) Assessment of spatial transferability of an activity-based model, TASHA. Transp Res Part A Policy Pract 78:200–213
Google Scholar

Download references

Acknowledgements

This research has been conducted using the UK Biobank Resource under application number 58421.

Funding

This work was supported by South West Doctoral Training Partnership (SWDTP) ESRC Studentship.

The authors were compliant with the ethical standards.

Author information

Authors and Affiliations

School of Geographical Sciences, University of Bristol, Bristol, UK
Yunqi Zhou, Richard Harris & Emmanouil Tranos

Authors

Yunqi Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Richard Harris
View author publications
You can also search for this author in PubMed Google Scholar
Emmanouil Tranos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunqi Zhou.

Ethics declarations

Ethical Approval

UK Biobank received ethical approval from the North West Multi-centre Research Ethics Committee (MREC). All participants gave written informed consent before enrolment in the study. Direct dissemination of the results to participants is not applicable. This study was performed under UK Biobank application number 58421.

Informed Consent

The authors have given the informed consent to publish this article.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 1.39 MB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, Y., Harris, R. & Tranos, E. Where Is the Geography? A Study of the Predictors of Obesity Using UK Biobank Data and Machine Learning. J geovis spat anal 7, 17 (2023). https://doi.org/10.1007/s41651-023-00142-4

Download citation

Accepted: 08 May 2023
Published: 12 June 2023
DOI: https://doi.org/10.1007/s41651-023-00142-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Where Is the Geography? A Study of the Predictors of Obesity Using UK Biobank Data and Machine Learning

Abstract

Introduction

Data and Method

Study Area and Dataset

Local Environment Exposures

Interpersonal Lifestyle Habits

Socio-economic Status

Demographic Characteristics

Outcomes: Body Mass Index (BMI , kg/m.2 )

Descriptive Sample Characteristics

Data Analysis

Results

Null Multilevel Results (Evaluations on VPC Values)

Prediction Abilities of Random Forest Models Across Cities (Comparisons of MAE, MSE, RMSE and R-squared)

Variable Importance Plots of Random Forest Models Across Cities

Results from the Multilevel Models Containing Socio-economic Status and Demographic Characteristics

Discussion

Summary of Findings

Interpretations and Discussion of Results

Strengths and Limitations

Conclusion

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Informed Consent

Conflict of Interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (DOCX 1.39 MB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Outcomes: Body Mass Index (BMI , kg/m.² )