1 Introduction

Heart disease and diabetes are two cardinal public health issues in the United States. According to the CDC official data, there were about 647,000 Americans who died due to heart disease each year, which is nearly one-quarter of the deaths in America (Heron 2019). According to CDC, in 2014 and 2015 the United States spent $219 billion in healthcare services, medicine, and lost productivity due to death. Medical conditions and lifestyle choices affect the likelihood to develop heart disease. A review of the literature indicates a strong association between type-2 diabetesFootnote 1 and coronary heart disease mortality (PAN et al. 1986). There are about 34.2 million people who have diabetes in the United States, which is approximately 10.5% of the total population (Centers for Disease Control and Prevention, 2022).

Several leading factors can cause heart diseases such as alcohol use (Vogel, 2019), smoking (Barengo et al. 2017), household income (Xiang et al. 2018), urbanization rate (Allender et al. 2008), blood pressure and depression (Brunström and Carlberg 2018, Li et al. 2020), as well as obesity and physical inactivity (Arsenault, 2010). Many of these factors such as urbanization rate (Fei et al. 2016), obesity, physical inactivity (Eaton and Eaton 2017), household income (Maty et al. 2005), smoking (Akter et al. 2017) and drinking habits (Holst et al. 2017) can also lead to diabetes. However, the existing research does not consider the spatiotemporal variation of those risk factors associated with heart disease or type 2 diabetes. For instance, does one geographic area have more heart disease and diabetes patients than other areas? If so, why? What are the leading environmental and demographic factors that cause heart disease and diabetes within a specific region? What kinds of public and health care services are needed in that region in order to reduce the disease risk?

The Geographically Weighted Regression (GWR) model has been used to model the non-stationary spatial variation and applied in various application domains to support spatial decision-making (Jiang et al., 2021). It explores the spatial changes and related driving factors of the research object in a certain scale by establishing a local regression equation of each point in the spatial range. GWR has been utilized to understand how spatial determinants vary across space in health and disease-related researches, such as targeting the spatial context of obesity determinants (Oshan et al. 2020) and modeling the transmission of hand, foot, and mouth disease (Hu et al. 2020). Some of the research work also focused on using GWR in modeling diabetes and heart disease. For example, Siordia et al. (2012) used GWR to estimate how poverty affects diabetes prevalence. Ford and Highfield et al. (2016) applied GWR to measure the association between social deprivation and cardiovascular disease mortality.

While much literature focuses on the applications of GWR in diabetes and heart disease (Hu et al, 2020), none studied the spatiotemporal variation of leading factors that may cause the diseases. Furthermore, the traditional GWR method assumes that all of the regression processes are in the same spatial scale, constraining local relationships within each model to vary at the same spatial scale (Yang 2014). Bandwidth in a GWR process represents the spatial range in which the data points can affect each other based on the distance between them. This range can vary from different independent variables as the spatial relationship between dependent variables and each independent variable may change. Therefore, GWR is limited to fit a single optimal bandwidth that reflects an “average” of the best bandwidths for each process (or different variables). On the other hand, Multiscale Geographically Weighted Regression (MGWR) can overcome that limit, which allows covariate-specific bandwidths to be optimized (Fotheringham et al. 2017).

In this research, we developed an MGWR-based approach to analyze the non-stationary spatiotemporal pattern of heart disease and diabetes using multiple environmental and demographic factors such as urbanization rate, household income, obesity percentage, blood pressure level, medical cost, physical inactivity percentage, smoking, and alcohol consumption habit in each county of the United States. The alcohol consumption habit data describes the proportion of adults (21 years and older) who have had, on average, more than one (for women) or two (for men) alcoholic drinks per day during the previous month. By employment of two-year data, the trend of major contributing factors can be determined. Showing the correlation between diabetes and heart disease risk factors will aid Federal and State agencies in developing educational campaigns, decide where to allocate resources, and predict future heart disease and diabetes rates based on adjusted risk factor levels. The rest of the article is organized as follows: section two introduces the background, theories, and data processing methods. Section three illustrates the methods of applying MGWR to the datasets in observing spatial variations of heart disease and diabetes patterns. Section four presents the research methods and results. Finally, sections five and six discuss the results and draw conclusions.

2 Methodology and theory

2.1 Study area and data processing

Our study area covers eight U.S. Census Divisions including New England, Middle Atlantic, East North Central, West North Central, South Atlantic, East South Central, and West South-Central, and Mountain. We selected all the regions for those eight U.S. Census Divisions except the Pacific division. For the Pacific Division, we only selected California, Oregon, and Washington. The medical datasets such as heart disease mortality rate and the number of diabetes patients were collected from the Centers for Disease Control and Prevention (CDC) (2022). We also included the datasets that describe people’s healthy living habits such as health care costs, moderate drinking and smoking habits, and their frequency of doing physical activities (IHME, 2014). Table 1 illustrates the descriptions of the dataset used in this project. In addition to the health-related datasets, we considered the urbanization rate as an important indicator (Kumar et al. 2006). Here, we used 300 m resolution classified land cover data from European Space Agency Climate Change Initiative (ESA-CCI-LC). We calculated the percentage of urban areas in each county as an index of urban rate. In addition, Figs. 1 and 2 are visualizations of the variables and study area.

Table 1 Descriptions of the datasets
Fig. 1
figure 1

Visualization of the study area and variables (year 2005)

Fig. 2
figure 2

Visualization of the study area and variables (year 2015)

In this study, we included datasets that were collected in 2005 and 2015 to observe the change of spatial variations of these two diseases for these two time points. The 10-year study period can allow us to capture any significant changes over time in the spatial pattern of the prevalence of these two diseases and the corresponding risk factors across the country. For the dataset, we used the following variables: the median household income as income level variable; among adults age 18 and older; the proportion who smoke cigarettes as smoking rate variable; among all adults age 21 and older, the proportion who have had, on average, more than one (for women) or two (for men) alcoholic drinks per day as the moderate alcohol consumption rate variable; age-adjusted percentage diagnosed diabetes patients as diabetes variable; the age-adjusted percentage of leisure time as physical inactivity variable; age-adjusted percentage of obesity percentage as obesity variable; percentage of urban land cover in each county as urbanization variable.

Compare between Figs. 1 and 2, we can find that most variables showed an increasing trend across the country between 2005 and 2015, with the exception of urbanization. Because the United States had finished urbanization process in 1980s and the urbanization increase rate became slow after that. The smoking and obesity rate had experienced the most significant increase among all variables. For each factor, all the data values were normalized between 0 to 1 using Eq. 1 in order to make the data comparable within each attribute.

$${x}_{norm}=\frac{x-{x}_{min}}{{x}_{max}-{x}_{min}}$$
(1)

2.2 Ordinary least square (ols) and multiscale geographically weighted regression(MGWR)

In the first step, we applied an ordinary least square (OLS) model to the datasets to observe the global distribution pattern of both diseases. The OLS equation is illustrated below:

$${Heart}_{i}, {Diabetes}_{i}={\alpha }_{0}+ \alpha 1 {OBE}_{i}+ \alpha 2 {SMO}_{i}+\alpha 3 {DR}_{i}+\alpha 4 {INAC}_{i}+a5 {Income}_{i}+\alpha 6 {Urban}_{i}+{\varepsilon }_{i}$$
(2)

In this article, several MGWR models were developed to observe non-stationary correlation between each type of disease (heart disease and diabetes) and various environmental and demographic risk factors. Table 2 illustrates the description of dependent and independent variables. In this case, the dependent variables are heart disease mortality rate and the number of diabetes patients; the independent variables include various environmental and sociodemographic indicators. In this study, we also looked at the temporal variation of the disease pattern by including two years of observations, 2005 and 2015.

Table 2 Description of dependent and independent variables

In this study, we employed Multiscale Geographically Weighted Regression (MGWR) to analyze how the relationship between heart disease and diabetes as well as between each of these diseases and their corresponding risk factors vary spatially.

Geographically Weighted Regression (GWR) has been often used for exploring spatially varying relationships. It is a spatial regression model that can be used to model spatial variations and non-stationary relationships between dependent and a set of independent variables. Traditional statistical methods, such as correlation analysis and ordinary least square (OLS) regression could lead to results where the impact of local variations could be hidden (Bacha 2003, Batisani and Yarnal 2009, Geriet al. 2010) because they produce ‘average’ or ‘global’ parameters to estimate the spatial relationships (Ali et al. 2007). In light of this, GWR is developed to build regression models to explore how one dependent variable changes in response to one or more independent variables at the local scale (McMillen 1996, Fotheringham et al. 1998, Leung et al. 2000, Yu and Wu 2004, Deller and Lledo 2007, Waller et al. 2007). The outcomes of the GWR model depend on the observations that are in close proximity to the subject point, so they reveal the relationships within the neighborhood (Fotheringhamet al. 2001, Foody 2004, Bickford and Laffan 2006). Fotheringham et al. (1998 and 2001) presented a general form of a basic GWR model (Fotheringham et al. 1998, Fotheringhamet al. 2001)

$${Y}_{i}={\alpha }_{0}\left(i\right)+{\sum }_{k=1}^{p} {\alpha }_{k}\left(i\right){X}_{ik}+{\varepsilon }_{i}, i=1,\dots ,n$$
(3)

It allows the parameters in the model to vary by location i. The GWR estimator is:

$${\alpha }^{^{\prime}}\left(i\right)={{(X}^{T}W(i)X)}^{-1}{X}^{T}W\left(i\right)Y$$
(4)

W(i) is a matrix of weights specific to location i (longitude, latitude) such that observations nearer to i are given greater weight than observations further away. The basic principle of GWR is that the data is ‘borrowed’ from nearby locations, weighted by the proximity of the location from which the data is being borrowed to the location for which the local regression is calibrated. This allows models to be calibrated specifically to location i. To minimize bias in the results, data from nearby locations is weighted more heavily than from more distant ones. In order to evaluate the performance of the GWR model, the following concepts were considered:

R2: A measure of goodness of fit of the model with values varying from 0.0 to 1.0, with higher values being preferable. It may be interpreted as the proportion of dependent variable variance accounted for by the regression model (Fotheringham, Charlton et al. 1998).

AIC: The optimal bandwidth was obtained in each GWR calibration through an iterative process to minimize the corrected Akaike Information Criterion (AICc) value calculated as:

$${AIC}_{c}=2n{log}_{e}\left(\widehat{\sigma }\right)+n{log}_{e}\left(2\pi \right)+2n\left[\frac{n+tr\left(S\right)}{n-2-tr\left(S\right)}\right]$$
(5)

where n denotes the sample size, is defined as the estimated standard deviation of the error term, and tr(S) denotes the trace of the hat matrix S.

Spatial autocorrelation: Global spatial autocorrelation is a description of the spatial characteristics of attribute values throughout the region. Spatial autocorrelation in the regression residuals is often interpreted to mean that (1) an important independent variable is missing from the regression, or (2) an underlying spatial process that induces spatial autocorrelation in some of the variables is missing from the model. Geographically Weighted Regression (GWR) can be used when there is spatial autocorrelation in the residuals from the regression, or the regression coefficients might change from one location to another (e.g., the regression coefficients are not stationary); it is critical to fulfilling spatial autocorrelation analysis before applying GWR methods.

Moran's Index is the most commonly used method to measure global spatial autocorrelation and quantify the similarity of outcome variables between regions defined as spatially related (Fu et al. 2014). It can be applied to detect the beginning of spatial randomness. The beginning of spatial randomness indicates spatial patterns such as grouping or forming trends towards space. The value generated in the Moran Index calculation ranges from -1 to 1. The value of the zero-value index is not grouped, the positive Moran's Index value indicates positive spatial autocorrelation which means that adjacent locations have similar and grouped values; the negative Moran's Index value indicates negative spatial autocorrelation which means that adjacent locations have different values (Pfeiffer et al. 2008). According to Lee and Wong, the Moran’s Index can be obtained using Eq. (6) (Lee and Wong 2001).

$$I=\frac{n}{{S}_{0}}\frac{{\sum }_{i=1}^{n} {\sum }_{j=1}^{n} {W}_{ij}({x}_{i}-\underline{x})({x}_{j}-\underline{x})}{{\sum }_{i=1}^{n} {({x}_{i}-\underline{x})}^{2}}$$
(6)

where S is the variation, xi is the input data of pixel i, and xj is the input data of pixel j, \(\underline{x}\) is the average value, n is the number, and Wij is the aggregate of all spatial weights. If pixel i and pixel j are adjacent, the value of corresponding elements in the matrix Wij is 1, otherwise it is 0.

The hypothesis testing of parameter i is carried out as follows:

H0: there is no spatial autocorrelation.

H1: there is positive autocorrelation (Moran Index is positive) or.

H1: there is negative autocorrelation (Moran's I index is negative).

Although GWR captures any spatial heterogeneity in relationships, it does so under the assumption that all such relationships vary at the same spatial scale across all covariates. MGWR is a significant improvement over GWR because it relaxes the “same spatial scale” assumption and allows covariate‐specific bandwidths to be optimized. It is formulated as (Fotheringham et al. 2017):

$${y}_{i}={\sum }_{j=0}^{m} {\beta }_{bwj}({u}_{i},{v}_{i}){x}_{ij}+{\varepsilon }_{i}$$
(7)

where bwj in \({\beta }_{bwj}\) indicates the bandwidth used for calibration of the jth conditional relationship. MGWR, thus allows different processes to operate at different spatial scales by deriving separate bandwidths for the conditional relationships between the response variable and different predictor variables. MGWR is calibrated using a back‐fitting algorithm as described in Fotheringham et al. (2017). The back‐fitting process is initialized with GWR parameter estimates. Based on these initial values, the calibration process works in an iterative manner and during each iteration, all local parameter estimates, and optimal bandwidths are evaluated. Iteration terminates when the difference between the parameter estimates from successive iterations converges to a specified threshold (we selected 1e‐5 in this study) (Fotheringham, et al. 2017, Oshan et al. 2020).

The election of bandwidth is a trade-off process since the spatial range in which the data points can affect each other. If the bandwidth is too large, the model cannot reflect the spatial non-stationary of the correlation between the dependent variable and independent variables, which will cause large bias in the local estimates (i.e., if the bandwidth = N of the dataset, the model will not have any spatial heterogeneity and will be the same as OLS model). Also, too small a bandwidth can lead to a large variance in the local estimates. In this study, associations between health conditions and factors that influence them to have spatial heterogeneities because of the differences in lifestyle, diet, income, etc. across the US.

3 Results

3.1 Results of OLS and spatial autocorrelation

The detailed information of the dependent and independent variables can be found in Table 2. The results of the OLS model for the years 2005 and 2015 are illustrated in Tables 3 and 4.

Table 3 The statistical analysis results of ordinary least squares regression model using datasets collected in 2005
Table 4 The statistical analysis results of ordinary least square estimation using the datasets collected in 2015

The results of calibrating the model for each disease are shown in Tables 3 and 4, which include normalized parameter estimates, significant test results, and spatial autocorrelation statistics results. The p values from the results are low, which indicated that the variables have significant correlation with both diseases, except the urbanization variable (see Table 4). According to Table 4, the R2 value of diabetes reaches 0.776, which is relatively high, indicating the model performance of diabetes variable in 2015 is good. For heart disease, the R2 values are relatively low (e.g., around 0.5) for 2015 data. For the 2005 data, the R2 is low for both diseases. The results also show that there is less than 1% likelihood that this clustered pattern could be the result of random chance, which indicates spatial heterogeneity among the independent variables. Therefore, the MGWR should be used for more advanced analysis.

3.2 Results of spatial coefficient analysis using MGWR

Due to the limitations that have been shown in OLS analysis, we applied MGWR with the same dependent and independent variables to observe the spatial variations of heart and diabetes diseases for 2005 and 2015 data. The statistical analysis results of the MGWR model are summarized in Tables 5 and 6. Figures 3 and 4 illustrate the spatial distribution of R2 for both years’ data. According to Fig. 3, most of the areas received high R2 values except small areas in New York, Ohio, Texas, and Indiana, which indicates the high explanation rate of the model in the majority areas of the country. For the 2015 data, the results are similar except Florida and Michigan states received low R2 values. The spatial interpretation of the coefficient analysis results will be explained in the results section. When visualizing the local coefficients of selected variables spatially, we excluded counties (colored in grey) where the variable is statistically non-significant (p > 0.1) locally.

Table 5 Statistical analysis results of MGWR estimation using the datasets collected in 2005
Table 6 Statistical analysis results of MGWR estimation using the datasets collected in year 2015
Fig. 3
figure 3

Spatial distribution of local R2 for the MGWR model’s results using the year 2005 datasets. Left: Diabetes; right: Heart disease; X/Y axis: projected coordinates system; Note: counties in grey represent areas with no data; The gradient color in the legend shows the local R2 of the MGWR model in the ascending order

Fig. 4
figure 4

Spatial distribution of local R2 for the MGWR model’s results using the year 2015 datasets. Left: Diabetes; right: Heart disease; X/Y axis: Projected coordinates system; Note: The gradient color in the legend shows the local R2 of the MGWR model in the ascending order

According to Tables 5 and 6, MGWR produced better performance than the OLS model since both R2 and AICc values have been improved significantly. The MGWR models produced a higher R2 value and lower AICc value compared to the OLS models.

The spatial coefficient analysis results of heart disease in the years 2005 and 2015 are shown in Figs. 5 and 6. For heart disease, the obesity variable shows statistical significance across the entire country in 2005, but not for 2015. The drinking habit seemed to negatively correlate with heart disease in the pacific coast area in 2005. However, this impact has increased a lot during 2015, especially for most areas of the Southeast and intersection areas between Southeast, Midwest, and Southwest of the U.S. According to Figs. 5 and 6, smoking will dramatically increase the probability of getting heart disease in the gulf, pacific, and east coasts of the U.S. in 2005. However, the situation has improved for the entire country between 2005 and 2015 since the coefficient value decreased in 2015 compared to 2005. Lack of physical exercise has been the leading factor for heart disease in east coast areas in 2005. The correlation has become more evident in 2015, especially for the states such as Virginia, North Carolina, New Mexico, and Colorado. Furthermore, the income negatively correlates to heart disease in many mid-west and southeast places of the U.S. and California. Finally, east coastal areas show a positive correlation between the urban cover and heart disease in 2005 and the results of the urban cover variable were not statistically significant for the 2015 data.

Fig. 5
figure 5

Illustration of spatial distributions of the coefficients for heart disease analysis (2005 data), X/Y axis: Projected coordinates system. Note: counties in grey represent areas that are not statistical significant (p > 0.1). The gradient color in the legend shows the value of the coefficients for each variable in the ascending order

Fig. 6
figure 6

Illustration of spatial distributions of the coefficients for heart disease analysis (2015 data), X/Y axis: Projected coordinates system. Note: counties in grey are those where the variable is statistically not significant (p > 0.1); The gradient color in the legend shows the value of the coefficients for each variable in the ascending order

These two datasets (2005 and 2015 data sets) produced similar spatial patterns for diabetes analysis (see Figs. 7 and 8). For example, obesity has been a leading risk factor of diabetes in the east, south-central, and south Atlantic regions of the U.S. On the other hand, smoking and alcohol consumption was the primary concern in the northern part of the U.S., such as North and South Dakota, Montana, and Wyoming in 2005. In 2015, alcohol consumption level was improved, but the smoking level remained the same in those regions and showed a more significant impact on diabetes in the neighborhood regions. During the ten-year period, lack of physical exercise has become a significant risk factor of diabetes in the Northeast and West parts of the U.S.

Fig. 7
figure 7

Illustration of spatial distributions of the coefficients for diabetes disease analysis (2005 data), X/Y axis: Projected coordinates system. Note: counties in grey are those where the variable is statistically not significant (p > 0.1); The gradient color in the legend shows the value of the coefficients for each variable in the ascending order

Fig. 8
figure 8

Illustration of spatial distributions of the coefficients for diabetes analysis (2015 data), X/Y axis: Projected coordinates system. Note: counties in grey are those where the variable is statistically not significant (p > 0.1); The gradient color in the legend shows the value of the coefficients for each variable in the ascending order

Furthermore, Tables 5 and 6 show the standard deviation of the coefficients of variables have changed from 2005 to 2015, but the trends are different between heart disease and diabetes. The standard deviation of lifestyle variables (drinking, smoking, obesity, and inactivity) has a considerable increase during the ten years, while income and urban area variables had a decreased standard deviation. However, only the standard deviation of the inactivity and smoking variable showed an increase in diabetes.

4 Discussion

The impact of the variables on both diseases exhibits strong spatial heterogeneity across the country. The direction of most variables is in line with our hypothesis, except for the moderate alcohol consumption variable. Moderate alcohol consumption has a negative coefficient for both diseases, which is a controversial finding. However, according to our dataset, moderate drinking is defined as “among all adults aged 21 and older, the proportion who have had, on average, more than one (for women) or two (for men) alcoholic drinks per day during the previous month.” And this is the same criteria for moderate drinking as defined by the Dietary Guidelines for Americans: 2015–2020. The previous study already shows that there was a significant decreasing trend in diabetes risk as alcohol consumption increased, and the risk of diabetes is especially lower with an alcoholic consumption of 8–14 drinks/week (He et al. 2019). Similarly, Zhang et al. (2017) did cohort research in China that also found that men who consumed 20.01– 40 g of ethanol per time with less than 5 times per week had a 24% lower risk of coronary heart disease (CHD) incidence compared with non-drinkers.

Furthermore, different variables may share a similar spatial pattern of the correlation coefficients in the same disease across ten years. For instance, drinking and smoking variables have high coefficient values concentrated in northern states for diabetes (e.g., in Montana, North Dakota, and Wyoming), while obesity has high coefficients in South Atlantic states, such as Florida and Louisiana. These patterns may result from the low population density, shared culture and climate shared in these states. Also, the relationship of physical inactivity variable to both diseases showed similar Spatio-temporal patterns from 2005 to 2015. For diabetes, whereas the low coefficients of physical inactivity are concentrated in West North Central states in 2005, they are concentrated in South Atlantic states for diabetes, such as Texas, Louisiana, and Florida in 2015. The same trend can be found in the results of heart disease, which means the influence of physical activity on both diseases is very similar. Another important finding is that urban land cover is not significant at all areas for the death rate of heart disease, which is out of our expectation.

By comparing the analysis results between 2005 and 2015, we captured the temporal trend of the spatial distribution of coefficients for both diseases. We found that the inactivity variable played a more important role from 2005 to 2015 for both diseases. On the other hand, the coefficient of obesity has decreased from 2005 to 2015 for both diseases. However, it does not mean the obesity had a lower effect on both diseases in 2015 than in 2005 as the exact value of the coefficients are not comparable between different years. In other words, the lower coefficient value only means the obesity has a relatively lower impact than other risk factors (e.g., inactivity) in 2015. This study also shows the spatial variations of different independent variables for both diseases across the country. Notably, the effect of income on the prevalence of diabetes became more uniform across the country between 2005 and 2015. The absolute values of the income coefficients also decreased across the country, indicating that the impact of income on the prevalence of diabetes decreased overall. These changes might be caused by the advances in biotechnology, rising living standards, and popularizing public medical insurance.

5 Conclusion and limitations

Understanding the leading factors of heart disease and diabetes is the key to address many spatial decision problems related to disease control and public health management. Geographic Information Systems played an important role in supporting spatial decision-making in various application domains (Zhang et al., 2017; Zhang et al., 2021A; Zhang et al., 2021B; Zhang et al., 2014). This article introduced an MGWR model into the risk factors analysis of heart disease and diabetes. We used urbanization rate, obesity, and healthy living habits such as moderate drinking and smoking habits, and the frequency of physical activities as independent variables to evaluate the impacts of variables on heart disease and diabetes. The MGWR model is applied to eight census divisions of the United States at the county level, including New England, Middle Atlantic, East North Central, West North Central, South Atlantic, East South Central, and West South-Central Mountain. The analysis results illustrate the spatial variations of different risk factors for these two diseases. The findings can inform the development of an intelligent decision support system for Federal and State agencies to facilitate the allocation of resources to combat these diseases and predict future heart disease and diabetes rates based on adjusted risk factor levels. The findings can also inform public education campaigns.

The major limitation of this study is the lack of measurement of other types of risk factors of type-2 diabetes and heart disease, such as high blood pressure, high cholesterol, etc. Although the MGWR can catch the spatial heterogeneity, the area-based characteristics are not a good representation of individual characteristics as each county has been impacted by the surrounding counties. Furthermore, the MGWR model can serve as a good solution to review the spatial distribution and historical trend of the coefficients of risk factors. In the future, a predictive model can be developed to forecast the future trend based on the MGWR results.