1 Introduction

The service sector, generally referred to as the tertiary sector of the economy, includes the provision of services to other businesses, including final consumers. In the European Union (EU), services account for approximately 70% of the Union's gross domestic product (GDP) and employment (Eurostat 2022). In some countries, such as the United States, it could be as high as 80% (World Bank Group 2022). Some of the most common areas of the service sector are tourism (e.g., accommodation, and travel agents), hospitality (e.g., food services), education, real estate, transport, and banking. The wide range of activities available in this sector makes it extremely difficult to assess the health status of their workforce. For example, the hospitality industry offers employment opportunities to minority groups, such as immigrants, women, or youth with low educational attainment. In many cases, these activities are characterized as labor-intensive and are related to long working hours and a high workload (Rydzik and Anitha 2020; Xu et al. 2020).

The impact of COVID-19 has exacerbated this situation (Chang et al. 2021). Globally, projections of investment in the health of the workforce appear inadequate, which undermines the future sustainability of the workforce and health systems (World Health Organization 2016). In practice, the tasks of men and women are often different, which creates health risks at work for each gender. Despite the advances carried out by medical geography and spatial epidemiology to predict spatial patterns of disease incidence, the abundance and accuracy of occupational health risk maps are still very limited (Gerassis et al. 2021). The introduction of geostatistical modeling to support occupational health decision-making is relatively new. Notably, Dos Santos et al. (2020) used a geostatistical wave model to determine healthy workspaces for rural workers exposed to tractor noise. Other application examples include the use of Bayesian geostatistical binary regression to model the 2-week disease prevalence rate among workers (Wen et al. 2021) or the geostatistical analysis of mental health in construction workers (Yuvaraj and Thulasimala 2022). While significant progress is being achieved, it remains a challenge how best to combine big data from occupational health with data from other domains to examine the relationships between workers and their environments. This is partly due to the multiple variables to be represented without a holistic approach to unify the field of the problem, and even more, to discover these differentiating variables.

Medical geography is cross-disciplinary (Jerrett et al. 2010). In practice, medicine has been integrated into a range of disciplines, including sociology, economy, history, ecology, biology, anthropology, and political science. The geography of health illustrates the importance of medical geography by identifying geographic variations within health and healthcare systems. Health geography has evolved from medical geography in recent decades, and the construction process continues (Moon 2020). In this study, medical and health geography is developed in occupational health to support public health policy and planning in the service sector. Here, the impact on the health of climate change (Orlov et al. 2020) is understood to be a key consideration.

Climate change is already influencing the intensity, severity, and frequency of heat waves (Perkins-Kirkpatrick and Lewis 2020), which points to increased cardiovascular and respiratory diseases and, subsequently, mortality, particularly during extreme heat events (Ho et al. 2015). Rising temperatures are expected to open the door to a growing number of diseases in the coming years whose effects could be worse at work. This could have a major impact on service sector performance, as related surveys reveal effects on morbidity, reduced productivity of individuals, and increased sick leave (Ebi et al. 2021; Wondmagegn et al. 2021). At present, an important question remains unanswered: what is the real impact of climate variables on the health of the service sector? To clarify, annual mean temperature and annual rainfall were introduced as covariates, enabling insight into the effect of location and environmental exposure on the workers’ health.

The concrete goal of this study is to introduce a methodological decision-to-visualization process to understand and measure the occupational health risk factors, together with local climatic conditions, leading to unhealthy workers that may be unfit to perform their duties. This is achieved by taking advantage of the latest advances in probabilistic models of structural equations (PSEMs) using Bayesian modeling and geostatistics. The occupational health data were obtained from the annual workers’ medical checks, socioeconomic covariates were obtained from the Spanish national agencies, and bioclimatic variables from the WorldClim database (Fick et al. 2017; Panagos et al. 2017). This survey involved research of big data using machine learning techniques, interpreted within a framework of spatial organization, aimed at supporting health policy development and targeting occupational strategies of disease monitoring at work. By putting theory into practice, a progressive learning approach is proposed to build a methodological process that leads to informed decision-making. Bayesian networks were selected as a proxy to manage the large number of variables associated with this type of complex problem. Specifically, Bayesian machine learning was used to develop a PSEM where the health status is the target node of the model. The use of a hierarchical Bayesian structure allows one to obtain a compact representation of the probabilistic dependencies between the multiple variables used to characterize the health status (Gao et al. 2022; Njah et al. 2021; Letta et al. 2022). Importantly, the Bayesian equation model offers the possibility of inserting latent variables into the structure, facilitating the identification of new relationships between variables and, consequently, reducing the complexity of the problem (Peterson et al. 2020; Keter et al. 2022).

The introduced methodological approach is understood as a unified and renewed conceptualization of a series of prior works. Modeling geospatial uncertainty with Bayesian models, including advanced deep learning techniques, is something already extended in the literature (e.g., Hoffimann et al. 2022; Kirkwood et al. 2022). Structural equation models (SEMs) have been an essential tool for causal analysis in social and behavioral sciences for more than 50 years (Pearl 1998).

Combining both domains leads to the PSEMs recently popularized by Conradi and Jouffe (2015) with the recent scale-up of dedicated software for automated machine learning (AutoML) both in research and industrial applications. Furthermore, mapping is a precise way of simplifying reality (Lahr and Kooistra 2010; Albuquerque et al. 2017); however, a two-dimensional representation may only aggregate and display a limited number of attributes. Consequently, when examining complex scenarios, such as environmental or epidemiological characterization, there is a need to reduce the dimensionality of the problem. Risk maps, which are widely cited in the literature, are highly relevant to the visualization of spatial models, and powerful tools to support policy development within a complex risk assessment framework. Examples of these practices include the distribution of pollutant levels or vulnerability assessment. Geostatistical techniques are based on the theory of regionalized variables (Matheron 1971) whereby variables within a region have spatially structured and random properties (Journel and Huijbregts 1978). Geostatistics is based on an extensive methodological approach and goes beyond the simple development and application of mathematical (probabilistic) models and methods. A key challenge is to analyze the practical problems to be solved and to formalize them in terms of concepts. In anticipating risk, it is imperative to emphasize the appropriateness of the likelihood that future estimated values will exceed the maximum permissible values. The delineation of zones of high and low impact requires the interpolation of the selected covariates to the nodes of a regular grid, making possible the assessment and prediction of prevalent spatial patterns, as guidance to a more sustainable management (Goovaerts 1997).

In that manner, the added value is the possibility to identify and characterize those variables that may have a differentiating impact that is not meaningful from a mathematical point of view (Kiebish et al. 2020; Mohamed et al. 2021). All in all, the results of this research work are expected to be one more contribution towards the medical services of the future, where the patient’s health status will no longer be subject to only a series of traditional medical tests and underlying medical conditions (Awotunde et al. 2021).

The remainder of the manuscript is organized as follows. Section 2 explains the methodology employed to develop the PSEM based on a Bayesian hierarchical structure, as well as the probabilistic induction of latent factors. In addition, a preliminary discussion on data description is currently under consideration. The methodological approach to geostatistical modeling is also discussed in this section. Section 3 shows the results of the PSEM and the spatial representation of the most significant variables identified to characterize the health status of service sector workers, namely, annual mean temperature, annual rainfall, spine health, limb health, cholesterol, age, and sleep quality. Finally, Sect. 4 argues how the use of the PSEM improves the interpretability of complex problems, its combination with geospatial modeling being a key tool for health decision-making.

2 Material and Methods

2.1 Data Characterization

A total of 74,401 occupational health surveillance tests gathered from workers belonging to the service sector in the period between 2012 and 2016 throughout the Spanish territory were used as a medical data source for this study. More specifically, the workers for this research database carried out activities related to administrative and auxiliary services (31,894), financial and insurance services (12,958), education (13,938), and hospitality (15,611). Each clinical examination was conducted in compliance with Spanish occupational health legislation (Ley 31/1995). Data were anonymized and released only after a period which does not interfere with processes performed by the relevant occupational health organizations and hospital services conducting the medical tests and gathering major information about the state of workers’ health. Health status is defined by the major health risk factors underlying the disease, including key physical conditions and health patterns. Importantly, this study goes beyond traditional occupational health surveillance analyses, adding to the medical record of each worker a cross-prediction with climatic and socioeconomic factors as an instrument to better characterize and predict those factors disrupting the health status in the future. The WorldClim database was used for bioclimate data extraction (Fick et al. 2017), and socioeconomic data were obtained from the National Statistical Institute (INEbase). Figure 1 in Sect. 2.2 (Analytical Steps) and Table 1 in Sect. 3 (Results and Discussion) give a comprehensive overview of the medical, bioclimatic, and socioeconomic variables used.

Fig. 1
figure 1

The methodological process was implemented with five levels of analysis. The colors shown in the description of the probabilistic latent factor induction represent each cluster of variables identified and the corresponding number of variables analyzed

Table 1 Results of cluster analysis

2.2 Analytical Steps

Procedurally, this research is conducted through a five-level approach, as illustrated in Fig. 1. First, an unsupervised Bayesian network is constructed to uncover direct probabilistic relationships between the initial 48 manifest variables (level 1). Secondly, latent class modeling is introduced to define representative subgroups for analysis (level 2). Thirdly, the probabilistic structural equation model (PSEM) is developed, in which latent factors and target variable health status provide an overall representation of the field of study (level 3). Later, for each activity group, the health status also acts as a target node for which the relevant patterns, associated with the type of work performed, are ascertained (level 4). These four levels are associated with the development of a Bayesian methodology for which BayesiaLab v.10.2 (www.bayesialab.com) was used. Finally, at level 5, a three-stage geostatistical approach to the computation of distribution maps was conducted, to deepen the network's knowledge from a spatial point of view. Ordinary kriging, followed by a G-group analysis, was used in the four service activity groups under analysis. ArcGIS v-10.2.2 and SpaceStat v-4.1.26 were used for computation.

2.3 Probabilistic Latent Factor Induction

This section introduces levels 1 and 2 of the analysis, which aim to exploit and interpret complex data by capturing direct probabilistic relationships and latent subgroups within the dataset:

  1. 1.

    Level 1: Induction of latent factors (unobserved variables) begins with the development of an exploratory Bayesian network (Pearl 1988). For this purpose, an unsupervised network is built to represent the strongest probabilistic relationships that exist between the manifest or observed variables under analysis. This approach has great potential to create a global framework that reveals hidden trends to healthcare providers. In a formal sense, the joint distribution \(p\left(x\right)\) of a random set of variables \({X=\left({X}_{1}, \dots , {X}_{m}\right)}^{T}\) may be described as follows (Murphy 2012)

    $$ p\left( x \right) = p\left( {x_{1} , \ldots , x_{m} } \right) = \mathop \prod \limits_{i = 1}^{m} p\left( {x_{i} {|}x_{\pi i} } \right), $$
    (1)

    which denotes that \({x=\left({x}_{1}, \dots , {x}_{m}\right)}^{T}\) is a realization of X, and \({x}_{\pi i}\) represents a realization of parent variables \({X}_{\pi i}\) of each \({X}_{i}\).

    At this stage, workers' health status is excluded from the learning process for variable clustering. The reason is that the worker’s health status will be used as a target variable in the conclusion of the PSEM; therefore, it is not relevant to add it to the local learning rather than to analyze its liaison with those relevant clusters of variables.

  2. 2.

    Level 2: The goal is to define the best-fit clusters of variables, for which an agglomerative hierarchical clustering algorithm supported in BayesiaLab v.10.2 is used. Specifically, arc force between manifest variables is used as a probabilistic measure to gradually group highly correlated variables (Conrady and Jouffe 2015). On this basis, once the global Bayesian model is built, because of the machine learning process aimed to discover significant relationships in the problem space search, the Kullback–Leibler (KL) divergence is used as a measure of strength in the relationship between two nodes that are directly connected by an arc. Under a formal approach, allow P and Q to represent the distribution of two common probabilities defined for the same set of variables or X nodes (van Erven and Harremos 2014).

    $$ D_{{{\text{KL}}}} \left( {P||Q} \right) = \sum\limits_{{x \in X}} P \left( x \right)\log _{2} \frac{{P\left( x \right)}}{{Q\left( x \right)}}. $$
    (2)

    Once the variables were grouped, a latent class model was introduced to define representative subgroups for the analysis. For this purpose, a naive architecture was created, in which the factor variable is the parent of the manifest variables. The established latent factors provide a homogeneous, compact, and stable representation of the local joint probability distributions (JPDs) defined by the associated manifest variables. Mathematically, an expression of a latent cluster model is provided by

    $$ P_{Yi} = \mathop \sum \limits_{n = 1}^{N} P_{Xn} P_{{\left( {Yi{|}Xn} \right)}}, $$
    (3)

    where \({P}_{Xn}\) describes the probability that an observation from the set of observed variables \(({Y}_{1},\dots ,{Y}_{n})\) describes the latent variable X, which belongs to a latent class (\(n=1, 2, ..., N\)). On the other hand, \({P}_{\left({Y}_{i}|{X}_{n}\right)}\) is the conditional probability of getting an observation with a response pattern \(Yi = ({y}_{1},..., {y}_{n}), \) while belonging to a class \(n\) of a latent variable X (Yousefi and Tucker 2022). Additionally, to characterize the probabilistic relationships between the latent factor and its manifest variables an expectation–maximization (EM) algorithm is used. This criterion for estimating the maximum likelihood permits us, through the Bayes theorem, to obtain the subsequent probability of an observation belonging to a given class n.

    $$ P_{{\left( {Xt{|}Yi} \right)}} = P_{{\left( {Yi{|}Xn} \right)}} \frac{{P_{{\left( {Xn} \right)}} }}{{P_{{\left( {Yi} \right)}} }}. $$
    (4)

2.4 Probabilistic Structural Equation Modeling (PSEM)

Subsequently, levels 3 and 4 involve the final construction of the Bayesian network using supervised health condition analysis. Once the observed variables have been linked to the underlying factors, the excluded target node (health status) mentioned in Sect. 2.3 is incorporated into the final network structure to finalize the probabilistic structural equation model (PSEM).

Specifically for level 3, the relationships between factors, marginal variables, and the target node were established based on the combination of the presented probabilistic theory and expert criteria. To characterize the target node, the relative weight value is displayed as a fraction of the maximum KL divergence value. Similarly, these weights can be represented as the global contribution percentage of each arc to the target node, quantifying the value between two directly connected nodes DKL (parent|child), and the sum of all KL divergence values across the network. This analysis enables the identification of clusters that have the greatest impact on the health of workers and enables the implementation of targeted interventions and strategies to improve their well-being. The opportunity to introduce expert criteria opens the door to a more flexible approach where different probabilistic configurations can be studied.

To validate the PSEM, the contingency table fit (CTF) metric was used to measure the quality of the representation of the joint probability distribution via the latent variable. BayesiaLab’s CTF is defined as the entropy value (HB) of the created network B compared to the value HC of the fully connected network C. In addition, HU represents the entropy value of the data considering the disconnected network U (BayesiaLab, n.d.)

$$ C_{B} = 100\frac{{H_{U} \left( D \right) - H_{B} \left( D \right)}}{{H_{U} \left( D \right) - H_{C} \left( D \right)}}. $$
(5)

Therefore, if the current system can generate a precise representation of the fully interconnected joint probability distribution, the CTF will assume a value of 100. Conversely, if the network represents a completely disconnected structure where all variables are marginally independent, the CTF will be equal to 0.

2.5 Supervised Machine Learning Techniques for Target Characterization

Recent advances in computer science offer the possibility to couple machine learning with traditional statistical methods such as Bayesian networks (Benavoli et al. 2017). Bayesian networks have shown their potential in problem domains with manifold variables of different typologies, where the medical and occupational health domain is a showcase of their performance (Abad et al. 2019; Gerassis et al. 2019). Concretely, information theory in combination with Bayesian networks can be used to respond to the different stages of this study, allowing one to quantify the reduction of uncertainty brought by each medical variable to the knowledge of the health state.

Accordingly, a set of supervised networks corresponding for each working group is established. This higher degree of granularity, in which health has acted as a target node, reveals trends associated with specific activities performed by workers. The relative mutual information value is employed at this level to quantify how much information a variable provides (\({X}_{n}\)) knowledge about the characterization of the patient's health status (\({X}_{T}\)). Mathematically, the formula used in the calculation was as follows

$$ I_{R} \left( {X_{T} ,X_{n} } \right) = \frac{{I\left( {X_{T} ,X_{n} } \right)}}{{H\left( {X_{T} } \right)}} = \frac{{H\left( {X_{T} } \right) - H\left( {X_{T} {|}X_{n} } \right)}}{{H\left( {X_{T} } \right)}}. $$
(6)

2.6 Spatial Representation

Spatial models of selected attributes, annual mean temperature, annual rainfall, spinal health, limb health, cholesterol, age, and quality of sleep have been constructed using three-step geostatistical modeling:

  1. 1.

    Experimental isotropic variograms were computed, and theoretical models were fitted.

  2. 2.

    Ordinary kriging (OK) was used as an interpolating algorithm for the original rank values.

  3. 3.

    Finally, local G clustering (Getis and Ord 1992) allowed us to measure the degree of association that results from the concentration of weighted points (or region represented by a weighted point), and all other weighted points included within a radius of distance from the original weighted point. Considering a given zone divided into n regions, I = 1, 2, …, n, where each neighbor is distinguished by a point for which the Cartesian coordinates are known. Each I has associated with it a value x (a weight) taken from a variable X. The variable has a natural origin and is positive. The statistic of G(i) developed below allows us to test the assumptions on the spatial concentration of the sum of the values x associated with the points j in d of the ith point. The following statistical information is obtained

    $$ G_{i} \left( d \right) = \frac{{\mathop \sum \nolimits_{j = 1}^{n} W_{ij } \left( d \right)x_{j} }}{{\mathop \sum \nolimits_{j}^{n} x_{j} }},\;j\;{\text{not}}\;{\text{equal}}\;{\text{to}}\;i, $$

    where Wij is a symmetric one/zero spatial weight matrix with ones for all links defined as being within distance d of a given i; all other links are zero, including the link of point i to itself. The numerator is the sum of all xj inside of i but without including xi. The divisor is the sum of all xj, except xi.

3 Results and Discussion

This section presents the results obtained from a PSEM based on a Bayesian machine learning construct to the subsequent spatial assessment using a geostatistical methodology. The results describe the findings in identifying the most significant occupational health risks.

3.1 Probabilistic Latent Factor Induction

At the first level, an unsupervised overall model was created from the initial 48 marginal variables, excluding the worker's health status. The main purpose of the established network was to find clusters in terms of probabilistic relations between nodes. Furthermore, as the construction of an unsupervised network is the first step of a PSEM, it was necessary to set a maximum number of variables per cluster. Specifically, to get an understandable representation of the domain, 10 initial clusters were selected with groups between 2 and 13 variables. This process was carried out with a hybrid approach; that is, based on the algorithmic clustering proposal, the expert criterion was introduced for an adjusted model of the analysis scenario. Finally, Table 1 provides the results of the cluster analysis, including the computed node strength value for each cluster. The node force value represents the sum of the arc forces of all interconnected arcs to the cluster, providing a quantitative measure of the individual influence of each cluster for the domain under analysis.

The clustering variables of the initial Bayesian model are aggregated into a higher-level subnetwork corresponding to the four main clusters under study: specific medical variables, socioeconomic variables, bioclimatic variables, and general medical variables. Multiple clustering algorithms within BayesiaLab create a discrete factor for every subset of grouped marginal variables. In this case, the optimum number of states to represent the JPD of the marginal variables is automatic. The only limitation, as recommended by Conrady and Jouffe (2015), is to set the number of classes from 2 to 5. The spatial representation of the domain is depicted in Fig. 2.

Fig. 2
figure 2

Multiple Bayesian models were generated from a data clustering analysis. The colors of the marginal nodes are established following the groups obtained after the variable group analysis. The induced latent factors for each group of marginal variables are shown in the top layer in yellow, while the main latent factors are shown in gray

3.2 Probabilistic Structural Equation Model (PSEM)

Once the multiple clustering is completed, the PSEM design shows four Bayesian subnetworks surrounded by the induced latent factors and their marginal variables (Fig. 3). To analyze the influence of all latent factors on workers' health conditions, the connection between health status (target node) and the four main clusters under study (specific medical variables, socioeconomic variables, bioclimatic variables, and general medical variables) was manually assigned (expert criteria).

Fig. 3
figure 3

Final PSEM structure with two distinct levels of complexity built using a semi-supervised Taboo algorithm

Based on the PSEM, the statistical association between health status and each cluster of the model was further investigated. The most representative parent–child relationships are shown in Table 2, which accounts for all service activities (administrative and ancillary services, financial and insurance services, education, and hospitality) in the analysis.

Table 2 The parent–child relationship and the analysis of the contribution between the clusters of the model on the characterization of the health status of workers in the service sector

The cluster of specific medical variables provides more insight into the target node than the rest of the latent cluster (80%). Likewise, general medical variables provide 10% of the knowledge required to characterize the health status of workers. From the point of view of predictive significance, the marginal variable workplace (location) contributes 96.6820% to the reduction of the uncertainty of the latent cluster in which it is located, compared to 1.0118% of the recognition type node. The high significance of the location variable within the latent cluster defined as general medical variables highlights the importance of providing spatial patterns that represent how the variables with the greatest impact on the health status of workers in the tertiary sector are distributed. Lastly, the authors consider that it is very important to emphasize the importance of bioclimatic variables on workers' health indicators. Obtaining a contribution of nearly 9% highlights the real influence of climate.

To identify the most dominant variables in terms of the latent factor uncertainty, the relative mutual information (RMI) of the arcs was computed exclusively between the manifest latent factors and the target node health state (Fig. 4). The results show that the cluster of specific medical variables is the factor that most reduces uncertainty about having a healthy or unhealthy patient by an average of 5.3556%. For clusters of bioclimatic and general health variables, the results are similar with percentages of 0.8840% and 0.7298%, respectively. Social and economic variables have the lowest MI values.

Fig. 4
figure 4

PSEM arc’s mutual information (RMI child color blue and RMI parent color red) and contribution visualization

3.3 Model Validation

The CTF can evaluate the quality of the induced factors. The great advantage of advanced software such as BayesiaLab lies in the possibility for researchers and health professionals to directly calculate this metric-normalized set of values ranging from 0 to 100%. Table 3 shows the CTF values obtained for the latent factors induced in the multi-network design phase.

Table 3 Performance indices of factors induced in multiple aggregations

3.4 Local Impact Analysis of Data

At the fourth level, four supervised Bayesian networks were established, corresponding to each of the defined service activities and whose common node was the worker's state of health. The application of a naïve Bayes algorithm allowed the generation of a pragmatic network structure for the analysis of the influence of each variable on the health status of the workers. The characterization of the target node revealed that age, location, and total cholesterol, previously identified as the most significant factors in the general network of the service sector, also present a high impact on all the concrete service activities under study. In that context, the authors have considered the need to deepen the understanding of those variables that are a priori not that significant, but which may hold key differentiating aspects within each population group.

When looking at the distribution of contributions of each variable to the characterization of the state of health, it is found that the nervous system (15–19%) matches to a high extent the characterization of the medical examinations of healthy workers (64–70%). The most important medical conditions affecting these two states are age, total cholesterol, and location, while hearing problems and drug use are always reflected as differential variables. As an example, after an inference analysis on patients with high levels of total cholesterol belonging to hostelry services, a greater impact could be seen on elderly workers (> 50) belonging to the autonomous community of the Basque Country (38.26% of registered cases) located in the north of Spain. In contrast, it can be concluded that there is a strong need to provide a higher level of granularity on the musculoskeletal (8–11%) and cardiovascular (6–9%) pathologies, as here the differences among possible additional differential variables, even if relevant from a mathematical point of view, cannot be that meaningful from a policy perspective (Table 4).

Table 4 Relative binary mutual information analysis of data over the target node for the states representing musculoskeletal and cardiovascular systems by service activity

The great horizontality of variables such as age, location, and total cholesterol directed this study toward the need to add value to those differentiating variables of the musculoskeletal and cardiovascular systems. In addition, according to Table 4, the binary mutual information for the four variables tends to be similar across service activities. This situation leads to the spatial representation of the variable’s spine observation, annual precipitation (BIO 12), limb observation, and annual mean temperature (BIO 1) under an ordinary kriging approach (Fig. 5). These variables were selected as the highest contribution to musculoskeletal and cardiovascular systems (Table 4). This approach through OK allows the identification of both a spatial distribution of spinal problems and potentially related extremities, and two differentiated regions where these problems, as well as the systems they are related to, have a higher impact; especially in the northeast of Spain, apart from Catalonia, and the south, with vascular problems such as the presence of varicose veins. As in the western part of Spain, a higher rate of spinal disorders, derived from muscle contraction or other minor discomforts, is identified. Based on Bayesian results, it can be demonstrated that this type of injury is related to a great extent to pathologies of the musculoskeletal system which is potentially present in in-service activities such as hospitality (31.23%) and administration (32.05%). Moreover, we also see that these pathologies are also an underlying cause for problems in the end.

Fig. 5
figure 5

Distribution maps for annual mean temperature (°C), annual rainfall (mm), and spine and limb observation variables using rate data between 2012 and 2016 interpolated by ordinary kriging

The estimated distribution maps and highly significant clusters were computed by administrative zone for cholesterol, age, and sleep quality (Figs. 6, 7, and 8), giving patterns of high values (red rings) and low values (blue rings). To begin with, as shown in Fig. 6, cholesterol levels will not change with the work area. However, there is a tendency associated with geography. The north is more cholesterol-rich than the south. When the spatial distribution of cholesterol is compared with the distribution of temperatures in Spain, a direct correlation can be identified: the colder the temperature, the higher the cholesterol levels. Previous research indicates that cold increases blood pressure and favors an increase in cholesterol levels (Davis et al. 2022). Another factor can be the intake of more caloric meals required in northern areas against cold (Tien et al. 2016). However, there are many more factors; the lifestyle of workers and lack of physical activity could also be key factors. In this context, southern regions have a warmer climate that invites more development of physical activities (Bernard et al. 2021).

Fig. 6
figure 6

a Distribution maps for cholesterol by service activity group using rate data between 2012 and 2016 through ordinary kriging; b high- and low-significance clusters

Fig. 7
figure 7

a Distribution maps for age by service activity group using rate data between 2012 and 2016 through ordinary kriging; b high- and low-significance clusters

Fig. 8
figure 8

a Distribution maps for sleep quality by service activity group using rate data between 2012 and 2016 through ordinary kriging; b high- and low-significance clusters

There is insufficient evidence to establish a correlation between cholesterol (Fig. 6) and age (Fig. 7). In any case, a trend can be observed in which the higher the age, the higher the cholesterol level, a pattern that has been observed in recent studies (Mansoori et al. 2023). Depending on the economic sectors, education has the highest age, except in Murcia, which shows the ageing of the sector in Spain. Looking further into the matter, Aragon has the highest average age and highest cholesterol levels. However, Catalonia and Navarra have the lowest average ages and the lowest cholesterol levels in the north. In the financial sector, Aragon and Navarre have the lowest average ages and lowest cholesterol levels in the north. However, Galicia and Catalonia have the highest average ages and the highest cholesterol levels in the north (with some exceptions such as Asturias). In the administrative section, La Rioja is the youngest community with the lowest cholesterol levels.

The quality of sleep based on employee service activity is not significantly different (Fig. 8). All data show a similar relationship in terms of color scale variation, except for the education group, which is less blue than the rest. This suggests that the education sector has poorer sleep quality than the rest, except the Navarre region. Freitas et al. (2020) found a similar correlation and linked it to the strong psychological demand teachers have during their psychosocial relationship with students at work. As for correlations with other variables, it is curious that the quality of sleep does not seem to depend upon age. However, sleep disturbances prevail in patients with high diabetes (Huang et al. 2023). This can be corroborated by these data, as regions with the lowest sleep quality can be observed to coincide with regions with high cholesterol levels in the north. Inhabitants of the south of Spain (Andalusia) also present bad sleep quality; in this case, this might be related to high temperatures (Lan et al. 2017).

4 Conclusions

This survey exposes the potentialities of a couple of Bayesian machine learning and geostatistical methodologies in the form of a renewed PSEM to translate the complex problem of determining the occupational health risks of workers in the service sector. The goal is to obtain feasible visual analyses, typical of the geography of health, that can feed the evolution of medical policies at different levels of the national health system in Spain. This methodology fully applies to the remaining geographic areas. With this methodology, it is important to note that quantifying the influence of bioclimatic and socioeconomic variables becomes a reality. Furthermore, if estimated bioclimatic scenarios are introduced for the coming years, the change in climate attributes may be addressed and consubstantiate insight for future medical decision-making and occupational health.

Concretely, the results of this study revealed that variables such as age, location, and cholesterol, with contributions to the general network between 9 and 17%, are generally critical for the characterization of the health status of workers in the service sector. To a second extent, it was possible to identify a series of differentiating variables such as spine and limb observation, sleep quality, annual precipitation (BIO 12), or annual mean temperature (BIO 1) that, despite not being extremely significant from a mathematical point of view, play a key role and show a great impact in health risk maps at a regional level. It is also worth noting the weight of the bioclimatic variables on the health status of the worker, with a contribution value of approximately 9%. Further analysis is required to measure the uncertainty associated with considering or not considering other groups of variables, including worker behavioral aspects.