Introduction

This paper outlines the methodology and the results of a study concerning the relationships among three well-being domains (Education, Economic Well-Being and Health) measured on Italian provinces. Data come from the Italian system of indicators on Equitable and Sustainable Well-Being (Benessere Equo e SostenibileBES) proposed by the National Institute of Statistics (ISTAT 2018). BES represents a well established reference database in the national and international debate on the research of alternative well-being measures. The present paper proposes an advancement of the work elaborated in Davino et al. (2017, 2018), where a hierarchical composite model was used to study the relationships among components of the BES. The proposal exploits quantile regression (QR) (Koenker and Basset 1978; Koenker 2005) to obtain the best predictions in a network of simultaneous equations.

The introduction of structural equation modeling has been a turning point in the analysis of complex relationships among unobservable variables, first in its hard modeling, or covariance-based approach (Jöreskog 1978), and then in the soft modeling, or composite-based approach (Wold 1982; Tenenhaus et al. 2005). This paper embraces the soft modeling approach, which does not require any distributional assumption on the variables and exploits non-parametric methods to estimate the model parameters. Partial Least Squares Path Modeling (PLS–PM), proposed by Wold (1985), Tenenhaus et al. (2005), is one of the most widespread composite-based method for structural equation modeling. Nowdays, PLS–PM is a well established method both in statistical literature (Esposito Vinzi et al. 2010) and in applied research in several disciplines (Henseler et al. 2009; Hair et al. 2012; Sarstedt et al. 2017; Di Napoli et al. 2019).

Recently a quantile approach to PLS–PM called Quantile Composite-based Path Modeling (QC–PM) was proposed by Davino and Esposito Vinzi (2016) to broaden the potential of PLS–PM. To this end, QC–PM exploits QR (Koenker and Basset 1978; Koenker 2005) in all the steps of the PLS–PM estimation algorithm. This allows to highlight if and how the relationships among variables change according to the explored quantile of interest. It is worth to emphasize that QC–PM is not an alternative to PLS–PM, but rather its ideal completion. PLS–PM aims to estimate the effect of the involved variables on the conditional mean of the responses, QC–PM extends the focus to the whole conditional distribution; PLS–PM provides an effective summary of the dependence structure, QC–PM is a useful tool to magnify it.

The present study deals with the use of composite-based models for predictive purposes. In fact, PLS–PM cannot easily be used as a predictive modeling because the network of relationships is complex and the identification of a single direction to be explained is troublesome. At this regard, see Evermann and Tate (2016), Shmueli et al. (2016) and Dolce et al. (2017). Starting from the explicit and general formulation of the predictive model introduced by Dolce and Hanafi (2017) and used in Dolce et al. (2018), the present paper proposes a predictive-oriented QC–PM, namely a model able to provide the best prediction for each statistical unit. It extends the approach proposed by Davino and Vistocco (2015, 2018) which was aimed at identifying a typology in a dependence model. The authors introduced the “best quantile” for each unit, i.e. the quantile associated with the conditional model that provides the best estimate of the response variable (the best model for each unit). This paper exploits the “best quantile” approach in the composite-based path modeling in order to obtain accurate predictions by estimating several path models at different quantiles.

Even if the risk of overfitting is always lurking when we deal with prediction, it does not actually matter for the case concerned here since we aim to provide the best in-sample predictions and not to generalize on different data. The analyzed dataset contains indeed all the population units, namely the Italian provinces, and we aim to define the most accurate model for each Health indicator and the best predictive model for each province.

The Reference Framework

The empirical application focuses on the prediction of health levels and health inequalities in a regional well-being perspective. BES is the reference framework. It consists of a wide set of about 130 statistical indicators produced by the Italian National Institute of Statistics (Istat) to describe and monitor the progress of Italian society from a social and environmental point of view in a comprehensive way (ISTAT 2013, 2018).

Health is one of the 12 domains of well-being considered in the BES framework, together with Education and training, Work and life balance, Economic well-being, Social relationships, Politics and Institutions, Safety, Subjective well-being, Landscape and cultural heritage, Environment, Innovation, research and creativity, and Quality of services.

In the BES framework, Health is seen as a central element in life and an essential condition for people’s well-being and prosperity of populations. In fact, Health outcomes are related to many dimensions of the individual and social well-being. Among the multiple relationships, which link Health to the other BES domains and assets, we focus here on two related domains: Education and training and Economic well-being. In a well-being perspective, Education does not only have an intrinsic value but it directly affects other well-being domains. People with higher education levels have higher standards of living and more possibilities to find work, they live longer and better because they have healthier lifestyles, easier access to services and more opportunities to find less risky jobs. Similarly, Economic well-being is both an asset of BES and a driver of the well-being outcomes in other domains. Indeed, earning capacities and economic resources ensure that an individual can obtain and support a specific standard of living.

Recent studies (Costa et al. 2014; ISTAT 2019b; Murtin et al. 2017; Petrelli et al. 2019) focused on the relationships between health and socio-economic conditions at individual level. They confirm that regional disparities in health outcomes are still marked in Italy, both in terms of life expectancy at birth and mortality risk. Health inequalities among Italian regions arise regardless of age, gender and socioeconomic status, but they clearly appear to be related to socio-economic factors, as they have a higher impact in the poorer southern regions of Italy. Furthermore, lower education levels explain a considerable proportion of mortality risk, although with different effects by geographical area: males with a lower education level throughout Italy have a life expectancy at birth that is 3 years less than those with higher education; residents in southern Italy lose an additional year in life expectancy, regardless of education level. Other studies highlight that health inequalities are more severe within the southern Italian regions than within the northern ones. The single and joint effects of education and income factors are remarkable: mortality inequalities between better educated and less educated people explain globally about 25% of deaths among men and more than 10% among women. The differences in health outcomes among the Italian regions also result from local policies, as the Italian Regional Administrations have the main power to regulate and organize the public health services; instead, focusing on the sub-regional level, the other differentiating factors gain greater importance. Therefore, the impact of economic factors and of education levels is bearer of interesting analysis. In particular, it would be useful to obtain a model to predict health outcomes controlling for the factors that affect Education and Economic Well-Being.

Methodological Framework

Partial Least Squares Path Modeling

PLS–PM, originally developed by Wold (1982, 1985), is a powerful multivariate statistical method that can be applied to the study of the relationships among K blocks of observed variables. Such blocks of observed or manifest variables (MVs), \(\mathbf{X} = [\mathbf{X} _1, \ldots , \mathbf{X} _K ]\), measure K latent variables (LVs), \(\xi _1, \ldots , \xi _k\), usually named as components or composites. PLS–PM is commonly considered an alternative approach to the covariance structure analysis (Jöreskog 1978) of Structural Equation Modeling (SEM), although these two approaches belong to two different families of statistical methods.

PLS–PM focuses on LV scores computation, accounting for variances of MVs and correlations between LVs. Each block of MVs is summarized in a component, or a composite (i.e. an exact linear combination of the MVs), that maximizes the explained variance of the set of MVs. Therefore, PLS–PM is commonly referred to as a component-based, composite-based or variance-based approach. Great flexibility, robustness, few demands concerning distributional assumptions and requirement for identification are the main features of PLS–PM, and underpin its widespread dissemination in many areas (Esposito Vinzi et al. 2010; Hair et al. 2014).

More formally, let us consider that P variables are collected in a table \(\mathbf{X} \) of data partitioned in K blocks:

$$\begin{aligned} \mathbf{X} = [\mathbf{X} _1, \ldots , \mathbf{X} _J , \mathbf{X} _{J+1}, \ldots , \mathbf{X} _{J+Q}, \mathbf{X} _{J+Q+1}, \ldots , \mathbf{X} _K ] \end{aligned}$$

Let \(\mathbf{X} _k = \{x_{ip_k}\}\) be the generic block, where

  • the input blocks are in the first J positions,

  • the intermediate blocks run from block \(J+1\) to block \(J+Q\),

  • the output blocks run from block \(J + Q + 1\) to block K

  • \(i = 1, \ldots , n\), with n denoting the number of observations,

  • \(p_k = 1, \ldots , P_k\), with \(P_k\) being the number of MVs in the k–th block.

We denote by \(\xi _k = \{\xi _{ik}\}\) the corresponding LVs for each block of variables. A generic MV is instead denoted by \(\mathbf{x} _{p_k} = \{x_{ip_k}\}\). The path diagram in Fig. 1 shows an example of a simple path model with an input, an intermediate and an output block of manifest variables.

Fig. 1
figure 1

A path model with an input, an intermediate and an output block of manifest variables

The general model consists of two sub-models: the inner (or structural) model and the outer (or measurement) model. The measurement model relates each MV to its own LV by the following equation:

$$\begin{aligned} \mathbf{x} _{pk} = \lambda _{pk0}+ \lambda _{pk} \varvec{\xi }_k+ \varvec{\epsilon }_{pk}, \end{aligned}$$
(1)

where \(\lambda _{pk0}\) is a location parameter, \(\lambda _{pk}\) is the loading coefficient that captures the effect of \(\varvec{\xi }_k\) on \(\mathbf{x} _{pk}\), and \(\varvec{\xi }_k\) is the measurement error variable. The structural model captures and specifies the dependence relationships among LVs. A generic dependent LV is linked to the corresponding explanatory LVs by the following model:

$$\begin{aligned} \varvec{\xi }_k = \beta _{k0}+\sum _{k' \rightarrow k}\beta _{kk'}\varvec{\xi }_{k'}+\varvec{\zeta }_k, \end{aligned}$$
(2)

where \(\beta _{kk'}\) is the so-called path coefficient that captures the effects of the predictor LV \(\varvec{\xi }_{k'}\) on the dependent LV \(\varvec{\xi }_{k}\) , and \(\varvec{\zeta }_k\) is the inner residual variable.

The following weighting relation defines the casewise scores of each LV as a weighted aggregate of its own MVs:

$$\begin{aligned} \varvec{\hat{\xi }}_k = \sum _{p=1}^{P_k} \hat{w}_{kp}{} \mathbf{x} _{pk}, \end{aligned}$$
(3)

where \(\hat{w}_{kp}\) is the outer weight obtained through the PLS–PM iterative algorithm.

Since in PLS–PM there are different kinds of residual variables, a set of partial (or local) least squares (PLS) criteria are defined and the optimal solution is found by an iterative algorithm (Lohmöller 1989). In particular, the estimation of the model parameters in Equations (1) and (2) proceeds in two stages. The first stage computes the outer weight vectors \(\hat{w}_{k}\) in Equations (3), and consequently the composite \(\varvec{\xi }_k\), through an iterative algorithm alternating OLS simple or multiple regressions. The second stage estimates the loading coefficients \(\lambda _{pk}\) and the path coefficients \(\beta _{kk'}\) through classical OLS regressions.

The statistical and numerical properties of PLS–PM were deeply investigated and interesting results were found in terms of global optimization criteria and convergence properties (Glang 1988; Mathes 1993; Hanafi 2007; Krämer 2007; Tenenhaus and Tenenhaus 2011). Furthermore, recent methodological developments introduced interesting features of the method, starting on which it is possible to generate predictions from PLS path models. The following subsection details prediction through PLS–PM.

Predictive-Oriented PLS-PM

PLS–PM is a powerful method both for explorative and predictive purposes. This is a distinctive feature compared to covariance-based SEM, which mainly focuses on obtaining valid inferences for population parameters. However, since its origin PLS–PM has been almost exclusively used as an explanation oriented technique. Only in the recent years the predictive ability of PLS–PM started to gain increasing interest from researchers (Evermann and Tate 2016; Shmueli et al. 2016; Dolce et al. 2017; Danks and Ray 2018; Shmueli et al. 2019; Sharma et al. 2019). The limitation of the PLS–PM as explanatory modeling was due to the lack of an explicit formulation of the predictive model, because of the complexity of the PLS path model. Two models are considered in PLS–PM, i.e. Equations (1) and (2), and data are partitioned into three kinds of blocks: input (only used for prediction), intermediate (used for prediction and as dependent blocks) and output blocks (only used as dependent blocks). As a matter of fact, prediction in PLS–PM has been considered a difficult task because a choice should be made between either prediction from the measurement model or from the structural model. Moreover, prediction of individual observations may refer to either individual LV score observations or individual observations of the MVs in the dependent blocks. Finally, intermediate blocks pose a special challenge in the predictive context, because they play a twofold role in the model: they are both predictor variable blocks and dependent variable blocks.

Lohmöller (1989) defines five different sorts of predictions from PLS–PM:

  1. 1.

    communality prediction: each MV is predicted by the corresponding LV—Equation (1);

  2. 2.

    structural prediction: the prediction of each LV is obtained using the related predictor LVs—Equation (2);

  3. 3.

    validity prediction: the prediction of each LV is obtained using their MVs—Equation (3);

  4. 4.

    redundancy prediction: each MV is predicted by the predictor LVs that is directly connected to its own LV;

  5. 5.

    operative prediction: each MV is predicted using only the MVs of the predictor blocks (all the LVs are replaced with their corresponding weight relation)—Equation (3).

Despite this complexity, it is possible to generate predictions from PLS–PM since appropriate schemes were recently proposed Shmueli et al. (2016), Dolce and Hanafi (2017). The present paper uses the explicit and general formulation of the predictive model proposed in Dolce and Hanafi (2017), which incorporates both the measurement and structural model in an unique model, and requires only MVs as predictors and outcomes, following the operative prediction defined in Lohmöller (1989).

Quantile Composite-Based Path Modeling

QC–PM has been proposed by Davino and Esposito Vinzi (2016) to complement PLS–PM. QC–PM exploits QR to explore the whole distribution of dependent variables as function of the set of predictors. Since PLS–PM is based on simple and multiple OLS regressions, its coefficients focus on the conditional means of the dependent variables. Even if it provides an effective summary, in some cases the estimates of coefficients may vary along the distribution of the dependent variable. This happens in presence of heteroscedastic variances of the errors or highly skewed dependent variables. In such cases, PLS–PM may give an incomplete picture of the relationships among variables. The quantile approach is instead able to model the location, the scale and the shape of the responses.

QR was introduced by Koenker (2005) to extend the regression model from the conditional mean to any conditional quantile of interest. In linear models, QR estimates have the same interpretation of any other linear model. The intercept measures the response value when all the regressors are set to zero, the slopes measure the rates of change in the response per a unit change in the value of the correspondent regressor, keeping all the others constant. Since QR estimates a set of coefficients (intercept and slopes) for each considered quantile, coefficients must be interpreted in terms of the quantiles of the response. A dense grid of equally spaced quantiles can provide a fairly accurate approximation of the whole quantile process (Furno and Vistocco 2018) and a reconstruction of the whole conditional distribution of the response variable, because each conditional quantile predicts the correspondent location of the response variable (Davino et al. 2013).

QC–PM uses QR instead of OLS regression in all the estimation steps. In particular, QC–PM consists in introducing a QR in the steps described in Sect. 3.1 to estimate model parameters—Equations (1), (2) and (3). For each quantile \(\theta \) of interest, \(\theta \in (0, 1)\), the first stage computes the outer weights vectors \(w_{k}\) and each composite \(\varvec{\xi }_k\), through an iterative algorithm alternating simple QR or multiple QR. The second stage estimates the loading coefficients \(\lambda _{pk}(\theta )\) and the path coefficients \(\beta _{kk'}(\theta )\) through QR:

$$\begin{aligned} \mathbf{x} _{pk}(\theta )= & {} \lambda _{pk0}(\theta )+ \lambda _{pk}(\theta ) \varvec{\xi }_k+ \varvec{\epsilon }_{pk}, \end{aligned}$$
(4)
$$\begin{aligned} \varvec{\xi }_k (\theta )= & {} \beta _{k0}(\theta )+\sum _{k' \rightarrow k}\beta _{kk'}(\theta )\varvec{\xi }_{k'}+\varvec{\zeta }_k. \end{aligned}$$
(5)

Similarly to the unconditional quantile minimisation (Fox and Rubin 1964), the conditional quantile estimator is obtained by minimizing a weighted sum of residuals. The function to minimize in case of equation (4) is:

$$\begin{aligned} {{\hat{\varvec{\lambda }}}}\left( \theta \right) =argmin_{{\varvec{\lambda }}(\theta )} \sum _{i=1}^n \rho _{\theta } \left( x_i-\mathbf {\xi }_{i}^{T} {\varvec{\lambda }}\left( \theta \right) \right) \end{aligned}$$
(6)

where the first element of \(\xi \) is equal to 1 to include the intercept, and \(\rho _{\theta } \left( .\right) \) is the check function, which asymmetrically weights positive and negative residuals, namely:

$$\begin{aligned} \rho _{\theta }(u)= \left\{ \begin{array}{lll} \theta \, u &{} \qquad \text{ if } &{} {u > 0} \\ \left( 1-\theta \right) u &{} \qquad \text{ if } &{} u\le 0 \end{array} \right. \end{aligned}$$
(7)

The same holds in case of Equation 5.

For each quantile of interest, QC–PM provides a set of outer weights, loadings and path coefficients. Therefore, it offers a more complete picture of the relationships among variables both in the outer model (as the outer weights measure the effects of each MV on the corresponding construct) and in the inner model (as the path coefficients quantify the impact of lower-order constructs on higher order constructs). However, in order to compare path coefficients estimated over quantiles, measurement invariance has to be satisfied in the models (Henseler et al. 2016). In other words, for each MV, all the loading should be very similar across quantiles and compared to the one estimated by PLS–PM, because the same LV should be measured across quantiles. If loadings change across quantiles, there is no guarantee that a LV is measuring the same concept and path coefficients estimated at different quantiles cannot be reliably compared.

In this situation, a possible solution may be to fix the quantile to the median in the measurement model and letting quantiles change just in the structural part, following the approach proposed by Wang et al. (2016) in factor-based structural equation modeling. This approach is justified if we consider that the role of the measurement model is to relate the MVs to LV and to construct a score for the LV. Finally, measures of goodness of fit and tests for evaluating the statistical significance of the coefficients typically used in PLS–PM can be easily extended to the QC–PM approach (see Davino et al. 2016 for details).

Predictive-Oriented QC-PM

Exploiting the ability of QR to model the whole conditional distribution of a dependent variable, a Quantile Composite-based Path Modeling can be used in a prediction perspective. As above stated, the use of a dense grid of equally spaced quantiles provides an accurate approximation of the whole quantile process (Davino et al. 2013). The predictive model and the proposed procedure aims to provide the best in-sample predictions (often called fitted values). The computation of in-sample predictions requires estimating the parameters of a PLS model by using a given data sample and then using the model to predict values for cases of the same sample (Shmueli et al. 2016).

This paper proposes a two-step procedure to provide the best predictions of the outcome MVs:

  1. 1.

    Estimation of several path models

    The first step aims to estimate the specified path model for a dense grid of equally spaced quantiles through the QC–PM algorithm. QC–PM provides m estimates for each parameter of the model, m LV scores and m previsions of the outcome MVs, where m is the number of the chosen quantiles. In particular, for the empirical analysis proposed in this paper, we exploited a grid of quantiles \(\theta \) varying from 0.01 to 0.99 with a step of 0.01.

  2. 2.

    Identification of the best model for each outcome MV and for each unit

    The second step aims to define the most predictive model for each outcome MV and the best accurate model for each statistical observation (henceforth best quantile). To achieve this goal it is necessary to compute the predictions corresponding to each quantile. Considering the outcome blocks, a partitioned table of predictions \(\hat{\mathbf{X }}\) is obtained for each \(\theta \): \( \hat{\mathbf{X }}(\theta ) = [ \hat{\mathbf{X }}_{J+Q+1}(\theta ), \ldots , \hat{\mathbf{X }}_K(\theta )]\). The generic element of a MV prediction, \(\hat{\mathbf{x }}_{p_k}(\theta ) = \{\hat{x}_{ip_k}(\theta )\}\), represents the prediction value of the MV \(\mathbf{x} _{p_k}\), for the i-th unit according to the \(\theta \)-th quantile.

    The best model for each unit i and for each dependent MV of the k-th block, \(x_{p_k}\), is identified by the quantile that best predict the variable, namely by the quantile which minimizes the absolute difference between the observed value and the estimated value:

    $$\begin{aligned} \theta ^{best}_{ipk}=argmin _{\theta =1,\ldots , m}\left| x_{ipk}-\hat{x}_{ipk}(\theta )\right| \end{aligned}$$
    (8)

    where \(\theta ^{best}_{ip_k}\) represents the quantile associated to the best predictive model for each unit i, each indicator \(p_k\) in each block k, while \(\hat{x}_{ip_k}(\theta )\) is the correspondent best prediction for \(x_{ip_k}\).

The denser the quantile grid is, the more accurate the forecasts provided by the predictive approach to QC–PM are. The best quantile provides an estimation of the unit position in the conditional distribution of the outcome variable. A comparison between the vector of the best quantiles for a given dependent variable and the corresponding unconditional quantile (namely the position of each unit in the observed MV) allows to understand what is the effect of the structure of relationships in predicting each outcome variable.

A very simple example can be used to clarify the added value of the proposal and the different information that unconditional and conditional quantiles convey. Figure 2 shows the scatterplot of two variables observed on a sample of 3000 units. The dependent variable, represented on the vertical axis, has been generated from a Gamma model. Its unconditional distribution is represented through a boxplot, a dotplot and an histogram on the right-hand side of the figure using gray color. The regressor, represented on the horizontal axis, is a numerical variable with six values. The conditional distributions of the response on the six values of the regressor are depicted through the first six dotplots, starting from the left-hand side. The plot portrays a scenario in which the standard deviation exhibits a linear growth in the response variable as the regressor increases, the skewness and the excess kurtosis being instead constant and positive. Two observations are highlighted as an example: the diamond denotes an observation located above the median of the dependent variable distribution, with an unconditional quantile equal to 0.61; the triangle is placed in the lower tail, its unconditional quantile being 0.35. This is evident from the position of the two points in the gray right-hand boxplot. Starting from a dense grid of quantiles, the correspondent QR models were estimated, in order to identify the best quantile for each observation, namely the one that minimizes the function in Equation (8). The two lines depicted in Fig. 2 are the best QR models identified for the two example observations, respectively at the quantiles 0.31 for the diamond and 0.58 for the triangle. In this simple example, it is easy to prove that the identified best quantiles correspond to the estimation of the position of the two points in the conditional distributions. The two points are indeed highlighted according to their regressor value, namely in correspondence of the second and fifth conditional distribution (from left). They lie on the two best models whose \(\theta \) values are equivalent to their position in the conditional distribution. This is further highlighted superimposing a boxplot in order to assess their position.

Fig. 2
figure 2

An illustrative scatterplot to visualize the unconditional rank-quantiles (right grey doplot) of two points (diamond and triangle) and their corresponding positions in the conditional distributions (second and fourth dotplot, from the left)

An Empirical Analysis: The Prediction of Health Indicators

Data Description

This paper proposes an exploration of the relationships among Health, Education and training (henceforth EDU) and Economic well-being (henceforth ECO) using the “BES measures at local level” (ISTAT 2019a), a subset of the BES indicators, that are measured and regularly updated by Istat on the 110 Italian provinces and metropolitan cities (i.e. at NUTS3 levelFootnote 1). The interest in such an application concerns both advances in knowledge about the dynamics producing the well-being outcomes at local level (multiplier effects or trade-offs) and a more complete evaluation of regional inequalities of well-being.

At the local level the well-being assets can strengthen each other affecting multiple disadvantages or advantages. For this reason it is important to consider both the levels and the relationships among BES indicators. Furthermore, equal results can be achieved in very different contexts and conditions. So, in assessing or comparing the well-being outcomes, the conditions within outcomes arise should be properly considered, adopting a multidimensional approach, able to support an accurate evaluation of the regional performances.

Figure 3 shows the specified network of relations. EDU, ECO and Health are the unobserved complex concepts that are measured as composites of the corresponding MVs (squares in Fig. 3 detailed in Tables 3 and 4 in the “Appendix”). Even though the model could be enriched by including further measures or domains, it still considers the most of the BES indicators that are currently produced by Istat at the NUTS3 level.

Fig. 3
figure 3

The PLS-PM model

In the path model in Fig. 1, Health variables are placed as response variables of EDU and ECO. The underlying hypothesis, supported by literature and empirical studies (Costa et al. 2014; Mackenbach et al. 2008; Murtin et al. 2017; Petrelli et al. 2019), is that EDU has a direct effect on Health and an indirect effect mediated by ECO. In fact human capital is both a factor of economic competitiveness and well-being, as higher education offers more income opportunities, and promotes lower vulnerability to health risks.

With respect to the MVs, life expectancy at birth of males (O.1.1M) and females (O.1.1F) and infant mortality rate (O.1.2.MEAN_aa) are the three indicators used to measure the main global outcomes in the Health domain. EDU consists of indicators of qualification (O.2.2; O.2.3), competences (O_2.7_2.8; O_2.7_2.8_AA), participation in education and long-life learning (O.2.4; O.2.5aa; O.2.6). ECO is measured by indicators of income and wealth (O.4.1, O.4.2, O.4.3, O.4.5) and economic difficulties (O.4.4aa; O.4.6aa). All the indicators were positively oriented towards the BES (the higher is the indicator value, the greater is the BES) to provide an easier interpretation of the results. The “aa” suffix denotes those indicators that we reversed for this purposeFootnote 2. Data refer to the latest update available (reference year is given in the last column of Table 4 in the “Appendix”).

A preliminary analysis of the distribution of the MVs provides an examination of the heterogeneity that can be observed in the distribution of well-being in the Italian provinces. Figures 4, 5 and 6 show the violin plots, a combination of a box plot and a density plot. It is realized rotating and placing symmetrically on each side two density plots. The length of the vertical axis of each graph allows to appreciate the range of the observed values while the shape highlights how values are distributed in terms of variability and skewness. The black dot in the middle is the median value. Note that violin plots in different panels are not always comparable, as the variables have different unit of measurement and scales.

As expected, life expectancy at birth has a similar distribution for females (O.1.1F) and males (O.1.1M). However the median value is lower for men (80.5 years) compared to women (85.0). The gender gap (4.5 years comparing median values) is wide even looking at the ranges of the distributions: the maximum for men (82.1 years) is smaller than the minimum for women (82.8 years). In both cases the Italian provinces fall into ranges of equal width, but the male’s life expectancy has a more regular shape. Infant mortality is a rare phenomenon, so the corresponding MV (O.1.2 MEAN_aa) has a high territorial and temporal variability; for this reason the model was calculated on a three-year average. Even this more aggregate measure reveals large differences among the Italian provinces. The range is 5.1 points per thousand between the province with the worst result (equal to zero in the chart because the indicator was reversed) and the one with the best result. Most of the Italian provinces thicken in the centre of the distribution, with few cases placed in upper and lower ends. Therefore the major differences concern a small number of cases. The territorial heterogeneity in health outcomes does not have a clear geographical gradient. The provinces of central and northern Italy have more often better results than those of the south and islands, with many positive and negative exceptions (ISTAT 2019b). Considering the Health predictors MVs, those of ECO are clearly the most discriminating. In particular all the indicators of income and wealth (O.4.1, O.4.2, O.4.3, O.4.5) have very asymmetric and polarized distributions, showing a sharp division between the group of provinces with the best economic outcomes and the group of the most penalized ones. The density of the median class is always lower compared to these two opposite groups. The shape of these graphs reflects a clear separation between the richer provinces of northern and central Italy, and the group of the southern and islands ones. The same division concerns the territorial distribution of low-income pensioners (O.4.4aa) which is very asymmetrical. In the Education block, the widest asymmetries emerge about the competences of young students (O_2.7_2.8) and the participation in education (O.2.5aa); both this measures oppose the southern Italian provinces, more disadvantaged, to the northern and central ones. Looking at the participation in lifelong learning (O.2.6) and at the highest qualification levels of the population (O.2.3) the territorial heterogeneity has a quite different sharp, that becomes thinner and longer moving towards the best outcomes, which therefore concern a few leading Italian provinces (namely the north-eastern ones for O.2.6 indicator and the northern metropolitan cities for O.2.3 indicator).

Fig. 4
figure 4

Violin plot of the Health indicators

Fig. 5
figure 5

Violin plot of the Education indicators

Fig. 6
figure 6

Violin plot of the Economic well-being indicators

After examining the distribution of each MV, it is necessary to check the internal consistency of each block of MVs through the Cronbach’s \(\alpha \) and Dillon-Goldstein’s \(\rho \), which need to be greater than 0.7. For the Cronbach’s \(\alpha \), Confidence Intervals are also reported following a recent approach (Trinchera et al. 2018). Moreover, the average variance extracted (AVE) is also considered (Tenenhaus et al. 2005). Table 1 shows that, for all the blocks, internal consistency is satisfied and all the AVE values are greater than 0.5.

Table 1 Reliability and internal consistency measures

PLS-PM and QC–PM Results

The model in Fig. 3 was estimated using the classical PLS–PM and QC–PM fixed to the three quartiles (\(\theta \)=[0.25, 0.5, 0.75]). Given the territorial heterogeneity expressed by the model MVs, the aim is to explore whether estimates vary across different parts of the variable distributions.

Firstly, the estimated loadings in the measurement models are examined. Table 2 presents the loadings for each MV estimated using conventional PLS–PM and QC–PM at the three defined quantiles.

Table 2 Measurement model results

Except for few cases, loadings are very similar across quantiles and compared to the one estimated by PLS–PM. However, for each block of MVs, measurement invariance should be verified to evaluate any potential difference in the structural relationships across the various quantiles (i.e., comparison among path coefficients). QC-PM still lacks a statistical test for measurement invariance, but we applied the variant of QC-PM fixing the quantile in the measurement model to the median and found no relevant differences in results, hence the measurement of the LV remains essentially the same for all quantiles, which allows a reliable comparison of path coefficients across the various quantiles.

Bars in each panel of Fig. 7 represent (from the top to the bottom) the path coefficients and the standard errors measuring respectively the effects on the conditional average and on the conditional quartiles of Health. It is interesting to note how QC–PM results complement PLS–PM results. Looking at the average of the distribution (PLS–PM), Education is the most important driver of Health, but QC–PM reveals that its importance is greater where the Health is lower (that is lower or equal to the median) while it decreases as the Health grows. In essence, the effect of EDU on health conditions increases moving from provinces with good to worse results. With regard to ECO, the additional information provided by QC–PM is also interesting because the PLS–PM results suggests that ECO does not contribute to Health while the QC–PM reveals an high path coefficient in those provinces where Health scores are the highest ever. Therefore ECO plays a discriminating role in explaining the best absolute outcomes.

Fig. 7
figure 7

Path coefficients linking Economic well-being and Education to Health

The territorial heterogeneity of the MVs is often associated with geographical differences; therefore it may be useful to add the geographical location of the provinces in the analysis of the model results (Davino et al. 2017). A possible source of heterogeneity could be, for example, the geographical area considering that Italian provinces are usually grouped into four areas: north-east (20%), north-west (23%), centre (20%) and south and islands (37%). Figure 8 shows all the possible scatter plot combining the three composites obtained by estimating the model in Fig. 3 with PLS–PM. In each panel a different color and shape of the points is used to distinguish the effect of the area. The lines represent the regression lines estimated in the four subgroups of provinces considering the variable represented on the vertical axis as dependent variable. The boxplots on the right (top) side of each panel show the distribution of the composites represented on the vertical (horizontal) axis distinguishing once again the provinces by geographical area. Considering the univariate distributions of the composites, it is clear that the southern provinces are lagging behind in all the three contexts analysed; the third quartile for the southern provinces is always well below the first quartile for the other provinces. The gap is broader for EDU and ECO.

As expected, given the distribution of EDU and ECO MVs (Sect. 5.1), the composites distribution has a clear geographical orientation: values increase moving from the south to the north of Italy, with the north-east group leading. Conversely, the Health distribution follows a different geographical progression. The group of central Italian provinces, although very heterogeneous, tends to have better scores than all the other groups, including the north-eastern one.

Considering the relationships among Health, ECO and EDU, the three composites are highly correlated at a national level but differently at local level. In Fig. 8, for each couple of scores, a regression line is estimated in the four geographical areas and superimposed to the scatter plot. The simultaneous representation of the scatter plot and regression lines allows to capture both the trend and the heterogeneity of the relationship. The correlation between the ECO and EDU composites is equally strong in all the four geographical areas despite the heterogeneity observed within each group of provinces (Fig. 8, bottom right panel). The correlation between Health and ECO it is by far the strongest in the south and islands group, according to the greater path coefficient of ECO on the Health worst outcomes (note that the south and islands provinces always lie in the Health distribution queue, with just one exception). Instead the strongest correlation between Health and EDU scores is that of the center group. Looking at the PLS–PM results shown in Fig. 8, the assumptions underlying the model are not confirmed for all the north-eastern provinces: also due to the high dispersion and heterogeneity of this provinces, no correlation arise between Health and EDU outcomes, while that between Health and ECO (very weak) still has a negative sign.

Fig. 8
figure 8

Education, Economic well-being and Health distributions according to the geographic area. PLS-PM results

The results of the QC–PM can provide a better definition of the characteristics of this heterogeneity. Focusing on the Health composite and on its three conditional quartiles, it is possible to analyse similarities and differences among the geographical areas at different health conditions. Figure 9 shows the distribution of the Health composite for each area (different panels) and for each model (rows in each panel). The density plot, the dot diagram and the boxplots allow to explore all the features of the distributions. In each line a segment joins the averages of the composite at the three quartiles. Considering that the global averages of the composites provided by the PLS–PM and by QC–PM at each quartile are equal respectively to 0, -0.47, 0.01 and 0.47, it is possible to note that the averages of the southern provinces distributions are always below the global average, while north-eastern provinces (and partially also the north-western ones) show an opposite behavior.

Fig. 9
figure 9

Distribution of the Health composite from a PLS-PM (top in each panel) and QC–PM estimated at the three quartiles, according to the geographic area

Prediction Results

In order to exploit the model in Fig. 3 in an operative predictive perspective, it is necessary to consider that Health is composed by three outcome variables: life expectancy at birth of males (O.1.1M) and females (O.1.1F) and infant mortality rate (O.1.2.MEAN_aa). In deepening the presentation of the prediction results, we will just focus on O.1.1M, which, as seen above (Sect. 5.1) is the most informative among the three Health indicators we used in the model. In fact it is more robust than the infant mortality rate (that is affected by extra-variability) and, compared to females rates, it is more able to explain both the differences among Italian provinces and the improvements achieved over time in the general level of the total life expectancy in Italy. As we deal with in-sample prediction, the post-analysis of the results can be based on the comparison between the observed and the prediction values. As described in the methodological sections, the proposed best quantile approach computes the predicted vector of each MV by selecting the quantile model providing the best prediction for each province. In case of a simple regression model (with one dependent variable and one regressor) the identification of the best quantile allows to exactly reconstruct the observed variable. In a more complex model as the network of relationships in Fig. 3 is, the goal is to identify the best prediction.

Figure 10 shows a smoothed version (using a linear smoother) of the scatterplot of observed and predicted values for the O.1.1M variable where the predicted values derives from a PLS–PM (solid line) and from the best quantile approach (dashed line). The gray line depicts the bisector, namely the place of the points where the observed values and the expected values coincide perfectly. It results that the best quantile predictions are much more accurate than PLS–PM, but the marginal gain in accuracy decreases at the distribution tails. This consideration is also confirmed in the analysis by geographical area (Fig. 11): the higher accuracy of the best quantile predictions is particularly evident for the north-east and south and islands areas (the dashed lines are very closed to the bisector) while the lower accuracy of the best quantile predictions in the extreme parts of the distribution is more evident in north-west (low tail) and center (high tail).

Fig. 10
figure 10

A smoothed version (using a linear smoother) of the scatterplot of observed and predicted values for the O.1.1M variable obtained using a PLS–PM (solid line) and the best quantile approach (dashed line)

Fig. 11
figure 11

A smoothed version (using a linear smoother) of the scatterplot of observed and predicted values for the O.1.1M variable obtained using a PLS–PM (solid line) and the best quantile approach (dashed line) according to the geographic area

The proposed quantile model-based prediction approach can provide useful information to understand, for each province, what is the contribution of the system of relationships in the model in Fig. 3 to the prediction of health levels. In essence, the comparison between conditional and unconditional quantiles for the Health MVs tells us if the observed results are in line with the starting conditions (in terms of ECO and EDU).

The scatter plot in Fig. 12 visualizes all the provinces according to the assigned best quantiles and to the unconditional quantiles. The unconditional quantile (horizontal axis) is the position of each province in the MV distribution without considering/controlling the effect played by EDU and ECO, while the best quantile (vertical axis) represents the position of each province compared to all the other provinces that have similar EDU and ECO levels (i.e., the position in the conditional distribution).

If on one hand, looking at the unconditional quantiles of O.1.1M one can find vertically aligned provinces that have the same value of the life expectancy of males also starting from different EDU and ECO levels, on the other hand looking at the horizontal alignment of the points, one can find provinces that perform similarly while having different socioeconomic levels. The furthest points from the bisector of the scatterplot identify the territories where the differences between observed and “potential” results are greatest. These divergences are the most interesting cases to study. By numbering the quadrants counterclockwise and starting from the quadrant at the top right, we can identify two “critical” situations: at the top left (second quadrant) fall those provinces that get better results than the expected ones (the best quantile is greater than the unconditional) and at the bottom right (fourth quadrant) fall those territories with an important negative gap (the best quantile is much lower than the observed one). The geographical area has some influence on the relationship between unconditional and conditional quantiles. In fact, in the second quadrant we find almost exclusively southern provinces (there are only Latina, Frosinone and Fermo for the center) while in the fourth quadrant we find only northern provinces together with Rome. The scatter confirms the advantage of the center and northen areas and the penalisation of the south and islands also for the males life expectancy at birth. However in this multimensional perspective the dimensions of the advantages and disadvantages are quite different from those we could appreciate in a context of univariate analysis (i.e. considering the single indicators and not also their interrelationships).

Fig. 12
figure 12

The scatter plot of the provinces according to the unconditional and conditional best quantiles of O.1.1M. The color and shape of the points represent the geographic area

For a more analytical illustration of the additional information that the model provides, an extreme example can be isolated: Bologna vs Ravenna. Bologna and Ravenna share quite similar health outcomes, considering the O.1.1M variable, but the predicted values are quite different as shown by their positions respectively in the fourth quadrant (Bologna) and in the second one (Ravenna). If we consider the subset of the provinces of the north-eastern area, it is more evident the effect induced by the model in the Health results of the two provinces. The slope graph in Fig. 13 shows the sub-group of provinces in the north-east of Italy ranked according to their position in the original (unconditional, left-hand side) and estimated (conditional, left-hand side) distribution of O.1.1M. The slope of the lines joining the unconditional and conditional position of each province clearly visualize how much taking into account the levels of EDU and ECO can affect the life expectancy of males. The limit case is represented by an horizontal line: it would mean that ECO and EDU levels make no contribution to the knowledge of Health. Both for Bologna and Ravenna, the results in terms of life expectancy of males, estimated in itself, are excellent, among the highest: the two provinces share the 83-th percentile in Italy. However, the effects of the estimated model are different in the two provinces resulting in an improvement in the position occupied by Ravenna (increasing slope of the stick) and a worsening for Bologna (decreasing slope of the stick), the two best quantiles being 0.20 and 0.89, respectively.

Fig. 13
figure 13

The slope graph of the sub-group of provinces in the north-east of Italy. Provinces are ranked according to their position in the original (unconditional, left-hand side) and estimated (conditional, left-hand side) distribution of O.1.1M

To interpret the different effect played by the model, with similar observed results, it is necessary to go back to the distribution of the original indicators. Figure 14 and 15 show some univariate statistics of the ECO and EDU through parallel coordinates. The grey double lines join the quartiles of the indicators while the thin lines represent the averages by geographical area. The broken lines representing Ravenna and Bologna are highlighted and allow to contextualize why the performance of Bologna appears less brilliant. If on one hand the levels of EDU and ECO in Bologna are among the highest, on the other hand in the group of provinces with these highest levels of EDU and ECO, Bologna ranks among the last in terms of life expectancy of males, as highlighted by the best quantile that is much lower than the unconditional one. This difference gives us a measure of the gap between potential and actual results, which in this case is negative. Reading the results from the point of view of Ravenna, it appears that Ravenna has the same life expectancy of males as Bologna (81.5 years), but a much less favorable ECO context, especially concerning the disposable income of households (O.4.1), the incomes of employees (O.4.2), pensioners (O.4.3) and, the wealth of families (O.4.5), and it is shown above that in the model ECO has a higher path coefficient on the higher part of the Health distribution. The gaps in terms of EDU are even more marked in particular on the rates of graduates (O.2.3), transition to university (O.2.4) and participation in lifelong learning (O.2.6); moreover we know that EDU in the model has a high path coefficient on the higher part of the Health distribution. Given its advantage over Ravenna in terms of EDU and ECO, Bologna should have a far better result in terms life expectancy of males than what is observed.

Fig. 14
figure 14

Distribution of the univariate statistics of the Economic well-being indicators: quartiles (grey double lines), averages by geographical area (thin lines). The broken lines representing Ravenna and Bologna are highlighted

Fig. 15
figure 15

Distribution of the univariate statistics of the Education indicators: quartiles (grey double lines), averages by geographical area (thin lines). The broken lines representing Ravenna and Bologna are highlighted

An opposite pattern can be exemplified by Cosenza and Catanzaro (Fig. 16). Cosenza falls in the group of provinces with low results (81st with 79.9 years), but gets a better position than expected, given ECO and EDU, as its conditional quantile is higher than the unconditional and the positive gap is quite wide. So we could say that Cosenza performs better than Catanzaro, as it gets the same result but in a more unfavorable context, mostly due to the lower levels of the indicators of ECO (see Fig. 17).

Fig. 16
figure 16

Distribution of the univariate statistics of the Education indicators: quartiles (grey double lines), averages by geographical area (thin lines). The broken lines representing Catanzaro and Cosenza are highlighted

Fig. 17
figure 17

Distribution of the univariate statistics of the Economic well-being indicators: quartiles (grey double lines), averages by geographical area (thin lines). The broken lines representing Catanzaro and Cosenza are highlighted

Conclusions and Further Developments

The analysis of the relationships among complex and unobservable factors can be enhanced using a quantile approach to PLS–PM, which allows to highlight the unobserved heterogeneity that could be overlooked by the classic estimation of the average effects. In this paper QC–PM is also proposed in a predictive perspective providing the best estimation, and thus the best model, associated to each statistical unit. For a given statistical unit, the quantile associated to the best model, in the paper named ”best quantile”, condenses the effect played by the regressors on the position of the unit in the conditional distribution of the dependent variable. QC-PM lacks a statistical test for measurement invariance, which allows a reliable comparison among path coefficients estimates over quantiles. However, as above-mentioned, a possible variant of QC-PM can be used, fixing the quantile in the measurement model to the median and changing only the quantile in the structural model. We applied this variant, but we did not find relevant differences in results, probably because there are not evident differences among loadings over quantiles.

The potential arising from a joint use of a PLS–PM and a QC–PM are exploited to explore the relationships among the Health outcomes and the levels of Economic well-being and Education in Italian provinces. The model is defined using a subset of dimensions and indicators of the well-known BES dataset produced by ISTAT at NUTS3 level (ISTAT 2019a). The underlying idea is that health levels and health inequalities at local level can be assessed more in depth taking into account both the observed and the unobserved heterogeneity. In fact similar levels of health can result from very different performances, when they are achieved in different socio-economic conditions. The study provided a multidimensional analysis of health inequalities at local level, in the effort to capture the unobserved heterogeneity that can be explained taking into account the relationships among Health, Economic well-being and Education.

The results of the PLS–PM confirmed that there is a relationship between Education and Health, as we hypothesized in the theoretical model. The QC–PM also revealed the existence of a relevant relationship between Economic well-being and high levels of Health and a decreasing impact and contribution of Education to increasing levels of Health. The geographical area also provided useful information for understanding differences in Health levels and in the relations among Health, Economic well-being and Education. Globally, the PLS–PM confirmed that the three well-being domains are highly correlated in all geographical areas, except the north-east. Deepening the analysis in a predictive perspective the best quantile predictions resulted much more efficient than PLS–PM, especially concerning the north-east subgroup. The observed health results of each province could then be assessed taking into account jointly its placement in the unconditional distribution, the results of the best quantile prediction and its geographical location. Looking at life expectancy of males, many provinces, almost all located in the south and islands, get low but better results than expected; in contrast, the provinces that get high but lower results than expected are less numerous and none of them is southern or insular.

Future research will explore the strengths of QC-PM for prediction outside the data sample used for estimating the model (out-of-sample prediction) and will try to deepen the knowledge about the determinants of health differences at the local level, including in the model the relationships among health and other well-being assets, such as the environment, the quality of health services, the exposure to risky jobs and other vulnerability factors.