1 Introduction

The present study is an application of a “Multiverse”-based model on the Italian Equitable and Sustainable Well-being indicators (a.k.a “Benessere Equo e Sostenibile”, BES). The proposed exploratory analysis regards the hypothetical claim that variation in future fertility can be explained by indicators of well-being.

The concept of the Vibration of Effect, originally introduced by Patel et al. (2015), has been applied to biomedical databases (Tierney et al., 2021; Palpacuer et al., 2019). The underlying idea is to test a hypothesis of dependency of an outcome variable on various explanatory factors, all gathered from a single, large source. Thus, the main purpose of VoE is to test the evidential strength of a causal claim of a variable within the dataset in relation to the outcome variable. This is achieved by constructing a synthetic sample of estimates that constitute, at least theoretically, a multidimensional density distribution. Differently from methods based on bootstrap, this distribution has no element of randomicity. It is instead fully determined by a process of differentiation on the specifications of, similarly to methodologies for sensitivity analysis proposed by Leamer (1983, 1985), Raftery (1995); Sala-I-Martin (1997), and other computationally intensive methods of inference.

The ultimate goal of performing a VoE analysis on a dataset is to identify whether or not regressors are evidently relevant variables in an explanatory model of the outcome. While other techniques for reducing the number of variables in a multivariate model, such as penalised regressions (LASSO, Ridge, etc), exist, the VoE methodology allows for the evaluation of informative statistics that directly interpret the uncertainty regarding the effectiveness of individual regressors, similar to quasi-experimental methods. Compared to penalised regressions, which tend to select a singular model, functional for decision-making (James et al., 2017; Oelker & Tutz, 2017; Young, 2019), VoE enables testing the hypothesis that there is no empirical basis for a dataset to explain an outcome. Furthermore, in this application, the VoE methodology has been expanded with some later introductions derived from the methods of Multiverse Analysis (Gelman & Carlin, 2014; Steegen et al., 2016; Muñoz & Young, 2018; Simonsohn et al., 2020). These introductions were necessary to better adapt the method to the panel structure, which is particularly recurrent in socio-demographic inferential data analysis, too.

Section 2 offers an account of this theoretical framework, and of the implementation in the adopted methodology, providing the application of the underlying principles and in the process of inference. The adoption of a multiversal model to sample distribution of estimates for a parameter allows not only to quantify the uncertainty regarding the estimate of that parameter, but also to assess how much of the hypothetical variation in that estimate is related to the sample distribution of the fit data (e.g. the standard error), and how much is related to the theoretical choices of the analyst in specifying the statistical model (e.g. the so-called modelling error), instead. In a regular analysis, the latter component of variation of the estimate is often not explicit and even neglected, so the uncertainty behind a scientific claim is systemically underestimated. The adoption of a multiversal model as an extended framework for VoE is aimed at correcting the statistical estimation of the hypothetical variance of the estimate, in order to better assess the overall robustness of statistically significant results. Peculiar attention is reserved for the quantification of the risk of claiming an effect when there is a consistent number of alternative statistically significant specifications of the model that would lead to claiming an effect of a different sign. This case, particularly relevant for therapeutical choices as for social policies, is formalised into a multiversal estimator of this risk (the “Type Sign Error”, or TSE), and this estimator is employed to corroborate the analysis of the suitability of the social indicators of BES as determinants or predictors of future fertility.

In Sect. 3 the hypothesis that sustainable well-being predicts future fertility is tested on a panel dataset of 48 BES indicators observed at Province level over a time span of 7 years. For each indicator, a multiverse of 1024 specifications of the regression is fit, based on an enriched version of the so-called Vibration of Effect (VoE) model of multiversal sampling. The results of the global analysis are not robust enough to assert that there is evidence for linear effects of BES indicators on Italian fertility. Another relevant result is that the model-based component of the variance of the multiversal estimates is higher than the sampling-based component. It means that, at least for the proposed task of investigating the effectiveness of well-being on fertility, the statistical significance of a linear effect depends more on the pertinence of the theoretical background of the model than on the observed variability in BES data. The multiverse of estimates for some individual indicators is represented through volcano plots; these are discussed with the aim of highlighting common issues in modelling causal effects on fertility.

The employment of a multiversal method serves to illustrate the problem of legitimisation of ‘science-based’ (‘evidence-based’, etc.) decisions on policies, too. Since the scientific knowledge behind the adoption of a public policy can be characterised a posteriori as ideologically driven, excessive sensitivity of the estimates to the assumptions of the specification of the model specification should be more explicit, in order to allow a better-informed scientific evaluation of the evidence for the proposed policy. This problem is commented in Sect. 4, where some methodological developments based on ‘Multiverse’-based designs are discussed, too.

2 Theoretical Background

2.1 Vibration of Effect and other Multiversal Methods

Multiversal methods include all the methods based on the practice of fitting on the same dataFootnote 1 multiple specifications of the same conceptual model of a scientific theory, and then reaching a scientific result that accounts for the variability observed across the sample of these specifications (Gelman & Loken, 2014; Patel et al., 2015; Steegen et al., 2016; Young & Holsteen, 2017; Simonsohn et al., 2020). All multiversal methods have in common collecting and comparing fit statistics from a systematically differentiated multiplicity of specifications of the same conceptual model.

The classic example of systemic differentiation of a single conceptual model into multiple specifications is the following: for any logistic regression that adopts a logit link function, exists also a logistic regression that adopts a probit link function (Gelman & Loken, 2014). The estimates (and the associated \(\mathrm{{p}}\)-values) between these two specifications are expected to be close to each other, yet not identical. Other elements of the regression, such as the operational definition of the variables, the control structure and the functional form of the model (i.e. the “formula”), the estimators, etc. are other sources of differentiation of the specifications of the same conceptual regression model (Simonsohn et al., 2020; Del Giudice & Gangestad, 2021). The difference between estimates is quantifiable hence it has a variance. This variance is considered the model-based component of the hypothetical variance of estimates of the parameter.

Among the other fit statistics that can be collected per specification, there are the p-values, \(R^2\), etc. After differentiation of many specifications into a set \({\hat{J}}\), this set can be treated as a (non randomFootnote 2) ‘multiversal’ sample. Statistics can be derived from a ‘multiversal’ sample with a loose analogy to regular sampling. For example, assuming that the model variance in the estimate is finite, it makes sense to compute a multiversal average, multiversal standard deviation, etc. (Young & Holsteen, 2017). The commonly analysed statistics in multiversal methods are estimates of the coefficient of the ‘first regressor’ \({\hat{\beta }}_x\) and the p.values of their tests of statistical significance in simple or multiple regression.

Vibration of Effects analysis (VoE) has been developed as a confirmatory method to check the statistical robustness of etiological models in observational biomedical databases (Patel et al., 2015; Palpacuer et al., 2019). Tierney et al. (2021) extended the goals of the method to elicit hypotheses of causal structure in observational studies. In the latter case, VoE acts as a formalisation of the customary research practice to check the coefficients of effect size on a singular outcome variable (a ‘Y’) of many potential linear predictors from a pre-collected database (the ‘main regressors’, X). As an exploratory method, it looks for empirical insights before formalising any explicit hypothesis on the elementary biomarkers.Footnote 3 The procedures to collect a multiversal sample for VoE Analysis are the following:

  1. 1.

    A coherent database of indicators (or, biomarkers, etc.) is identified.

  2. 2.

    A model of regression is fully specified.

  3. 3.

    An outcome variable Y is identified. An exemplary y is a variable of mortality in the statistical units (e.g. patients, etc.). In Medicine, mortality is both a relevant issue and a reliable one: it is a naturally binary state with no ambiguity in measurement, e.g. compared to psychometric indicators. These are excellent characteristics of an ideal Y.

  4. 4.

    A set \({\textbf{Z}}: \{Z_1, Z_2,...\}\) of control variables is identified within the variables of the dataset.

  5. 5.

    The other variables in the database that are not \(Z \in {\textbf{Z}}\) are identified as independent X.

  6. 6.

    Data are fit on each regression \(Y \sim X\), adopting all the elements of the power set \(2^{{\textbf{Z}}}\) as control structures in the functional form (pre-specified in the regression model). Hence, each vector \({\textbf{x}}\) of the database is fit \(2^{\mid {\textbf{Z}} \mid }\) timesFootnote 4. This operation is the “vibration”, while the “effect” refers to the parameter \(\beta _x\).

  7. 7.

    The vectors of all the estimates \({\hat{\beta }}_x\), and of their associated standard errors (\(se({\hat{\beta }}_x)\)) and p-values, grouped per X (\(2^{\mid {\textbf{Z}} \mid }\) raws for each X), make the \({\hat{J}}\) multiversal sample for the VoE analysis.

VoE assumes that if these seven steps are conducted ‘reasonably well’, then the sample statistics of the multiverse are representative of or at least informative about the latent uncertainty of the inferential model.Footnote 5 In the application (Sect. 3), the VoE template is altered in 2 ways: (i) \({\hat{\beta }}_x\) is estimated through four general models and not only one, and (ii) Y is specified in 8 different fertility rates.

Indeed, in Young and Holsteen (2017) the sample variance of the standard estimates \(s^2({\hat{\beta }}_x)\) (estimator for “model variance”) is explicitly assumed as informative about the sensitivity of the inferred effectiveness of the relation between the independent X and the outcome Y: a lower model variance is indicative of a robust linear effect, mirroring the standard error of the estimate.

Behind this assumption is the idea that the interplay of total survey errors (Groves & Lyberg, 2010) and misspecification error can induce a rate of false positives higher than the significance level \(\alpha \) of the test, helping mischievous (or just unaware) authors to propose as statistically validated theories that would be not, under a more rigorous set of assumption. This specific mispractice in scientific research is known as p-hacking (Head et al., 2015; Imbens, 2021).

In a VoE analysis is checked the so-called Type Sign Error (Gelman & Carlin, 2014), or “Janus effect” (Patel et al., 2015), too. This error concerns the case when the sign (i.e. positive vs. negative) of \({\hat{\beta }}_x\) is not unique across the sub-samples of the statistically significant estimates. A high Error Type Sign (TSE) indicates that the estimate is sensitive towards conditional adjustments of the model.

2.2 Sustainable Well-Being and Fertility

In the words of Alaimo et al. (2022), the Italian Equitable and Sustainable Well-being (Benessere Equo e Sostenibile, BES) is “probably the worldwide most advanced attempt to pursue the beyond GDP perspective effectively”. The purpose of this panel dataset is to develop a multidimensional approach to complement the indicators related to economic measures together with measures of inequality and sustainability with the aim to depict the complex system of indicators produced in a comprehensive way through the analysis of a very wide annual updated set of measures, organised into 12 conceptual “domains”, each corresponding to related dimensions along the path of well-being.Footnote 6

Mostly collected by the Italian National Institute of Statistics, the coverage of indicators keeps growing over time, including data from social surveys and administrative records. These records offer some thematic and analytical insights both at the territorial level and with respect to the different socio-demographic characteristics of the population. They are provided at different levels of territorial administration: some indicators are also recorded at Provincial level (Chelli et al., 2016) although these may suffer from missingness of data.

This work is constantly updated to reflect all improvements in the production of statistical measures within the National Statistical System and the progressive extension of the “mapping” of the indicators proposed by United Nations Inter-agency and Expert Group on Sustainable Development Goals (UN-IAEG-SDGs).

In the present study, the indicators of Italian well-being are hypothesised to be determinant causes of fertility, that is the average number of children per woman in fertile age. This hypothesis is not novel in the literature on BES: De Rose et al. (2020) found significant results for both a positive impact on fertility of some composite indicators of sustainable well-being (e.g. “quality of work”, “quality of services”) and a negative impact of others dimensions (”level of employment”, “level of education”). The authors admit that these results are truly substantial only for Northern territories and that in general estimates may differ across territorial clusters of the country. This territorial differentiation seems in line with similar findings on determinants of health (Ciccarelli et al., 2022).

Including literature on the impact of well-being on fertility in other countries, there are three reasons for adopting a multiversal approach to check the soundness of the theory that variation in fertility rate can be determined by sustainable well-being:

  1. 1.

    Studies on human fertility, and in particular on the estimation of the impact of social determinants in national fertility, are of particular relevance in Social Sciences. There is a conspicuous international programme of research devoted to understanding why fertility rates are low in Western countries. Historically, this programme has been related to variation in economic performance and economic inequalities (Billari & Kohler, 2004; Sobotka et al., 2011; Balbo et al., 2013; Barbieri et al., 2015), yet these economic models should always face concurrent, and possibly more complex explanations for the variation of fertility across time and nation. For example, cultural shifts or changes in family structure are well-researched, too (Tropf & Mandemakers, 2017; Rodrigo-Comino et al., 2021; Giuntella et al., 2022). For BES or similar projects willing to “go beyond the GDP perspective” the development of rich-but-robust models of fertility may be one of the most prestigious challenges. Italy, with its persistent, yet paradoxical (Kertzer et al., 2009), low fertility is a good candidate to approach the riddle of fertility in the interplay of economical, sociological, and cultural determinants.

  2. 2.

    By the admission of almost all authors who approached it, a paradigmatic theory like Demographic Transition (Galor, 2012) is maybe not sufficient to fully explain (and govern) the observed shifts in fertility and in the age pyramids of nations (Graham, 2021; Sigle, 2021). Given the abundance of studies, this theoretical impasse should not be attributed to a lack of effort but to the inherent complexity of human behaviour about reproductive choices. Another, related yet different, issue regards the operational definition of fertility itself. The Total Fertility Rate (TFR) accounts for the number of newborns divided by the number of women within a range of years for which women are considered fertile. This indicator lends itself to confounding those effects that are localised across specific age ranges. Let us assume, just for example, that an increase in female employment implies a decrease in fertility under 30, but also an increase above 40. This dynamic of ‘postponement’ (Billari & Borgoni, 2005; Ni Bhrolchain & Beaujouan, 2012) would be lost by any estimator of a linear effect adopting TFR as unique Y. There are many age-specific fertility rates to test, so to test them all requires an analytical framework that accounts for the risk of p-hacking or of TSE. This gap can be filled by multiversal methods.

  3. 3.

    Demographic models of fertility, as background for social policies, have the tendency to be objects of political controversy. For example, in Hungary, policies implemented for raising fertility have been accused to reflect a biased political agenda (Szekulesz, 2022). More in general, hypotheses on determinants of fertility can often be characterised a posteriori with a political bias. This is more an epiphenomenon of the complexity and the relevance of fertility: many variables may predict variation in fertility with only a spurious connection to it, so different political actors have preferences for some theories and some policies that are more or less related to religious, moral, or ideological beliefs. Under these conditions, there are personal incentives for p-hacking a significant effect in order to show “to be right about it”. In Saraceno et al. (2021), Multiversal models have been successfully employed to channel a diversity of personal ‘priors’ into a scientific methodology, and parties converged towards a shared and univocal interpretation of results.

3 Application

3.1 Aims of the Application

Compared to other multiversal methods, VoE is convenient for checking the risks associated with modelling of complex dynamics in databases that collect an extensive number of variables but a relatively low number of observations and statistical units. These features make VoE a desirable VoE ‘template model’ for investigating large socio-economic observational datasets.

Building an enriched model of VoE on the BES database has two main goals. The first is to perform a global exploratory evaluation regarding the hypothesis that some indicators of sustainable well-being can explain or predict the variation in Italian fertility. The second goal is to propose the multiversal method as a general methodology for detecting variables that are:

  1. 1.

    robust predictors of future fertility,

  2. 2.

    definitely not related to future fertility,

  3. 3.

    ambiguous, because they could lead to false, ideologically driven claims. For example, a variable with a high rate of significant estimates across \({\hat{J}}\) that is also affected by high TSE (Janus effect) is an ambiguous predictor.

The prior expectation is that this exploratory analysis should not identify of a high number of significant variables, because most BES variables have theoretically no direct connection with human fertility. No prior expectation can be elicited on the frequency of Janus effect. An ideal and desirable result is to separate a small cluster of significant variables (with a unique sign for each), and a majority of never-significant variables.

3.2 Data Sources

Data on fertility has been imported by the Health For All database (HFA) edited by the Italian National Institute of Statistics (ISTAT).Footnote 7 A fertility rate is defined as follows: given an interval of age (in years, e.g. from 15 to 19 years old), a fertility rate is the total number of newborns from mothers aged within the interval, divided by the number of women aged within that interval. In HFA are reported 8 intervals of mother’s age: 15 : 49 (Total Fertility Rate, TFR); 15 : 19, 20 : 24, 25 : 29, 30 : 34, 35 : 39, 40 : 44, 45 : 49. TFR can be assumed at the ‘main’ fertility rate, the other operative definitions are its alternative specifications.

From HFA are imported two other variables: frgn_rate, the percentage of registered inhabitants without a Citizenship of European Union; and marrg_rate, the rate of marriages pro capita, a measure indicative of the propensity of people to form new families. In Italy, new families and a higher presence of households of immigrants are positively correlated with fertility in the population.

The statistical units of the BES are 100 Italian ProvincesFootnote 8. The panel regards recorded values in BES database from the year 2013 to the year 2020. In the years before 2013, Italy sustained a considerable period of economic and social crisis and initiated significant reforms with an impact on the job market, demographic dynamics, and lifestyles of its inhabitants. After 2020, job markets, demographic trends, and lifestyles have been severely impacted by the pandemic threat of COVID-19. This is also the same timespan considered by De Rose et al. (2020).

The criterion for a BES variable to be included in the analysis is the following: there are recorded values for at least two consecutive years between 2013 and 2020, for at least one Province. This criterion is functional to adopting advanced procedures of estimation (see Sect. 3.3). The indicator for the extension of “Urban Green” (extension o non-historical parks and gardens in urban areas) has been excluded too because the data are unreliable, showing excessive variance. After this selection, of the original 78 variables of the BES at the Province level, 51 BES variables are left, for a total of 53 variables adding up frgn_rate and marrg_rate. Of these, in Sect. 3.4, 5 variables have been selected as Z controls, so the analysis regards only 48 independent X. A list of these 48 independent variables is in Table 2.

3.3 Type of Regressions and Estimators

3.3.1 Lagged Fixed Effects (LFE)

A panel dataset is characterised by a vector of identifier values \({\textbf{i}}: \{i_1, i_2,...\}\) of the statistical unit and by a vector of time units \({\textbf{t}}: \{t_1, t_2,...\}\). A common estimator for ascertaining if the value \(y_i\) of variable Y depends on the value \(x_i\) of an independent variable X is the parameter for the \(\beta _x\) standardised linear coefficient of the lagged regression:

$$\begin{aligned} y_t = \alpha _{i} + \beta _x x_{t-1} + \beta _{{\textbf{z}}} {\textbf{z}}_{t-1} + \epsilon _{i,t} \end{aligned}$$
(1)

where: \(\alpha _{i}\) is a parametric effect that varies across statistical units, but it is invariant (’fixed’) across time; \({\textbf{z}}: {{\textbf{z}}_1,{\textbf{z}}_2,...}\) is the set of covariates with a function as control variables, with \(\beta _{\textbf{z}}\) being the vector of their coefficients; and \(\epsilon _{i,t}\) is the error term (Reed, 2015; Bellemare et al., 2017). \(\alpha _{i}\) and \(x_{t-1}\) are the salient features of the model in Eq. 1, which in this context takes the name of 1-Lagged Fixed Effect model (1LFE).

If a statistical test finds evidence to reject the null hypothesis for \(\beta _x = 0\) in Eq. 1, the statistical significance of the estimate \({\hat{\beta }}_x\) needs an explanation. Indeed, the reason to adopt the lagged regression is connected to the goal to connect the predictive model with a causal interpretation of the estimates: \(x_{t-1}\) happens before \(y_{t}\), so it must be generally excluded the hypothesis that the estimate \(\hat{\beta _x}\) can be influenced by the effect of y on x, i.e. the effect of the “reverse causality” (Leszczensky & Wolbring, 2022). Through lagging x, assuming the absence of serial correlation over time (autocorrelation) in X and YFootnote 9, the model excludes hypotheses of reverse and reciprocal causality (Fig. 1, models A and B).

Since between conception and birth, there is a time of pregnancy, results from 1RLE should be compared with a 2-Lagged Fixed Effect model (2LFE):

$$\begin{aligned} y_t = \alpha _{i} + \beta _x x_{t-2} + \beta _{{\textbf{z}}} {\textbf{z}}_{t-2} + \epsilon _{i,t} \end{aligned}$$
(2)

3.3.2 Lagged First Differences (LFD)

The case of covariation of x and y can also be explained through triadic causal structures called forks: a third value u determines both x and y (Pearl, 2015). In this case, even if x has a non-zero mediating effect on y, \({\hat{\beta }}_x\) would be biased (the correlation is ‘spurious’) unless it holds that if \(u \in U\) then \({\textbf{u}} \in {\textbf{Z}}\), where \({\textbf{u}}\) is a vector in the panel that is a representative sample of U.

This bias is commonly referred as ‘confounder bias’: the U in the fork, when not controlled, is the ‘confounder’ (Fig. 1, model C). Assuming \({\textbf{Z}}\) does not contain \({\textbf{u}}\), then U is ‘hidden’: the necessary condition for unbiasedness of lagged models is if the effects of U on X and of U on Y are simultaneous, which is the canonical case for forks. However, if U is autocorrelated and the effect of U on Y is delayed, compared to the effect of U on X, then lagged models are biased by the confounder U. This latter case is atypical, yet is structurally equivalent to the canonical case of spurious serial correlation over time in X (see Fig. 1, model G).

While the assumption that values in the vector \({\textbf{y}}\) are not autocorrelated in the short term is sound, it is not plausible to assume that it holds for each \({\textbf{x}}\) in the BES. To de-bias the estimate it can be adopted the estimator for \(\hat{\beta _x}\) in the 1-Lagged First Difference model (1LFD):

$$\begin{aligned} \begin{aligned}&\Delta (y) = y_t - y_{t-1}; \\&\Delta (x) = x_{t-1} - x_{t-2}; \\&\Delta ({\textbf{z}}) = {\textbf{z}}_{t-1} - {\textbf{z}}_{t-2}; \\&\Delta (\epsilon ) = \epsilon _t - \epsilon _{t-1}; \\&\Delta (y) = \beta _x \Delta (x) + \beta _{{\textbf{z}}} \Delta ({{\textbf{z}}}) + \Delta (\epsilon ) \end{aligned} \end{aligned}$$
(3)

that resilient to serial correlations in X and to delayed confounding effects of hidden U: even if u biases x, this bias affects equally \(x_{t-1}\) and \(x_{t-2}\). Since these terms enter the regression with different signs, the bias on \(\beta _x\) from u is stochastically neutered, i.e. it holds

$$\begin{aligned} {\textbf{E}}(bias(x_{t-1}) - bias(x_{t-2})) = 0 \end{aligned}$$
(4)

For the same principle, since \(\alpha _i\) are assumed time-invariant, they are neutered in the LFD, i.e. it holds \(\Delta (\alpha _i):= 0\) for all statistical unit i (Allison, 2009; Levanon et al., 2009).

Again, by analogy to Eq. 2, the lag in Eq. 3 can be raised up to two years (2LFD):

$$\begin{aligned} \begin{aligned}&\Delta (y) = y_t - y_{t-1}; \\&\Delta (x) = x_{t-2} - x_{t-3}; \\&\Delta ({\textbf{z}}) = {\textbf{z}}_{t-2} - {\textbf{z}}_{t-3}; \\&\Delta (\epsilon ) = \epsilon _t - \epsilon _{t-1}; \\&\Delta (y) = \beta _x \Delta (x) + \beta _{{\textbf{z}}} \Delta ({{\textbf{z}}}) + \Delta (\epsilon ) \end{aligned} \end{aligned}$$
(5)
Fig. 1
figure 1

Models of causal effects. A black straight arrow symbolises a correctly identified causal effect. A dashed line represents a hypothetical causal connection that is not specified. A dotted line with no arrows represents correlation. The red colour represents the induction of a bias. A red dotted line is a spurious correlation. A black line ending in a square represents that the effect of a variable has been successfully controlled. In the model D, the reciprocal dashed arrows represent that the model does not identify univocally the direction of the effect. This identification happens instead in model F, through the inclusion of lagged time variable. In model E, controlling for Z induces a collider bias (Pearl, 2015). Collider bias induces an artificial correlation between X and Y, which is wrongly identified as a causal effect of X on Y, hence the empty arrow. In model G delayed confounding effects from autocorrelated U affect both X and Y over time. These effects induce spurious correlation between X and Y and spurious serial correlations in X and Y. (Color figure online)

In this regard, as demonstrated by Vaisey and Miles (2017), opting for an LFD model prevents confounding if the lags in a fork are correctly identified by the lags in the DeltasFootnote 10.

3.4 Control Structure

Vibration of Effects analysis (VoE) requires identifying the set \({\textbf{Z}}: {Z_1, Z_2,...}\) in order to expand it into its power set \(2^{{\textbf{Z}}}\). The elements of \(2^{{\textbf{Z}}}\), combined by addition, are the specifications of the control structure of the VoE model.

The causal relation between Y and Z is not the object of investigation of the VoE. Indeed, even if the Zs are technically treated as control variables, their role is not to adjust the estimate \({\hat{\beta }}_x\) towards a value that is a priori assumed in the hypothesis of the specification to be closer to the true value. Instead, across their power set, they alternate their presence and absence in the estimation function, allowing \({\hat{\beta }}_x\) to express variation across the specifications. Since the variation is to be attributed to the elements \(2^{{\textbf{Z}}}\), the choice of \({\textbf{Z}}\) is not trivial nor a-theoretical. In the scientific literature on multiversal methods, Zs are usually chosen because they are consistently associated with the outcome variable Y or because there is a strong principle for Z being a cause Y.

In the current application, five control variables are elicited. The first two are already mentioned in Sect. 3.2: frgn_rate and marrg_rate. The rate of foreigners is consistently positively correlated to an increase in fertility in Italy, to the point that the literature suggests that foreigners sustain a healthier age pyramid in Italy (Strozza et al., 2016; Benassi & Carella, 2022). The relationship between marriage and newborns is more ambiguous because is routed in cultural schemes that may change over time. However, the presence of marriage rates as controls is pervasive in the literature about fertility in Italy, hence the inclusion in \({\textbf{Z}}\).

The third variable is the rate of graduated inhabitants of the Province (grad_rate), which is included in BES. It has been selected as a proxy value to measure the general level of education in the population. Education is considered a strong predictor of fertility, especially in the long run, both because it has a direct effect (e.g. through ‘postponement’) both because a variation in the rate of highly educated people spuriously reflects cultural changes (Impicciatore & Dalla Zuanna, 2017; Hahn et al., 2018; Impicciatore & Tomatis, 2020).

The fourth and fifth variables are included in BES, too. These are income_pc average (the income pro capita), and employment (the rate of people with an occupation). Income and employment are associated with lower fertility in the popular ”Theory of Demographic Transition”: it states that as incomes and standards of living increase through modernisation, the fertility rate decreases; as people become more economically secure, they will concentrate investments in higher education of a small number of children. Additionally, income raises because a larger share of people, including a fair share of women, is employed. As a result marriages and pregnancies are delayed, and families have fewer children overall (Galor, 2012). Recent developments question whether this relationship should be modelled as a linear effect: the hypothesis itself is that there is also evidence that in advanced societies, richer families have more children because they can assure a sufficiently high standard of living for all of them (Turner & Robbins, 2022). So, as a variant of the \({\textbf{z}} = \texttt {income\_pc}\), it included also the quadratic term \(\texttt {income\_pc}^2\); the same addition is included for \(\texttt {employment}\) (Cazzola et al., 2016). In addition, these two variables are collinear: in provinces with higher employment, a higher family income is expected. In a regular control structure, this fact would not be ideal, yet given that VoE is an analysis of robustness, stressing the model introducing collinearity in the predictor helps to track how it would inflate variation in the estimates.

In Table 1 are reported the standardised estimates for coefficient, and the associated standard error and p-value for the five \({\textbf{z}}\) in the multivariate regression of y on \({\textbf{z}}\).

Table 1 Multivariate regression of the control structure on TFT

Table 1 helps to determine the control structure of the enriched VoE model: assuming \(\alpha =.05\), frgn_rate reaches significance in three models, yet it is positively associated only in one case over three. marrg_rate presents a much more robust scheme: it is consistently associated when the lag is minimal, in accordance with the theory that variation in the rate of marriages determines natality in the next year because people marry with already the plan of a future pregnancy. grad_rate exhibit a double Janus effect: it is significant for 1LFE and 2LFD, but with opposite signs. Income shows a strong significance in Fixed Effects models but not in First Difference models, and the hypothesis that it could manifest a quadratic form (convex effect) is unsubstantiated. Employment, on the other hand, shows significant p-values of the quadratic term for 1-Lagged model. The identified control set \({\textbf{Z}}\) will contain these five variables, with the only addition that employment will always show up with the addition of its quadratic term.

3.5 Evaluation Measures

3.5.1 Robustness Ratio

A summary statistic that is helpful to evaluate independent variables is the Young-Holsteen Robustness Ratio, or RR (Young & Holsteen, 2017):

$$\begin{aligned} RR(X) = |\bar{\beta _x} |\cdot \frac{1}{ \sqrt{ ({\bar{se}}({\hat{\beta }}_x))^2 + (s_{{\hat{J}}}(\beta _x))^2} } \end{aligned}$$
(6)

where \(\bar{\beta _x}\) represents the average estimate of \(\beta _x\), \({\bar{se}}({\hat{\beta }}_x)\) is the average parametric standard error of \(\beta _x\) and \(s_{{\hat{J}}}(\beta _x)\) is the sample standard deviation of the vector of estimates of \(\beta _x\), which is also called ‘model error’ (Young & Holsteen, 2017).

Young and Holsteen (2017) give a parametric interpretation of RR(X): given an exploratory analysis including many X independent variables, \(X \mid RR(X) < 2\) should not be claimed as causal determinants of Y. As aforementioned in Sect. 3.1 the ideal case is to observe a few variables with a high RR, and many others with a low RR, since this would help to identify patterns of causality excluding unrelated variables.

Since the composition is still linear, if \(s_{{\hat{J}}}(\beta _x) > {\bar{se}}({\hat{\beta }}_x)\), then the estimates are very sensitive to model variance, which is usually hidden in regular analysis. This is a severe red flag that only a multiverse analysis can test the statistical significance of the relation between the variables.

3.5.2 Type Sign Error Estimator

To estimate TSE, the following procedure is proposed: let

$$\begin{aligned} n_{+^\bigstar }(x) = n({p_{{\hat{\beta }}_x}<\alpha } \mid {\hat{\beta }} >\beta _{H_0}) \end{aligned}$$
(7)

hence \(n_{+^\bigstar }(x)\) is the number of specifications associated and to estimates that are simultaneously (i) statistically significant and (ii) higher than the null value \(\beta _{H_0}\) (e.g. \(\beta _{H_0} = 0\)), and let:

$$\begin{aligned} f_{+^\bigstar }(x) = \frac{n_{+^\bigstar }(x)}{n_J(x)} \end{aligned}$$
(8)

where \(f_{+^\bigstar }(x)\) is the relative frequency over the whole number of specifications of X (\(n_J(x)\)).

Then for asymptotically large \(n_J(x)\), it is assumed:

$$\begin{aligned} Pr.(\beta _x > \beta _{H_0}) \sim f_{+^\bigstar }(x) \end{aligned}$$
(9)

Equation 9 can be treated as an estimator of the probability to draw by chance a significant result higher than the null value among all the specifications sharing the same modelling process (i.e. the VoE model) that sampled the \({\hat{J}}\) multiverse.

Likewise, the probability to draw by chance a significant result lower than the null is the same, but switching the operators in Eqs. 7 and 8:

$$\begin{aligned} \begin{aligned}&n_{-^\bigstar }(x) = n({p_{{\hat{\beta }}_x}<\alpha } \mid {\hat{\beta }}< \beta _{H_0}); \\&f_{-^\bigstar }(x) = \frac{n_{-^\bigstar }(x)}{n_J(x)}; \\&Pr.(\beta _x < \beta _{H_0}) \sim f_{-^\bigstar }(x) \end{aligned} \end{aligned}$$
(10)

To quantify the risk of TSE, consider the two relevant hypotheses:

  1. 1.

    \(\beta _x > \beta _{H=0}\). When this hypothesis is true, the TSE has a probability \(\sim f_{-^\bigstar }(x)\).

  2. 2.

    \(\beta _x < \beta _{H=0}\). When this hypothesis is true, the TSE has a probability \(\sim f_{+^\bigstar }(x)\).

The probabilities of TSE should be weighted for the probabilities of the hypothesis being true. In Eqs. 9 and 10 there is a method to estimate the probabilities for these hypotheses to be true. It follows:

$$\begin{aligned}&Pr._{TSE}(X)\\ &\quad \sim f_{+^\bigstar }(x) \cdot f_{-^\bigstar }(x) + f_{-^\bigstar }(x) \cdot f_{+^\bigstar }(x) \\ & \quad =2(f_{+^\bigstar }(x) \cdot f_{-^\bigstar }(x)) \end{aligned}$$
(11)

From Eq. 11 follows that \(max(Pr._{TSE}(X)) =.5\). This is coherent with the assumption that if \(f_{+^\bigstar }(x) >.5\), then \(f_{-^\bigstar }(x) <.5\) by definition (see Eq. 8 and 10. Indeed, it holds that if \(f_{+^\bigstar }(x) =.5\) and \(f_{-^\bigstar }(x)\), then

$$\begin{aligned} 2(f_{+^\bigstar }(x) \cdot f_{-^\bigstar }(x)) =.5 = max(Pr._{TSE}(X)) \end{aligned}$$
(12)

3.6 Multiversal Model

For each of the 48 X, it is generated a multiversal sample of 1, 024 specifications, summing up to a total of 49, 152 total estimates. This model, while keeping the general assumptions of VoE, includes three additions to how the specifications ‘vibrates’:

  • The estimation technique of \(\hat{\beta _x}\) vibrates across two types of panel regression: Fixed Effects and First Differences.

  • Lag vibrates across 1 year of lag and 2 years of lag.

  • Y vibrates across 8 specifications of the rate of fertility across different age ranges

As a consequence:

$$2{\mathcal{E}} \times 2{\mathcal{L}} \times 8{\mathcal{Y}} \times 2^{5} {\mathcal{Z}} = 1,024{\mathcal{J}}$$
(13)

where \({\mathcal {E}}\) are the estimation methods, \({\mathcal {L}}\) are the lags, \({\mathcal {Y}}\) are the specifications of the dependent variable, \({\mathcal {Z}}\) are the control variables and \({\mathcal {J}}\) is the number of specifications.

3.7 Results

The list of 48 independent variables with their evaluation statistics is presented in Table 2. Of these, 23 have a positive average of estimates in the multiverse and 25 have it negative. Assuming \(\alpha =.05\), a median rate of significance \(f_{\pm ^\bigstar } =.2\) implies that, globally, the evidence to claim linear effects of the independent X on Italian fertility is very poor. For all X, the sample standard deviation in the multiverse of their estimates is always higher than their average parametric standard error.

Table 2 Multiverse statistics on the 48 main regressors from BES database

It is possible to qualify the BES database as an insufficient empiric source to claim that, in general, sustainable well-being has a direct impact on Italian fertility: no one of the 48 variables reaches the goal of \(RR > 2\) (Young & Holsteen, 2017). The highest RR is reached by the variable “Family debts” (\(RR =.65\)).Footnote 11 As a consequence, the model presented in De Rose et al. (2020) deserves further investigation.Footnote 12

Nonetheless, this outcome was not unexpected, given the heterogeneity of social indicators in BES. The risk of making a claim of a causal effect in the wrong direction is relatively contained: the estimates for \(Pr._{TSE}\) rarely are above .05. However, the Kendall correlation between \(f_{\pm ^\bigstar }\) and \(Pr._{TSE}\) is .77; given the structure of the estimator of \(Pr._{TSE}\) (see Sect. 3.5), this result is informative about the fact that most of the variance in the risk of observing a significant wrong sign depends by the probability to observe a significant result, which is another red flag.

3.7.1 Individual Analysis

These results lead to the individual investigation of those variables showing a severe “Janus effect” (high \(Pr._{TSE}\) with high \(f_{\pm ^\bigstar }\)).

There are only 6 variables with a rate of significant specification \(>.45\). These are: “Retirement incomes” (.61, \({\bar{\beta }}_x = -0.23\)), “Low income pensions” (.46, \({\bar{\beta }}_x = 0.12\)), “Not violent crimes” (.51, \({\bar{\beta }}_x = 0.20\)), “Violent crimes” (.47, \({\bar{\beta }}_x = 0.05\)), “Women local administrators” (.49, \({\bar{\beta }}_x = -0.02\)), and “Specialised medics” (.46, \({\bar{\beta }}_x = -0.20\)).

Two hypothetical patterns are worthy of mention: (i) a rise in the allocation of pensions after retirement predicts (small) decreases in future fertility and (ii) higher crime is associated with future fertility. Even after adopting lagged models (see Sect. 3.3), both hypotheses do not configure necessarily a linear causal effect. Consider that “Retirement incomes”, the index of pro capita incomes after retirement (e.g. pensions), has the highest \(Pr._{TSE}\) of the whole BES at .16, and the correlated quota of “Low-income pensions” has a \(Pr._{TSE}\) on the higher end of the distribution (.9), too. This condition can be visualised in Fig. 2.

Fig. 2
figure 2

Volcano plot of estimates of the effect sizes of pro capita incomes after retirement on Italian fertility. In different colours are reported 7 age-specific fertility rates, hence 896 specifications are represented in the model. Darker colours are associated with fertility in older women. The dashed line is the significance level \(\alpha =.05\) (notice the y-axis is log-scaled)

The difference in the magnitude of the estimated effect sizes between Lagged Fixed Effects (LFE) and Lagged First Difference (LFD) is striking. This pattern repeats for the models of the impact of the diffusion of not violent crimes (Fig. 3) and of the rate of women as local administrators (Fig. 4) on generational fertility. Other variables exhibit a similarly patterned ‘Janus effect’, even if less accentuated.

Fig. 3
figure 3

Volcano plot of estimates of the effect sizes of diffusion of non-violent crimes on Italian fertility

Fig. 4
figure 4

Volcano plot of estimates of the effect sizes of the rate of women as local administrators on Italian fertility

Observing the volcano plots, the principal cause of Janus effect seems the adoption of LFE models. This models, which is adopted in De Rose et al. (2020), inflates the estimate compared to LFD. Another common pattern in these three plots is that the Janus effect in LFE models is also connected to the age of the mothers (darker colours on one side, lighter on the other). For example, a causal interpretation of the effect of the rise of women administrators of Italian fertility based exclusively on LFE models could lead to think that the administration of women directly promotes fertility in older women but at the cost of disadvantaging it in young women (’postponement’ effect on fertility). However, the LFD estimates tell totally another story: the presence of women as local administrators has a slightly, hard-to-catch, positive impact on the fertility of young and middle-aged women.

Which estimation method is best to listen to? Across all X, the first difference is more conservative in the magnitude of the effect. Compared to Fixed Effects, it suffers less lag-induced confounding effects (since it ‘neuters’ these) and more the misspecification of the lag structure itself. Overall, the observed results are coherent with the following explanation: LFE models cannot properly de-confound the estimate by latent structures. So, re-connecting this last assertion to the example of the effect of women administrators on fertility, one could say that the patterns observed in Fig. 4 are coherent with a phenomenology where women administrators have a very tiny positive direct effect on fertility (e.g. the promote pro-fertility policies) but the prerequisites for a rise in the rate of women administrators (e.g. ‘women emancipation’) also induce a latent, lagged, ‘postponement’ effect in fertility - which are responsible for the technical negative average effect on fertility.

It is also possible to illustrate the analysis of a variable not affected by the Janus effect: the impact of non-performing family loans (“Family debts” in Table 2) on fertility (Fig. 5).

Fig. 5
figure 5

Volcano plot of estimates of the effect sizes of non-performing family loans on Italian fertility

This is a peculiar case because the average of the effect size is very small and negligible (\({\bar{\beta }}_x = 0.04\)), yet is very consistent across specifications and even across methods of estimation. A very small but robust effect size can be interpreted as the reflection of a persistent, but not very extensive, social dynamic. For example, poor families contract relevant debts during pregnancy. In this sense, debts would not be the cause of fertility, but a small contingency that manifests 1 year before birth. Indeed, a careful eye can catch that no specification lagged to 2 years is above the threshold of significance (the dashed line in Fig. 5). This is another perk of the enhanced VoE employed in the present analysis, as an exploratory method: it helps at isolating short-termed effects (when only 1-lagged models are significant) from effects with an unclear lag structure (both 1-lagged and 2-lagged are significantFootnote 13).

4 Conclusive Discussion and Future Directions

Results of Sect. 3 are coherent with the hypothesis that the effect of ’well-being’ is too small to justify a policy based on well-being to raise fertility in Italy. The major argument to support this thesis is the moderate magnitude of the average multiversal estimates. No \(|{\bar{\beta }}_x |>.3\) in Table 2 and the standard deviation of the multiversal estimates \(s_{{\hat{J}}}(\beta _x)\) is consistently higher than the average standard error of the estimate of the coefficient of linear effect, so much of the total variance of the multiversal estimates depends on the hidden model-dependent variance and not on sampling variance.

However, individual analyses on some of the most promising variables help to understand that “there is no evidence for linear effects of BES indicators on Italian fertility” does not imply that the effect of well-being on fertility is uniformly zero for all women. Two different arguments are provided to clarify why reasonable policies may be ineffective in supporting fertility in Italy.

The first argument regards the implication of ‘linearity’ of the effect. In Figs. 2, 3, and 4, it can be noticed that the darker cluster of the scatterplot of estimates, referring to older women, goes in one direction, while the lighter cluster of younger women, goes into another. Without this distinction, this could be referred to as a Janus effect, but through the differentiation across ages, this effect is totally explainable, in the sense that the same regressor may induce a negative effect in the reproductive decision in one cluster of women and positive effects in another. Here the role of the methodology is to visualise clearly this duality. This statistical behaviour would be coherent with the hypothesis of postponement of firstborn in young women. Given this specific configuration, it would also be possible to weight the multiverse estimates based on the actual contribution to the fertility of different age clusters of women. However, it should be noted that this solution would not be free from technical, operational, and even moral issues. From a technical standpoint, a correction criterion based on cluster-wise estimators (Abadie et al., 2023) would be necessary when computing the multiverse error. In practice, there is still no theoretical formulation of Young and Holsteen (2017) RR statistic that harmonizes standard error with cluster error. From an operational perspective, this solution would completely evade the concept of sustainability implicit in the BES. The idea of promoting social policies to support women’s fertility should be aimed at a specific purpose, which is to rebalance the decision-making power of women who would like to have children but are not in a position to bear the costs. However, a criterion that favours certain clusters over others seems to be out of line with the foundational spirit of the BES and more in line with the aforementioned initiatives of the Hungarian government. In this sense, multiverse analysis can also be a starting point for an ethical debate on reproductive rights. A clear case for this argument is that, adopting the LFE model in Fig. 4, it can be argued that women should face more obstacles to reach the position of administration, in order to raise fertility in young women. This is unlikely an empowering solution for young women! In other words, the first argument against the claim that well-being is a determinant of fertility is that, even when local evidence is found for it, this effect will be nested within a complexity of other variables, so the policy could easily backfire (or ‘vibrate back’) in the long run.

The second argument for avoiding to claim that policies of well-being can help to regularise fertility in Italy regards the extent to which collected evidence is truly sufficient for accurately refusing the null hypothesis of no effect. The study considered a panel of 100 provinces, across 6 years, and two different modelling approaches to account for the panel structure. The differences in the magnitude of the values between LFE and LFD estimates are striking. Aside from the interpretation of these differences (LFD may be seen as more conservative), there are more differences between designs of the inference than within them. This condition alone implies that the collected evidence ‘is not sufficient’ because it alone does not explain (’is not the main reason’) for a specification to be statistically significant: the significance of a result depends more on the estimator than on the estimated.

In this regard, a limitation of the study is that, contrarily to De Rose et al. (2020), the modelling of estimation never involves multi-level assumptions. The Italian provinces are treated as an unstructured collection of statistical units, without the typical clusterisation in geographic macro-areas. Multi-level multiversal modelling has been attempted by Harder (2020), and it represents an ambitious future direction for multiversal modelling for demography.

From this consideration follows a second limitation, regarding the possibility of ecological fallacies across the multiple inferences (Gnaldi et al., 2018). For example, the model accounts for the presence of immigrants in the province, but then the outcome is the pooled fertility rate between natives and immigrants. In this regard, it can be said that the general interpretation of the results is more related to evaluating the uncertainty regarding a very general hypothesis rather than the possibility of simultaneously testing multiple hypotheses. If the research question specifically aims to quantify the uncertainty regarding the estimation of the effects of policies aimed at promoting the sustainability of well-being on fertility for Italian or foreign-born individuals, then it would be absolutely necessary to diversify the output variable.

On the extendibility of the methodology for purposes other than the evaluation of the robustness of a scientific claim, there is an ongoing debate about the usability of RR for model selection (Slez, 2019; Young, 2019). In this application, the VoE model produced 1024 specifications of the same conceptual model. How should one pick one? Aside from traditional methods based on Information Theory (Burnham & Anderson, 2002), Prof. Breznau wrote extensively on this problem from the perspective of the social scientist (Breznau, 2021). He favours the crowdsourced (or, meta-analytical) solution to multiversal sampling: the same dataset and the same research question are provided to multiple research teams, in line with the spirit of the “Many Labs” protocol (Stroebe, 2019). Researchers send back a sample of specifications and the whole multiverse is the union of the proposals. This crowdsourced method is technically superior: the same specification can be drawn in the sample more than one time (up to times equal to the number of teams), so the assumptions about the random structure of the multiverse are relaxed. BES has excellent features to be the empirical basis for reproducing this crowdsourced method; it could answer some of the most complex questions regarding the social life of people living in Italy. However, this path is not without adversities. Breznau’s experiment demonstrated that a relevant research question, namely: “Does immigration induce variation in the desirability of welfare?” still exhibits an irreducible Janus effect even in the crowdsourced sample (Breznau et al., 2022). In this sense, the proposal of Slez (2019) to adopt a system of weightings sounds sensible.

To conclude, it has been aforementioned that the present study adapts a template originated in Biomedical Sciences for the new frontiers of Social Research. A relevant future direction of the research is envisioned reconnecting the Multiverse with the developments of the research on BES indicators. Part of the research on BES is focused explicitly on methods for aggregation of the elementary indicators into composite indicators of well-being for the 12 latent domains, with the Adjusted Mazziotta–Pareto method (Mazziotta & Pareto, 2016), the one referred in De Rose et al. (2020), being prevalent (Ciommi et al., 2017; Alaimo & Maggino, 2020), but not the only one (Davino et al., 2018; D’Urso & Vitale, 2021; Tomaselli et al., 2021; Giacalone et al., 2022; Monte & Schoier, 2022). The robustness of the inferences based on these composite indicators can be checked by adopting different alternatives and checking if these analytical choices, for the operational definition of the latent variables, induce significant differences in variance or effect size, and even if some methods of composition are consistently at risk of TSE.