1 Introduction

In the field of applied statistics and data science, two prevailing cultures have emerged: the data modeling culture, with a focus on understanding the data generating process, and the algorithmic modeling culture, emphasizing accurate predictions through techniques like neural networks, as outlined by Breiman [1]. As datasets evolve into larger and more intricate structures, the demand for models transcending accurate predictions to offer clear and interpretable explanations intensifies. This need is underscored by regulations such as the General Data Protection Regulation (GDPR) [2]. Specifically the right for explanations for the decisions made by algorithmic approaches is written in Art.22 https://gdpr-info.eu/art-22-gdpr/ and Recital 71 https://gdpr-info.eu/recitals/no-71/ to it. GDPR requires transparent explanations for decisions made by automated systems, so that a specific decision can be challenged by the individual it was applied to. Many of the modern decision-making tools, like deep neural networks [3], lack the transparency related to being able to explain their decisions. And, although, there are tools like SHAP [4] or SAGE [5] to explain (respectively) locally and globally decisions made by black box algorithms, more transparent by design models are required. The limitation of these methods is that an approximation of explanation is used for an approximation of the data generative mechanism made by a black box model. Thus, uncertainty regarding the decisions as well as full mechanistic control of the decision-making may be lost in these procedures. While simple linear models satisfy to a greater extent these requirements, they may result in worse predictive performance and end up in less accurate decision-making [6, 7], which creates the need for a more flexible yet still transparent way of predictive modeling that is also capable of uncertainty handling. One prominent approach to tackle these issues is a symbolic regression [8]. Symbolic regression aims at building a potentially complicated and nonlinear yet still closed-form functional tree that links the covariates and the responses. However, while symbolic regression resolves the transparency issue, it genuinely lacks uncertainty handling in both the functional trees and the predictions.

To resolve these issues, and have, on the one hand, flexible but, on the other hand, explainable approach, the Bayesian Generalized Nonlinear Model (BGNLM) was introduced by Hubin et al. [6]. Designed for flexibility and interpretability, BGNLMs extend generalized linear models [9], assuming the distribution of observations comes from the exponential family, but with the mean parameter intricately linked to covariates through a hierarchical nonlinear function combined with automatic Bayesian linear and nonlinear effect selection or averaging. While the hierarchy of features of BGNLM is inspired by concepts from neural networks, the novel model distinguishes itself by striking a balance between predictive power, uncertainty handling, and interpretability. Unlike their neural network counterparts, BGNLMs offer simplicity and transparency without sacrificing predictive accuracy. As shown in Hubin et al. [6], the approach can recover Kepler’s laws and complicated logical interactions corresponding to the data generative model in an M-closed setting [10], while still performing well in terms of predictions in the case of the M-open world [11].

Yet, the computational demands inherent in many Bayesian approaches, including BGNLMs, pose a challenge. Traditionally, their application has been confined to smaller problems with limited observations and covariates due to the computational overhead of algorithms such as (among others) mode jumping Markov chain Monte Carlo (MJMCMC) [12] or integrated nested Laplace approximations (INLA) [13]. Efforts to enhance scalability, such as subsampling MJMCMC [14], have been made to scale to tall data. However, these approaches still struggle with datasets containing too many covariates (wide data) due to their combinatorial nature in an exponentially large model space. Recently, an R package FBMS was published on CRAN for BGNLMs [15], but BGNLMs are still not widely used in practice. One of the reasons for that is that scalability should be improved for being able to be used for massive datasets, which become more and more common in machine learning and data science.

Resolving this challenge creates the motivation for this paper, which proposes an innovative fitting algorithm, leveraging insights from stochastic variational Bayes both mean-field and with flexible normalizing flows, which replace the MJMCMC in the GMJMCMC algorithm. This approach aims at enhancing the scalability of BGNLMs, making them more amenable to datasets characterized by a multitude of covariates and observations. Yet, we show that this does not come with a significant deterioration of performance as compared to the costly asymptotically exact approaches based on Monte Carlo sampling. Also, methodologically speaking, to the best of our knowledge, this is the first time evolutionary algorithms are combined with variational Bayes, thus filling a broader research gap than just inference on BGNLMs.

2 Bayesian generalized nonlinear models (BGNLM)

The core model for this paper is the Bayesian Generalized Nonlinear Model (BGNLM). Developed by Hubin et al. [6], this model aims to offer flexibility and interpretability combined with a nonlinear regression with a rich class of nonlinearities and interactions. Formally, we model the relationship between p explanatory variables and a response variable based on n samples from a dataset. For \(i = 1,...,n\), where \(Y_i\) is the response data, and \(\textbf{x}_i = (x_{i1},..., x_{ip})\) is the corresponding covariate vector.

The model resembles GLM [16] but introduces flexibility by incorporating a range of nonlinear transformations, named features. These features, denoted as \(F_j(\textbf{x}_i, \varvec{\alpha }_j)\) for \(j=1,..,q\), will be introduced shortly. Conditional on the full hierarchy of features, the BGNLM is defined as follows:

$$\begin{aligned} Y_i|\mu _i,\phi&\sim f(y|\mu _i, \phi ), \end{aligned}$$
(1)
$$\begin{aligned} h(\mu _i)&= \beta _0 + \sum _{j=1}^{q} \gamma _j \beta _j F_j(\textbf{x}_i, \varvec{\alpha }_j). \end{aligned}$$
(2)

Here, \(f(\cdot |\mu _i,\phi )\) represents the probability distribution density (mass) from the exponential family, and \(h(\mu _i)\) is the link function relating the mean to the features. Features enter the model with coefficients \(\beta _j \in \mathbb {R}\) for \(j= 1,...,q\), with a binary variable \(\gamma _j \in \{0, 1\}\) indicating inclusion of a specific feature in a model. The \(\varvec{\alpha }_j\)’s are any potential internal parameters of the features and will be discussed in the next section.

2.1 Features

To understand Equation (2), we need to specify the features \(F_j(\textbf{x}_i, \varvec{\alpha }_j) \; j=1,..,q\) and define the feature hierarchy: A feature is a covariate, an interaction or a nonlinear transformation, using functions from \(\mathcal {G} = {g_1,...,g_k}\). Starting with input variables \(\textbf{x}_i\) as features, denoted as \(F_j(\textbf{x}_i, \varvec{\alpha }_j) = x_{ij}\) for \(j \in {1,...,p}\), we recursively define the whole hierarchy of features. Assume a set of features \({F_k(\cdot , \varvec{\alpha }_k), k \in A}\) is available at a given level of recursion. New features for the next level are defined as follows:

$$F_{j} ({\mathbf{x}}_{i} ,\alpha _{j} ) = \left\{ {\begin{array}{*{20}l} {g_{j} (\alpha _{{j,0}}^{{out}} + \sum\limits_{{k \in A_{j} }} {\alpha _{{j,k}}^{{out}} } F_{k} ({\mathbf{x}}_{i} ,\alpha _{k}^{{in}} ))} \hfill & {{\text{projection,}}} \hfill \\ {g_{j} (F_{k} ({\mathbf{x}}_{i} ,\alpha _{k}^{{in}} ))} \hfill & {{\text{modification,}}} \hfill \\ {F_{k} ({\mathbf{x}}_{i} ,\alpha _{k}^{{in}} )F_{l} ({\mathbf{x}}_{i} ,\alpha _{l}^{{in}} )\;k,l \in A} \hfill & {{\text{multiplication}}{\text{.}}} \hfill \\ \end{array} } \right.{\text{ }}$$
(3)

Features are projections, modifications, or multiplications, introducing flexibility and parsimony. The first transformation, known as projection, has a similar definition to those used in neural networks. It involves taking a linear combination over a subset of features before applying a nonlinear transformation selected from \(\mathcal {G}\). Note the distinction between the parameters defining the current projection, \(\varvec{\alpha }_j^{out}\), and the parameters contained within the previously defined features nested inside the projection, \(\varvec{\alpha }_j^{in}\). Nonetheless, all \(\varvec{\alpha }_j\) parameters originate from projection transformations and need to be determined inside these. Hubin et al. [6] proposed three strategies, but they will not be discussed here. Strategy 2 is implemented in this paper.

Even though the definitions may look complicated, the resulting features can be quite interpretable as they can be represented as functional trees, which, in turn, may correspond to phenomena that are easy to understand, like nonlinear physical laws. Toy examples of tree presentations of the features of BGNLMs are given in Fig. 1.

Fig. 1
figure 1

Illustration of how features are composed. On the left, the resulting feature is \(x_1\text {cos}(x_2) (x_3 + \text {exp}(x_4))\). On the right, the resulting feature is \((x_1^2x_2)^{\frac{1}{3}}\). This is one form of Kepler’s third law (Hubin et al. [6])

All features have specific characteristics: depth, local width, and operations count. The depth, \(d_j\), of a feature \(F_j\) is determined by the minimum number of nonlinear transformations applied recursively to generate it. For example, if a feature \(F_j\) is defined as \(F_j(\textbf{x}_i, \varvec{\alpha }_j) = h(u(v(x_{i1})) + w(x_{i2}))\), for some nonlinear functions huv, and w, then its depth is 3. Conversely, if a multiplication operation is applied, the depth is defined as one plus the sum of the depths of the operands. For example, \(F_k(\textbf{x}_i, \varvec{\alpha }_k) = x_{i2} u(x_{i1})\) has depth \(d_k\) of 2, using that the depth of a linear component is zero. Hubin et al. [6] showed that the number of features grows super-exponentially with depth, and in practice, the modeler limits the depth to be small even for problems with few covariates. Further, the local width, \(lw_j\), of a feature is the number of other features used to generate it. The value of \(lw_j\) depends on the type of operation used: \(|A_j|\) for a projection, 1 for a modification, and 2 for a multiplication. The operations count, \(oc_j\), of a feature is the total number of algebraic operations used in its representation. For instance, \(F_j(\textbf{x}_i, \varvec{\alpha }_j) = x_{i1}\) has \(oc_j = 0\), while \(F_j(\textbf{x}_i, \varvec{\alpha }_j) = v(u(x_{i1}))\) has \(oc_j = 2\).

Our priors align with those in the original paper, to mitigate the risk of overfitting arising from the vast and flexible feature space generated by (3). Thus, we adopt a Bayesian approach with priors on \(p(\gamma _j)\) favoring simplicity. Further, we are incorporating three hard constraints: Constraint 1. The depth of any feature is less than or equal to \(D\). Constraint 2. The local width of any feature is less than or equal to \(L\). Constraint 3. The number of features in a model is less than or equal to \(Q\). These constraints ensure a finite feature space, curbing both the number of features and models. To integrate our model (2) into a Bayesian framework, we assign prior probabilities to all parameters. The unique structure of a specific model is dictated by the vector \(\varvec{\gamma }= (\gamma _1,..., \gamma _q)\). Our initial step involves establishing prior probabilities for \(\varvec{\gamma }\):

$$\begin{aligned} p(\varvec{\gamma }) \propto \text {I}(|\varvec{\gamma }| \le Q)\prod _{j=1}^q p(\gamma _j). \end{aligned}$$
(4)

Here \(|\varvec{\gamma }| = \sum _{j=1}^q \gamma _j\) limits the number of features included in the model to the maximum allowed, Q. Terms \(p(\gamma _j)\) assign lower prior probabilities to more complex features, using the form \(p(\gamma _j) = a^{\gamma _j \text {c}(F_j(\cdot , \varvec{\alpha }_j))}\), where \(0< a < 1\) and \(\text {c}(F_j(\cdot , \varvec{\alpha }_j)) \ge 0\)  is a complexity measure of feature j.

The choice of a and the complexity measure play a pivotal role in determining the models’ prior penalty. We mainly use \(a = e^{-2}\) for prediction and \(a = e^{-\log n}\) for model identification corresponding to AIC- and BIC-prior penalties, respectively. For \(\text {c}(F_j(\cdot , \varvec{\alpha }_j)) \ge 0\), we use \(1+oc_j\). To complete our Bayesian model, priors for components of \(\varvec{\beta }\) where \(\gamma _j = 1\) and, if necessary, the dispersion parameter \(\phi ,\) are specified. We use independent Gaussian priors with the same variance for all \(\beta _j\), which corresponds to \(l_2\) regularization:

$$\begin{aligned} p(\beta _j|\gamma_j = 1) = \mathcal {N}(\beta _j; 0,\sigma ^2_\beta ). \end{aligned}$$
(5)

Alternative approaches, including Jeffreys prior and mixtures of g-priors, were considered in Hubin et al. [6, 17].

3 Genetically modified MJMCMC (GMJMCMC)

While MCMC or MJMCMC [18] can efficiently explore the marginal space of models, applying them directly to navigate the feature space from Equation (2) presents challenges. Firstly, the model space, with a size of \(2^q\), grows exponentially with the number of features q. Secondly, the increase in q is exponential with the depth of the features. Consequently, an alternative solution becomes imperative. In Hubin et al. [19], this issue is addressed within the context of Bayesian logic regressions, introducing the Genetically Modified MJMCMC (GMJMCMC) algorithm. This adaptive algorithm embeds MJMCMC into a genetic programming framework, a concept extended to BGNLM in Hubin et al. [6].

The process begins with an initial population of features \(\mathcal {S}_{0}\), comprising the essential covariates. MJMCMC is then deployed to explore the model space defined by this population. A new population emerges through random filtration, retaining features with high marginal inclusion probabilities. Subsequently, replacements are generated by applying transformations corresponding to the feature-generating process from Hubin et al. [6] on the remaining features. This iterative process repeats for a predefined number of times T, denoted as the number of populations in that particular run. Moreover, the algorithm is equipped with an embarrassingly parallelized version in Hubin et al. [6, 19]. The algorithm is further proven to be irreducible under practical assumptions. However, running MJMCMC on a combinatorial discrete space of models is quite computationally demanding, which reduces the approach’s applicability to data with a relatively small number of features in each population. Moreover, the original GMJMCMC implementation from Hubin et al. [6] does not scale in the sample size either. Below, we present a way to improve this.

4 Evolutionary variational Bayes

In this paper, we propose to replace MJMCMC with variational Bayes within the populations of GMJMCMC resulting in what we call an evolutionary variational Bayes approach (EVB).

Algorithm 1 presents the pseudocode of the suggested in this paper inference method. For further illustration, it is accompanied by Diagrams 1 and 2. In what follows, we shall give a detailed verbal description of its work: The algorithm commences with all p covariates as features, forming the first generation \(\mathcal {S}_0\) just like in GMJMCMC. For subsequent populations post-\(\mathcal {S}_0\), we allow the size of each population, \(q^*\geq Q\), to surpass p, expediting the exploration of diverse features.

Following GMJMCMC [6], selecting features for the next generation involves two steps: Firstly, we estimate marginal inclusion probabilities \(\kappa _k\) for each feature in the population, then retaining all \(\mathcal {S}_t\) members with an inclusion probability above a threshold \(\rho _{del}\). Those with inclusion probability less than \(\rho _{del}\) are retained with a probability proportional to their marginal inclusion probability. Removed features are replaced with new ones generated randomly through projections, modifications, multiplications as per (3), or by selecting an input covariate. Newly generated features that are already present or linearly dependent with features in \(\mathcal {S}_t\) are excluded from the next population, \(\mathcal {S}_{t+1}\). Thus, the next population comprises high-probability features from \(\mathcal {S}_t\) plus the newly generated ones.

When replacing features, the replacement is randomly generated through the projection transformation with probability \(P_{pr}\), modification transformation with probability \(P_{mo}\), multiplication transformation with probability \(P_{mu}\), or an input covariate with probability \(P_{in}\), where \(P_{pr} + P_{mo} + P_{mu} + P_{in} = 1\). This allows excluding a certain transformation by setting \(P_{pr}\), \(P_{mu}\), \(P_{mo}\), or \(P_{in}\) to zero. If a projection or modification is added, a nonlinearity is chosen from \(\mathcal {G}\) with probabilities \(P_\mathcal {G}\).

Algorithm 1
figure a

EVB

4.1 Model for each generation

In the generations \(t \in 1,..., T\) of a genetic algorithm described above, we essentially fix a subset of features defined as a population of features \(\mathcal {S}_t = \{F_{t_1},..., F_{t_{q^*}}\} \subseteq \{F_{1},..., F_{q}\}\). This makes inference within each population equivalent to Bayesian model averaging within a GLM context with \(F_{t_1},..., F_{t_{q^*}}\) acting as independent variables. Let, for simplicity, \(\textbf{v}_j\) be denoted as the values of a feature \(F_{j}\).

In practice, we use the \(\textbf{v}_j\)’s as inputs to a Bayesian GLM, aiming to estimate the joint posterior of the weights \(\varvec{\beta }= (\beta _{t_1},..., \beta _{t_{q^{*}}})\) and binary model vectors \(\varvec{\gamma }= (\gamma _{t_1},..., \gamma _{t_{q^{*}}})\). Then, our linear predictor for observation i, denoted as \(u_i\), takes the following form:

$$\begin{aligned} u_i = {b} + \sum _{j=1}^{q^*} \gamma _{j}\beta _{j}{v}_{ij}, \quad \; i = 1, ..., n. \end{aligned}$$
(6)

This model formulation is a linear model exactly corresponding to the settings of Carbonetto and Stephens [20]. But it can also be seen as a simplified version of the latent binary Bayesian neural networks (LBBNN) from Hubin and Storvik [21] and Skaaret-Lund et al. [22], utilizing a single hidden layer with one node. Throughout this section, we leverage the inference techniques provided in those papers.

4.2 Mean-field approximation

The mean-field Gaussian is the initial choice for the approximate posterior, commonly employed in practice. It is defined over the weights in an arbitrary regression with \(q^*\) (for simplicity of notation) parameters as:

$$\begin{aligned} q(\varvec{\beta }) = \prod _{j=1}^{{q^*}}\mathcal {N}(\beta _j; \tilde{\mu }_{j}, \tilde{\sigma }_{j}^2). \end{aligned}$$
(7)

The mean-field approach was extended to include binary inclusion variables (and thus handle model uncertainty) in Carbonetto and Stephens [20] and Hubin and Storvik [21] in the contexts of linear models and neural networks correspondingly through the following variational approximation:

$$\begin{aligned} q(\varvec{\beta }|\varvec{\gamma })&= \prod _{j=1}^{q^*}\Big [\gamma _{j}\mathcal {N}(\beta _{j}; \tilde{\mu }_{j}, \tilde{\sigma }_{j}^2) + (1 -\gamma _{j})\delta (\beta _{j})\Big ], \end{aligned}$$
(8)
$$\begin{aligned} q(\gamma _{j})&= \text {Bernoulli}(\gamma _{j};\tilde{\kappa }_{j}), \end{aligned}$$
(9)

where \(\delta (\cdot )\) is Dirac’s delta function.

4.3 Flow approximation

The mean-field distribution is likely to lack the flexibility to approximate the true posterior as it completely ignores dependencies between the parameters. This creates the need for a more versatile distribution. Following Skaaret-Lund et al. [22], we adopt the multiplicative normalizing flow (MNF) [23]. The idea relies upon introducing a latent variable \(\textbf{z}\) that models dependencies between the parameters. A comparison between the mean-field and flow models is illustrated in Fig. 2.

Fig. 2
figure 2

Mean-field (left) and flow-based (right) approximate posteriors

The variational posterior of the flow models for the weights is now given by:

$$\begin{aligned} q(\varvec{\beta }|\varvec{\gamma })&= \prod _{j=1}^{q^*}\Big [\gamma _{j}\mathcal {N}(\beta _{j}; z_j\tilde{\mu }_{j}, \tilde{\sigma }_{j}^2) + (1 -\gamma _{j})\delta (\beta _{j})\Big ], \end{aligned}$$
(10)

while the approximation for the inclusion indicators remains the same as in Equation (9). This is a version of the variational distribution from Skaaret-Lund et al. [22] that is adopted to GLM instead of LBBNN. The key difference from the mean-field approximation is the introduction of the latent variable \(\varvec{z}\). The variational posterior becomes more flexible through normalizing flows by applying transformations to the weights through a latent variable \(\textbf{z}\). The flow, in turn, follows the inverse autoregressive with numerically stable updates design from Kingma et al. [24]. The chain of flows is initialized by \(\textbf{z}_0\) derived from the input features, and each flow component \(f_{\phi _k}\) is a so-called MADE [25] to retain an autoregressive structure. The resulting log density of \(\textbf{z} = \textbf{z}_K\) from the last layer of the flow is given as follows:

$$\begin{aligned} \log q (\textbf{z}) = \log q(\textbf{z}_0) - \sum _{k=1}^{K} \log \Big | \text {det} \frac{\partial f_{\phi _k}}{\partial z_{k-1}}\Big |, \end{aligned}$$
(11)

where the base \(q(\textbf{z}_0)\) comes from isotropic Gaussian distribution.

4.4 Stochastic variational Bayes

Within each generation of the EVB algorithm, we learn the variational parameters that make the approximate posterior as close as possible to the true posterior. Closeness is measured through the Kullback–Leibler (KL) divergence,

$$\begin{aligned} \text {KL} \left[ q_{\varvec{\theta }}(\varvec{w},\varvec{\gamma })||p(\varvec{w},\varvec{\gamma }|\mathcal {D})\right] = \sum _{\varvec{\gamma }} \int _{\varvec{w}} q_{\varvec{\theta }}(\varvec{w},\varvec{\gamma })\log \dfrac{q_{\varvec{\theta }}(\varvec{w},\varvec{\gamma })}{p(\varvec{w},\varvec{\gamma }|\mathcal {D})}\,d\varvec{w}, \end{aligned}$$
(12)

where \(q_{\varvec{\theta }}(\varvec{w},\varvec{\gamma }) = q(\varvec{\beta }|\varvec{\gamma })q(\varvec{\gamma })\) and \(\varvec{\theta }\) is a vector of all parameters of variational approximating distributions \(q(\varvec{\beta }|\varvec{\gamma })\) and \(q(\varvec{\gamma })\).

Minimizing the KL divergence (with respect to \(\varvec{\theta }\)) is equivalent to maximizing the evidence lower bound (ELBO):

$$\begin{aligned} \text {ELBO}_{\varvec{\theta }} = \mathbb {E}_{q_{\varvec{\theta }}(\varvec{w},\varvec{\gamma })}\left[ \log p(\mathcal {D}|\varvec{w},\varvec{\gamma })\right] - \text {KL}\left[ q_{\varvec{\theta }}(\varvec{w},\varvec{\gamma })||p(\varvec{w},\varvec{\gamma })\right] . \end{aligned}$$
(13)

The objective is thus to maximize the expected log-likelihood while penalizing with respect to the KL divergence between the prior and the variational posterior. We further apply stochastic optimization combined with a local reparametrization trick just like in Skaaret-Lund et al. [22] to estimate the parameters of our variational distribution by minimizing \(\text {KL} \left[ q_{\varvec{\theta }}(\varvec{w},\varvec{\gamma })||p(\varvec{w},\varvec{\gamma }|\mathcal {D})\right]\) or maximizing \(\text {ELBO}_{\varvec{\theta }}\).

To emphasize theoretical gain in computational time, Table 5 reports those of different popular methods. The most important comparison for BGNLM fitted by EVB here is the comparison with GMJMCMC for which we have an improvement that allows us to go from exponential complexity within each population to linear complexity, which comes at the price of approximations used instead of exact sampling.

5 Simulation studies

In testing our BGNLM fitting algorithms, we conducted three simulation studies with a focus on variable selection and model detection in a M-closed framework [10], emphasizing the accuracy of our approximate posteriors in terms of learning the data generative process over prediction accuracy. For each study, we performed 100 simulation runs, evaluating the algorithms using commonly used measures to evaluate model selection, namely, power and false discovery rate (FDR). Also, our simulation studies follow the Aims–Data-Generating Mechanism–Estimands and Targets–Methods–Performance Metrics–Results (ADEMP) convention from Morris et al. [26]. Neither here nor in the “small data” real data examples do we report the computational time due to that it is theoretically (and practically) unfair to compare implementation in different programming languages (R and Python) utilizing different data structures for storing the results, with or without C++ backend, and finally run on GPUs and CPUs at the same computational node shared with multiple users (whole department) and thus having potentially different loads at different times. However, we report the inference time for the tall data real data example. This is done to check the feasibility of the approaches and understand the range of computational time one may expect (not strictly comparing the computational time per se) in a challenging situation.

The hyperparameters and tuning parameters of the BGNLM and EVB for all of the experiments are listed in Table 6 in the Appendix of the paper. If we omit reporting any of them in the text, interested readers can refer to the table for further information.

5.1 Simulation study 1

Aims The primary aim of this simulation study was to evaluate the performance of novel algorithms in variable selection within Bayesian linear regression using independent normal data with varying noise levels.

Data generating mechanism The data were generated according to the following process:

$$\begin{aligned} {x}_{ij} \sim \text {N}(0, 1), \quad Y_i \sim \text {N}(\varvec{\beta }^T\textbf{x}_i, \sigma ^2), i = 1,...,n, j = 1,...,p, \end{aligned}$$
(14)

where \(n = 15,000\) and \(p=20\), and the regression coefficients were fixed as:

$$\begin{aligned}{} & {} \varvec{\beta } = (0,0,0,0,0, 1.5, 4, 3, -0.2, \\{} & {} \quad 1,0,0,0,0,0,-2, 1.3, 0.3, -0.8, 3), \end{aligned}$$

while the noise level, \(\sigma ^2\), was varied across the simulations.

Estimands and targets The main goal was to estimate the set of nonzero regression coefficients using the median probability rule and to check whether we correctly identify these as nonzero. The additional goal was to check that also for nonzero coefficients, the data generative effect sizes are within 95% credible intervals.

Methods We applied both mean-field and flow-based approximate posteriors. For the model prior, we used the AIC-prior penalty \(a = e^{-2} \approx 0.135\) for all \(j\). For both algorithms, we used a batch size of \(1000\) and \(300\) epochs. The flow-based approximate posterior was computed using \(K = 2\) components with \(2\) hidden layers of size \(75\) for each component. For both of the methods, we used \(T=1\) thus reducing our EVB to a simple VB algorithm on a given set of covariates, which is sufficient in this experiment as we are only interested in variable selection.

Performance metrics We evaluated the methods based on statistical power and FDR. Additional metrics included the coverage of credible intervals for \(\varvec{\beta }\) (results in Figs. 5 and 6 in the Appendix).

Results As shown in Fig. 3 (left), both methods achieved perfect recovery of the true model as the signal strength increased. The flow-based model did not outperform the mean-field approximation in terms of power, but the mean-field approximation exhibited a faster reduction in FDR. This outcome was expected since the data generative model did not introduce correlations between covariates. Also, as demonstrated in Figs. 5 and 6 in the Appendix, as the signal increases the true effects become located within the 95% credible intervals for both the mean-field and the flow-based approaches.

Fig. 3
figure 3

Power and FDR for mean-field and flow-based approximate posterior for different values of \(\sigma ^2\) and independent data (left) and of \(\alpha\) for correlated data (right). The dashed lines represent 95% confidence intervals across 100 runs

Implications The results suggest that while both methods are effective for variable selection in independent data scenarios, the mean-field approach may be more efficient in reducing false discoveries when no correlations exist among covariates (Fig. 3).

5.2 Simulation study 2

Aims The aim of this study was to challenge the algorithms for fitting BGNLMs with correlated data, thereby testing their robustness and performance under different data dependencies. Introducing correlation between covariates adds complexity to the variable selection task. The mean-field approximation assumes independence among covariates, and we wanted to demonstrate the need for a more complex approximation in cases with correlation.

Data generating mechanism The data were generated in the same way as in the previous study. However, we introduced a correlation between two variables, a true and a false effect, by defining the dependence of \(x_{i6}\) on \({x}_{i3}^*\) as follows:

$$\begin{aligned} {x}_{i3}^* = \alpha {x}_{i6} + (1 - \alpha ) x_{i3}, \quad 0< \alpha < 1. \end{aligned}$$
(15)

The noise level \(\sigma ^2\) was fixed at 1 because, at this noise level, both methods were able to accurately report only the true effects in the previous simulation study with independent effects. Then, \({x}_{i3}^*\) replaced \({x}_{i3}\) in the design matrix. Any false discoveries now arise solely from the incorrect detection of \({x}_{i3}\).

Estimands and targets Similar to the first study, the target was to correctly identify the nonzero regression coefficients under the new correlation structure.

Methods The same Bayesian regression model was employed with both mean-field and flow-based approximate posteriors.

Performance metrics Statistical power and FDR were again used as the primary metrics. Additional metrics included the coverage of the 95% credible intervals for \(\beta _3\) (results in Fig. 7 in the Appendix).

Results The right panel of Fig. 3 illustrates that the flow-based approximation outperformed the mean-field approximation in detecting the true model with correlated data. Specifically, the mean-field approximation identifies \(\gamma _3\) as significant when \(\alpha \ge 0.2\), leading to a FDR of 0.1, as all false discoveries originate from the incorrect detection of \(\gamma _3\). Conversely, the flow-based approximation estimates the selection of \(\gamma _3\) as zero for \(\alpha \le 0.6\). Consequently, the flow-based approach correctly determines that \(x_{i3}\) should not be part of the true model under higher levels of correlation.

Implications This highlights the advantage of flow-based approximations in handling correlated data, where traditional mean-field approximations may fail due to their independence assumption among covariates. The choice of approximation method should hence consider the correlation structure of the data. In scenarios where covariates are likely to be correlated, flow-based approximations provide a more robust and reliable option. These findings are consistent with the previous research (e.g., Carbonetto and Stephens [20]).

5.3 Simulation study 3

Aims This study aimed to assess the performance of the algorithms on a dataset with a complex model structure, using the ART study design from Royston and Sauerbrei [27], assessing fractional polynomials models using a real breast cancer dataset (as covariates) from Schmoor et al. [28].

Data generating mechanism The data are available at http://biom131.imbi.uni-freiburg.de/biom/Royston-Sauerbrei-book/Multivariable_Model-building/downloads/. Following https://www.mdpi.com/2504-3110/7/9/641, we modified the outcome model to make it more challenging for variable selection algorithms. The data generative process is thus given by

$$\begin{aligned} Y_{i} = & |x_{{i1}} |^{{0.5}} + x_{{i1}} + |x_{{i3}} |^{{ - 0.5}} + |x_{{i3}} |^{{ - 0.5}} \log (|x_{{i3}} | + \varepsilon ) \\ & \; + x_{{i4a}} + x_{{i5}}^{{ - 1}} + \log (|x_{{i6}} | + \varepsilon ) + x_{{i8}} + x_{{i10}} + \epsilon _{i} , \\ \end{aligned}$$
(16)

where \(x_{i4a}\) is the second level of \(x_{i4}\) and \(\epsilon _i \sim \text {N}(0, \sigma ^2)\). We used a small number, \(\varepsilon = 10^{-5}\), for numerical stability. Further, \(\sigma ^2\) was varied from \(10^2\) to \(10^{-10}\).

Estimands and targets The focus was on variable selection power and FDR.

Methods We utilized mean-field and flow-based approximations within BGNLM, comparing them to GMJMCMC https://www.mdpi.com/2504-3110/7/9/641, Bayesian fractional polynomials (BFP), and multivariate fractional polynomials (MFP). One hundred EVB runs were performed for each setting to fit the BGNLM. Here \(T = 15\) and \(\mathcal {G} = \{\mathcal {G}_0,\mathcal {G}_1,\mathcal {G}_2\}\) with \(\mathcal {G}_0 = \{x\}\), \(\mathcal {G}_1 = \{x^{-2}, x^{-1}, x^{0.5}, \log x, x^{0.5}, x^3\}\) or \(\mathcal {G}_2 = \{ x^{-2}\log x, x^{-1} \log x, x^{-0.5}\log x, \log x \log x, x^{0.5}\log x, x \log x, x^{2} \log x, x^3 \log x\}\). Other hyper and tuning parameters are found in Table 6 in the Appendix.

Performance metrics Statistical power and FDR were again used as the primary metrics.

Results The results of this study confirmed the findings of the second study. As can be seen in Fig. 4, the flow-based approximation performed slightly better in terms of both power and FDR than the mean-field approximation when explanatory variables were correlated. In this study, not only were the covariates correlated, but also the polynomial terms were heavily correlated. Interestingly, the flow-based approximation even outperformed the GMJMCMC-based approximation for BGNLM in terms of the median performance for stronger signals. On the other hand, the mean-field approximation was on par with GMJMCMC in terms of power and outperformed it in terms of FDR for stronger signal levels. Both mean-field approximations and flow-based approximations outperformed all versions of the older implementation of Bayesian fractional polynomials (BFP) by Sabanés Bové and Held [29] in terms of both power and FDR under all model priors within BFP. Additionally, both of our variational approximations outperformed the frequentist multivariate fractional polynomials model (MFP) from Royston and Sauerbrei [27] in terms of FDR. They also outperformed MFP in terms of power for weaker signals and were on par with MFP for stronger signal levels.

Fig. 4
figure 4

Power (left) and FDR (right) for mean-field and flow-based, GMJMCMC approximate posterior within BGNLM as well as BFP and MFP for different values of \(\sigma ^2\) and ART data. The dashed lines represent 95% confidence intervals across 100 runs

Implications These results demonstrate the robustness of flow-based approximations in complex real-world data scenarios, where they offer significant advantages over traditional methods.

5.4 Summary of the simulations

The overall conclusion is that our BGNLM inference algorithms, both mean-field and flow-based, show promise in variable selection tasks across different simulation scenarios. The flow-based approach tends to outperform the mean-field approximation when the effects are correlated, but as expected does not improve in the independent case.

6 Real data applications

In this section, we first apply our novel EVB algorithms on real-world datasets used in Hubin et al. [6] for comparison with GMJMCMC and other methods studied there. All datasets proposed in Hubin et al. [6] are in a domain of relatively small data setting with at most a few thousand observations. That is why we additionally give a “tall data” example in this section. Here, we assume M-open settings [11] and focus on predictive performances rather than recovery of the data generative processes. In all of the real data examples, we ran EVB for \(T = 15\) generations with the population size defined as \(p+40\), where p is the number of covariates, with ("_NF") and without ("_MF") normalizing flows. Following Hubin et al. [6], for binary classification, the AIC-prior inclusion penalty was used, but in the regression problem, we additionally report the results for the BIC-prior penalty.

The sets of nonlinearities \(\mathcal {G}\) sets correspond to those from Hubin et al. [6] and will be described in the text for each experiment, while all other hyper and tuning parameters of BGNLM and EVB are reported in Table 6 in the Appendix.

6.1 Prediction of metric outcome

For predicting a metric outcome, we use a BGNLM with a Gaussian response and identity link function. We benchmark our methods against various regression models, reporting root-mean-squared error (RMSE), mean absolute error (MAE), and correlation (CORR) on the test set aggregated based on 100 runs of each method with different seeds.

Abalone age dataset

The Abalone data are available in a UCI repository through the link https://archive.ics.uci.edu/ml/datasets/Abalone. There, the goal is to predict the age of abalone from physical measurements of the covariates including sex (categorical), length (continuous), diameter (continuous), height (continuous), whole weight (continuous), shucked weight (continuous), viscera weight (continuous, gut weight, and after bleeding), and shell weight (continuous and after being dried), resulting in \(p = 9\). For this dataset, we have \(n = 4177\) observations, of which we chose 3177 randomly for training, and the remaining were kept for testing. Nonlinearities were selected from \(\mathcal {G} = \{\text {sigmoid}(x), \log (|x| + 1), \text {exp}(-|x|), |x|^{7/2}, |x|^{5/2}, |x|^{1/3} \}\) with uniform probability, and we used all transformations with \(P_{mu} = P_{mo} = P_{pr} = P_{new} = 1/4\).

From Hubin et al. [6], we know that the nonlinearities and interactions were important for making predictions on this data resulting in that more complicated method outperforming the linear models. We confirm the finding for the case of BGNLMs being trained by the suggested EVB approach. For this specific task, all BGNLMs outperform other models in terms of the addressed metrics on the test set; however in general, GMJMCMC-based inference remains favorable as compared to EVB in the majority of the cases. Also, the normalizing flows-based variational approximation seems to give better predictive performance than the mean-field approach. Note however that we are not concerned with outperforming every other method on every task, but rather focus on demonstrating the robustness of our approach in various scenarios. We further report predictive credible intervals in Fig. 8 in the Appendix to demonstrate the possibility of handling the uncertainty.

Table 1 Root-mean-squared error (RMSE), mean absolute error (MAE), and correlation (CORR) of various regression algorithms

6.2 Binary classification

For binary classification, we consider a BGNLM with a Bernoulli-distributed response variable, using the logit link function and dispersion parameter \(\phi = 1\). We compare our methods with various classification algorithms, reporting accuracy (ACC), false-positive rate (FPR), and false-negative rate (FNR) on the test set aggregated over 100 runs on different seeds. In both of the classification examples, we used \(\begin{gathered} {\mathcal{G}} = \{ {\text{gauss}}(x),{\text{sigmoid}}(x),{\text{sin}}(x),{\text{cos}}(x),{\text{tanh}}(x), \hfill \\ {\text{tan}}^{{ - 1}} (x),\log (|x| + \varepsilon ),{\text{exp}}(x),|x|^{{7/2}} ,|x|^{{5/2}} ,|x|^{{1/3}} \} \hfill \\ \end{gathered}\).

Wisconsin Breast Cancer dataset

This is a classical dataset from the UCI repository available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). It has 357 observations with benign and 212 with malignant tissues. This dataset contains 30 explanatory variables from digitized images of breast mass fine needle aspirates. We used 142 observations as the training set, with the remaining used for the test set. A detailed discussion of this dataset and GMJMCMC and other methods’ results are given in Hubin et al. [6]. There, it is also discovered that in this set of data linear models are performing best and BGNLM is only recovering linear structures. We observe the same performance of BGNLM fitted through EVB with both mean-field approximations and normalizing flows as we had in the case of using GMJMCMC for inference. The mean-field variational approximation here achieves the best median performance, although the differences are not significant. We would like to once again emphasize that we are not concerned with beating all other predictive methods, but rather with demonstrating robust performance of BGNLM trained with EVB.

Table 2 Accuracy (ACC), false-negative rate (FNR), and false-positive rate (FPR) of various classification algorithms

Spam Dataset

The second classification task involves the utilization of the Spam data from https://archive.ics.uci.edu/ml/datasets/spambase. The term ’Spam’ encompasses a wide array of content, ranging from product advertisements and money-making schemes to chain letters and the dissemination of inappropriate images or videos. In this dataset, the collection of spam emails comprises messages actively marked as spam by users, while non-spam emails are categorized as work-related or personal. The dataset encompasses 4601 emails, with 1813 labeled as spam. Each email is associated with 58 characteristics serving as explanatory input variables, including 57 continuous and one nominal variable. Most of these characteristics pertain to the frequency of specific words or characters. Additionally, three variables offer diverse measurements of the sequence length of consecutive capital letters. The data were randomly split into a training set comprising 1536 emails and a test set containing the remaining 3065 emails.

It can be observed that in general, nonlinear models perform better than linear models, indicating the presence of nonlinear relationships between the response and the covariates. The Bayesian Generalized Nonlinear Linear Model (BGNLM) fitted by all algorithms is superior to linear models and thus adapts to the patterns of the data, demonstrating its robustness. However, for this particular example, all BGNLMs perform somewhat worse than other nonlinear models. Finally, there is no significant difference between the BGNLMs fitted by the GMJMCMC and EVB algorithms, both with and without flows.

Table 3 Accuracy, false-negative rate (FNR), and false-positive rate (FPR) of various classification algorithms

6.3 Tall data example

In this experiment, we shall study the scalability/feasibility of the suggested algorithm for fitting BGNLM to a very tall dataset consisting of 253,680 observations. Here, we also focus on predictive performance across resampled data rather than the stability and robustness of fitting algorithms on different seeds. Thus, in this experiment, instead of running inference 100 times on different seeds with a fixed training set, we perform cross-validation across 10 splits of the data. We further report computational time as a reference (not for comparisons). Finally, we report AUCs in addition to accuracies, due to a large imbalance in this dataset. Also, in addition to the full sample GMJMCMC, in this experiment, we ran the novel subsampling GMJMCMC approach from Lachmann and Hubin [14]. The model fitted by the subsampling GMJMCMC algorithm is denoted as BGNLM_SMC, for which 1% of the data was used for subsampling. Also in both of the implementations of the inference algorithm, we used \(\begin{gathered} {\mathcal{G}} = \{ {\text{gauss}}(x),{\text{sigmoid}}(x),{\text{sin}}(x),{\text{cos}}(x),{\text{tanh}}(x), \hfill \\ {\text{tan}}^{{ - 1}} (x),\log (|x| + \varepsilon ),{\text{exp}}(x),|x|^{{7/2}} ,|x|^{{5/2}} ,|x|^{{1/3}} \} \hfill \\ \end{gathered}\).

6.3.1 Heart disease health indicators dataset

The Heart Disease Health Indicators Dataset https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset, sourced from the 2015 Behavioral Risk Factor Surveillance System, encompasses 253,680 survey responses, reflecting the health-related behaviors and conditions of Americans. In the US, heart disease claims around 647,000 lives annually, marking it as a frequent cause of death. With 21 covariates derived from participant responses, the dataset is a comprehensive repository of factors influencing heart disease. The dataset also features a distinct class imbalance—229,787 respondents without heart disease and 23,893 with a history of heart disease.

Table 4 Area under the curve (AUC), accuracy (ACC), and TIME of various classification algorithms

It is worth noting that in this experiment, there is a small variability in terms of performance across the methods. However, the BGNLMs that were fitted using all the methods, just as demonstrated in the previous examples, have shown a robust predictive performance. Additionally, these methods are feasible for inference on a tall dataset. As expected, the variational EVB fitting seems to run faster than the full sample GMJMCMC. As anticipated, the mean-field approximation runs quicker than the flow-based method. Finally, the subsampling GMJMCMC runs faster than the full sample approach when using the same hyperparameters. Also, overall both GMJMCMC and EVB are slower than the non-Bayesian baselines, which is also expected. Yet, there might have been different loads of the server leading to additional variations in the run time. The same concerns different methods being implemented in R and Python with or without C++ backend and run on CPUs or GPUs. In that sense, we would like to emphasize that the computational times here should be interpreted carefully.

Summary of the real data examples. In real data predictive tasks, we have found that our BGNLM inference algorithms, both mean-field and flow-based, exhibit promise in being robust and adaptable for predictive tasks. We have tested these algorithms with both short and tall datasets with varying numbers of covariates in both regression and classification tasks, where nonlinearities are important in some cases but not in others. Regarding predictive properties, we found no significant differences between the flow-based EVB and mean-field EVB. Moreover, the EVB fits did not result in a decline in predictive performance compared to the GMJMCMC. The EVB fits were also faster in both theory and practice than a full sample GMJMCMC (although there are considerations regarding the computational times that have been discussed above).

7 Discussion

In this paper, we have introduced the use of stochastic variational Bayes inside the populations of the GMJMCMC algorithm for inference on BGNLM. The novel algorithm is called evolutionary variational Bayes or EVB. Two variational approximations, the mean-filed one from Carbonetto and Stephens [20] and Hubin and Storvik [21] and the normalizing flows-based one from Skaaret-Lund et al. [22], are suggested for scalable inference. Then, the resulting algorithm is compared to GMJMCMC and other relevant approaches in both M-closed settings of model generating process recovery in linear regressions and fractional polynomials. Further, predictive inferences in M-open settings were addressed. Finally, one example focusing on tall data is run to check the feasibility of the inference based on the suggested approach for settings with massive data, where standard MCMC becomes too slow in practice. These experiments demonstrated the robust performance of the proposed approach in all of the settings with either mean-field or normalizing flow. The inference trains models which are good at both model discovery and predictions in the settings of independent and correlated covariates, and in both cases, where linear structure or complicated nonlinearities are important. We also demonstrated the predictive uncertainty handling ability of the fitted by EVB BGNLMs. All of the experiments are also accompanied by the implementations and code https://github.com/sebsommer/BGNLM, yet a fully documented Python library remains an important future task.

In Table 6, we provide an overview of the hyperparameters and tuning parameters applied in the applications. Various considerations were taken into account when selecting the prior model and the hyperparameter (\(\alpha\)). We discovered that a prior favoring a sparser model is generally preferable. When determining the number of generations to run (T), it is essential to acknowledge that an excessive number of generations can result in a time-consuming training process, while too few generations may overlook crucial features. We found it advantageous to include a substantial number of features in each generation (\(q^*\)) to facilitate rapid exploration of different features. However, excessive features may lead to time-consuming feature generation, given the requirement that features within each population should not be collinear. Regarding the tuning of batch sizes (B) and epochs (E), our algorithm appears to be sensitive to these parameters, emphasizing the need for careful consideration based on the characteristics of the datasets. It is evident, however, that moderately small batches and more epochs generally prove to be a beneficial approach.

As the main limitation of the paper, it is first and foremost important to mention that we mainly provide empirical results and acknowledge the lack of theoretical guarantee on the tightness of the bounds of the suggested variational approach. Within the simple variational families we consider, these guarantees would not be feasible with any reasonable regularity conditions. Thus, the results are approximate as compared to the MCMC sampling. This is the price we have to pay for more a scalable approach. That said, although in theory our results are approximations, in practice the performance of the trained models appeared on par with MJMCMC and GMJMCMC in the empirical experimental studies.

While the proposed methods perform well in inferential tasks within a specified context, it is also important to acknowledge their limitations and scenarios in which their utility might be constrained. Primarily, unlike the original GMJMCMC for BGNLM, we do not propose the variational approach for handling mixed-effect models, which are useful for the cases of spatiotemporal dependencies or data with repeated measurements on specific groups. Thus, the suggested approach so far only works for conditionally independent tabular data. The focus on tabular data also implies that the application of the suggested method to other data formats, such as unprocessed raw images, audio, or text, has not been explored. Scenarios requiring the preprocessing of such data will require different approaches or adaptations of the current method.

In addition, the complexity and resource requirements of the methods, particularly the flow adaptation, remain a limitation. Depending on the size and complexity of the dataset, the method’s demand for computational power and memory can still be substantial at the training stage. This will restrict its applicability in resource-constrained environments or for practitioners lacking access to high-performance computing resources. While the method performs in scenarios necessitating accurate inference, its utility might be limited when the need for precision does not outweigh the associated computational costs. At test time, however, the method provides extremely sparse solutions that will be runnable even on very low-resource devices.

It is also important to mention a practical limitation that the implementation is not wrapped in an easy-to-use library and is rather presented as raw source code as of now. That is due to that the packages for GMJMCMC are implemented in R and the current paper suggests a Python implementation, hence it will be important to either build an independent Python library or implement the EVB algorithms in R to make them easily usable.

In the future research, one can first address the limitations above. Other directions include expanding the variational approach to the computational graph covering the full feature space of BGNLM without having to use the evolutionary component at all for inference. Also, extending the suggested variational approaches to structure learning in other model spaces could be another research direction. Last, but not least, it is still of general interest to extend the approach to multidimensional responses such as multinomial, Dirichlet, or multivariate Gaussian.