1 General comments

First, I would like to congratulate the authors for having produced a very valuable contribution to a still quite topical field, located today between statistics and computer science. Since the pioneering work of Pearl (1985, 1988, 2000) and Neapolitan (1989), Bayesian networks have become increasingly popular in many areas for solving seemingly quite different problems, comprising estimation, classification, causality and network analysis or prediction. For reviews on this, we can simply consult the author’s publication list, see (Blanco et al. 2003; Bielza et al. 2011; Bielza and Larrañaga 2014a, b), to mention only some of the various reviews. While presently, at least regarding distribution and practical use, they are maybe not as popular as other machine learning methods such as neuronal networks or random forests; a major advantage is their clear and intuitive structure and interpretation. This makes them particularly attractive not just for causality analysis—in both directions, for identifying causal links and for estimating so-called treatment effects (Fölich and Sperlich 2019), but even more generally for interpretable machine learning in contrast to creating black boxes that we afterward try to understand. This is not a minor observation since making black box models explainable has become a major research area. Due to a quite well-done description of the state of the art, their contribution is clearly identified: combining parametric hybrid Bayesian networks with nonparametric networks, including implementation, computational aspects and illustration of its use. This all makes it a topical contribution to a relevant and broadly applicable method that has already proven to be very useful in practice. My main comments will focus (a) first on the gap between the two extremes combined here, namely linear Gaussian models (LG) and kernel density estimation (KDE), and (b) on the smooth handling of discrete variables.

2 Between linear Gaussian and nonparametrics

For the inclusion of continuous variables in the Bayesian network, the authors consider only two options. The first one is to estimate a conditional density of the type

$$\begin{aligned} f_{CLG}\left( x_i | \varvec{y}, \varvec{z}\right) := \mathcal{N} \left( \beta _{\varvec{z},0}+\sum ^r_{j=1} \beta _{\varvec{z},j} y_j , \sigma ^2_{\varvec{z},i} \right) \ , \end{aligned}$$
(1)

(keeping notation of their article) where \(x_i\) is a given value in the i’th dimension of the full vector of variables, \(\varvec{y}= (y_1, ... , y_r)\) given values for continuous, and \(\varvec{z}= (z_1,...,z_p)\) given values for discrete parent variables. The \(\beta \)- and \(\sigma ^2\)- parameters have to be learned from the network. In case all densities are of this type, we obtain the ’classic’ (linear) Gaussian Bayesian network. If we consider hybrid networks with very few continuous but many discrete variables that furthermore can take many different values, then structure (1) can be said to be quite flexible, also because the distributions of the discrete variables are (in most cases) estimated nonparametrically, maybe modified by a uniform Bayesian Dirichlet equivalent, see equation (14) of their article. It may then even be over-parameterized with a tendency to overfitting, see next section.

If, in contrast, we face many continuous variables, then it is arguable to give in such conditional density all flexibility to the discrete variables, and just a very limited one to the continuous ones. Consequently, Atienza et al. (2022) propose to replace (1) by a kernel density estimate of type

$$\begin{aligned} \hat{f}_{CKDE} \left( x_i | \varvec{y}, \varvec{z}\right) := \frac{\sum _{j=1:\varvec{z}^j=\varvec{z}}^{N} K_{\varvec{H}(\varvec{z})} \left( \begin{bmatrix} x_i \\ \varvec{y}\end{bmatrix} - \begin{bmatrix} x^j \\ \varvec{y}^j \end{bmatrix} \right) }{\sum _{j=1:\varvec{z}^j=\varvec{z}}^{N} K_{\varvec{H}_{-i} (\varvec{z})} (\varvec{y}- \varvec{y}^j)} \end{aligned}$$
(2)

where \(K_{\varvec{H}_{-i} (\varvec{z})}\) and \(K_{\varvec{H}(\varvec{z})}\) denote kernel functions with their corresponding \(\varvec{z}\)-dependent bandwidth matrices \(\varvec{H}_{-i}\) and \(\varvec{H}\), and N being the sample size.

There is no doubt about the advantages of a fully nonparametric distribution for different types of data analyses. Inside a potentially large network, however, this might only be recommendable when the subsamples generated by the conditioning \(\{ \varvec{z}^j=\varvec{z}\}\) are sufficiently large and the resulting computational time relatively small. Therefore, it is worth to think about compromises between these two extremes (equations 1 and 2). To what extent those compromises are reasonable or useful will always be context dependent. An obvious and probably straightforward one could be to substitute a generalized LG for equation (1) above, namely

$$\begin{aligned} f_{CgLG}\left( x_i | \varvec{y}, \varvec{z}\right) := \mathcal{N} \left( \beta _{\varvec{z},0}+\sum ^{s}_{j=1} \beta _{\varvec{z},j} g_j(\varvec{y}) , \sigma ^2_{\varvec{z},i} \right) \ , \end{aligned}$$
(3)

where the \(g_1,...,g_s\) are (known) functions, allowing also for modeling interactions of parent \(\varvec{Y}\)s in the mean of \(X_i\).

One may also discuss the use of functions \(g_k\) that depend on unknown parameters separable from the \(\beta \)s and learnable from the data. The main challenge will then be to find an efficient implementations of the necessary parameter learning methods. More attractive could be to use only known \(g_k\), but including many of them, resulting in a large s. It is obvious that this gives the link between parametric and nonparametric estimation via series estimators, i.e., by approximation instead of smoothing. While there are many arguments in favor of smoothing, in the considered context and for computational reasons the (nonparametric) approximation approach seems to be an interesting alternative.

3 Smoothing over discrete variables

As already indicated in the last section, the proposed hybrid semiparametric Bayesian network gives full flexibility to the impact of discrete variables, denoted by \(\varvec{Z}\). This is only breached when they introduced the uniform Bayesian Dirichlet equivalent, see their equation (14), to encounter numerical problems when estimating the probabilities of discrete variables. No similar remedy is proposed when conditioning on discrete variables. That is, Atienza et al. (2022) propose to strictly separate the data set along the different configurations of vector \(\varvec{z}\) when conditioning on them, recall (1) and (2) from above. As said, this may work well (though it is not efficient) when each resulting subsample is sufficiently large. In practice, people treat discrete variables like continuous ones if they take more than three or four different values. Li and Racine (2003) showed that in finite samples with both, continuous and discrete variables, such smoothing over the discrete ones is indeed efficient. Later, Chu et al. (2017) showed that this still holds true when only facing discrete variables; intuitively not too surprising as it simply generalizes ideas of the Stein estimator.

More specifically, in (2) the sum(s) would then run over the entire sample without the constraint(s) \(\{ j:\varvec{z}^j = \varvec{z}\}\), while the kernels \(K_{\varvec{H}_{-i} (\varvec{z})}\) and \(K_{\varvec{H}(\varvec{z})}\) get multiplied by discrete kernels, see (Li and Racine 2003) and (Chu et al. 2017) for details. The same procedure could be applied to the LG densities (1). For them one may argue that it looks weird to smooth over the discrete but not the continuous variables. In practice, however, it may happen that in your sample you observe more distinct values for \(\varvec{Z}\) than you have for \(\varvec{Y}\).

Note that such smoothing over discrete variables is not only interesting when conditioning on them; it is also an alternative to the uniform Bayesian Dirichlet equivalent. For simple (plug-in) bandwidth choices for the discrete part, see (Chu et al. 2015).

4 Minor comments

Some minor comments are the following. In the kernel estimation literature, the Gaussian kernel is typically recommended when inside a method one has to divide by some of the kernel density estimates. That justifies also its use in the article of Atienza et al. (2022). Else it is actually not recommended for various reasons, efficiency and implementation being maybe the main ones, see (Härdle et al. 2004).

For many network based methods it is emphasized that all variables should be standardized beforehand or even brought to the same scale. This is not needed here, at least not for the kernel estimates, as the proposed bandwidth matrices do this job. Nonetheless, it has been mentioned in various papers that prior transformations of \(\varvec{Y}\) beyond standardization can improve significantly the estimators’ performance, especially when strong asymmetries and long tails are present in the original data, see, e.g., Buch-Larsen et al. (2005) and references therein.

It is understandable that computational considerations suggest to use plug-in bandwidths for the nonparametric, maximum likelihood and Dirichlet priors for the parameter learning, and k-fold cross validation for structure learning. However, they all are a model selection problems; but different selection criteria often refer to different objective functions. The problem with methods for which one has no clear asymptotic theory (so-called soft methods) is that changing the objective function(s) during estimation, learning, training, selection or trimming will destroy a clear interpretation of what you are actually mini- or maximizing. More discussion on this issue would be very welcome. For the different objective functions regarding bandwidth selection, you may consult (Heidenreich et al. 2013) and (Köler et al. 2014).

Last but not least, when (Atienza et al. 2022) speak of asymptotic theory, then they refer to the computational aspects, not to the statistical behavior of the final estimates. Instead, as it is quite common in this literature, they directly propose sampling and consequently Monte Carlo strategies to assess the estimators’ behavior, obtain standard errors or confidence intervals. In non- and semiparametric statistics this is actually a quite serious issue that has led to an extensive literature on wild bootstrap procedures. The really tricky problem is then to adjust the various regularization-, smoothing- or bandwidths parameters (Sperlich 2014). A more detailed discussion how the method(s) proposed and used by Atienza et al. (2022) relate(s) to that literature would be very much appreciated.