Abstract
This note comments on the article of David Atienza, Pedro Larrañaga and Concha Bielza in which they first review recent contributions to Bayesian networks and then introduce a new hybrid version. It combines parametric and nonparametric density estimates for continuous variables by simultaneously allowing for discrete parents. We discuss straightforward extensions of the linear Gaussian parts and potential smoothing over the outcomes of discrete parents and conclude with some minor comments.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 General comments
First, I would like to congratulate the authors for having produced a very valuable contribution to a still quite topical field, located today between statistics and computer science. Since the pioneering work of Pearl (1985, 1988, 2000) and Neapolitan (1989), Bayesian networks have become increasingly popular in many areas for solving seemingly quite different problems, comprising estimation, classification, causality and network analysis or prediction. For reviews on this, we can simply consult the author’s publication list, see (Blanco et al. 2003; Bielza et al. 2011; Bielza and Larrañaga 2014a, b), to mention only some of the various reviews. While presently, at least regarding distribution and practical use, they are maybe not as popular as other machine learning methods such as neuronal networks or random forests; a major advantage is their clear and intuitive structure and interpretation. This makes them particularly attractive not just for causality analysis—in both directions, for identifying causal links and for estimating so-called treatment effects (Fölich and Sperlich 2019), but even more generally for interpretable machine learning in contrast to creating black boxes that we afterward try to understand. This is not a minor observation since making black box models explainable has become a major research area. Due to a quite well-done description of the state of the art, their contribution is clearly identified: combining parametric hybrid Bayesian networks with nonparametric networks, including implementation, computational aspects and illustration of its use. This all makes it a topical contribution to a relevant and broadly applicable method that has already proven to be very useful in practice. My main comments will focus (a) first on the gap between the two extremes combined here, namely linear Gaussian models (LG) and kernel density estimation (KDE), and (b) on the smooth handling of discrete variables.
2 Between linear Gaussian and nonparametrics
For the inclusion of continuous variables in the Bayesian network, the authors consider only two options. The first one is to estimate a conditional density of the type
(keeping notation of their article) where \(x_i\) is a given value in the i’th dimension of the full vector of variables, \(\varvec{y}= (y_1, ... , y_r)\) given values for continuous, and \(\varvec{z}= (z_1,...,z_p)\) given values for discrete parent variables. The \(\beta \)- and \(\sigma ^2\)- parameters have to be learned from the network. In case all densities are of this type, we obtain the ’classic’ (linear) Gaussian Bayesian network. If we consider hybrid networks with very few continuous but many discrete variables that furthermore can take many different values, then structure (1) can be said to be quite flexible, also because the distributions of the discrete variables are (in most cases) estimated nonparametrically, maybe modified by a uniform Bayesian Dirichlet equivalent, see equation (14) of their article. It may then even be over-parameterized with a tendency to overfitting, see next section.
If, in contrast, we face many continuous variables, then it is arguable to give in such conditional density all flexibility to the discrete variables, and just a very limited one to the continuous ones. Consequently, Atienza et al. (2022) propose to replace (1) by a kernel density estimate of type
where \(K_{\varvec{H}_{-i} (\varvec{z})}\) and \(K_{\varvec{H}(\varvec{z})}\) denote kernel functions with their corresponding \(\varvec{z}\)-dependent bandwidth matrices \(\varvec{H}_{-i}\) and \(\varvec{H}\), and N being the sample size.
There is no doubt about the advantages of a fully nonparametric distribution for different types of data analyses. Inside a potentially large network, however, this might only be recommendable when the subsamples generated by the conditioning \(\{ \varvec{z}^j=\varvec{z}\}\) are sufficiently large and the resulting computational time relatively small. Therefore, it is worth to think about compromises between these two extremes (equations 1 and 2). To what extent those compromises are reasonable or useful will always be context dependent. An obvious and probably straightforward one could be to substitute a generalized LG for equation (1) above, namely
where the \(g_1,...,g_s\) are (known) functions, allowing also for modeling interactions of parent \(\varvec{Y}\)s in the mean of \(X_i\).
One may also discuss the use of functions \(g_k\) that depend on unknown parameters separable from the \(\beta \)s and learnable from the data. The main challenge will then be to find an efficient implementations of the necessary parameter learning methods. More attractive could be to use only known \(g_k\), but including many of them, resulting in a large s. It is obvious that this gives the link between parametric and nonparametric estimation via series estimators, i.e., by approximation instead of smoothing. While there are many arguments in favor of smoothing, in the considered context and for computational reasons the (nonparametric) approximation approach seems to be an interesting alternative.
3 Smoothing over discrete variables
As already indicated in the last section, the proposed hybrid semiparametric Bayesian network gives full flexibility to the impact of discrete variables, denoted by \(\varvec{Z}\). This is only breached when they introduced the uniform Bayesian Dirichlet equivalent, see their equation (14), to encounter numerical problems when estimating the probabilities of discrete variables. No similar remedy is proposed when conditioning on discrete variables. That is, Atienza et al. (2022) propose to strictly separate the data set along the different configurations of vector \(\varvec{z}\) when conditioning on them, recall (1) and (2) from above. As said, this may work well (though it is not efficient) when each resulting subsample is sufficiently large. In practice, people treat discrete variables like continuous ones if they take more than three or four different values. Li and Racine (2003) showed that in finite samples with both, continuous and discrete variables, such smoothing over the discrete ones is indeed efficient. Later, Chu et al. (2017) showed that this still holds true when only facing discrete variables; intuitively not too surprising as it simply generalizes ideas of the Stein estimator.
More specifically, in (2) the sum(s) would then run over the entire sample without the constraint(s) \(\{ j:\varvec{z}^j = \varvec{z}\}\), while the kernels \(K_{\varvec{H}_{-i} (\varvec{z})}\) and \(K_{\varvec{H}(\varvec{z})}\) get multiplied by discrete kernels, see (Li and Racine 2003) and (Chu et al. 2017) for details. The same procedure could be applied to the LG densities (1). For them one may argue that it looks weird to smooth over the discrete but not the continuous variables. In practice, however, it may happen that in your sample you observe more distinct values for \(\varvec{Z}\) than you have for \(\varvec{Y}\).
Note that such smoothing over discrete variables is not only interesting when conditioning on them; it is also an alternative to the uniform Bayesian Dirichlet equivalent. For simple (plug-in) bandwidth choices for the discrete part, see (Chu et al. 2015).
4 Minor comments
Some minor comments are the following. In the kernel estimation literature, the Gaussian kernel is typically recommended when inside a method one has to divide by some of the kernel density estimates. That justifies also its use in the article of Atienza et al. (2022). Else it is actually not recommended for various reasons, efficiency and implementation being maybe the main ones, see (Härdle et al. 2004).
For many network based methods it is emphasized that all variables should be standardized beforehand or even brought to the same scale. This is not needed here, at least not for the kernel estimates, as the proposed bandwidth matrices do this job. Nonetheless, it has been mentioned in various papers that prior transformations of \(\varvec{Y}\) beyond standardization can improve significantly the estimators’ performance, especially when strong asymmetries and long tails are present in the original data, see, e.g., Buch-Larsen et al. (2005) and references therein.
It is understandable that computational considerations suggest to use plug-in bandwidths for the nonparametric, maximum likelihood and Dirichlet priors for the parameter learning, and k-fold cross validation for structure learning. However, they all are a model selection problems; but different selection criteria often refer to different objective functions. The problem with methods for which one has no clear asymptotic theory (so-called soft methods) is that changing the objective function(s) during estimation, learning, training, selection or trimming will destroy a clear interpretation of what you are actually mini- or maximizing. More discussion on this issue would be very welcome. For the different objective functions regarding bandwidth selection, you may consult (Heidenreich et al. 2013) and (Köler et al. 2014).
Last but not least, when (Atienza et al. 2022) speak of asymptotic theory, then they refer to the computational aspects, not to the statistical behavior of the final estimates. Instead, as it is quite common in this literature, they directly propose sampling and consequently Monte Carlo strategies to assess the estimators’ behavior, obtain standard errors or confidence intervals. In non- and semiparametric statistics this is actually a quite serious issue that has led to an extensive literature on wild bootstrap procedures. The really tricky problem is then to adjust the various regularization-, smoothing- or bandwidths parameters (Sperlich 2014). A more detailed discussion how the method(s) proposed and used by Atienza et al. (2022) relate(s) to that literature would be very much appreciated.
Change history
08 September 2022
Missing Open Access funding information has been added in the Funding Note.
References
Atienza D, Bielza C, Larrañaga P (2022) Hybrid Semiparametric Bayesian Networks. Test
Bielza C, Li G, Larrañaga P (2011) Multi-dimensional classification with bayesian networks. Int J App Reasoning 52:705–727
Bielza C, Larrañaga P (2014) Discrete bayesian network classifiers: a survey. ACM comput surv 47(1):1–43
Bielza C, Larrañaga P (2014b) Bayesian networks in neuroscience: a survey. frontiers in computational neurosciences, 8,131
Blanco R, Inza I, Larrañaga P (2003) Learning bayesian networks in the space of structures by estimation of distribution algorithms. Int J Intell Syst 18:205–220
Buch-Larsen T, Nielsen JP, Guillén M, Bolancé C (2005) Kernel density estimation for heavy-tailed distributions using the champernowne transformation. Statistics 39:503–518
Chu C-Y, Henderson DJ, Parmeter CF (2015) Plug-in bandwidth selection for kernel density estimation with discrete data. Econometrics 3:199–214
Chu C-Y, Henderson DJ, Parmeter CF (2017) On discrete epanechnikov kernel functions. Comput Stat Data Anal 116:79–105
Frölich M, Sperlich S (2019) Impact Eval: Treat Eff Causal Anal. Cambridge university press
Härdle, Müller WM, Sperlich S, Werwatz A (2004) Nonparametric and Semiparametric Models, Springer Series in Statistics, Springer-Verlag, Heidelberg, NY
Heidenreich NB, Schindler A, Sperlich S (2013) Bandwidth selection methods for kernel density estimation: a review of fully automatic selectors. AStA - Adv Stat Anal 97:403–433
Köhler M, Schindler A, Sperlich S (2014) A Review and comparison of bandwidth selection methods for kernel regression. Int Stat Rev 82:243–274
Li Q, Racine J (2003) Nonparametric estimation of distributions with categorical and continuous data. J Multivar Anal 86:266–292
Neapolitan RE (1989) Prob reasoning in expert syst: theory algorithms. Wiley
Pearl J (1985) Bayesian Networks: A Model of Self-Activated Memory for Evidential Reasoning (UCLA Technical Report CSD-850017). Proceedings of the 7th Conference of the Cognitive Science Society University of California, Irvine, CA. 329–334
Pearl J (1988) Probab Reasoning Intell Syst. Morgan-Kaufmann
Pearl J (2000) Causality: Models, Reasoning Inference. Cambridge university press
Sperlich S (2014) On the choice of regularization parameters in specification testing: a critical discussion. Empirical Econ 47:427–450
Funding
Open access funding provided by University of Geneva.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This comment refers to the invited paper available at: 10.1007/s11749-022-00812-3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sperlich, S. Comments on: hybrid semiparametric Bayesian networks. TEST 31, 335–339 (2022). https://doi.org/10.1007/s11749-022-00817-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-022-00817-y