This is an interesting paper that distils structure learning in Bayesian networks (BNs) and kernel methods in a quest to produce more flexible distributional assumptions. Conditional (linear) Gaussian Bayesian networks (CGBNs) have been well explored in the literature for some time, to the point that they now appear in many recent textbooks (Koller and Friedman 2009; Scutari and Denis 2021; Kjærluff and Madsen 2013). The authors address one of the key limitations of CGBNS that they can only capture linear dependencies between the continuous variables they contain and remove it by replacing (mixtures of) linear regression models with more general kernel densities. Dependencies between discrete variables were already flexible, because the conditional probability tables that parametrise them essentially act as a saturated model (Rijmen 2008). It is not obvious that more flexibility will produce better models for whatever task we have in mind: it can also lead to overfitting, instability and hyperparameter tuning problems. However, the accuracy of reconstruction demonstrated by the proposed Hybrid Semiparametric BNs (HSPBNs) is encouraging.

Like all good papers, it raises interesting questions in its construction.

How to measure structural distances? Common distributional assumptions in BNs, including CGBNs, assign the same type of distribution to a node in all possible network structures. Therefore, the presence of an arc denotes the same general type of dependence in all possible structures as well. However, this is no longer the case in HSPBNs because continuous nodes can have either parametric or nonparametric characterisations for the same arcs. The authors acknowledge this by complementing the Structural Hamming Distance (SHD; Tsamardinos et al. 2006) with a Type Hamming Distance (THD) in their experimental evaluation. Should we combine them in a single measure by adding colours to the arcs and extending SHD to count different colours as errors? And how should we weight such errors compared to false positive and false negative arcs? Then, there is also the question of whether we should update the definition of equivalence classes: it is used in constructing SHD and it has wide implications in our interpretation of BNs. In a CGBNs, continuous variables are assumed to be jointly distributed as a multivariate normal: that ensures that arcs that are not compelled can be oriented in either direction while producing networks in the same equivalence class. It is not obvious that this is the case with nonparametric nodes. We would also assume that any single node must have the same distribution in all the BNs in the same equivalence class, which means extending its characterisation as well to consider arc colours.

How to implement constraint-based learning with kernel-based nonparametric local distributions? Since the heuristic algorithms used in constraint-based learning are distribution-agnostic, the question becomes how to define a suitable conditional independence test. The answer is not trivial because we cannot easily construct likelihood-ratio tests from the likelihoods used in the cross-validated score \(S^k_\mathrm {CV}(\mathcal {D}, \mathcal {G})\). Firstly, those likelihoods are only defined for continuous variables conditional on discrete ones which is problematic when testing the independence of discrete variables conditional on continuous ones. Secondly, it is unclear what the degrees of freedom of the test would be: computing effective degrees of freedom from the kernel transform is possible (Hastie et al. 2009), but not obviously appropriate. One option is to look for inspiration at those existing kernel tests that have been extended to test conditional independence like the Hilbert–Schmidt independence criterion (HSIC; Gretton et al. 2008; Doran et al. 2014).

A second option would be to look at BN structure learning approaches such as Handhayani and Cussens (2020) that use kernels for both continuous and discrete variables. The resulting flexibility in defining the nature of conditional independence relationships in the BN could also address the remaining limitation of CGBNs that is still present in HSPBNs: removing the constraint that continuous nodes cannot be parents of discrete nodes. The impact of this limitation cannot be overstated: it prevents CGBNs and HSPBNs from being used as causal models in the general case because the direction of arcs connecting discrete and continuous variables is fixed and has nothing to do with the cause–effect relationship present in the data we learn such BNs from. In turn, the directions of adjacent arcs may not necessarily reflect cause–effect relationships either because of the cascading effects of incorrect arc inclusions in structure learning.