First, I want to congratulate the authors for this highly interesting paper proposing a new class of hybrid Bayesian networks and proving that they can be learned from data (both parameters and the structure) and that it is possible to make inference by means of a sampling Monte Carlo procedure. The results are highly significant, not only by the new proposals to analyze and extract information from data, but also because they open a new field of possibilities to improve the expressiveness of probabilistic graphical models.

Bayesian networks have experienced a huge growth in the past decades, including many applications in different fields. There are three main reasons for this success:

  1. 1.

    They do not only encode quantitative information, but also qualitative information which can be interpreted in terms of conditional independence relationships between problem variables (Pearl 1988), and in some situations as causality relationships (Pearl 2009).

  2. 2.

    The models can be built with the help of experts providing conditional independence and/or causal relationships, but also learned from data, both the qualitative part and the parameters (Neapolitan 2004.)

  3. 3.

    There are efficient algorithms to compute conditional probabilities from these models, both exact and approximate, even in cases involving a very large number of variables (Koller and Friedman 2009). Exact algorithms are based on local computation techniques which are very efficient if the directed acyclic graph of a Bayesian network is sparse.

Being important the third reason, perhaps it has been given too much importance in relation to reasons 2 and 3, in such a way that it has been an obstacle to the development of more expressive models, in particular in the case of hybrid Bayesian networks, containing discrete and continuous variables. So, traditionally only conditional linear Gaussian models have been considered and more general models have been proposed but limited by the premise of the existence of exact propagation algorithms, as the MTE potentials (Rumí et al. 2006). This paper is important, as it goes in the direction of being released from the existence of exact propagation algorithms in order to improve the construction of models which can be more faithful to a given dataset. The authors show that it is possible to use very general nonparametric conditional distributions, in such a way that they can be effectively estimated from data, and that it is also possible, to learn which is the most appropriate distribution, nonparametric or Gaussian for each variable. This is proven in the extensive experimental part of the paper. The price is that there are not available exact inference algorithms, but it is shown that it is possible to apply approximate Monte Carlo methods. More concrete comments for the different parts of the paper follow.

1 The semiparametric model

It is a great idea to use a nonparametric model to estimate the conditional densities, but at the same time not to lose the capability of using conditional Gaussian distributions when they are more appropriate. In that sense and as authors write in the Conclusions section, other families of parametric densities can be incorporated into this framework. Also, the limitation of the discrete variables with no continuous parents seems a bit artificial, as far as it is possible to apply approximate computation procedures. So, a new range of possibilities are open for the future.

The authors propose a procedure to estimate a conditional density based on kernel density estimation. This method works and produces good results in the experiments. But, there can be alternative procedures specially for high-dimensional data (Wang and Scott 2019) which could be considered for future research. One of them is based on data partitioning, which we used for estimating MTE potentials in Rumí et al. (2006). In that case, we wanted to have exact computation, and this posed some extra problems and limitations, but it is an alternative which can provide a new opportunity extending the methods for partitioning the conditional data space, when approximate computations are considered. For example, we only considered inequalities involving one variable, but in this setting, there would not be important difficulties when using inequalities involving a linear combination of conditional variables.

Another point which could be considered is the direct estimation of conditional densities. The paper is based on methods for the estimation of a joint density, and then the conditional densities are estimated as a quotient of the full joint divided by the joint of the conditional variables, but there are kernel procedures aiming to a direct estimation of the joint conditional density which could be considered in the future (Ambrogioni et al. 2017). In fact, in Expressions (21) and (22) in the paper, the quotient is transformed into a mixture of Gaussian kernels where weights depend on the conditioning variables. In Ambrogioni et al. (2017), weights are directly optimized using deep neural networks.

2 Learning

From my point of view, the most remarkable contribution in the learning algorithms lies in the smart use of cross-validation for model selection, as classical scores are not appropriate to compare alternative graphs when very heterogeneous estimation procedures are used, including kernel methods. This way of scoring has a double security procedure to avoid overfitting, including an extra validation set, apart from the cross-validation performance. The alternative of using a different partition in each step of the algorithm has the problem that the same candidate model can have different scores in different steps of the learning algorithm. My comment is relative to the importance of the overfitting problem when using only one partition set, as there are no experiments showing it. Also, the possibility of using not one, but a set of different partitions to avoid overfitting (using the sum of the scores of the considered partitions) could also be considered. A final question is about a procedure based on the leaving-one-out method. In many cases, this can be computationally expensive, but in other situations, it is possible to obtain a closed form for the score which can be efficiently computed.

3 Sampling

A sampling procedure is proposed which shows that inference is possible by using Monte Carlo methods. This simulation method is simple and effective. However, it would be convenient to devise some alternative procedures for the case of observed variables. I believe that it is very simple to transform it into a likelihood weighting algorithm. But these methods can suffer when the number of observations is high; then, the use of Markov chain Monte Carlo methods could be more appropriate.