1 Introduction

Density estimation is the unsupervised task of learning an estimator for the joint probability distribution over a set of random variables (RVs) that generated the observed data. Once such an estimator is learned, it is used to do inference, i.e., computing the probability of the queries about certain states of the RVs. Since a perfect estimate of the real distribution would allow to solve many learning tasks exactly when reframed as different kinds of inferenceFootnote 1, density estimation classifies as one of the most general task in machine learning [13].

The main challenge in density estimation is balancing the representation expressiveness of the learned model against the cost of learning it and performing inference on it. Probabilistic Graphical Models (PGMs), like Bayesian Networks (BNs) and Markov Networks (MNs), are able to model highly complex probability distributions. However, exact inference with them is generally intractable, i.e., not solvable in polynomial time, and even some approximate inference routines are intractable in practice [23]. With the aim of performing exact and polynomial inference, a series of tractable probabilistic models (TPMs) have been recently proposed: either by restricting the expressiveness of PGMs by bounding their treewidth [24], e.g., tree distributions and their mixtures [18], or by exploiting local structures in a distribution [4]. It is worth noting that inference tractability is not a global property, but it is associated to classes of queries. For instance, computing exact marginals on a TPM may be feasible, while MPE may be not [1]. TPMs like Arithmetic Circuits [6], Sum-Product Networks (SPNs) [19], and Cutset Networks (CNets) [21] promise a good compromise between expressive power and tractable inference by compiling high treewidth distributions in compact and efficient data structures. Even if learning such TPMs may be done in polynomial time, thanks to several recent algorithmic schemes, making these algorithms scale to high dimensional data is still an issue. We focus on CNets since they (i) exactly and tractably compute several inference query types like marginals, conditionals and MPE inference [7], and (ii) promise faster learning times, when compared to other TPMs.

CNets have been introduced in [21] as weighted probabilistic model trees having tree-structured models as the leaves of an OR tree. They exploit context-specific independencies (CSIs) [2] by embedding Pearl’s conditioning algorithm. While the learning algorithm originally proposed in [21] provides a heuristic approach, it still requires quadratic time w.r.t. the number RVs to select each tree inner node to condition on. A theoretically principled and more accurate version, presented in [9], overcomes many of the initial version issues, like the tendency to overfit. However, in order to do so, it increases the complexity of performing a single split to cubic time. We tackle the problem of scaling CNet learning to high dimensional data while preserving inference accuracy.

Here we introduce Extremely Randomized CNets (XCNets), as CNets that can be learned in a simple, fast and yet effective approach by performing random conditioning to grow the OR tree. In such a way, selecting a node to split on reduces to constant time w.r.t. the number of features. As we will see, while the likelihood of a single XCNet is not greater than an optimally learned CNet, ensembles of XCNets outperform state-of- the-art density estimators on a series of standard benchmark datasets, yet employing a fraction of the time needed to learn the competitors. To further reduce the learning complexity, we investigate the exploitation of a naive factorization as leaf distribution in XCNets. As a result, we can build an extremely fast mixture of density estimators that is more accurate than several CNets and comparable to a BN exploiting CSI [3].

2 Background

Notation. Let RVs be denoted by upper-case letters, e.g., X, and their values as the corresponding lower-case letters, e.g., \(x \sim X\). We denote sets of RVs as \(\mathbf X\), and their combined values as \(\mathbf x\). For a set of RVs \(\mathbf X\) we denote with \(\mathbf X_{\setminus i}\) the set \(\mathbf X\) deprived of \(X_i\), and with \(\mathbf X_{|\mathbf Y}\) the restriction of \(\mathbf X\) to \(\mathbf Y\subseteq \mathbf {X}\) (the same applies to assignments \(\mathbf x\)). W.l.o.g., we assume RVs we deal with in the following to be binary valued.

Density Estimation. Let \(\mathcal D = \{\mathbf \xi ^j\}_{j=1}^m\) be a set of m n-dimensional samples drawn i.i.d. according to an unknown joint probability distribution \(\mathsf p(\mathbf X)\), with \(\mathbf {X}=\{X_{i}\}_{i=1}^{n}\). We refer to \(\xi ^{j}[X_i]\) as the value assumed by the sample \(\xi ^{j}\) in correspondence of the RV \(X_i\). We are interested in learning a model \(\mathcal {M}\) from \(\mathcal {D}\) such that its estimate of the underlying distribution, denoted as \(\mathsf {p}_{\mathcal {M}}(\mathbf X)\), is as close as possible to the original one [13]. Generally, measuring this closeness is done via the log-likelihood function, or one of its variants, defined as: \(\ell _{\mathcal {D}} (\mathcal {M}) =\sum _{j=1}^{m} \log \mathsf {p}_{\mathcal {M}}(\xi ^{j})\). In the next sub-sections we review the approaches to density estimation as the building blocks of XCNets we propose in Sect. 4.

2.1 Product of Bernoulli Distributions

The simplest representation assumption for \(\mathsf {p}(\mathbf {X})\) over RVs \(\mathbf X\), allowing tractable inference, involves considering all RVs in \(\mathbf {X}\) to be independent: \(\mathsf {p}(\mathbf {x}) =\prod _{i=1}^{n}\mathsf {p}(x_{i})\). For binary RVs, this naive factorization leads to the product of Bernoulli distributions (PoBs) model, where building \(\mathsf {p}_{\mathcal {M}}\) equals to estimate the \(\mathsf {p}_{\mathcal {M}}(x_{i}^{0})=\theta _{i}^{0}\).

Proposition 1

\(\varvec{(}{\mathbf {\mathsf{{LearnPoB}}}}\ \mathbf{time}\ \mathbf{complexity}\varvec{).}\) Learning a PoB from \(\mathcal {D}\) over RVs \(\mathbf {X}\) has time complexity O(nm), where \(m=|\mathcal {D}|\) and \(n=|\mathbf {X}|\).

Proof

For each Bernoulli RV \(X_{i}\in \mathbf {X}\), estimating \(\theta _{i}\) requires a single pass over \(\{\xi ^{j}[X_{i}]\}_{j=1}^{m}\), hence taking O(m). Consequently, for all RVs in \(\mathbf {X}\), it takes O(mn).

Similarly to what Naive Bayes provides for classification, PoBs deliver a cheap and very fast baseline for tractable density estimation, even if the total independence assumption clearly does not hold on real data. Moreover, mixtures of PoBs, sometimes simply referred to mixtures of Bernoulli distributions (MoBs), have proved as an effective way to increase the representation expressiveness of PoBs [16]. However, while inference on MoBs is still tractable, learning them in a principled way requires running the EM algorithm for k iterations and r restarts, thus increasing the complexity up to O(rkmn) [16].

2.2 Probabilistic Tree Models

A directed tree-structured model [18] over \(\mathbf {X}\) is a BN in which each node \(X_{i}\in \mathbf {X}\) has at most one parent, \(\mathrm {Pa}_{X_i}\). It encodes a distribution that factorizes as: \(\mathsf {p}(\mathbf {x}) = \prod _{i=1}^n\mathsf {p}(x_i|\mathrm {Pa}_{x_i})\), where \(\mathrm {Pa}_{x_i}\) denotes the projection of the assignment \(\mathbf x\) on the parent of \(X_i\). By modeling such dependencies, tree-structured models can be more expressive than PoBs, yet still performing exact complete and marginal inference in O(n) [18]. To learn a model \(\mathcal {M}=\langle \mathcal {T}, \{\theta _{i|\mathrm {Pa}_{X_i}}\}_{i=1}^{n} \rangle \), now one has to estimate both a tree structure \(\mathcal {T}\) and the conditional probabilities \(\theta _{i|\mathrm {Pa}_{X_i}}=\mathsf p_{\mathcal M}(X_{i}|\mathrm {Pa}_{X_i})\). Growing an optimal model, according to the KL-divergence, can be done by employing the classical result from Chow and Liu [5]. We will refer to tree-structured models as Chow-Liu trees, or CLtrees, assuming the Chow-Liu algorithm (LearnCLTree) has been employed to learn them.

Proposition 2

\(\varvec{(}{\mathbf {\mathsf{{LearnCLTree}}}}\ \mathbf{time\ complexity}\) [5]\(\varvec{).}\) Learning a CLtree from \(\mathcal {D}\) over RVs \(\mathbf {X}\) has time complexity \(O(n^2 (m + \log n))\), where \(m=|\mathcal {D}|\) and \(n=|\mathbf {X}|\).

Proof

For each pair of RVs in \(\mathbf {X}\), their mutual information (MI) can be estimated from \( \mathcal D\) in \(O(mn^{2})\) steps. Building a maximum spanning tree on the weighted graph induced by the adjacency matrix MI takes \(O(n^2\log n)\). Lastly, both arbitrarily rooting the tree, traversing it, and estimating the conditional probabilities \(\theta _{i|\mathrm {Pa}_{X_i}}\) can be done in O(n).

All in all, the complexity of learning a CLTree is quadratic in n. While this is a huge gain w.r.t. learning a higher order dependency BN, it still poses a practical issue when LearnCLTree is applied as a routine in larger learning schemes and on datasets with thousand features. Nevertheless, CLTrees have been employed as the core components of many tractable probabilistic models ranging from mixtures of them [18], SPNs [26] and CNets [8, 9, 21]. We will specifically tackle the problem of scaling CNet learning in the following sections.

3 Cutset Networks

Cutset Networks are TPMs introduced in [21] as a hybrid of OR trees and CLTrees as the tree leaves. Here we generalize their definition to comprise generic TPMs as leaf distributions. A CNet \(\mathcal C\) over a set of RVs \(\mathbf X\), is a probabilistic weighted model tree defined via a rooted OR tree \(\mathcal G\) and a set of TPMs \(\{\mathcal M_i\}_{i=1}^{L}\), in which each \(\mathcal M_i\) encodes a distribution \(\mathsf {p}_{\mathcal M_i}\) over a subset of \(\mathbf X\), called scope and denoted as \(\mathsf {sc}(\mathcal M_i)\). The scope of a CNet \(\mathcal {C}\), \(\mathsf {sc}(\mathcal C)\), is the set of RVs appearing in it. A CNet may be defined recursively as follows.

Definition 1

(Cutset network). Given binary RVs \(\mathbf X\), a CNet is: (1) a TPM \(\mathcal {M}\), with \(\mathsf {sc}(\mathcal {M})=\mathbf X\); or (2) a weighted disjunction of two CNets \(\mathcal C_0\) and \(\mathcal C_1\) graphically represented as an OR node conditioned on RV \(X_i \in \mathbf X\), with associated weights \(w_i^0\) and \(w_i^1\) s.t. \(w_i^0 + w_i^1 = 1\), where \(\mathsf {sc}(\mathcal C_{0})=\mathsf {sc}(\mathcal C_{1})=\mathbf X_{\setminus i}\).

A CNet over binary RVs is shown in Fig. 1: each circled node is an OR tree node and labeled by a variable \(X_i\). Each edge emanating from it is weighted by the probability \(w_i^0\), resp.\(w_i^1\), of conditioning \(X_i\) to the value 0, resp. 1. The distribution encoded by a CNet \(\mathcal C\) can be written as:

$$\begin{aligned} \mathsf {p}(\mathbf {x}) = \mathsf {p}_l(\mathbf x_{|\mathsf {sc}(\mathcal C)\setminus \mathsf {sc}(\mathcal M_l)})\mathsf {p}_{\mathcal M_l}(\mathbf x_{| \mathsf {sc}(\mathcal M_l)}), \end{aligned}$$
(1)

where \(\mathsf {p}_l(\mathbf x_{|\mathsf {sc}(\mathcal C)\setminus \mathsf {sc}(\mathcal M_l)})=\prod _i (w_i^0)^{1-x_i}(w_i^1)^{x_i}\) is a factor obtained by multiplying all the weights attached to the edges of the path in the OR tree starting from the root of \(\mathcal C\) and reaching a unique leaf node l; on the other hand, \(\mathsf {p}_{\mathcal M_l}(\mathbf x_{| \mathsf {sc}(\mathcal M_l)})\) is the distribution encoded by the reached leaf l. \(\mathsf {p}_{\mathcal M_l}\) can be interpreted as a conditional distribution \(\mathsf {p}(\mathbf x_{| \mathsf {sc}(\mathcal M_l)} | \mathbf x_{|\mathsf {sc}(\mathcal C)\setminus \mathsf {sc}(\mathcal M_l)})\).

Fig. 1.
figure 1

Example of a CNet over binary RVs. Inner (rounded) nodes on variables \(X_i\) are OR nodes, while leaf (squared) nodes represent CLtrees.

3.1 Learning CNets

Learning both the structure and parameters of a CNet from data equals to perform searching in the space of all probabilistic weighted model trees. This would require an exponential time: for a dataset \(\mathcal D\) over RVs \(\mathbf X\) learning a full binary OR tree with height k has time complexity \(O(n^k 2^k(n^2 (m + \log n)))=O(m2^kn^{k+2})\), with \(m=|\mathcal D|\) and \(n=|\mathbf X|\). In practice, this problem is tackled in a two-stage greedy fashion by: (i) first performing a top-down search in the space of weighted OR trees, and then (ii) learning TPMs as leaf distributions according to a conditioned subset of the data. The first structure learning algorithm for CNets is the one introduced in [21], leveraging a heuristic approach to induce the OR tree and demanding pruning to combat overfitting. A following approach has been introduced in [9], growing the OR tree by a principled Bayesian search maximizing the data likelihood. In the following, we introduce a general scheme to learn CNets, showing how, by properly determining a splitting criterion to grow the OR tree, one can recover both the algorithms from [21] and [9]. This, in turn, highlights how the splitting criterion time complexity determines that of learning the whole OR tree, and hence the whole CNet. In Sect. 4, we propose a variation of the splitting procedure drastically reducing its cost.

figure a

General Learning Scheme. Algorithm 1 reports a general approach for CNets structure learning. In particular, the procedure tries to select a variable \(X_i\) on the input data slice \(\mathcal D\) (line 4). If a such a variable exists (line 5), it then recursively (line 8) tries to decompose the two new slices \(\mathcal D_0\) and \(\mathcal D_1\) over \(\mathbf X_{\setminus i}\). When the slice \(\mathcal D\) has few instances, or it is defined on few variables, then a leaf distribution is learned (line 10). Both, the algorithms reported in [9, 21] use CLtrees as leaf distribution, i.e., the \(\mathsf {learnDistribution}\) procedure on line 10 corresponds to call the LearnCLTree algorithm.

By deriving the time complexity of both growing the OR tree and learning the leaf distributions, one can derive the whole time complexity of LearnCNet. In turn, the time complexity of growing the OR tree clearly depends by the cost of selecting the RV to split on at each step. If we assume the variations of LearnCNet have grown the same sized OR trees, the time complexity of each implementation of select determines the whole OR tree growing phase complexity. Concerning learning leaf distributions, its complexity is determined by the cost of learning a single distribution, that in case of CLTrees is \(O(n^{2}(m+\log (n)))\) (see Proposition 2). As a consequence, assuming to learn L leaves for a tree, then it would take \(O(Ln^{2}(m+\log (n)))\) for all variations to learn such leaves. In the following Sections we revise and analyze the two variations of LearnCNet reported in [9, 21].

Proposition 3

Growing a full binary OR tree with LearnCNet on \(\mathcal D\) over RVs \(\mathbf X\) has time complexity \(O(k(S+m))\), where \(m = |\mathcal D|\), \(n=|\mathbf X|\), k is the height of the OR tree, and \(S=T(m, n)\), assumed to grow linearly w.r.t. m holding n constant, is the time required to compute the OR split node selection procedure on \(\mathcal D\) (select function in Algorithm 1, line 4).

Proof

A set \(\mathcal D^h_t \subset \mathcal D\) of samples falls in each internal node t with height h, such that \(\forall i \ne j: \mathcal D^h_i \cap \mathcal D^h_j = \emptyset \), and \(\cup _{i=1}^{2^h} \mathcal D^h_i = \mathcal D\). Furthermore, for each internal node t with height h, \(T(|\mathcal D_t^h|,n-h)\) has been the time required to compute the OR split selection, and \(|\mathcal D_t^h|\) is the time required to split the samples \(\mathcal D_t^h\). Assuming that T(mn) grows linearly w.r.t. m holding n constant, then for each height h we have a time complexity equal to \(O( \sum _{i=1}^{2^h} (T(|\mathcal D_i^h|,n-h) + |\mathcal D_i^h| ))=O( T(|\mathcal D|,n-h)+m)\). Since the OR tree has height k, then the overall time is \(O ( \sum _{i=0}^{k-1} ( T(|\mathcal D|,n-i) + m ) )=O(k(T(|\mathcal D|, n)+m))\).

Information Gain Splitting Heuristic. The algorithm to learn CNet structures proposed in [21], that here we will call entCNet, performs a greedy top-down search in the OR-trees space that can be reframed in Algorithm 1. It implements the select function as a procedure to determine the RV \(X_{i}\) that maximizes a generative reformulation of the information gain from decision tree theory. Since computing the joint entropy over RVs \(\mathbf {X}_{\setminus i}\) would be unfeasible to calculate, it heuristically approximates it by computing the average over marginal entropies.

To cope with the systematic overfitting showed by CNets learned by entCNet, always in [21], a post-pruning method on a validation set is introduced. Leveraging this decision tree technique, on a fully grown CNet, by advancing bottom-up, leaves are pruned and inner nodes without children replaced with a CLtree (that needs to be learned from data), if the network validation data likelihood after this operation is higher than that scored by the not pruned network.

Proposition 4

\(\varvec{(}{\mathbf {\mathsf{{select}}}}\ \mathbf{time\ complexity\ in}\ {\mathbf {\mathsf{{entCNet}}}}\) [21]\(\varvec{).}\) The time complexity for selecting the best splitting node on a slice \(\mathcal D\) over RVs \(\mathbf X\) in entCNet is \(O(mn^2)\), where \(m = |\mathcal D|\) and \(n = |\mathbf X|\).

Corollary 1

Growing a full binary OR tree for entCNet when learning a CNet on \(\mathcal D\) over RVs \(\mathbf X\) has time complexity \(O(kmn^2)\), where \(m = |\mathcal D|\), \(n=|\mathbf X|\), and k is the height of the OR tree.

Proof

From Propositions 3 and 4, the overall time complexity to grow a full binary OR tree is \(O(k(mn^2 + m))= O(km(n^2+1))\).

dCSN : likelihood guided splitting. In [9], the authors proposed the dCSN algorithm that exploits a different approach from that in [21], by avoiding decision tree heuristics while choosing the best variable directly maximizing the data log-likelihood. As already reported in [9], the log-likelihood function of a CNet may be decomposed as follows. Given a CNet \(\mathcal C\) learned on \(\mathcal {D}\) over \(\mathbf {X}\), its log-likelihood \(\ell _{\mathcal D} (\mathcal C )\) can be computed as follows: \( \ell _{\mathcal {D}} (\mathcal {C} ) = \sum _{\xi \in \mathcal {D}} \sum _{i=1,\ldots ,n} \log \mathsf p(\xi [X_i]|\xi [\mathrm {Pa}_{X_i}])\), when \(\mathcal C\) corresponds to a CLtree. While, in the case of a OR tree rooted on the variable \(X_i\), the log-likelihood is:

$$\begin{aligned} \ell _{\mathcal {D}} (\mathcal {C} ) = \sum _{j=0,1} {m}_j \log w_i^j + \ell _{\mathcal D_j} ( \mathcal C_j ), \end{aligned}$$
(2)

being \(\mathcal C_j\) the CNet involved in the OR, \(\mathcal D_j = \{ \xi \in \mathcal {D} : \xi [ X_i] = j\}\), \({m}_j = |\mathcal {D}_{j}|\), and \(\ell _{\mathcal D_j} (\mathcal C_j ) \) is the log-likelihood of the sub-CNet \(\mathcal C_j\) on the slice \(\mathcal D_j\), for \(j=0,1\).

By exploiting this recursive nature of CNets, a CNet is grown top-down, allowing further expansion, i.e., the substitution of a CLtree with an OR node, only if it improves the structure log-likelihood, since it is clear to see that maximizing the second term in Eq. 2, results in maximizing the global score.

As reported in [9], one starts with a single CLtree, learned from \(\mathcal D\) over \(\mathbf X\), and then it checks whether there is a decomposition, i.e., an OR node on the best variable \(X_i\) applied on two CLtrees, providing a better log-likelihood than that scored by the initial tree. If such a decomposition exists, than the decomposition process is recursively applied to the sub-slices \(\mathcal D_0\) and \(\mathcal D_1\) over \(\mathbf X_{\setminus i}\), testing each leaf for a possible substitution.

Proposition 5

\(\varvec{(}{\mathbf {\mathsf{{select}}}}\ \mathbf{time\ complexity\ in}\ {\mathbf {\mathsf{{dCSN}}}}\varvec{).}\) The time complexity for selecting the best splitting node on a slice \(\mathcal D\) over RVs \(\mathbf X\) in dcsn is \(O(n^3(m + \log n))\), where \(m = |\mathcal D|\) and \(n = |\mathbf X|\).

Proof

For each variable \(X_i \in \mathbf X\), two CLTrees have been computed on \(\mathcal D_0\) and \(\mathcal D_1\) leading to a splitting complexity \(O(n^2(m + \log n))\). Since n splits have to be checked, the overall complexity to select the best split is \(O(n^3(m + \log n))\).

Corollary 2

Growing a full binary OR tree on \(\mathcal D\) over RVs \(\mathbf X\) with dCSN has time complexity \(O(kmn^3)\), where \(m=|\mathcal D|\), \(n=|\mathbf X|\), and k is the height of the OR tree.

Proof

From Propositions 3 and 5, the overall time complexity to grow a full binary OR tree is \(O(k(mn^3 + m))=O(km(n^3+1))\).

3.2 Learning Ensembles of CNets

To mitigate issues like the scarce accuracy of a single model and their tendency to overfit, since [21] CNets have been employed as the components of a mixture of the form: \( \mathsf {p}(\mathbf X) = \sum _{i=1}^{c} \lambda _i \mathcal C_i(\mathbf X), \) being \(\lambda _i \ge 0: \sum _{i=1}^c \lambda _i = 1\) the mixture coefficients. The first approach to learn such a mixture employs EM to alternatively learn both the weights and the mixture components. With this approach, the time complexity of learning CNets grows at least of a factor of ct, where t is the number of iterations of EM. All the classic issues about convergence and instability of EM make this approach less practical then the following ones. A more efficient method to learn Mixtures of CNets, presented in [9], adopts bagging as a cheap and yet more effective way to only increase time complexity by a factor c. For bagged CNets, mixture coefficients are set equally probable and the mixture components can be learned independently on different bootstrapped data samples. An approach adding random subspace projection to bagged CNets learned with dCSN has been introduced in [8]. While its worst case complexity is the same as for bagging, the cost of growing the OR tree reduced by random sub-spacing is effective in practice. Mixtures of CNets have been learned by exploiting three boosting approaches proposed in [20], having time complexity equals to that for bagging or even worst.

4 Extremely Randomized CNets

XCNets (Extremely Randomized CNets) are CNets that are built by LearnCNet where the OR split node procedure (the select function in Algorithm 1, line 4) is simplified in the most straightforward way: selecting a RV uniformly at random. We denote this algorithmic variant of LearnCNet as XCNet. As a consequence, the cost of the new select function in XCNet does not directly depend anymore on the number of features n and can be considered to be constant.

Proposition 6

\(\varvec{(}{\mathbf {\mathsf{{select}}}}\ \mathbf{time\ complexity\ in}\ {\mathbf {\mathsf{{XCNet}}}}\varvec{).}\) The time complexity for selecting the splitting node on a slice \(\mathcal D\) over \(\mathbf X\) in XCNet is O(1).

Proof

The time required to randomly choose a number in \((1,\ldots ,|\mathbf X|)\).

Corollary 3

Growing a full binary OR tree on \(\mathcal D\) over \(\mathbf X\) with XCNet has time complexity O(km), where k is the height of the OR tree.

Proof

From Propositions 3 and 6, the overall time complexity to grow a full binary OR tree is \(O(k(1 + m))\).

While we introduce this variation with the obvious aim of speeding up a CNet OR tree learning process, we argue that XCNet should still provide accurate density estimators. We support this conjecture with the following motivations.

A CNet can be seen as a sort-of mixture of experts in which the gating function role is demanded to the OR tree, the leaf distributions act as the local experts, and the gating function operates by selecting only one expert per input sample. Let \(g:\mathbf {X}\rightarrow \{\mathcal {M}_i\}_{i=1}^{L}\) be a gating function that associates each configuration \(\xi \sim \mathbf {X}\) to only one leaf model, \(\mathcal {M}_\xi \). For a CNet \(\mathcal {C}\), g can be built by associating to each \(\xi \) a path p in the OR tree structure \(\mathcal G\) of \(\mathcal {C}\). A path \(p=p_{(1)}p_{(2)}\cdots p_{(k)}\) of length k is grown as a sequence of observed values \(v_{1} v_{2} \cdots v_{k}\) in the same fashion as one performs inference according to Eq. 1: starting from the root of \(\mathcal {C}\), for each OR node i traversed, corresponding to RV \(X_{p(i)}\), the branching corresponding to the value \(v_{i} = \xi [X_{p(i)}]\) is followed. At the end of the path p, a leaf model \(\mathcal {M}_{p}=\mathcal {M}_{\xi }\) is reached. Alternatively, one can express g as a function of all possible combinations one can build over a set of observed RVs \(\mathbf {X}\): \(g(\mathbf {\xi })=\sum _{p\in \mathcal {G}}\prod _{i=1}^{|p|} \mathbbm {1}\{\xi [X_{p(i)}]=v_{i}\}\mathcal {M}_{p}\). Now, from this construction of g, one can derive that permuting the order of appearance of the RVs values \(v_{i}\) does not change the value of g. In the same way, from the factorization in Eq. 1, it follows that neither the joint probability mass associated to the configuration \(\xi \) changes after such a permutation. This follows from the fact that the portion of the likelihood assigned to \(\xi \) that depends on the path p can be exactly recovered by choosing another sequence of conditionings, as different applications of the chain rule of probability still model the same joint distribution. This permutation invariance suggests that given a way to associate a sample to a leaf distribution, the way in which conditionings are performed can be irrelevant. Clearly, while this is true for an already learned CNet, for algorithms inducing the OR tree in a top-down fashion, the order in which conditionings are performed during learning obviously matter. Nevertheless, in practice, it might matter less than expected. From another perspective, building an OR tree, and hence g, is likely to perform a clustering of all possible sample configurations. For all LearnCNet variants, this clustering performs a trivial aggregation of samples based on their equal observed values for the conditioned RVs. This is one of the issues why algorithms like entCNet are very prone to overfit. For XCNets, however, the randomization introduced in this clustering phase behaves as a regularizer and helps to overcome the aforementioned issue. All in all, we argue that it is more demanding to estimate good distributions at the leaves than an overoptimized gating function.

Moreover, an additional motivation to the introduction of XCNets comes from ensemble theory. From the interpretation of CNets as mixture of experts, the leaf distribution of a CNet acts as an ensemble of density estimators. Employing a randomized selection criterion increases the diversification of the leaf distributions, and, on the other hand, a strong diversification helps ensembles to better generalize [12]. To better understand this aspect consider a run of entCNet in which the select function has chosen a RV \(X_{i}\) instead of RV \(X_{j}\) to condition on as the first reduces the model entropy more than the second. In both branches \(x_{i}^{0}\) and \(x_{i}^{1}\) of such a conditioning, it is likely that RV \(X_{j}\) would still be considered as one of the top ranked RVs to be split on in the following iterations. By repeating this argument, it might be likely that the leaf distributions appearing in the sub trees generated by conditioning on \(x_{i}^{0}\) and \(x_{i}^{1}\) would have very similar scopes.

When constructing ensembles of CNets we expect this diversification effect introduced by randomization to be even more prominent and effective. In ensemble methods like bagging one employs bootstrapping as a source of randomness to diversify the ensemble components [12]. This is also the case for mixtures of CNets built by bagging (see Sect. 3.2). Differently from bagged CNets, ensemble of XCNets do not need an additional way to produce strongly different components. Therefore, when learning mixtures of XCNets, we aggregate the components by learning each component independently on the full dataset.

Lastly, we review Extremely Randomized Tree, or simply ExtraTrees [10] as they are similar in spirit and by name to XCNets. An ExtraTree is a decision tree that is learned by considering only a random subset of features for the introduction of an OR node (like for random forests [12]) and by randomly selecting a threshold for the actual split. Among those randomly generated hyperplanes, the best according to an optimization criterion is chosen. XCNets differ from ExtraTrees from several perspectives. First, they are density estimators and therefore each OR node in them has to split over all the possible values the chosen RV is defined on, otherwise the modeled distribution would not be a valid probability density. Consequently, an OR node in an XCNet is totally selected at random, while for ExtraTrees the best of the random selection is actually employed. Lastly, an XCNet only slightly underperforms a corresponding non-random model, while a single ExtraTree is generally a weak learner whose “raison d’etre” is to be a component in an ensemble [10].

It is tempting to further reduce the complexity of XCNet by substituting CLTrees with even simpler models. As stated in Proposition 1, learning PoBs reduces the complexity to be linear w.r.t. n. Clearly, we do not expect a CNet with PoBs as leaves to achieve a better likelihood than one with CLtrees. Nevertheless, we intend to measure how the likelihood degrades with less expressive leaf distributions and, at the same time, how faster this variant can be.

5 Experiments

The research questions we are validating are: (Q1) how much does extreme randomization affect the performance of an XCNet when compared to the optimal one learned with dCSN on real data? (Q2) how accurate are ensembles of XCNets and how do they compare against all other CNet ensembling techniques and state-of-the-art density estimators? (Q3) how scalable are and how much time do actually XCNets save in practice?

Table 1. Datasets used and their statistics.

We answer all the above questions by performing our experimentsFootnote 2 on 20 de-facto standard benchmark datasets for density estimation. Introduced by [15] and [11], they are binarized versions of real data from different tasks like frequent itemset mining, recommendation and classification. We adopt their classic splits for training, validation (hyperparameter selection) and testing. Detailed names and statistics are reported in Table 1. Additionally, for the qualitative experiments in Sect. 5.1 we employ the first 10000 training \(28\times 28\) pixel images of digits of MNIST, binarized as in [14].

5.1 (Q1) Single Model Performances

Likelihood Performances. Table 2 reports the results, as the average test log-likelihoods, for all the benchmarks for a entropy-based CNet (entCNet) as reported in [9], a CNet learned with dCSN, and a XCNet (XCNet). Furthermore, we learned a CNet (\(\mathsf {dCSN_{PoB}}\)) and a XCnet (\(\mathsf {XCNet_{PoB}}\)) with PoBs as leaf distributionsFootnote 3. For the two XCnets variants for each dataset the reported results are the average and the standard deviation over 10 different runs. Clearly, the best scores are achieved by dCSN with entCNet following it soon after. Nevertheless, all the log-likelihoods achieved by XCNet are only slightly worse and always on the same order of magnitude if compared to non random models, while PoB variants perform considerably worse. We plot the training and test log-likelihoods achieved by dCSN and XCNet models, both ran with \(\delta = 100\), while adding nodes during learning in Fig. 2. It is possible to note how, on those datasets, dCSN grows CNets that start overfitting much earlier, while the aleatory nature of XCNet slows the process down and mitigates the effect.

Table 2. Average test log likelihoods for all (for XCNet models mean and standard deviation over 10 runs are reported).
Fig. 2.
figure 2

Negative log-likelihood during learning CNets and XCNets.

The worst performance is obtained on Ad, with XCNet scoring a relative decrease of \(14.46\%\) of the log-likelihood w.r.t. dCSN, while PoB degrade it up to \(\%126.07\) Footnote 4. These results are very encouraging but not highly surprisingly given our interpretation of CNets as mixture of experts. Moreover this stresses the difference between XCNets with CLTrees and ExtraTrees [10], since a single extremely randomized tree performs much worse than a non-random tree, a behavior we can associate to XCNets with PoBs as leaf distributions.

Generating Samples. It is worth investigating how good is an XCNet at generating samples w.r.t. a CNet learned by dCSN. While results from the previous section can give us a fairly confident estimate according to sample log-likelihoods, these values may not align to the human evaluation of a sample quality [25]. For this reason we perform a qualitative evaluation on samples drawn from XCNets and CNets learned on the first 10000 samples of a binarized version of MNIST with fixed parameters \(\delta =50\), \(\alpha =0.01\), and \(\sigma =4\).

We randomly sampled 25 digits from both models comparing them to the nearest neighbor in the training set, ensuring that the generated samples are not simple memorization, as reported in Fig. 3. It is evident how both models have not memorized the training samples. Since it is not possible to visually spot very relevant differences between the two sample sets, we can confirm that close log-likelihoods correspond to qualitatively similar samples for XCNets and CNets.

Fig. 3.
figure 3

Samples obtained from a CNet (a), resp. XCNet (c), learned on samples of the binarized MNIST dataset, and their nearest neighbor in training set (b), resp. (d).

Table 3. Average test log likelihoods for all ensembles and other competitors.

5.2 (Q2) Ensemble Performances

To investigate the performance of ensembles of XCNets we build ensembles of 40 components to be comparable with the approaches reported in Sect. 3.2 and introduced in [9, 20, 21]. We report in the first half of Table 3 the best results for ensembles of bagged (\(\mathsf {CNet}^{40}\)) and boosted (\(\mathsf {CNet}^{40}_{\mathsf {boost}}\)) entropy-based CNets taken from [20]Footnote 5. Additionally, we learn an ensemble of 40 bagged CNets learned with dCSN as in [9] (\(\mathsf {dCSN}^{40}\)) with a grid search over \(\delta \in \{1000,2000\}\), \(\alpha \in \{0.1,0.2\}\) and \(\sigma =4\). Lastly, we train an ensemble of 40 XCNets (\(\mathsf {XCNet^{40}}\)) and another ensemble of 40 XCNet with PoBs as leaf distributions by running a grid search over \(\delta \in \{300,500,1000,2000\}\), \(\alpha \in \{0.1,0.2,0.5,1,2\}\) and \(\sigma =4\). For these two random models Table 3 reports the average and the standard deviation over 10 different runs. Note that we are not performing bagging for our XCNet ensembles, since we do not draw bootstrapped samples of the data. This is motivated by the intuition that randomization is a form of diversification in the ensemble by itself, and it has been confirmed with a preliminary experimentation.

Next we compare CNet ensembles to other state-of-the-art TPMs learned by employing much more sophisticated models as ID-SPN [22], ACMN [17]. The first learns a complex hybrid architecture of SPNs and ACs while the latter learns high treewidth MNs represented as tractable ACs. Lastly, we employ the WinMine toolkit (WM) [3]. WM learns a treewidth unbounded BN exploiting context sensitive independencies by modeling its CPTs as trees. These models results are taken from [22]. The 40 component ensemble \(\mathsf {XCNet}^{40}\) already delivers log-likelihoods comparable to those of the aforementioned models on more than half datasets. Nevertheless, we investigate the effect of building a large ensemble, up to 500 components (\(\mathsf {XCNet}^{500}\)) by running a grid search over \(\delta \in \{300,500,1000,2000\}\), \(\alpha = 0.1\) and \(\sigma =4\). On many datasets the log-likelihood scores of such an ensemble are the best achieved in the literature. Compared to \(\mathsf {XCNet}^{40}\), results from \(\mathsf {XCNet}^{500}\) generally improve, however, on datasets like Nltcs and KDDCup2k the improvement saturated, suggesting that adding more components does not diversify the ensemble anymore. It is worth noting that \(\mathsf {XCNet}_{\mathsf {PoB}}^{40}\) is competitive on half datasets against a far more complex model like \(\mathsf {WM}\), yet outperforming it in terms of speed of learning and inference.

Table 4. Numbers of victories for the algorithms on the rows w.r.t those on columns.
Table 5. Times (in seconds) taken to learn the best models on each dataset for dCSN, XCNet, \(\mathsf {dCSN_{PoB}}\), \(\mathsf {XCNet_{PoB}}\), their ensembles and ID-SPN with default parameters.

We summarize comparisons among the algorithms in the first half of the table (resp. all algorithms) through ranking over the twenty datasets. For each dataset, we ranked the performance of the algorithms in the first half of the table (resp. all the algorithms) from 1 to 5 (resp. 9). The average rank of the algorithms is reported in the last two rows of the Table 3, showing that a mixture of XCNets performs the best. Finally, Table 4, reporting the number of victories for each algorithm w.r.t. the others, shows again the performances of mixtures of XCNet against the competitors that obtains 16.62 victories on average.

5.3 (Q3) Running Times

We derived the complexity for all considered variants of CNet learning schemes thus proving that XCNets are the ones scaling better w.r.t. the number of the features. Nevertheless, we empirically analyze XCNets learning times since we want (i) to evaluate whether and how much learning the leaf distribution actually impacts on real data, (ii) to compare the learning times of the density estimators employed in the previous sections. While a non-theoretical comparison may fall into the pitfalls of comparing different programming languages and optimization schemes, we provide it as a rule of thumb for practitioners to decide on which off-the-shelf density estimator toolbox to use.

In Table 5 we report the time, in seconds, spent by each algorithm to learn the best model on each dataset. Even increasing the number of components one order of magnitude more than what competitors are able to do in a reasonable time, XCNet still learn a competitive model (see Table 3) in time lesser than that of the competitors (see for instance the comparison w.r.t. ID-SPN).

6 Conclusions

We introduced XCNets, simplifying CNet learning through random conditioning. When learned in ensembles, XCNets achieve the new state-of-the-art results for density estimation on several benchmark datasets. Due to their simplicity to implement, fast learning times, and accurate inference performances, XCNets set the new baseline to compare against for density estimation with TPMs. As future work we plan to exploit their mixture of experts interpretation to devise more expressive gating functions that still perform exact and fast inference.