1 Introduction

Conditional Density Estimation (CDE) is a fundamental statistical task. Given a domain $${\mathcal {X}}$$, a codomain $${\mathcal {Y}}$$, and a joint Probability Density FunctionFootnote 1 (PDF) $$\rho (\cdot ,\cdot )$$ over $${\mathcal {X}}\times {\mathcal {Y}}$$, the CDE task is to estimate, for each $$x\in {\mathcal {X}}$$, the Conditional (probability) Density Function$$\rho (\cdot | x)$$. CDE estimators are inductively learned from a training set $$Z$$, which is a collection of $$n$$ pairs $$(x,y)\in {\mathcal {X}}\times {\mathcal {Y}}$$ drawn i.i.d. from the distribution arising from $$\rho (\cdot ,\cdot )$$.

Classical regression estimates only the conditional expectation$${\mathbb {E}}[y | x]$$, whereas CDE estimates the conditional distribution of y given x. Depending on the application, conditional density estimates can be used as-is, or their quantiles, moments, and other statistics can be computed, making CDE more flexible than regression. Regressors often assume homoskedasticity, while CDE methods handle heteroskedasticity, and can thus describe complicated phenomena like skew or multimodality. CDE can be also used when regression is meaningless, e.g., when conditional densities exhibit heavy tailed power-law distributions with undefined expectation.

We focus our attention on interpretable CDE, where the task is to train an accurate CDE model such that both the model and its estimates are easy to understand by a human analyst. Interpretability is difficult to quantify, but in our context, having small representation size and low query complexity is a necessary condition, and a reasonable proxy for interpretability, as the analyst should be able to conceptualize or visualize the entire model, and mentally follow the process by which queries are answered. Decision trees and random forests naturally satisfy these requisites. It is also necessary that density estimates are interpretable, as understanding the model and query process is only beneficial if the analyst can also understand the actual predictions. Simple parametric distributions are interpretable at a glance, but large mixture models, non-parametric estimates, and complex graphical models, while computationally convenient, are largely uninterpretable.

Existing tree-basedFootnote 2 CDE techniques learn uninterpretable models, and often select splits that do not yield even local improvements to CDE accuracy. These techniques must store all training labels associated with each leaf in order to answer queries, yielding high storage costs and query time complexities. Probabilistic graphical models with tree structure address some of these issues, and bear some resemblance to decision trees, but inference on them is far more complicated, and the learned models are less interpretable to the human analyst.

Contributions We present CaDET, a CDE algorithm based on decision trees and random forests that overcomes the above limitations of existing tree-based CDE approaches, and produces interpretable parametric conditional density estimates.

• CaDET trees are standard decision trees that use parametric distributions stored at the leaves to answer conditional density queries. While parametric CDE methods are less expressive than non-parametric methods, they usually require less training data, better leverage domain knowledge, and are more interpretable, as they store fewer parameters and produce simpler estimates.

• CaDET trees use the empirical cross-entropy impurity criterion for tree growth, which directly incentivizes splits that lead to more accurate estimates than the criteria used by existing tree-based CDE techniques. We show that CaDET generalizes information-gain classification trees and mean-square-error-minimizing regression trees to a broad family of parametric CDE estimators.

• CaDET is the first tree-based CDE technique that can answer queries without requiring complex graphical model operations or iterating over stored training labels. Instead, each leaf stores a fixed-size sufficient statistic for the labels of training points mapped to it, which allows CaDET to perform Maximum Likelihood Estimation (MLE) as though it had access to the training labels.

• By selecting parametric families with appropriate support, CaDET can handle both univariate and multivariate CDE, as well as CDE on more exotic spaces, such as directional spaces, probability simplices, and mixed spaces, whereas non-parametric methods are generally restricted to particular domain types.

• Our experimental evaluation on real datasets shows that CaDET produces models that are generally more accurate, less prone to overfitting, and are more interpretable than existing tree-based CDE techniques.

Outline The paper is organized as follows. An introduction to decision trees and random forests is given in Sect. 2. We discuss related work in Sect. 3. CaDET, is presented in Sect. 4, followed by extensions to the basic algorithm in Sect. 5. We present our experimental comparison of CaDET to existing tree-based CDE techniques in Sect. 6. Some conclusions complete the work in Sect. 7.

2 Decision trees and random forests

We now define the key concepts about decision trees and random forests. Our description of these data structures and the learning procedure is sufficiently general to encompass various learning tasks, including regression, classification, and CDE.

Decision trees As in the introduction, consider a domain $${\mathcal {X}}$$ and a codomain $${\mathcal {Y}}$$, and let $$Z$$ be the training set, which is a collection of $$n$$ pairs $$(x,y)\in {\mathcal {X}}\times {\mathcal {Y}}$$. A decision tree$$T$$ is a strict rooted binary tree such that:

1. 1.

each non-leaf node v stores a split rule$${\mathsf {s}}_v$$ that maps each element of $${\mathcal {X}}$$ to either the left or the right child of v, splitting $${\mathcal {X}}$$ into two. For any node u (leaves included), there is a subset of $${\mathcal {X}}$$ that is mapped to u. Any $$x\in {\mathcal {X}}$$ is mapped to all the nodes found by walking down the tree $$T$$ starting from the root and following the split rule at each encountered non-leaf node. For any node u, we denote with $${\mathsf {T}}(u;T,Z)$$ the subset of $$Z$$ that is mapped to u. For any $$x\in {\mathcal {X}}$$ we denote with $${\mathsf {tl}}(x;T)$$ the leaf of $$T$$ that x is mapped to;

2. 2.

each leaf $$\ell$$ stores some information $${\mathsf {L}}(\ell ;T)$$, a set of values whose role we describe below. $${\mathsf {L}}(\ell ;T)$$ is a function of $${\mathsf {T}}(\ell ;T,Z)$$, the elements of $$Z$$ that $$T$$ maps to $$\ell$$.

As an example, in standard regression trees with numeric features, a split rule $${\mathsf {s}}_v$$ at a non-leaf node v is a univariate threshold function, which is an indicator function for an inequality on the value of a single feature, such as “$$\text {age} \le 4$$.” Elements that satisfy the condition are mapped to the left child of v, the others to the right child. In the same scenario, the information $${\mathsf {L}}(\ell ;T)$$ stored at a leaf $$\ell$$ is the mean of the $${\mathcal {Y}}$$ components of $${\mathsf {T}}(\ell ;T,Z)$$.

We are usually interested in the leaf information or in the set of training points associated with the leaf containing some query point $$x \in {\mathcal {X}}$$, so we abuse notation, taking $${\mathsf {L}}(x;T)$$ to mean $${\mathsf {L}}\bigl ({\mathsf {tl}}(x;T);T\bigr )$$, and $${\mathsf {T}}(x;T,Z)$$ to mean $${\mathsf {T}}\bigl ({\mathsf {tl}}(x;T);T,Z\bigr )$$.

Query answering Decision trees and random forests are used to answer queries. For a decision tree $$T$$ (we discuss forests later) that makes predictions in some codomain $${\mathcal {U}}$$, queries are answered with the function $${\mathsf {q}}(\cdot ; T): {\mathcal {X}}\rightarrow {\mathcal {U}}$$, where for $$x\in {\mathcal {X}}$$, $${\mathsf {q}}(x; T)$$ is computed using the information $${\mathsf {L}}(x;T)$$ stored at the leaf to which $$T$$ maps x. In univariate regression, $${\mathcal {U}} = {\mathcal {Y}}= {\mathbb {R}}$$, and the query response is simply $${\mathsf {q}}(x; T) = {\mathsf {L}}(x; T)$$, but in general $${\mathcal {U}}$$ may be different than $${\mathcal {Y}}$$ (e.g., in probabilistic classification, $${\mathcal {Y}}$$ is a discrete set, and $${\mathcal {U}}$$ contains distributions over$${\mathcal {Y}}$$), and the leaf information may be used in various ways to respond to queries.

Impurity criterion The split rule in each non-leaf node is learned using the training set. Before describing the learning procedure, we introduce impurity criteria, which are functions $${\mathsf {m}}: {\mathcal {Y}}^{n} \rightarrow {\mathbb {R}}$$. For a set of training labels $$Y \in {\mathcal {Y}}^{n}$$, $${\mathsf {m}}(Y)$$ is the impurity value of Y, which is usually a proxy for the average loss that any constant prediction would incur over Y. We often abuse notation, taking $${\mathsf {m}}(Z)$$ for $$Z \in {({\mathcal {X}}\times {\mathcal {Y}})}^n$$ to simply ignore $${\mathcal {X}}$$, and compute the impurity over the $${\mathcal {Y}}$$ components of Z.

The Mean Square Error (MSE) impurity, used in regression trees (Breiman et al. 1984), is

\begin{aligned} {\mathsf {m}}_{\mathsf {mse}}(Y) = \frac{1}{|Y|} \sum _{y \in Y} {\bigl ( y - {\bar{Y}} \bigr )}^2,\ \text {where}\ {\bar{Y}} = \frac{1}{|Y|} \sum _{y \in Y} y . \end{aligned}
(1)

Taking $$\hat{{\mathbb {P}}}(i)$$ to be the sample frequency (in Y) of $$i \in {\mathcal {Y}}$$, the (discrete) entropy impurity, used in information-gain classification trees (Quinlan 1986), is

\begin{aligned} {\mathsf {m}}_{\mathsf {H}}(Y) = -\sum _{i \in {\mathcal {Y}}} \hat{{\mathbb {P}}}(i) \ln \bigl (\hat{{\mathbb {P}}}(i)\bigr ) . \end{aligned}
(2)

These impurities correspond to the square loss and cross entropy loss of regressors and probabilistic classifiers, respectively, though they may also be interpreted as measures of dispersion at the leaves of a decision tree. Under either interpretation, by selecting splits to minimize total leaf impurity, decision trees seek to explain as much variation in $${\mathcal {Y}}$$ through $${\mathcal {X}}$$ as possible.

Some tree-based CDE methods (Pospisil and Lee 2018) use the Mean Integrated Square Error (MISE) impurity, defined as

\begin{aligned} {\mathsf {m}}_{\mathsf {mise}}(Y) = \frac{1}{|Y|} \sum _{y \in Y} \int _{{\mathcal {Y}}} {\bigl ( {\hat{\rho }}_B(y) - \rho (y|x) \bigr )}^2 \; \mathrm {d}y, \end{aligned}

where $${\hat{\rho }}_B(\cdot )$$ is the estimated density computed using B, and $$\rho (\cdot |x)$$ is the true conditional density given x. While this impurity criterion incentivizes returning $$\rho (\cdot |x)$$ as the estimate, computing $${\mathsf {m}}_{\mathsf {mise}}(Y)$$ requires knowledge of $$\rho (\cdot |x)$$ itself, so Pospisil and Lee (2018) approximate these true densities with a cosine or tensor basis non-parametric estimate. Since estimating $$\rho (\cdot |x)$$ is the goal of CDE, using the MISE impurity creates a cyclic dependency that is not easily resolved.

Learning procedure The learning procedure builds the tree starting from the root. It chooses a split rule $${\mathsf {s}}_v$$ for the current node v and creates its two children. To choose $${\mathsf {s}}_v$$, it finds the partitioning of $${\mathsf {T}}(v;T,Z)$$ into L and R that maximizes, over some family of partitionings (such as the univariate thresholds mentioned above for regression trees), the impurity reduction w.r.t. $${\mathsf {m}}(\cdot )$$, defined as

\begin{aligned} \bigl |{\mathsf {T}}(v;T,Z) \bigr |{\mathsf {m}}\bigl ({\mathsf {T}}(v;T,Z)\bigr ) - \bigl (|R |{\mathsf {m}}(R) + |L |{\mathsf {m}}(L)\bigr ) . \end{aligned}
(3)

The split rule $${\mathsf {s}}_v$$ stored at v is then chosen in such a way as to be consistent with the partitioning of $${\mathsf {T}}(v;T,Z)$$ into L and R. The procedure recursively splits each child v until a stopping criterion is met. Example criteria include the depth of v exceeding a user-specified threshold, the impurity reduction falling short of a user-specified threshold, or the family of partitionings for v being empty.

Random forests A random forest$$F$$ (Breiman 2001) is a collection of trees $$T_1,\cdots ,T_t$$, where each tree is trained using the procedure described above, but each tree uses a resampled training set. This bagging of the original training set $$Z$$ is done with the goal of increasing diversity and lowering variance. To further promote diversity among the forest, a random subset of the family of partitionings is searched at each node. Given a query point $$x\in {\mathcal {X}}$$, the leaf information $${\mathsf {L}}(x;T_j)$$ for each tree $$T_j \in F$$ is used to compute an ensemble response to queries. In the running example of regression trees, the query response is a simple average of tree predictions, namely

\begin{aligned} {\mathsf {q}}(x; F) = \frac{1}{t}\sum _{T\in F} {\mathsf {L}}(x; T) . \end{aligned}

3 Related work

Rosenblatt (1969) first describes CDE with kernel CDE, which applies Kernel Density Estimation (KDE) to the CDE problem, by reporting $$\smash {{\hat{\rho }}}(y | x) = \smash {\frac{{\hat{\rho }}(y, x)}{{\hat{\rho }}(x)}}$$, for each $$\smash {{\hat{\rho }}}(\cdot )$$ estimate on the RHS made with KDE. Kernel CDE and many other nonparametric estimators require that the joint density is absolutely continuous to ensure that densities exist and that densities over $${\mathcal {Y}}$$ and $${\mathcal {X}}\times {\mathcal {Y}}$$ exist. Generalized Linear Models (GLM) (Nelder and Wedderburn 1972) are CDE methods that essentially generalize linear regression beyond the fixed-variance Gassian case. They do not require absolute continuity, although $${\mathcal {Y}}$$ must be continuous. Low-dimensional GLM are generally interpretable but inflexible, while generalizations like import vector machines (Zhu and Hastie 2002) are flexible but uninterpretable.

By their inherently probabilistic nature, graphical models are well-suited for CDE. Cutset networks (Rahman et al. 2014; Di Mauro et al. 2017) are OR trees with tractable probabilistic models at their leaves, and mixed-sum product networks (Molina et al. 2018) are graphical models with tree structure for mixed data. Each bears some resemblance to decision trees, and admits more efficient induction and inference than general graphical models. However, they must be large enough to represent conditional density relationships between all variables, since they make no distinction between features and labels. Answering CDE queries on these models requires, despite their tree structure, a complicated global process of marginalization, conditioning, and related operations, often spanning the entire network. This procedure is less efficient and more recondite to the human analyst than standard decision tree queries, which occur locally along a single root-to-leaf path.

Decision trees are lauded for their simplicity, efficiency, and interpretability, but current tree-based CDE techniques lack these properties. Chaudhuri et al. (2002) propose the first tree-based Conditional Quantile Estimation (CQE) technique, and Meinshausen (2006) introduces the first tree-based CQE approach, Quantile Regression Forest (QRFs). QRFs minimize standard regression impurity criteria to select split rules, which essentially only consider the means of the target variable y in the subsets resulting from the split, rather than taking into account the entire sample distribution of the target variable. These impurity criteria are ill-suited for CDE, as they do not incentivize splits that improve CDE estimates (discussed further in Sect. 4.2). Pospisil and Lee (2018) introduce Random Forests for Conditional Density Estimation (RFCDE), which are largely equivalent to CQE, except they use estimated MISE impurity (whose issues were discussed in Sect. 2), and output KDE (effectively kernel-smoothed quantile estimates). RFCDE and QRF queries operate on the training labels mapped to each leaf, which must be stored and processed explicitly, incurring high storage and query costs.

Hothorn and Zeileis (2017) propose the transformation forest (TF), which chooses split rules using null-hypothesis testing. It is not clear that conservatively chosen splits benefit forests, as ensemble methods thrive on diverse weak learners. TFs fit distributions using transformation families: given a fixed univariate PDF $$\psi$$ they pick an invertible transformation function$$\phi : {\mathbb {R}}\rightarrow {\mathbb {R}}$$, producing the density estimate $$(\psi \circ \phi )(y) = |\smash {\frac{\mathrm {d}}{\mathrm {d}y}}\phi (y)|^{-1}\psi \bigl (\phi (y)\bigr )$$. The learned $$\phi$$ can be complicated, yielding uninterpretable models even for simple $$\psi$$, and TFs must also store and process raw labels to answer forest queries (see also Sect. 5).

CaDET overcomes these limitations with a parametric approach, learning interpretable trees that make parametric density estimates. It uses the empirical cross-entropy impurity criterion, which incentivizes effective splits for CDE. CaDET attains low storage and query costs by storing sufficient statistics of the training labels associated with each leaf, requiring bounded memory and computation. CaDET estimates parametric densities within a user-selected family, which are generally more interpretable, and learning them requires fewer samples than nonparametric estimates. Finally, as CaDET makes no assumptions on the underlying probability space, it can be instantiated directly on arbitrary probability spaces (including multivariate, mixed, and other exotic cases).

4 CADET: interpretable parametric CDE with trees and forests

CaDET is a specific instantiation of the decision tree and random forest models (Sect. 2). It makes heavy use of sufficient statistics, so we first discuss this concept.

4.1 Sufficient statistics

Let $${\mathcal {F}}$$ be a parametric family of PDFs over $${\mathcal {Y}}$$, with parameter space $$\varTheta$$, and take $$\theta \in \varTheta$$. The member of $${\mathcal {F}}$$ identified by $$\theta$$ is denoted as $$\rho (\cdot ; {\mathcal {F}}, \theta )$$. We omit $${\mathcal {F}}$$ from this and other notation when clear from context.

Let $$Y \in {\mathcal {Y}}^{n}$$ for some sample size n, sampled i.i.d. from the distribution arising from some unknown $$\rho (\cdot ; \theta )\in {\mathcal {F}}$$. A sufficient statistic for $$\varTheta$$ (alternatively referred to as a sufficient statistic for $${\mathcal {F}}$$) is a vector-valued function $${\mathsf {w}}^{(n)}: {\mathcal {Y}}^{n} \rightarrow {\mathbb {R}}^{\dim ({\mathsf {w}})}$$ (where $$\dim ({\mathsf {w}})$$ is the codomain dimension of $${\mathsf {w}}(\cdot )$$) such that $${\mathsf {w}}^{(n)}(Y)$$ is as informative asY for the purpose of estimating the unknown $$\theta$$ that determines the unknown PDF $$\rho (\cdot ; \theta )$$ (Casella and Berger 2002, Sect. 6.2). For example,

\begin{aligned} {\mathsf {w}}^{(n)}(Y)= \Biggl ( \ \sum _{y \in Y} y, \sum _{y \in Y} y^2 \Biggr ) \end{aligned}

is a sufficient statistic for the Gaussian family, with MLE mean and variance $${\hat{\mu }} = \frac{{\mathsf {w}}^{\smash {(n)}}_1(Y)}{n}$$ and $${\hat{\sigma }}^2 = \frac{{\mathsf {w}}^{\smash {(n)}}_2(Y)}{n} - {\hat{\mu }}^2$$. A sufficient statistic for the Pareto family is

\begin{aligned} {\mathsf {w}}^{(n)}(Y)= \Biggl ( \min (Y), \prod _{y \in Y} y \Biggr ) . \end{aligned}
(4)

The Fisher-Neyman factorization theorem (Halmos et al. 1949) shows that for any PDF $$\rho (\cdot ;\theta ): {\mathcal {Y}}\rightarrow {\mathbb {R}}_{0+}$$ from a family $${\mathcal {F}}$$ with sufficient statistic $${\mathsf {w}}^{(1)}(\cdot ): {\mathcal {Y}}\rightarrow {\mathbb {R}}^{\dim ({\mathsf {w}})}$$, there exists a base measure$${\mathsf {h}}(\cdot ): {\mathcal {Y}}\rightarrow {\mathbb {R}}_{0+}$$ and a factorization function$$\mathrm {F}(\cdot ; \cdot ): {\mathbb {R}}^{\dim ({\mathsf {w}})} \times \varTheta \rightarrow {\mathbb {R}}_{0+}$$ such that

\begin{aligned} \rho (\cdot ;\theta ) = {\mathsf {h}}(\cdot )\mathrm {F}\bigl ({\mathsf {w}}^{(1)}(\cdot ); \theta \bigr ) . \end{aligned}
(5)

We now define $${\mathsf {p}}^{(n)}(\cdot ): {\mathbb {R}}^{\dim ({\mathsf {w}})} \rightarrow \varTheta$$ to be the function that selects $$\theta \in \varTheta$$ to maximize the likelihood of an i.i.d. sample $$Y \in {\mathcal {Y}}^{n}$$ given $${\mathsf {w}}^{(n)}(Y)$$:

\begin{aligned} {\mathsf {p}}^{(n)}\Bigl ({\mathsf {w}}^{(n)}(Y)\Bigr ) = \arg \max _{\theta \in \varTheta } \prod _{y \in Y} \rho (y; \theta ) = \arg \max _{\theta \in \varTheta } \sum _{y \in Y} \ln \bigl (\mathrm {F}({\mathsf {w}}^{(n)}(Y); \theta )\bigr ), \end{aligned}
(6)

where the rightmost equality follows from (5). We omit the sample-size superscript from both $${\mathsf {w}}(\cdot )$$ and $${\mathsf {p}}(\cdot )$$ when clear from context, and further abuse notation when discussing trees, letting $${\mathsf {w}}^{(n)}(Z)$$ ignore the $${\mathcal {X}}$$ elements of a sample $$Z \in {({\mathcal {X}}\times {\mathcal {Y}})}^{n}$$.

Exponential class A natural sufficient statistic is a sufficient statistic for $${\mathcal {F}}$$, such that for i.i.d. samples $$Y \in {\mathcal {Y}}^{n}$$, $$Y' \in {\mathcal {Y}}^{n'}$$, and their concatenation$$Y \uplus Y'$$, it holds (Casella and Berger 2002, Thm. 6.2.10) that

\begin{aligned} {\mathsf {w}}^{(n+n')}(Y \uplus Y') = {\mathsf {w}}^{(n)}(Y) + {\mathsf {w}}^{(n')}(Y') . \end{aligned}
(7)

A distribution family $${\mathcal {F}}$$ with parameter space $$\varTheta$$ and support $${\mathcal {Y}}$$ is said to be in the exponential class if it admits a factorization into a natural sufficient statistic$${\mathsf {w}}^{(1)}(\cdot ): {\mathcal {Y}}\rightarrow \dim ({\mathsf {w}})$$, base measure$${\mathsf {h}}(\cdot ): {\mathcal {Y}}\rightarrow \smash {{\mathbb {R}}_+}$$, parameter function$$\eta (\cdot ): \smash {\varTheta \rightarrow {\mathbb {R}}^{\dim ({\mathsf {w}})}}$$, and log-partition function$${\mathsf {A}}(\cdot ): \smash {\varTheta \rightarrow {\mathbb {R}}}$$, such that any PDF $$\rho (\cdot ; \theta ) \in {\mathcal {F}}$$ can be written as

\begin{aligned} \rho (\cdot ; \theta ) = {\mathsf {h}}(\cdot )\exp \bigl (\eta (\theta ) \cdot {\mathsf {w}}^{(1)}(\cdot ) - {\mathsf {A}}(\theta )\bigr ), \end{aligned}
(8)

The exponential class contains many well-known (thus interpretable to a human analyst) distribution families, including the Gaussian, exponential, gamma, beta, Dirichlet, geometric, and Poisson families. Sufficient statistics and combination functions like (7) are key to the performance guarantees of CaDET, so naturally one might wonder under which conditions they exist. The Pitman-Koopman-Darmois theorem (Koopman 1936) shows that if a family $${\mathcal {F}}$$ has fixed support and a bounded-dimensional sufficient statistic, then $${\mathcal {F}}$$ is in the exponential class.

Among variable-support families with a bounded-dimensional sufficient statistic $${\mathsf {w}}(\cdot )$$, some admit a combination function$${\mathsf {g}}(\cdot , \cdot )$$ such that

\begin{aligned} {\mathsf {w}}^{(n+ n')}(Y \uplus Y') = {\mathsf {g}}\bigl ({\mathsf {w}}^{(n)}(Y), {\mathsf {w}}^{(n')}(Y')\bigr ), \end{aligned}
(9)

which generalizes (7) beyond the exponential class. The Pareto and uniform interval families admit such $${\mathsf {g}}(\cdot , \cdot )$$; the reader is invited to derive one for the Pareto family, starting from the sufficient statistic in (4). CaDET estimates conditional densities by storing sufficient statistics at each leaf of the decision tree, which through (6), are isomorphic to MLE distribution estimates.

4.2 Decision trees for interpretable parametric CDE

CaDET is an instantiation of the decision tree model described in Sect. 2. It is parameterized by a parametric family $${\mathcal {F}}$$, which determines the class of densities that a CaDET tree or forest can predict. Bounded-dimensional sufficient statistics and combination functions are needed to efficiently train CaDET trees and to aggregate tree information into forest queries, so here we assume these exist for $${\mathcal {F}}$$. Their nonexistence does not impact the theory behind CaDET, thus with minor changes, CaDET may be instantiated for parametric families lacking bounded-dimensional sufficient statistics or combination functions, although in this case, training time, forest memory, and forest query time costs may be higher.

Impurity criterion Let $${\mathcal {F}}$$ be a parametric family of PDFs, with bounded-dimensional sufficient statistic $${\mathsf {w}}$$ and parameter space $$\varTheta$$. CaDET minimizes the Empirical Cross Entropy (ECE) impurity, defined as

\begin{aligned} {\mathsf {m}}_{\mathsf {ece}}(Y; {\mathcal {F}}) = -\frac{1}{|Y|}\sum _{\smash {y \in Y}} \! \ln \bigl ( \rho (y; {\mathsf {p}}({\mathsf {w}}(Y))) \bigr ) . \end{aligned}
(10)

The ECE impurity is parametric in the sense that it depends on the hyperparameter $${\mathcal {F}}$$ (omitted when clear from context). This dependence is key, as it allows $${\mathsf {m}}_{\mathsf {ece}}$$ to incentivize splits that lead to the data being well-fit by$${\mathcal {F}}$$. The ECE impurity should be contrasted with the MSE loss from (1), which Hothorn and Zeileis (2017) argue is ineffective for CDE, as it is not sensitive to changes over $${\mathcal {X}}$$ of the conditional distribution of $${\mathcal {Y}}$$, but only to changes of the conditional expectation of $${\mathcal {Y}}$$.

The ECE is the impurity-criterion counterpart of the cross entropy loss, often used in neural networks (Goodfellow et al. 2016, Ch. 5.5) and binomial regression models (Weisberg 2005, Ch. 12). Cross entropy loss is theoretically motivated, both from decision-theoretic and coding-theoretic perspectives. In decision theory, a strictly proper scoring rule is a loss function that is uniquely minimized by predicting the true density. The cross entropy (often called the logarithmic scoring rule), is the only such rule (up to affine transformation) that is also local, meaning that given label y and estimated distribution $${\hat{\rho }}(\cdot )$$, it may be computed as a function of $${\hat{\rho }}(y)$$ (Shuford et al. 1966). From a coding theory perspective, cross entropy is a measure of the degree of inefficiency of using one distribution to encode symbols from another. The source coding theorem (Shannon 1948) shows that maximal efficiency is attained when the encoding distribution matches the true distribution.

The entropy of a PDF $$\psi$$ with support $${\mathcal {Y}}$$ isFootnote 3

\begin{aligned} {\mathsf {H}}(\psi ) = -\int _{\mathcal {Y}}\psi (y)\ln \bigl (\psi (y)\bigr ) \; \mathrm {d}y . \end{aligned}

We now show that ECE impurity and the entropy of the MLE distribution often coincide in the exponential class.

Lemma 1

Suppose $$Y \in {\mathcal {Y}}^{n}$$, and $${\mathcal {F}}$$ a member of the exponential class, with base measure $${\mathsf {h}}(\cdot )$$ and sufficient statistic $${\mathsf {w}}(\cdot )$$. Let $$\theta = {\mathsf {p}}^{(n)}\bigl ({\mathsf {w}}^{(n)}(Y)\bigr )$$, $$\smash {{\hat{B}}} = \smash {\frac{1}{n}} \sum _{y \in Y} \ln \bigl ({\mathsf {h}}(y)\bigr )$$, $$y'$$ drawn with density $$\rho (\cdot ; \theta )$$, and $$B = {\mathbb {E}}_{y'}\bigl [\ln \bigl ({\mathsf {h}}(y')\bigr )\bigr ]$$. Then

1. 1.

if $$\ln \bigl ({\mathsf {h}}(\cdot )\bigr )$$ is an affine function of $${\mathsf {w}}^{(1)}(\cdot )$$, then $${\mathsf {m}}_{\mathsf {ece}}(Y) = {\mathsf {H}}\bigl (\rho (\cdot ; \theta )\bigr )$$; and

2. 2.

in general, $${\mathsf {m}}_{\mathsf {ece}}(Y) = (B - {\hat{B}}) + {\mathsf {H}}\bigl (\rho (\cdot ; \theta )\bigr )$$.

Proof

We first show Case 2, from which Case 1 follows.

The Maximum Likelihood step holds since in MLE, sample sufficient statistics are always preserved in the fitted distribution (this property is evident in the maximum entropy interpretation of MLE, where it holds by definition).

The additional hypothesis in Case 1 implies the existence of $$\beta \in {\mathbb {R}}, \alpha \in {\mathbb {R}}^{\dim ({\mathsf {w}})}$$ such that $$\smash {\ln \bigl ({\mathsf {h}}(\cdot )\bigr ) = \beta + \alpha \cdot {\mathsf {w}}^{(1)}(\cdot )}$$. It then holds that

\begin{aligned} {\hat{B}} = \frac{1}{n}\sum _{\smash {y \in Y}} \ln \bigl ({\mathsf {h}}(y)\bigr ) = \beta + \alpha \cdot \frac{1}{n}{\mathsf {w}}^{(n)}(Y) = \beta + \alpha \cdot {\mathbb {E}}_{y'}\bigl [{\mathsf {w}}^{(1)}(y')\bigr ] = B, \end{aligned}

and via Case 2, noting that here $$B - {\hat{B}} = 0$$, we obtain Case 1. $$\square$$

Case 1 of Lemma 1 applies to many families of interest, such as the Gaussian, gamma, and Von-Mises families, where $${\mathsf {h}}(\cdot )$$ is constant, and the beta, Dirichlet, and log-Gaussian families, where $$\ln \bigl ({\mathsf {h}}(\cdot )\bigr )$$ is an affine function of $${\mathsf {w}}^{(1)}(\cdot )$$. When Case 1 holds, the splits chosen by CaDET are the same that would be chosen by minimizing entropy, as done in information gain trees. These trees select splits that explain as much variation in $${\mathcal {Y}}$$ as possible, leading to more homogeneous leaves to which more accurate distributions can be fit. When the ECE and entropy do not coincide, an argument can be made for using either as an impurity criterion, and CaDET can be adapted to instead select entropy-minimizing splits if so desired.

A more practical consequence of Lemma 1 is that the impurity reduction (see (3)) w.r.t. $${\mathsf {m}}_{{\mathsf {ece}}}(\cdot )$$ of any split at any node with training labels Y can computed from $${\mathsf {w}}(Y)$$ without having to iterate over Y or knowing $${\hat{B}}$$. Furthermore, $${\mathsf {m}}_{{\mathsf {ece}}}(Y)$$ can be computed from $${\mathsf {H}}\bigl (\rho (\cdot ; \theta )\bigr )$$ even in Case 2, if $$\smash {{\hat{B}}}$$ (the sum of log base measures) is computed along with $${\mathsf {w}}(Y)$$. Similar results can often be derived for $${\mathcal {F}}$$ not in the exponential class; the reader is invited to confirm that for the uniform interval distribution (over $${\mathbb {R}}$$ or $${\mathbb {Z}}$$), it holds that $${\mathsf {m}}_{{\mathsf {ece}}}(Y) = {\mathsf {H}}\bigl (\rho (\cdot ; \theta )\bigr )$$.

Leaf information The information $${\mathsf {L}}(\ell ;T)$$ stored at the leaf $$\ell$$ of a CaDET tree $$T$$ is the number of training points mapped to the leaf $$\left|{\mathsf {T}}(\ell ;T,Z)\right|$$, and the sufficient statistics$${\mathsf {w}}\bigl ({\mathsf {T}}(\ell ;T,Z)\bigr )$$ of the training elements $${\mathsf {T}}(\ell ;T,Z)$$ that $$T$$ maps to $$\ell$$. For notational convenience, we take $${\mathsf {L}}(\cdot ; \cdot )$$ to be a vector, where the 0th component is the sample size, and the remaining components are the sufficient statistic, i.e.,

\begin{aligned} {\mathsf {L}}_{0}(\ell ;T) = \bigl |{\mathsf {T}}(\ell ;T,Z)\bigr |, \ \text {and}\ {\mathsf {L}}_{1:\dim ({\mathsf {w}})}(\ell ;T) = {\mathsf {w}}\bigl ({\mathsf {T}}(\ell ;T,Z)\bigr ), \end{aligned}

where $$V_{a:b}(\cdot )$$ is vector slice notation, corresponding to codomain indices $$a, \dots b$$ of the vector-valued function $$V(\cdot )$$. Because CaDET stores only $${\mathsf {w}}\bigl ({\mathsf {T}}(\ell ;T,Z)\bigr )$$ at each leaf $$\ell$$, it has lower storage and query time costs than current tree-based CDE methods, which must store and process raw training labels to answer forest queries.

Response to queries Given a tree $$T$$, the response $${\mathsf {q}}(\cdot ; x, T)$$ to a query at $$x\in {\mathcal {X}}$$ is the MLE PDF w.r.t. $${\mathcal {F}}$$ on the $${\mathcal {Y}}$$ components of $${\mathsf {T}}(x;T,Z)$$:

\begin{aligned} {\mathsf {q}}(\cdot ; x, T) = \rho \Bigl ( \, \cdot ; {\mathsf {p}}^{(N)}\Bigl ({\mathsf {w}}^{(N)}\bigl ({\mathsf {T}}(x;T,Z)\bigr )\Bigr )\Bigr ) = \rho \Bigr ( \, \cdot ; {\mathsf {p}}^{(N)}\bigl ({\mathsf {L}}_{1:\dim ({\mathsf {w}})}(x;T)\bigr )\Bigl ), \end{aligned}

taking $$N = \left|{\mathsf {T}}(x;T,Z)\right| = {\mathsf {L}}_{0}(x;T)$$. Since CDE responses are PDFs, which are themselves functions, we write $${\mathsf {q}}(\cdot ; x, T)$$, where the first argument is an element of the domain of the PDF, the second the query point, and the third the tree.

This response is well-motivated, as $${\mathsf {T}}(x; T, Z)$$ should be an approximately independent sample from approximately the conditional distribution at x. The “approximate” qualification is needed as split choice induces some dependence, and the conditional distribution changes as $${\mathcal {Y}}$$ varies throughout the leaf. Ignoring the approximation, it is then reasonable to return the MLE estimate for this sample.

The careful reader may notice that one could just store this PDF at the leaf, in place of the sufficient statistic of the training set mapped to this leaf. For trees, either suffices, but we will require sufficient statistics to answer queries with forests.

4.3 Random forests

Consider a random forest $$F$$ composed of CaDET trees $$T_1,\cdots ,T_t$$, with training sets $$Z_1,\cdots ,Z_t$$, and shared distribution family $${\mathcal {F}}$$. Here the response $${\mathsf {q}}(\cdot ; x, F)$$ to the query at $$x\in {\mathcal {X}}$$ is

\begin{aligned} {\mathsf {q}}(\cdot ; x, F) = \rho \!\left( \cdot \,; {\mathsf {p}}^{(N)}\!\!\left( \!\, {\mathsf {w}}\left( \biguplus _{i = 1}^{t} {\mathsf {T}}\bigl (x; T_i, Z_i\bigr )\right) \right) \! \right) \! = \! \rho \!\left( \cdot \,; {\mathsf {p}}^{(N)}\!\!\left( \!\, \sum _{T\in F} \! {\mathsf {L}}_{1:\dim ({\mathsf {w}})}(x;T)\right) \! \right) \!, \end{aligned}

where $$\displaystyle N = \sum _{i = 1}^{t} \bigl |{\mathsf {T}}\bigl (x; T_i, Z_i\bigr )\bigr |= \sum _{T\in F} {\mathsf {L}}_0(x; T)$$,

and for exponential-class $${\mathcal {F}}$$, the sum is from (7), and must be replaced by repeated applications of $${\mathsf {g}}(\cdot , \cdot )$$ from (9) for $${\mathcal {F}}$$ not in the exponential class.

If each training set $$Z_i$$ for each $$T_i$$ were drawn i.i.d., then sample concatenation across the trees would be well-motivated, since for any $$x \in {\mathcal {X}}$$, by the same reasoning as in the tree case, each $${\mathsf {T}}(x; T_i, Z_i)$$ is an approximately i.i.d. sample from the true conditional density at x, thus the MLE estimator for their sample concatenation should be better than any of the individual trees estimates. When instead each $$Z_i$$ is created by bagging the original training data, the samples at each leaf are more dependent (duplicates are more likely), and MLE should behave similarly to a parametric bootstrap estimate, but the same reasoning of combining small approximately i.i.d. samples into one large sample and performing MLE holds.

4.4 Discussion

On domains and parametric familiesCaDET can be instantiated with any parametric family with a bounded-dimensional sufficient statistic over any $${\mathcal {Y}}$$. In contrast, non-parametric techniques are generally tied to a particular codomain, often $${\mathbb {R}}^d$$. Although many spaces (e.g., discrete, simplicial, spherical, or cyclic) can be embedded in $${\mathbb {R}}^d$$, interpreting a nonparametric model over $${\mathbb {R}}^d$$ in $${\mathcal {Y}}\subseteq {\mathbb {R}}^d$$ may invalidate density estimates, as densities over $${\mathbb {R}}^d$$ are not necessarily densities over $${\mathcal {Y}}$$.

Specifically, if $${\mathcal {Y}}\subseteq {\mathbb {R}}^d$$, but densities in $${\mathcal {Y}}$$ are interpreted w.r.t. the Lebesgue or Borel measures in $${\mathbb {R}}^d$$, then often the total mass over $${\mathcal {Y}}$$ is less than 1. Furthermore, if $${\mathbb {R}}^d$$ and $${\mathcal {Y}}$$ do not share a measure (as in simplicial or spherical domains, where $${\mathcal {Y}}$$ is ($$d-1$$)-dimensional), the total mass of estimated densities can even exceedFootnote 4 1. Workarounds like transformation functions exist, though they have their own issues (see Sect. 5), whereas CaDET can handle tasks directly in their original space, using simple probabilistic models designed to work well for a particular setting.

Parametric versus non-parametric sample complexityCaDET’s restriction to parametric families with bounded-dimensional sufficient statistics necessarily limits the representative power of its CDEs: if $${\mathcal {F}}$$ poorly models true conditional densities, then nonparametric CDE trees may outperform CaDET given enough training data. However, CaDET will generally perform better with small sample sizes, as MLE exhibits faster convergence than nonparametric techniques.Footnote 5 We show an example of this behavior in Sect. 6.

This faster rate is particularly important in CDE-trees, since each leaf requires enough data to accurately estimate conditional densities. CaDET trees thus require fewer samples at each leaf than nonparametric methods, allowing them to better model conditional density structure with additional splits. Even with additional splits, CaDET generally remains more interpretable than nonparametric methods, as splits are easily understood, whereas complicated nonparametric distribution estimates are not.

Generalizing prior art Let $${\mathcal {F}}_\mathrm {c}$$ be the categorical family, and $${\mathcal {F}}_\mathrm {G}$$ the unit-variance Gaussian family. It holds by Lemma 1 that $${\mathsf {m}}_{{\mathsf {ece}}}(\cdot ; {\mathcal {F}}_\mathrm {c}) = {\mathsf {m}}_{{\mathsf {H}}}(\cdot )$$, and $${\mathsf {m}}_{{\mathsf {ece}}}(\cdot ; {\mathcal {F}}_\mathrm {G}) \propto {\mathsf {m}}_{{\mathsf {mse}}}(\cdot )$$. Thus, with these family choices, CaDET makes the same splits as entropy-minimizing classification trees (Quinlan 1986) and MSE-minimizing regression trees (Breiman et al. 1984), respectively. CaDET therefore generalizes two classic decision-tree models to a broad class of parametric estimation problems.

4.5 Training time complexity

Consider the training of a decision tree using a training set $$Z\in {({\mathcal {X}}\times {\mathcal {Y}})}^n$$, where splits are chosen from all univariate threshold functions over a constant number of features to minimize either $${\mathsf {m}}_{{\mathsf {H}}}(\cdot )$$ (for classification) or $${\mathsf {m}}_{{\mathsf {mse}}}(\cdot )$$ (for regression). The time necessary for the training is in the best case $${\varvec{\Theta }}\bigl (n\log n\bigr )$$, and in the worst case $$\smash {{\varvec{\Theta }}\bigl (n^2\bigr )}$$. TF (Hothorn and Zeileis 2017) and RFCDE (Pospisil and Lee 2018) require $${\varvec{\Omega }}\bigl (\left|{\mathsf {T}}(v;T,Z)\right|\bigr )$$ time to evaluate each potential split of node v, thus training them takes time $$\smash {{\varvec{\Omega }}\bigl (n^2\bigr )}$$ in the best case and $$\smash {{\varvec{\Omega }}\bigl (n^3\bigr )}$$ in the worst case. These times are worse than the ones mentioned above by a factor $$\smash {{\tilde{\varvec{{\varvec{\Omega }}}}}}(n\bigr )$$.

Training CaDET trees with a family $${\mathcal {F}}$$ attains the faster training time complexities of $${\mathsf {m}}_{\mathsf {H}}(\cdot )$$ and $${\mathsf {m}}_{\mathsf {mse}}(\cdot )$$ trees, as long as $${\mathcal {F}}$$ has sufficient statistic $${\mathsf {w}}(\cdot )$$ and combination function $${\mathsf {g}}(\cdot , \cdot )$$ (see (9)), such that $${\mathsf {g}}(\cdot , \cdot )$$, $${\mathsf {w}}^1(\cdot )$$, and $${\mathsf {H}}\bigl (\rho (\cdot ; {\mathsf {p}}(w))\bigr )$$ for any $$w \in \smash {{\mathbb {R}}^{\dim ({\mathsf {w}})}}$$ can all be evaluated in $${\varvec{\Theta }}(1)$$ time. CaDET attains these time complexities because sufficient statistics can be updated via $${\mathsf {g}}(\cdot , \cdot )$$ in amortized time at each potential split, and entropies can be efficiently computed (by assumption), matching the cost of computing discrete entropy or variance in classification or regression trees. Without bounded-dimensional sufficient statistics or combination functions, CaDET generally must perform $${\varvec{\Omega }}\bigl (\left|{\mathsf {T}}(v;T,Z)\right|\bigr )$$ work to evaluate a split at node v, exactly as in RFCDE and TF. In this case, CaDET would then attain the slower training time complexities of these algorithms.

5 Extensions

Parametric distributions with few parameters, such as univariate Gaussians, are generally interpretable. However, the distribution families one might naturally consider in high-dimensional or unfamiliar spaces may have many parameters, thus becoming less interpretable. We now discuss three methods to construct rich parametric families over complex domains from simple constituent families over familiar domains, without sacrificing interpretability:

• product families, which are multivariate distributions built from univariate constituent distributions;

• transformation families, which can be used to produce distributions with restricted support to suit domain-specific requirements; and

• union families, which enable performing MLE over multiple families.

Product families We often want to estimate multivariate densities, i.e., $${\mathcal {Y}}$$ is a product space with $${\mathcal {Y}}= {\mathcal {Y}}_1 \times \cdots \times {\mathcal {Y}}_d$$, but have domain-specific knowledge about each $${\mathcal {Y}}_i$$ (e.g., whether the support is discrete, real, or semireal) which standard primitive distributions, such as the multivariate Gaussian, would ignore. Product families compute the joint density over $${\mathcal {Y}}$$ as a product of density estimates (thus treating each multiplicand as an independent random variable) over each $${\mathcal {Y}}_1, \dots , {\mathcal {Y}}_d$$. Computation over product families is particularly convenient, as sufficient statistics, densities, and entropies can all be computed from univariate densities, and the exponential class is closed under finite products.

Product-family CaDET should be contrasted with CaDET applied separately on each $${\mathcal {Y}}_i$$, and estimating joint densities as products of univariate densities. In both cases, joint CDE are product distributions, however in the first case, CaDET uses impurity reduction across all$${\mathcal {Y}}_1, \dots , {\mathcal {Y}}_d$$ to select splits, whereas in the second case, splits are separately learned for each $${\mathcal {Y}}_i$$. If the conditional densities of each $${\mathcal {Y}}_i$$ vary similarly over $${\mathcal {X}}$$, then this additional information allows better split selection in the first case. Additionally, the product-family tree is simpler than the collection of trees for each $${\mathcal {Y}}_i$$, thus more interpretable and less prone to overfitting.

Transformation families Often $${\mathcal {Y}}$$ is not $${\mathbb {R}}^d$$ or some space with a plethora of convenient well-known distribution families. For example, $${\mathcal {Y}}$$ could be the unit sphere, unit simplex, or some compact subset of $${\mathbb {R}}^d$$. Transformation families contain distributions over such a $${\mathcal {Y}}$$, obtained by transforming familiar distributions over some isomorphic space $${\mathcal {Y}}'$$. Such transformations can be intuitive and thus interpretable; for instance we may transform Cartesian coordinates of points on the Earth’s surface to the more familiar latitude and longitude coordinates.

A transformation function$$\phi : {\mathcal {Y}}\rightarrow \phi ({\mathcal {Y}})$$ is a differentiable invertible mapping. Given a family $${\mathcal {F}}$$ over $${\mathcal {Y}}$$ parameterized by $$\varTheta$$, we define the $$\phi$$-transformed density

\begin{aligned} (\rho \circ \phi )(\cdot ; \theta ) = \bigl |{\mathcal {J}}\bigl (\phi (\cdot )\bigr ) \bigr |^{-1}\rho \bigl (\phi (\cdot ); \theta \bigr ), \end{aligned}
(11)

and the corresponding $$\phi$$-transformed family

\begin{aligned} {\mathcal {F}}\circ \phi = \{ (\rho \circ \phi )(\cdot ; \theta ): \ \theta \in \varTheta \}, \end{aligned}
(12)

where $$|{\mathcal {J}}(\phi (\cdot )) |$$ is the absolute determinant of the Jacobian of $$\phi$$. In CaDET we assume the existence of a bounded-dimensional sufficient statistic, which is particularly convenient with transformation functions through (5), as we may compute the base measure of $${\mathcal {F}}\circ \phi$$ as $${\mathsf {h}}(\cdot ; {\mathcal {F}}\circ \phi ) = |{\mathcal {J}}(\phi (\cdot )) |^{-1} {\mathsf {h}}(\phi (\cdot ))$$ and the sufficient statistic as $${\mathsf {w}}(\cdot ; {\mathcal {F}}\circ \phi ) = {\mathsf {w}}(\phi (\cdot ); {\mathcal {F}})$$.

For example, we can construct the inverse-gamma family from $$\phi (y) = y^{-1}$$ and the gamma family, and the log-normal family from $$\phi (y): {\mathbb {R}}_+^d \rightarrow {\mathbb {R}}^d = \ln (y)$$ and the Gaussian family. The logarithm elicits a domain-change, yielding families over $${\mathbb {R}}_+$$, which is useful for estimating positive quantities.

Transformation functions, when paired with an appropriate coordinate system, can also be used to construct distributions over sets of Lebesgue measure zero, such as the unit simplex $$\Delta ^d = \smash {\{y \in {(0, 1)}^{d+1}: \ \left||y\right||_{1} = 1\}}$$, or the unit sphere $${\mathcal {S}}^d = \smash {\{y \in {\mathbb {R}}^{d+1}: \ \left||y\right||_{2} = 1\}}$$, which are of key importance in compositional statistics and directional statistics, respectively. E.g., for simplicial data we can apply the Additive Log-Ratio-Transform (ALRT) (Aitchison 1982)

\begin{aligned} \phi _{{\textsf {ALRT}}}(y_1,\cdots ,y_{d+1}): \Delta ^{d} \rightarrow {\mathbb {R}}^{d} = \left( \! \ln \!\left( \frac{y_1}{y_{d+1}}\right) \!, \ln \!\left( \frac{y_2}{y_{d+1}}\right) \!, \dots , \ln \!\left( \frac{y_d}{y_{d+1}} \! \right) \! \right) , \end{aligned}
(13)

and for spherical data, the stereographic projection transform

\begin{aligned} \phi _{{\textsf {Stg}}}(y_1,\cdots ,y_{d+1}): {\mathcal {S}}^{d} \rightarrow {\mathbb {R}}^{d} = \left( \frac{y_1}{1 - y_{d+1}}, \frac{y_2}{1 - y_{d+1}}, \dots , \frac{y_d}{1 - y_{d+1}} \right) . \end{aligned}
(14)

In regression under the assumption of heteroskedastic noise, where the task is to predict $${\mathbb {E}}[y | x]$$, data transformation is unsatisfying: for some transformation $$\phi$$, learning $${\mathbb {E}}[\phi (y) | x]$$ is insufficient, as it does not in general hold that $${\mathbb {E}}[y | x] = \phi ^{-1}({\mathbb {E}}[\phi (y) | x])$$. In contrast, in CDE, we can convert conditional densities over $$\phi ({\mathcal {Y}})$$ to conditional densities over $${\mathcal {Y}}$$ through (11), so we retain the ability to interpret transformed variables in the untransformed space.

Hothorn and Zeileis (2017) also use transformation functions in their Transformation Forests (TFs), though they fix distributions and parameterize transformations, while CaDET does the opposite. For simple cases like affine transformations in location-scale families, they are equivalent, but we argue that simple distributions with complicated parameterized transformations are generally less interpretable than complicated parametric distributions with simple fixed transformations. TFs also only handle $${\mathcal {Y}}= {\mathbb {R}}$$, and operate on quantiles rather than densities. Generalizing TFs to $${\mathbb {R}}^{d}$$ is nontrivial, as working with multivariate quantiles or CDFs of transformations generally requires sophisticated integration, complicating interpretability and computation.

Transformation is thus intuitive, interpretable, and computationally convenient for parametric CDE. These beneficial properties put this use of transformation in stark contrast to its use in regression and quantile estimation, where it is in general difficult to interpret the output of transformed models in the original space.

Union families It is often hard to select a priori a parametric family to model conditional densities. One could select between models trained over multiple families, but to do so would be inefficient, and would perform poorly when the best family to fit conditional densities varies over $${\mathcal {Y}}$$. It would be preferable to learn a model that is able to select distribution families in a data-dependent manner, fitting different distribution families to different regions of $${\mathcal {X}}$$.

One could select the MLE at each leaf among multiple families, but this approach favors complexity over simplicity, and tends to overfit. Given families $${\mathcal {F}}_1, {\mathcal {F}}_2$$ such that $${\mathcal {F}}_1 \subseteq {\mathcal {F}}_2$$ (e.g., the exponential and gamma families), for any i.i.d. sample $$Y \in {\mathcal {Y}}^{n}$$, with MLE parameter estimates $$\theta _1 = \smash {{\mathsf {p}}^{(n)}}\bigl (\smash {{\mathsf {w}}^{(n)}}(Y; {\mathcal {F}}_1); {\mathcal {F}}_1\bigr )$$ and $$\theta _2 = \smash {{\mathsf {p}}^{(n)}}\bigl (\smash {{\mathsf {w}}^{(n)}}(Y; {\mathcal {F}}_2); {\mathcal {F}}_2\bigr )$$, the MLE sample densities obey

\begin{aligned} \rho (Y; {\mathcal {F}}_1, \theta _1) = \prod _{y \in Y} \rho (y; {\mathcal {F}}_1, \theta _1) \le \prod _{y \in Y} \rho (y; {\mathcal {F}}_2, \theta _2) = \rho (Y; {\mathcal {F}}_2, \theta _2) . \end{aligned}

However, the estimate $$\rho (\cdot ; {\mathcal {F}}_1, \theta _1)$$ is often preferable to $$\rho (\cdot ; {\mathcal {F}}_2, \theta _2)$$, for instance when they fit similarly well or $$n$$ is small, as simpler distributions are more interpretable and generally less susceptible to overfitting.

We address these issues with a more nuanced approach, termed regularized union family selection. Given families $${\mathcal {F}}_1, \cdots , {\mathcal {F}}_m$$, with parameter spaces $$\varTheta _1, \dots , \varTheta _m$$, the union family$${\mathcal {F}}=\cup _{i=1}^{m} {\mathcal {F}}_i$$ has parameter space $$\cup _{i=1}^{m} (\{i\} \times \varTheta _i)$$, with

\begin{aligned} \rho \bigl (\cdot ; {\mathcal {F}}\!, (i, \theta )\bigr ) = \rho (\cdot ; {\mathcal {F}}_i, \theta ), \end{aligned}

thus $${\mathcal {F}}$$ can be used to select among distributions from several families. The sufficient statistics for each $${\mathcal {F}}_i$$ are enough to perform MLE within each subfamily, and for exponential-class families, we may perform MLE over the entire union family given these sufficient statistics and the sample log base measures $$\ln \bigl ({\mathsf {h}}(\cdot ; {\mathcal {F}}_i)\bigr )$$ associated with each $${\mathcal {F}}_i$$ (see Lemma 1). However, to control for overfitting, prioritize simpler distributions, and incorporate a priori domain knowledge, we take regularization hyperparameters$$\lambda = \langle \lambda _1, \cdots , \lambda _m \rangle$$, and select the distribution that maximizes regularized sample log likelihood, defining $${\mathsf {p}}^{(n)}(\cdot ; {\mathcal {F}}\!, \lambda )$$ as

\begin{aligned} {\mathsf {p}}^{(n)}\bigl ({\mathsf {w}}^{(n)}(Y); {\mathcal {F}}\!, \lambda \bigr ) = {{\,\mathrm{argmin}\,}}_{i \in \{1, \dots , m\}, \theta \in \varTheta _i} \!\! \lambda _i + \frac{1}{n}\sum _{y \in Y} \ln \bigl (\rho (y; {\mathcal {F}}_i, \theta )\bigr ) , \end{aligned}

with corresponding regularized empirical cross entropy impurity criterion

\begin{aligned} {\mathsf {m}}_{\mathsf {ece}}(Y; {\mathcal {F}}\!, \lambda ) = \min _{i \in \{1, \dots , m\}} \lambda _i + {\mathsf {m}}_{\mathsf {ece}}(Y; {\mathcal {F}}_i) , \end{aligned}
(15)

where the notation explicitly references the family and regularization parameters.

There are many reasonable ways to select regularization parameters. For example, in our experiments we use the Akaike Information Criterion (AIC), setting

\begin{aligned} \lambda _i = \frac{\dim (\varTheta _i)}{n} , \ \text {for}\ i \in \{1,\cdots ,m\}, \end{aligned}
(16)

where $$\dim (\varTheta _i)$$ denotes the dimension of the parameter space of $${\mathcal {F}}_i$$.

6 Experimental evaluation

Here we present the results of our experimental evaluation of CaDET, including the comparison with RFCDE (Pospisil and Lee 2018). We test the versatility of CaDET by instantiating it with many parametric families, including over multivariate codomains, probability simplices, and a cyclic codomain. We also evaluate CaDET with a union family and regularized ECE impurity (see Sect. 5). The accuracy of a learned model M is measured with the Average Conditional Log Likelihood (ACLL) of the conditional density estimations produced by M on a test set $$\smash {Z' \in {({\mathcal {Y}}\times {\mathcal {X}})}^{n'}}$$, i.e.,

\begin{aligned} \frac{1}{n'} \sum _{(x, y) \in Z'} \ln \bigl ({\mathsf {q}}(y; x, M)\bigr ). \end{aligned}

The ACLL is a good accuracy measure, as it can be computed from the estimated conditional density$${\mathsf {q}}(\cdot ; x, M)$$, and is maximized in expectation by the true conditional density.

Implementation Our implementationFootnote 6 of CaDET extends scikit-learn (Pedregosa et al. 2011). It supports many distribution families, shown in Table 1. When building trees, it selects the split that minimizes impurity over all univariate threshold functions such that at least some user-specified number of training points are assigned to each child. We call this parameter the Minimum Samples per Leaf (MSL). Forests do not search all univariate threshold functions in all features, but instead consider only univariate thresholds on features, drawn uniformly without replacement at each node.

Baseline We compare the accuracy and interpretability of various CaDET models to RFCDE models (Pospisil and Lee 2018) on many multivariate CDE tasks. RFCDE was experimentally shown to be superior to other tree-based techniques such as QRFs (Meinshausen 2006) and TFs (Hothorn and Zeileis 2017), both of which only operate over univariate $${\mathcal {Y}}$$. We use the RFCDE implementation provided by the authors, with Gaussian KDE, the normal reference dynamic kernel-width selection strategy, and a 7-term tensor-cosine basis. This implementation does not allow log-density queries, and thus can output conditional density 0 (due to limited floating-point precision), to which we assign log-density − 1000. This choice does not artificially disadvantage RFCDE, as our CaDET implementation permits floating-point log-density queries, which can attain values far below − 1000.

Datasets, tasks, and families We used datasets with different associated prediction tasks, requiring different choices of the parametric family $${\mathcal {F}}$$:

• The Air Quality dataset (De Vito et al. 2008) tasks us with estimating (multivariate) conditional probability densities of the concentrations (particle count or unit mass per unit volume) of four pollutants, given time, temperature, humidity, and air quality sensor readings. We randomly split the dataset into training and test sets of 3,698 samples each. Concentrations must be non-negative, so we use CaDET with the unconstrained, uncorrelated, and symmetric log-Gaussian families. We compare these models to logarithmically-transformed RFCDE.

• The Batting dataset (Lahman 2018) is our largest dataset, with 88,461 samples, each representing a professional baseball player, with height, weight, age, handedness, birthplace, league, and team features. Batting tasks us with estimating the probabilities of a player attaining each of five batting outcomes (base 1–3, home, or strikeout), thus outcome distributions are members of $${\mathcal {Y}}= \Delta ^4 = \{y \in {(0, 1)}^{5}: \ \left||y\right||_{1} = 1\}$$. We also define a 2-way variant, where the task is to estimate the probability of striking out, in which case $${\mathcal {Y}}= [0, 1]$$.

Dirichlet-CaDET and beta-CaDET produce densities w.r.t. the Lebesgue measure over $$\Delta ^4$$ and [0, 1], respectively, and thus are appropriate for the task. We compare these models to three Gaussian-CaDET models and to RFCDE, using the ALRT from (13) for the 5-way Gaussian-CaDET and RFCDE models to convert to a problem over $${\mathbb {R}}^4$$, letting the strikeout probability be the last (asymmetric) variable of the transformation. In the 2-way task, we compare beta-CaDET to Gaussian-CaDET and RFCDE without transformation, although a density estimate $$\rho$$ from the non-beta models has $$\int _{0}^{1} \rho (y) \; \mathrm {d}y < 1$$ (see Sect. 4.4). This disadvantage is intrinsic to these approaches and highlights the flexibility of CaDET with transformation functions. Interpretability is particularly difficult in the 5-way task. Here the Dirichlet estimates of Dirichlet-CaDET should be understood by any analyst familiar with compositional statistics, making this model the most interpretable. The Gaussian models are also quite simple, but although standard in compositional statistics, some effort is required to interpret the ALRT (see (13)), which makes the Gaussian models behave roughly like log-Gaussian models. With covariance matrices, Gaussian-CaDET can model correlations between (approximate) log-frequencies, e.g., between the probabilities of reaching first-base and reaching second-base, unlike the Dirichlet distribution, which has only 5 parameters. Thus, although inherently more complicated, Gaussian-CaDET remains interpretable, and can even yield insights that would be impossible for Dirichlet-CaDET.

• The task on the Sml2010 dataset (Zamora-Martínez et al. 2014) is to estimate the time of day, represented as a value in [0, 1), where is noon. This task is interesting for its cyclic nature. Classical regression struggles around midnight, as training points immediately before and after midnight average to noon (maximally incorrect), and non-parametric methods fail to enforce the constraint that predicted times be on the interval [0, 1), nor do they leverage the cyclic nature of the label space. We use the Von-Mises distribution family, scaled to have support [0, 1), as well as Gaussian-CaDET and beta-CaDET, for our parametric models, and compare to RFCDE.

• We use many UCI datasets (Table 2) to evaluate the efficacy of the impurity criterion used by CaDET, the use of union families, and the competitiveness of CaDET against RFCDE on real-world learning tasks. Each task is a multivariate conditional density estimation task, with each label in $${\mathcal {Y}}= {\mathbb {R}}^{\dim ({\mathcal {Y}})}$$. Most of these datasets are intended for univariate classification or regression, but we instead predict the continuous variables from the categorical or integer-valued variables. In some cases, due to many missing values or lack of features, we leave some continuous values as features; the details are presented in the supplementary material. We compare several CaDET variants and RFCDE on these datasets.

6.1 Results

Impact of minimum samples per leaf on overfitting We first study how the (MSL) parameter, which controls the minimum number of training samples per leaf (enforced by the learning procedure), impacts overfitting in CDE trees. We plot MSL versus training and test ACLL on the Air Quality dataset in Fig. 1 (left). Here we consider only single trees, as diversity in random forests tends to obscure overfitting in individual trees.

We see classic bias-variance trade-off curves for all models, with training ACLL monotonically decreasing with the MSL, and test ACLL first increasing, then decreasing. The training-test ACLL difference is a measure of overfitting, and here it decreases as MSL increases, and also as the number of parameters in each parametric family (see Table 1) decreases. The CaDET trees all perform optimally at MSL $$\approx 25$$, whereas RFCDE reaches optimal performance with MSL $$\approx 100$$, illustrating the lower sample complexity of parametric methods (see Sect. 4.4).

Figure 2 shows ACLL as function of MSL on the Sml2010 task. In Fig. 2 (left), we see that the Von-Mises-CaDET outperforms all competitors, which is unsurprising, as Von-Mises density estimates are best able to represent uncertainty across the midnight boundary. Indeed in Fig. 2 (center), we see that the Von-Mises-CaDET ACLL decreases least when considering only test samples on the 23:00–1:00 interval, while the Gaussian-CaDET (which is least able to split mass between late night and early morning) ACLL decreases the most. On the 11:00–13:00 interval, Fig. 2 (right), Gaussian-CaDET outperforms the remaining models for small MSL (i.e., large trees with many leaves), but for higher MSL values, this advantage disappears. These results support our claim that parametric models that leverage domain-specific knowledge (in this case the cyclic nature of time) are superior to generic models that do not.

Susceptibility to irrelevant features We now examine the response of the models to irrelevant features: a well-designed impurity criterion should be imperturbable to such features and choose splits independently from them. We augment the Air Quality dataset by generating N additional noise features, where each feature value for each sample was drawn i.i.d. from the standard Gaussian distribution. We train models using MSL 50 on this augmented dataset. We plot training and test ACLL as a function of N in Fig. 1 (right). As N increases, CaDET ACLL decreases almost imperceptibly, whereas RFCDE ACLL drops sharply and significantly. RFCDE’s performance drop is not due to overfitting to the noise features, as both test and training ACLL rapidly decrease. Rather, we attribute it to the approximation error of its learning algorithm, which is a consequence of the chosen impurity criterion. The heuristic $${\mathsf {m}}_{\mathsf {mise}}$$ estimate used by RFCDE inadequately assesses the quality of splits, thus RFCDE often splits on noise features, degrading model accuracy. The $${\mathsf {m}}_{\mathsf {ece}}$$ used by $${\textsc {CaDET}}$$ strongly disincentivizes splits on noise features, resulting in similar models regardless of N.

Effectiveness of the chosen impurity criterion We now examine the importance of the impurity criterion via ablation. We train Gaussian-CaDET trees with the MSE impurity from (1) instead of the ECE from (10), and train “vanilla” CaDET trees as a control. The results are shown in the two leftmost columns of Table 2 (we discuss the other columns later). We report ACLL to measure model accuracy, and we quantify interpretability using model size, defined as the total number of continuous parameters required to represent the distributions at each leaf of the tree. Vanilla CaDET yields on average, and more often than not, higher (better) ACLL scores, with much smaller, thus more interpretable, models.

Dependence on training set size We now evaluate the behavior of the ACLL as we increase the training set size $$n$$ on the Batting dataset. The MSL is fixed to $$\lfloor {\sqrt{n} + \frac{1}{2}}\rfloor$$, and test ACLL is computed on all samples not in the training set. We plot tree and forest experiments in Fig. 3, though overfitting is clearer in the trees.

In the 2-way task, when using trees, beta-CaDET performs the best, though with sufficiently large samples, all models are comparable. In particular, each CaDET model uses a 2-parameter distribution, so we expect a similar amount of overfitting in each, and indeed we see similar rates of improvement as the training size increases. RFCDE, as expected, overfits more due to its KDE estimates: its rate of improvement levels off more slowly than the CaDET models.

In the 5-way tree task, $$\dim (\varTheta )$$, which varies between 5 for the Dirichlet and symmetric Gaussian families, and 14 for the Gaussian family, strongly influences model performance. Each model outperforms all others for a contiguous range of $$n$$, and these ranges occur in order of $$\dim (\varTheta )$$, with RFCDE beating the CaDET models only for the highest $$n$$ we examined. The fact that RFCDE beats all other models with sufficient data is unsurprising, as its KDE estimates are consistent, thus with enough data should outperform the parametric estimates of CaDET. The Batting dataset contains 88,461 samples, and with only 7 features, we would expect simple conditional density relationships between $${\mathcal {X}}$$ and $${\mathcal {Y}}$$, thus this task, relative to the others, should measure a model’s capacity to fit unconditional densities (at leaves) more than its ability to model conditional structure via splits.

These experiments highlight that CaDET is not only particularly well-suited to small-sample settings, but also that non-parametric methods overtake CaDET only when an enormous amount of data is available, even on very simple datasets. The case for CaDET is even stronger when interpretability is considered: CaDET trees have $${\mathbf {O}}\bigl (\dim (\varTheta )\sqrt{n}\bigr )$$ total parameters, while RFCDE trees have $${\varvec{\Theta }}\bigl (\dim ({\mathcal {Y}})n\bigr )$$, as they must store all training labels at tree leaves.

Forests improve over trees for every model examined in this experiment. The most significant improvement is in small-sample performance, which is unsurprising, as forests combine estimates across trees, thus prediction are based on larger numbers of training samples. The effect is most pronounced with RFCDE, as while its small-sample performance is still worse than all CaDET forests, with enough data, it eventually outperforms them. Again we conjecture that this is because the Batting tasks primarily assess unconditional density estimation (at leaves), and the bagging in forests reduces KDE overfitting in RFCDE.

Effect of using union families We study CaDET with a union family, containing the unconstrained, uncorrelated, and symmetric variants of the Gaussian and log-Gaussian families. As the three variants of the Gaussian and log-Gaussian families (each) are nested, the uncorrelated and symmetric families never uniquely maximize sample likelihood. We employ AIC regularization from (16) to incentivize the uncorrelated and symmetric families.

In the rightmost six columns of Table 2, we compare union-CaDET to Gaussian-CaDET and RFCDE. The union-CaDET trees significantly outperform the Gaussian-CaDET trees, as measured by ACLL, while maintaining significantly smaller model sizes. CaDET produces smaller models than RFCDE, which averages 6458 distribution parameters per tree, while producing less accurate (as measured by ACLL) models. The average training-test ACLL gap for the CaDET models is $$\approx 0.9$$, but for RFCDE it is $$\approx 0.1$$. We thus claim that RFCDE is underfitting, and that its low training-set ACLL is due to poor split selection, since if all else were equal, the KDE at RFCDE leaves should be able to overfit much more than CaDET’s parametric estimates.

7 Conclusion

We present CaDET, a tree-based algorithm for parametric CDE. CaDET learns interpretable models that produce interpretable estimates. CaDET trees are built by minimizing the Empirical Cross-Entropy (ECE) impurity criterion. ECE is specific to CDE, thus creates better splits that lead to better estimates than generic regression impurity criteria. CaDET is a natural generalization of both MSE regression trees and information-gain classification trees, and attains the same training time and space complexities, under mild conditions. Our experimental evaluation shows that CaDET is less prone to overfitting than existing CDE tree-based algorithms, and can outperform them in both accuracy and interpretability.