1 Introduction

In many practically relevant applications, a proper coverage of interactions between predictors is key for constructing strong predictive models. One particularly important example is the analysis of genetic or environmental risk factors in epidemiological and medical studies for, e.g., constructing genetic/polygenic risk scores (Che & Motsinger-Reif, 2013; Ho et al., 2019) that can be viewed as a function \(\varphi : {\mathcal{X}} \rightarrow {\mathcal{Y}}\) from the p-dimensional space \({\mathcal{X}} = \lbrace 0,1,2 \rbrace ^p\) of p SNPs (single nucleotide polymorphisms), i.e., single base-pair substitutions in the DNA, to the response space \({\mathcal{Y}}\) assigning a risk estimate. For example, for a binary outcome such as a binary disease status, a probability estimate \({\hat{P}}(Y = 1 \mid {\varvec{X}} = {\varvec{x}}) \in [0,1]\) of developing this disease might be a proper risk estimate. Since SNPs are variables with three possible outcomes counting the number of minor allele occurrences with respect to both chromosomes, i.e., how often the less frequent variant occurs in an individual, they can be easily (and biologically meaningful) divided into two binary variables each, i.e., in \(\text{SNP}_D = \mathbbm{1}(\text{SNP} \ne 0)\) and \(\text{SNP}_R = \mathbbm{1}(\text{SNP} = 2)\), coding for a dominant and a recessive effect, respectively. It is well-known that in the analysis of genetic features such as SNPs, interactions, e.g., gene-gene interactions (Che & Motsinger-Reif, 2013) and gene-environment interactions (Ottman, 1996), play a crucial role. Especially in this setting, not only a high predictive ability of the resulting models, but also a high interpretability for understanding which and how genetic variants influence the risk of disease is desirable.

Tree-based statistical learning methods such as decision trees, random forests, or logic regression are very popular and versatile in recognizing underlying data structures. These methods have been already applied to analyze SNP data (e.g., Bureau et al., 2005; Winham et al., 2012; Ruczinski et al., 2004). However, these methods typically fail at simultaneously achieving a reliable predictive strength and a high interpretability of how exactly predictions are composed.

In this article, we propose the tree-based supervised learning procedure logicDT (logic decision trees) which is specifically tailored for properly incorporating interactions between binary predictors. Continuous relationships of additional covariates and interactions of these covariates with the binary variables can also be covered by this procedure. logicDT is designed for yielding highly interpretable prediction models, while maintaining a high predictive ability. For measuring the influence of predictors and their interactions, a novel variable importance measure framework is proposed which, in principle, can be used in conjunction with any other learning procedure.

We start with briefly discussing similar methods and efforts on enhancing existing algorithms in Sect. 2. Then, logicDT and its extensions are presented in detail in Sect. 3. We additionally prove that logicDT is consistent. In Sect. 4, the novel variable importance measuring framework for estimating the influence of input variables and their interactions is proposed. Empirical studies on simulated data as well as on real data follow in Sect. 5 illustrating logicDT’s properties in practice and comparing logicDT to other procedures. Sections 6 and 7 contain discussions and concluding remarks.

2 Background and related work

In the following, we briefly discuss tree-based supervised learning procedures and their extensions.

2.1 Decision trees and random forests

One very popular and powerful statistical learning method are decision trees. Important implementations include classification and regression trees (CART) (Breiman et al., 1984) and C4.5 (Quinlan, 1993). Decision trees recursively partition the predictor space \({\mathcal{X}}\) considering one predictor per split into disjoint patches, to which individually a prediction value will be assigned. For predicting new outcomes, one starts at the root node and follows the edges corresponding to the specific predictor setting until a leaf is reached. Figure 1a illustrates an exemplary decision tree consisting of three binary predictors in a binary classification scenario.

Fig. 1
figure 1

Exemplary tree models for three binary input variables \(X_1\), \(X_2\) and \(X_3\) predicting two different classes 0/false and 1/true. In a, a classification tree is shown. b depicts a logic tree describing the Boolean expression \(X_1 \vee (X_2 \wedge X_3^c)\), where negations are denoted by \(^c\) in this article. For the logic tree, terminal nodes containing negated predictors are depicted as black squares containing white text. Vice versa, non-negated predictors are depicted as white squares containing black text. Both trees are equivalent, i.e., they perform the same predictions for each predictor setting. Adapted from Lau et al. (2022)

figure a

Similar to Louppe (2014), Algorithm 1 summarizes the fitting process of decision trees. In Lines 11 through 14, the locally best split, i.e., the predictor and the splitting point which maximize the node homogeneity after splitting is identified and used for further splitting the tree into two subnodes. For measuring the homogeneity, an impurity measure i is used which assigns a node an estimate of its heterogeneity. For evaluating the strength of a split s partitioning the node t into two child nodes \(t_L\) and \(t_R\), the impurity reduction

$$\begin{aligned} \Delta i(s,t) \ :=\ i(t) - \frac{n_{t_L}}{n_t} i(t_L) - \frac{n_{t_R}}{n_t} i(t_R) \,\ge \, 0 \end{aligned}$$
(1)

for the number of training observations \(n_t\) falling into node t is maximized. For regression purposes, the impurity measure of the mean squared error

$$\begin{aligned} i_\text{Regression}(t) \ :=\ \frac{1}{n_t} \sum _{({\varvec{x}},y) \in {\mathcal{D}}_t} (y-{\hat{y}}_t)^2 \end{aligned}$$
(2)

is used as the impurity measure considering the subset \({\mathcal{D}}_t\) of the training data set \({\mathcal{D}}\) to node t and the predicted outcome \({\hat{y}}_t\) in node t. For classification or risk estimation, the Gini impurity

$$\begin{aligned} i_\text{Gini}(t) \ :=\ \sum _{c \in {\mathcal{Y}}} \frac{n_{c,t}}{n_t} \left( 1-\frac{n_{c,t}}{n_t} \right) \end{aligned}$$
(3)

is used for classes \(c \in {\mathcal{Y}}\) and their corresponding frequency \(n_{c,t}\) in node t. An alternative popular impurity measure for classification tasks is the information gain

$$\begin{aligned} i_\text{Entropy}(t) \ :=\ - \sum _{c \in {\mathcal{Y}}} \frac{n_{c,t}}{n_t} \log _2\left( \frac{n_{c,t}}{n_t} \right) \end{aligned}$$

that is based on the Shannon entropy (e.g., Louppe, 2014).

The partitioning of a tree branch locally stops when the training data cannot be further divided, i.e., if for all \(({\varvec{x}},y), ({\varvec{x}}',y') \in {\mathcal{D}}_t\), it either holds \({\varvec{x}} = {\varvec{x}}'\) or \(y = y'\) (see Line 7 of Algorithm 1). Usually, to prevent overfitting, additional stopping criteria are used such as the minimum node size, i.e., the minimum number of training observations falling into a leaf, or a minimum impurity reduction which has to be achieved in order to split the node. However, these additional stopping criteria yield hyperparameters which, thus, require proper tuning. Finally, the last important step is the assignment of a predicted value to a leaf (Line 8 of Algorithm 1). Although theoretically, this predicted value is already used for evaluating the splits. The prediction values are obtained by empirical risk minimization yielding the arithmetic mean for regression tasks. For binary risk estimation, also the arithmetic mean of the outcome Y given the predictor values \({\varvec{x}}\) is used if Y is coded as 0 or 1. If pure classifications are considered, the class with the lowest risk estimate is chosen.

A particularly popular and successful extension of decision trees are random forests which build ensembles of randomized decision trees yielding even higher predictive performance at the cost of losing interpretability of the fitted models (Breiman, 2001). The randomization is performed by employing bagging (Breiman, 1996), which is described in more detail in Sect. 3.9, and by considering random predictor subsets for splitting at each node. Random forests can substantially outperform single decision trees due to the instability issue of decision trees, which states that small noise-like changes of the training data set can lead to large modifications of the fitted model. This instability issue is mainly caused by the greedy fashion of choosing splits (Li & Belford, 2002; Murthy & Salzberg, 1995).

If deep trees are grown, both single decision trees and random forests can overfit (Hastie et al., 2009; Tang et al., 2018). For certain, not necessarily realistic scenarios (e.g., no subsampling combined with totally randomized trees in which the splits are chosen independent of the outcome or too extreme subsampling in which the subsample size remains constant, but the sample size approaches infinity), Tang et al. (2018) proved that random forests with deeply grown trees are inconsistent.

If shallow trees are grown, fruitful splits might be left out. Furthermore, decision trees and random forests struggle uncovering interactions effects, if the interacting variables only exhibit negligible marginal effects (Wright et al., 2016). Moreover, due to the prediction values of the leaves being constant for finitely many predictor scenarios in conventional decision trees and random forests, continuous function relationships can only be approximated by step functions. However, for example, in the analysis of genetic and environmental risk factors of certain diseases, in which random forests are frequently used (Winham et al., 2012; Bellinger et al., 2017), a continuous influence of an environmental factor on the disease risk is reasonable.

There are a variety of modifications to decision trees and random forests which try to overcome the issues mentioned above. These methods, however, address individual issues. In the following section, we will discuss some of these modifications.

2.2 Extensions of decision trees and random forests

For improving the ability on detecting interactions, one well-known approach is the usage of multivariate splits, i.e., splits based on multiple variables at once, e.g., by using linear combinations of the predictors. Exemplary methods of this class are oblique decision trees (Murthy et al., 1994) and oblique random forests (Menze et al., 2011), where a particular implementation of the latter is, e.g., SPORF (Sparse Projection Oblique Randomer Forests; Tomita et al., 2020). For binary predictors as considered in this article, these multivariate linear splits can be used for creating Boolean conjunctions of predictors, thus, potentially splitting on an interaction. However, methods that try to linearly separate the current feature space based on the (binary) class label in each splitting node (such as the method proposed by Menze et al., 2011) are only suited to classification tasks. Another recent modification is interaction forests (Hornung & Boulesteix, 2022) which directly searches for interaction splits at each node. An overview over such interaction-focused modifications of decision trees and random forests is, e.g., given by Hornung and Boulesteix (2022).

The greedy search algorithm employed in classic decision tree fitting procedures (such as in CART) is fast and scales to high-dimensional problems. However, as the greedy search conducts local searches for splits, it requires detectable marginal effects to identify interaction effects. For example, if \(X_1\) and \(X_2\) interact with each other, \(X_1\) or \(X_2\) have to be individually identified first as splitting variables. Due to increasing computational capabilities, optimal decision trees have been proposed by Nijssen and Fromont (2010) and Bertsimas and Dunn (2017) to perform a global optimization. In the former method, namely DL8 (decision trees from lattices), dynamic programming is utilized to fit decision trees. In the latter method, namely OCT (optimal classification trees), the decision tree fitting problem is phrased as a mixed-integer optimization problem. More recently, alternative optimal decision tree algorithms that utilize dynamic programming such as DL8.5 (Aglin et al., 2020a) and MurTree (Demirović et al., 2022) and optimal decision tree fitting procedures that incorporate multivariate splits such as WODT (Yang et al., 2019) and SVM1-ODT (Zhu et al., 2020) have been proposed. A review of optimal decision tree fitting procedures is, e.g., given by Carrizosa et al. (2021).

Blockeel and De Raedt (1998) proposed combining decision trees with logic programming. Their method is called TILDE (top-down induction of logical decision trees). At each inner node, a Boolean conjunction is responsible for further partitioning the input data. Model fitting is performed in a greedy fashion very similar as in C4.5 (Quinlan, 1993). However, the space of eligible splits, over which the greedy search is applied, has to be defined by the user by utilizing background knowledge and, e.g., specifying which variables may be part of the same conjunction. Another important difference between TILDE and other decision tree algorithms is that TILDE uses logic programs for specifying data examples. This is in contrast to the statistical learning setup considered in this article. We consider the standard setting, in which data are given in a tabular format and relevant background knowledge about the relationships of certain variables is not available.

Rule extraction methods aim at increasing the interpretability of tree ensemble methods while keeping their predictive strength. They start by fitting a tree ensemble such as random forests and try to extract the most important prediction rules from the individual decision tree paths. These prediction rules are then gathered in rule lists yielding the final model, in which predictions are made according to which rules hold true. One of the first and most established rule extraction methods is RuleFit (Friedman & Popescu, 2008), which fits a boosted ensemble of decision trees and selects the most important rules using the lasso (Tibshirani, 1996). Alternative rule extraction methods include node harvest (Meinshausen, 2010) and SIRUS (Stable and Interpretable Rule Set, Bénard et al., 2021), which both fit random forests for generating the models from which the rules are to be extracted.

For modeling continuous regression models in the leaves, typically, GLMs are employed such as in MOB (model-based recursive partitioning, Zeileis et al., 2008). An overview on several GLM-based approaches is, e.g., given by Rusch and Zeileis (2013). However, the right parametric model might not be known prior to fitting models so that a more flexible non-linear regression model might be preferable. Moreover, these methods do not lay a focus on properly handling interactions between the splitting variables.

2.3 Logic regression

Logic regression (Ruczinski et al., 2003) is another tree-based supervised learning method. It has been specifically developed for analyzing SNP data and is, therefore, frequently used in such analyses (e.g., Ruczinski et al., 2004; Zhi et al., 2015). Logic regression is focussed on binary predictors and tries to identify Boolean combinations of the predictors that shall explain the variation in the outcome. These Boolean expressions can also be presented as logic trees, i.e., trees holding predictors (or their negations) in their leaves and recursively combining them with the Boolean AND-operator (denoted by \(\wedge\) in the following) or the Boolean OR-operator (denoted by \(\vee\) in the following) using inner nodes. Figure 1b illustrates an exemplary logic tree corresponding to the Boolean expression \(X_1 \vee (X_2 \wedge X_3^c)\). If a true logic tree is identified with class 1 and a false logic tree is identified with class 0, this tree is equivalent to the classification tree from Fig. 1a.

To generalize the usage of logic regression to regression purposes, logic trees are embedded in GLMs, i.e., a model of the form

$$\begin{aligned} g(\mathbb{E}[Y \mid {\varvec{X}} = {\varvec{x}}]) = \beta _0 + \beta _1 L_1({\varvec{x}}) + \cdots + \beta _m L_m({\varvec{x}}) \end{aligned}$$

is considered for a link function g and logic trees \(L_1, \ldots , L_m\). In general, every possible logic regression model can be transformed into an equivalent decision tree, and vice versa (Ruczinski et al., 2003). However, logic trees tend to be more sparse, i.e., by using Boolean logic, logic trees can describe the same prediction model with fewer nodes than decision trees in certain scenarios. For example, even in the simple prediction model depicted in Fig. 1, the logic tree consists of five nodes, whereas seven nodes are required in the CART tree to represent the Boolean expression. Note that this tree sparsity property holds true for binary classification scenarios in which a hard classification task instead of a more general class probability estimation task is considered.

The fitting procedure in logic regression is performed by a global stochastic search over all possible models, i.e., logic trees \(L_1, \ldots , L_m\) and their GLM coefficients \(\beta _0, \ldots , \beta _m\), where these GLM coefficients are determined by fitting a GLM using the considered logic trees as predictors in each step of the global stochastic search. In particular, simulated annealing (Kirkpatrick et al., 1983) is employed using simple modifications of the current model/state, i.e., adding or removing branches, exchanging variables or operators, and splitting or removing variables. Alternatively, a greedy local search always moving to the best neighbor state can be employed. However, this faster search comes without any guarantees of finding a globally optimal state. For evaluating the current state, a score function such as the mean squared error for linear regression or the deviance for logistic regression is used. For a detailed description and discussion of logic regression, see Ruczinski et al. (2003).

Single logic regression models tend to be unstable, if the signal is weak or if many predictors are actually predictive. One approach to tackle this problem is to apply bagging to logic regression models (Schwender & Ickstadt, 2007). However, similar to random forests, these models are no longer easily interpretable.

Even single logic regression models can be hard to interpret due to possibly complex logic tree structures. Typically, one is interested in the statistical interaction of predictors, which can be defined as the effect of the presence of certain predictor settings at once, i.e., using Boolean conjunctions, since conjunctions of input variables directly reveal the specific type of interaction that is considered (Chen et al., 2011). By De Morgan’s laws, if a Boolean disjunction needs to be represented, the negation of the conjunction containing the negations of the input terms can be used, i.e., making disjunctions obsolete if all negations are available.

Logic regression can only take quantitative covariables additively into account by adding them to the linear predictor of the GLM containing the logic trees as single terms. Thus, no interactions between the binary predictors and quantitative predictors can be included. Similarly, interactions between logic trees themselves can also not be captured, thus, relying on the additive structure of the individual terms. If, for example, the scale of an underlying linear predictor is unknown, being able to also model interactions between the terms can be beneficial. Consider, e.g., the regression function

$$\begin{aligned} \mathbb{E}[Y \mid {\varvec{X}}]&\ =\ \left[ \alpha \cdot \mathbbm{1}(X_1) + \beta \cdot \mathbbm{1}(X_2 \wedge X_3^c)\right] ^2 \\&\ =\ \alpha ^2 \cdot \mathbbm{1}(X_1) + 2 \cdot \alpha \cdot \beta \cdot \mathbbm{1}(X_1 \wedge X_2 \wedge X_3^c) + \beta ^2 \cdot \mathbbm{1}(X_2 \wedge X_3^c). \end{aligned}$$

On the squared scale, the terms \(X_1\) and \(X_2 \wedge X_3^c\) do not interact. However, on the original scale, if both terms are true at once, the linear predictor is adjusted by an additional \(2 \alpha \beta\).

3 Logic decision trees

To overcome the issues mentioned in the last section, we propose a novel method, called logicDT (logic decision trees), which combines decision trees and an improved version of the Boolean term search of logic regression.

We define logic decision trees to be decision trees that can use Boolean conjunctions of input variables as splitting variables, which is in contrast to standard decision tree procedures. Logic decision trees may be used for regression purposes, in which—similar to regression trees—each leaf holds a direct estimate of the outcome, or for classification purposes, in which—similar to probability estimation trees (Provost & Domingos, 2003; Malley et al., 2012)—each leaf holds an estimate of the class membership probability. As discussed in Sect. 3.5, logic decision trees may also contain regression models in their leaves for modeling continuous relationships.

Allowing Boolean conjunctions of input variables as splitting variables, firstly, simplifies the resulting decision tree. If we, e.g., consider an outcome that is only altered if \(X_1^c \wedge X_2\) holds, then creating a tree stump (i.e., a decision tree consisting of only one split) splitting on \(X_1^c \wedge X_2\) would be sufficient when using logicDT, whereas a common decision tree only using single input variables for splitting would require a split on \(X_1\) and another split on \(X_2\) in the branch in which \(X_1 = 0\) holds (see Fig. 2).

Fig. 2
figure 2

Decision trees for splitting on \(X_1^c \wedge X_2\). In a, a standard decision tree splitting on single input variables is shown. In b, a Boolean conjunction is used for splitting

Secondly, this makes the prediction values in some leaves more robust. In our example, the common decision tree in Fig. 2a would further distinct between \(X_1 = 1\) and \(X_1^c \wedge X_2^c = 1\), while the tree in Fig. 2b uses one shared prediction, thus, utilizing more observations for creating the prediction value. Thirdly, due to the greedy search employed in standard decision tree splitting approaches, the interaction might not be found due to potentially negligible marginal effects of, in our example, \(X_1\) or \(X_2\) leading to splitting on other variables or not splitting at all, if a stopping criterion is triggered.

In the following subsections, logicDT is presented in detail.

3.1 Preliminaries

Let \({\varvec{X}} = (X_1, \ldots , X_p)\) be a p-dimensional random vector of binary input variables taking values in the p-dimensional space \({\mathcal{X}} = \lbrace 0,1 \rbrace ^p\) and let Y be a target random variable taking values in the space \({\mathcal{Y}}\). Let \({\mathcal{D}} = \lbrace ({\varvec{x}}_1, y_1), \ldots , ({\varvec{x}}_n, y_n) \rbrace\) be a training data set with independent and identically distributed observations from the joint probability distribution of \(({\varvec{X}}, Y)\). Then the corresponding statistical learning task can be formulated as estimating the true regressor \(\mathbb{E}_{({\varvec{X}},Y)}[Y \mid {\varvec{X}} = \cdot \, ]\) by a function \(\varphi : {\mathcal{X}} \rightarrow {\mathcal{Y}}\) using the training data set \({\mathcal{D}}\) (e.g., Hastie et al., 2009).

In this article, Boolean conjunctions between binary input variables are denoted using the Boolean \(\wedge\) (AND) and negations of binary input variables are denoted using a superscript \(^c\) (complement), i.e., \(X_j^c = 1-X_j\).

logicDT is aimed at identifying response-associated interactions, where two input variables \(X_i\) and \(X_j\) are defined to interact with each other with respect to the outcome Y, if the effect of one input variable (i.e., the partial derivative/finite differences of \(\mathbb{E}[Y \mid {\varvec{X}}]\) with respect to one input variable) depends on the other input variable (Sorokina et al., 2008). Therefore, if there is no interaction between \(X_i\) and \(X_j\), the regression function \(\mu ({\varvec{X}}) = \mathbb{E}[Y \mid {\varvec{X}}]\) can be decomposed into a sum \(\mu ({\varvec{X}}) = \mu _{\setminus i}({\varvec{X}}_{\setminus i}) + \mu _{\setminus j}({\varvec{X}}_{\setminus j})\), where \(_{\setminus i}\) denotes leaving out the ith entry of the vector of input variables (Friedman & Popescu, 2008). This definition can be directly generalized to (statistical) interactions of arbitrary order. If there is no interaction between \(X_{(1)}, \ldots , X_{(k)}\), \(\mu\) can be decomposed into a sum of functions, in which no summand is a function of all considered variables \(X_{(1)}, \ldots , X_{(k)}\) simultaneously.

In this article, we mainly focus on binary input variables. Therefore, every function \(\varphi : {\mathcal{X}} \rightarrow {\mathcal{Y}}\) mapping from a p-dimensional space of binary input variables to a real number can be expressed as a sum of the form

$$\begin{aligned} \varphi ({\varvec{X}}) = \beta _0 + \sum _{j=1}^m \beta _j \cdot \mathbbm{1}\left( X_{k_{j,1}}^{(c)} \wedge \cdots \wedge X_{k_{j,p_j}}^{(c)}\right) , \end{aligned}$$

where \(^{(c)}\) denotes potentially negating the considered variable and \(k_{j,i}\) is the index of the ith variable in the jth summand. Hence, binary input variables \(X_{(1)}, \ldots , X_{(k)}\) interact with each other (with respect to Y), if \(\mu\) cannot be decomposed without using a Boolean conjunction that simultaneously includes \(X_{(1)}, \ldots , X_{(k)}\). Boolean disjunctions are not considered in logicDT, since, by De Morgan’s laws, Boolean disjunctions can be expressed using Boolean conjunctions and negations.

3.2 Core methodology of logicDT

The aim of logicDT is to identify important input variables and Boolean conjunctions of input variables to perform accurate predictions of the outcome. An input variable or a Boolean conjunction of input variables will be in the following referred to as a term. A set of terms will be referred to as a state. Examples of possible states would be

$$\begin{aligned} \lbrace \lbrace X_{73} \rbrace \rbrace \quad \text{or} \quad \lbrace \lbrace X_1^c \wedge X_2 \rbrace , \lbrace X_5 \rbrace , \lbrace X_9 \wedge X_{14}^c \wedge X_{42}^c \rbrace \rbrace . \end{aligned}$$

In logicDT, states are obtained by a global stochastic search procedure that is introduced later in this section.

Logic decision trees are induced by identifying a state and exclusively using the terms contained in this state as input variables for fitting a conventional decision tree. For example, the three terms \(X_1^c \wedge X_2\), \(X_5\), and \(X_9 \wedge X_{14}^c \wedge X_{42}^c\) are used as input variables to induce a decision tree, if the corresponding state \(\lbrace \lbrace X_1^c \wedge X_2 \rbrace , \lbrace X_5 \rbrace , \lbrace X_9 \wedge X_{14}^c \wedge X_{42}^c \rbrace \rbrace\) is considered. Hence, creating a logic decision tree based upon a state is a two-stage procedure. First, the original training data set is transformed into a tree training data set using the terms of the considered state. Next, using this tree training data set, a decision tree is fitted.

For a set consisting of m terms

$$\begin{aligned} \left\{ \left\{ X_{k_{1,1}}^{(c)}, \ldots , X_{k_{1,p_1}}^{(c)} \right\} , \ldots , \left\{ X_{k_{m,1}}^{(c)}, \ldots , X_{k_{m,p_m}}^{(c)} \right\} \right\} , \end{aligned}$$

the original training data set is transformed into a tree training data set by constructing a \(n \times (m+1)\) data matrix containing the m different predictors or conjunctions and the outcome. For example, if a training data set is given by

figure c

and the state \(s = \lbrace \lbrace X_1 \rbrace , \lbrace X_2 \wedge X_3^c \rbrace \rbrace\) is identified by the global stochastic search, the tree training data set, which is directly used for fitting the decision tree, is given by

(4)

Since each term is a binary variable itself, there is only one possible split of the data based on this term. Thus, the tree fitting procedure only needs to consider one split per input term, which makes the identification of the best local split particularly fast. For evaluating potential node splits and selecting the split, the conventional node impurity splitting criterion from Eq. (1) is used. For regression tasks, the MSE (mean squared error) impurity (see Eq. (2)) is used, and for classification tasks, the Gini impurity (see Eq. (3)) is used.

After the tree corresponding to the current state has been fitted, its performance on the training data is evaluated by passing all observations through the tree and calculating a score that measures the training data error, where the score is chosen so that a smaller value of the score corresponds to a better fit. For regression purposes, the MSE is employed. For risk estimation/classification purposes, probability estimation trees (Provost & Domingos, 2003; Malley et al., 2012) are grown that directly hold class probability estimates in their leaves by using empirical probabilities, i.e., using proportions of class occurrences. Thus, for scoring a state in the risk estimation/classification setting, the deviance is used, which is also known as the cross entropy or the negative binomial log-likelihood.

Alternatively, the negative area under the curve with respect to the receiver operating characteristic (AUC) might be used. However, the AUC does not capture the magnitude of the risk estimate in contrast to the deviance. Another alternative is the Brier score, which is the mean squared error between the risk estimate and the actual outcome.

For identifying an ideal state, logicDT performs a global search over all eligible states. The search is performed by using the current state to construct a decision tree, evaluating the performance of this tree, modifying the current state, and repeating this procedure. Modifications of logicDT states are called neighbors and are implicitly defined by slightly altering a given state. Figure 3 illustrates the possible state modifications/neighbor states using exemplary states. In the center of this figure, the current state is depicted. The possible state changes include

  • exchanging or negating single variables (see, e.g., the replacement of \(X_2\) by \(X_4\) in the top and the negation of \(X_2\) in the bottom of Fig. 3),

  • adding or removing single variables from a term (see, e.g., the addition of \(X_8\) in the top right and the removal of \(X_3^c\) in the bottom right of Fig. 3),

  • adding or removing logic terms consisting of exactly one variable (see, e.g., the addition of \(X_{10}\) in the top left and the removal of \(X_2\) in the bottom left of Fig. 3).

Fig. 3
figure 3

Exemplary state modifications of the reference state \(\lbrace \lbrace X_1, X_3^c, X_5 \rbrace , \lbrace X_2 \rbrace \rbrace\) depicted in the center

To avoid tautologies and uninformative terms, some specific alterations are prohibited. More precisely, the same variable should not occur more than once in a single term and the same term should not occur more than once in the proposed state.

The search is initialized by finding the single input variable that minimizes the score function, e.g., \(\lbrace \lbrace X_{73} \rbrace \rbrace\). Using this initial state, a global optimization procedure employing simulated annealing (Kirkpatrick et al., 1983) is carried out for finding the state that minimizes the score function, i.e., now permitting all possible states potentially consisting of more than one term.

Simulated annealing is a stochastic optimization algorithm that, given a current state, randomly selects one of its neighbor states, evaluates its score, and uses the score difference between these two states for determining the probability of transitioning to the proposed neighbor state. For a state s and a proposed neighbor state \(s'\), the score function \(\epsilon\), and the current temperature t, this state acceptance probability is given by

$$\begin{aligned} \gamma (\epsilon (s), \epsilon (s'), t) \ :=\ \min \left\{ 1, \exp \left( \frac{\epsilon (s)-\epsilon (s')}{t} \right) \right\} . \end{aligned}$$
(5)

Thus, if a state with a better score is proposed, the transition is carried out with probability 1. However, worse states may also be accepted with the acceptance probability \(\in (0,1)\) to avoid getting stuck in local minima. The main idea of simulated annealing is slowly lowering the temperature t such that the acceptance probability of worse states tends to 0 and in the end, the globally optimal state is identified.

In logicDT, a fully automatic simulated annealing schedule governing the temperature lowering is employed. If desired, the cooling schedule can be changed, e.g., by decreasing or increasing the parameter \(\lambda\), that controls the magnitude of the temperature decreases, for performing a finer or coarser stochastic search. The number of search iterations is, thus, (implicitly) controlled by \(\lambda\) and stopping criteria for terminating the search procedure. Alternatively, a fixed geometric cooling schedule can also be employed in logicDT. However, we recommend using the adaptive cooling schedule for fitting logicDT models. More details on the simulated-annealing-based search in logicDT are given in Appendix 1.

The proposed state modifications ensure that the modifications lead to a Markov chain that fulfills aperiodicity and irreducibility when performing a global search via simulated annealing. These properties ensure that simulated annealing asymptotically leads with probability 1 to a globally optimal state (Van Laarhoven & Aarts, 1987). More details on these Markov properties are given in Appendix 2.

3.3 The logicDT algorithm

In Algorithm 2, the logicDT procedure is presented.

figure b

In Line 2, the initial state is obtained by choosing the single input variable that minimizes the score. That is, for each input variable, a decision tree using only this input variable, i.e., a decision tree stump, is fitted and evaluated. The input variable \(X_j\) that leads to the minimum score is chosen as the initial state \(\lbrace \lbrace X_j \rbrace \rbrace\). Alternatively, a random state or an empty state could also be used as the initial state.

In Lines 3 and 8, the current state is used for transforming the original training data set \({\mathcal{D}}\) into a tree training data set that can be directly used by a learning procedure using the identified terms as input variables. See Eq. (4) for an example on how a tree training data set is obtained from the original data set consisting of the values of the input variables.

If no leaf regression models for continuous covariables shall be fitted, the decision trees are constructed using Algorithm 1 (see Lines 4 and 9 of Algorithm 2). If leaf regression models are to be fitted (see Sect. 3.5 for more details), the splitting criterion from Sect. 3.5.2 is used in place of the impurity reduction criterion and the corresponding regression models are fitted in each leaf in contrast to single prediction values.

In Lines 5 and 10 of Algorithm 2, the training data score is calculated by passing all training observations through the fitted decision tree, performing predictions using leaf regression models if these were fitted, and comparing the predictions with the true outcomes.

In Line 7, the current state is modified by randomly performing one of the state modifications proposed in Sect. 3.2, where the state modification is randomly drawn from a uniform distribution over all possible state modifications of the current state.

This proposed modified state is then evaluated in Line 11, i.e., it is randomly accepted with the acceptance probability from Eq. (5).

The global search is carried out until a stopping criterion is true. More details on the search algorithm itself are discussed in Appendix 1.

logicDT is implemented in the R package logicDT (Lau, 2023) available on CRAN.

3.4 Controlling the complexity of logicDT models

For restricting the complexity of logicDT models and regularizing them, the maximum number \(\texttt{max\_conj}\) of terms and the total maximum number \(\texttt{max\_vars}\) of variables contained in a state should in practice be properly tuned to avoid overfitting or underfitting. Since some (potentially very long) conjunctions might correspond to no or very few observations, similar to the stopping criterion in decision trees, a minimum conjunction size, defining the minimum number of observations falling into this conjunction and its negation, can be specified in logicDT to exclude practically useless terms. Furthermore, one may prohibit the removal (and the addition) of whole terms in order to guarantee a certain number of terms. This might, e.g., be useful if a pure variable selection should be performed so that the maximum number of total variables is set to the maximum number of terms. In this case, the initial state should be chosen such that it already includes the desired number of terms.

logicDT aims to identify the optimal set of predictors and conjunctions with regard to the predictive ability. Thus, post-pruning of the fitted decision trees is not necessary, since the model complexity is already covered by the model size hyperparameters and the ideal splitting terms are already identified by the global search, which is similar to logic regression and in contrast to standard decision trees. However, the following two stopping criteria for locally terminating the splitting of a branch are used to filter out completely unnecessary splits.

One of the stopping criteria is the minimum number of observations in the respective leaves. If a split would lead to child nodes from which at least one of the children contains less than the prespecified number of observations, this split is prohibited. This criterion is particularly useful for regression and risk estimation purposes, where a stable estimate needs a certain amount of observations.

As second stopping criterion, the minimum (scaled) impurity reduction is considered. A split is discarded, if it does not reach the required impurity reduction, i.e., if

$$\begin{aligned} \frac{n_t}{n} \cdot \Delta i(s,t) \,\le \, cp , \end{aligned}$$

holds for the impurity reduction \(\Delta i(s,t)\) defined in Eq. (1) and the complexity parameter \(cp \ge 0\). For continuous outcomes, \(cp\) will be scaled by the empirical variance \(s_Y^2\) of the outcome Y to ensure the right scaling, i.e., \(cp \leftarrow cp \cdot s_Y^2\). Since the impurity measure for continuous outcomes is the mean squared error, this can be interpreted as controlling the minimum reduction of the normalized mean squared error (NRMSE—normalized root mean squared error—to the power of two).

The hyperparameter optimization in logicDT is discussed in more detail in Sect. 3.6.

3.5 Quantitative covariables

Decision trees are particularly suitable models for binary input data, since there is only a finite number of possible predictor scenarios in this case, i.e., every possible prediction function (including the true regression function \(\mathbb{E}[Y \mid {\varvec{X}}]\)) can be expressed using a decision tree. Quantitative predictors often induce a continuous relationship to the outcome that cannot be properly expressed with piecewise constant functions such as decision trees or random forests. In standard decision-tree-based methods, continuous variables are included as possible splitting candidates in the decision tree fitting process. This approach is very intuitive for merely considering all available data. However, as mentioned above, this does not allow to cover continuous relationships.

3.5.1 Leaf regression models

For properly including quantitative covariables in logicDT models, we propose, similar to MOB (model-based recursive partitioning, Zeileis et al., 2008), to fit regression models in the leaves that result from splits exclusively using the binary terms. This approach allows to fit individual curves for each binary term setting, thus, also covering interactions between the binary predictors and the quantitative covariable.

In principle, any kind of regression model such as linear or non-linear regression models could be fitted in the leaves depending on the application. Moreover, multiple regression models could also be fitted, if multiple covariables need to be considered.

For properly evaluating logicDT states, regression models need to be fitted in each decision tree and used to generate the training data predictions for computing the score, i.e., the regression models should be fitted in each iteration of the search procedure of logicDT. If, however, the computational burden is too high for, e.g., fitting non-linear regression models in each leaf of each decision tree, we recommend using linear models for the search and non-linear regression models for the final fit. In this case, the functional relationship is still taken into account in the search process and the final model utilizes the desired type of regression model. For a fast model fitting with a binary outcome, logistic regression curves through LDA (linear discriminant analysis) might be fitted that have a closed-form solution (Hastie, Tibshirani, and Friedman, 2009), and therefore, do not require an iterative optimization procedure such as standard logistic regression.

3.5.2 Splitting criterion

If regression models should be fitted in each leaf, functional trends have to be analyzed instead of simple leaf means. Therefore, we propose evaluating splits based on a likelihood-ratio test for comparing nested models as an alternative to the conventional node impurity splitting criterion specified in Eq. (1). More precisely, linear regression or LDA models, which can be determined particularly quickly, are fitted for each eligible split and resulting child node. Since we consider simple regression models, each model consists of two parameters (offset and slope) such that the difference in parameters of two submodels versus one joint model is given by \(2 \cdot 2 - 2 = 2\). Thus, the likelihood-ratio test statistic

$$\begin{aligned} -2 \log (\Lambda ) := -2 \log \left( \frac{L_\text{reduced}}{L_\text{full}} \right) \end{aligned}$$
(6)

is—under the null hypothesis of equal model parameters in both subnodes—asymptotically \(\chi ^2\)-distributed with 2 degrees of freedom following Wilks’ theorem (Wilks, 1938). Here, \(L_\text{reduced}\) denotes the maximized likelihood of the reduced model (i.e., the fitted joint regression model using one node) and \(L_\text{full}\) denotes the maximized likelihood of the full model (i.e., the model consisting of two individually fitted sub-regression models resulting in two nodes).

With the test statistic from Eq. (6), we, hence, test

$$\begin{aligned}&H_0:\ \mathbb{E}[Y \mid {\varvec{X}}_{(t)} = {\varvec{x}}_{(t)}, X_s, E] = \mathbb{E}[Y \mid {\varvec{X}}_{(t)} = {\varvec{x}}_{(t)}, E] \hspace{1.5em} \\ \text{vs.} \hspace{1.5em}&H_1:\ \mathbb{E}[Y \mid {\varvec{X}}_{(t)} = {\varvec{x}}_{(t)}, X_s, E] \ne \mathbb{E}[Y \mid {\varvec{X}}_{(t)} = {\varvec{x}}_{(t)}, E], \end{aligned}$$

where t is the node that shall be splitted, \({\varvec{X}}_{(t)}\) is the subvector of input variables that are used as splitting variables in ancestor nodes of t, \({\varvec{x}}_{(t)}\) is the corresponding binary vector containing the predictor setting at node t, \(X_s\) is the binary predictor that shall be evaluated for splitting the node, and E is (are) the continuous covariable(s). We, thus, test with this likelihood-ratio test whether the split on \(X_s\) leads to different prediction models in the current tree branch. E.g., for one continuous covariable, the model

$$\begin{aligned} g(\mathbb{E}[Y \mid {\varvec{X}}_{(t)} = {\varvec{x}}_{(t)}, X_s, E]) = \beta _0 + \beta _1 \cdot E + \gamma _0 \cdot \mathbbm{1}(X_s) + \gamma _1 \cdot \mathbbm{1}(X_s) \cdot E \end{aligned}$$

is used for testing the null hypothesis \(H_0:\ \gamma _0 = \gamma _1 = 0\), which is equivalent to the above null hypothesis, using the identity as link function g for a continuous outcome and the \(\text{logit}\) function as link function g for a binary outcome.

Using this new splitting criterion, likelihood-ratio tests for all eligible splits at a certain node are performed to appropriately rank eligible splits and to interpretably quantify the strength of a split. The split that achieves the lowest p-value is used, if this p-value is below a prespecified significance threshold such as \(\alpha = 50\%\). Here, we propose to use a very liberal (high) threshold to avoid to miss fruitful splits. If no split can provide such a p-value, the node in question is declared as a terminal node so that this splitting criterion can also act as a stopping criterion.

Figure 4 illustrates an exemplary logicDT model with two terms and three variables in total. The current set of terms on the left induces the decision tree on the right by fitting a decision tree using the terms as potential splitting variables. The quantitative covariable E is used for evaluating the splits in likelihood-ratio tests and for fitting the regression models in the leaves. Therefore, in the root node, the terms \(\text{SNP3D}^c \wedge \text{SNP2D}\) and \(\text{SNP1D}\) are both evaluated as splitting candidates by fitting regression models using E as the predictor. Since \(\text{SNP3D}^c \wedge \text{SNP2D}\) yields a lower p-value than \(\text{SNP1D}\) in the likelihood-ratio test splitting criterion, the term \(\text{SNP3D}^c \wedge \text{SNP2D}\) is used for splitting the root node. The fitted tree is then evaluated as a whole using a score function (see Sect. 3.2). Afterwards, the state is slightly modified using the modifications proposed in Sect. 3.2 and the procedure is repeated.

Fig. 4
figure 4

An exemplary logicDT model/state. On the left hand side, the set of terms is depicted with an additional quantitative covariable which is excluded from the search over the set of terms. On the right hand side, the resulting decision trees which uses the binary predictors and identified conjunctions as input/splitting variables. Since in this case also a quantitative variable is supplied, the leaves are continuous functions instead of single point estimates

3.6 Hyperparameter optimization

For maximizing the performance of logicDT, it is necessary to optimize the model complexity parameters that act as regularization parameters. These parameters are

  • \(\texttt{max\_vars}\)—the total maximum number of variables contained in the model,

  • \(\texttt{max\_conj}\)—the maximum number of conjunctions/terms in the model,

  • \(\texttt{nodesize}\)—the minimum number of observations per leaf in the resulting decision tree,

  • \(\texttt{conjsize}\)—the minimum of observations contained in a conjunction and its negation.

In general, \(\texttt{max\_vars} \ge \texttt{max\_conj}\) has to be fulfilled. Furthermore, we recommend imposing \(\texttt{max\_vars} \le 2 \cdot \texttt{max\_conj}\) in cases in which marginal effects still seem to be dominant and it is not justifiable that only high-order interaction terms compose the main influence on the outcome. This restriction is useful due to the standard learning issue that more complex models usually fit the training data better. Moreover, it reduces the set of eligible hyperparameter configurations to be evaluated speeding up the hyperparameter tuning process.

Specifically for fitting single logicDT models (via simulated annealing), it is advisable to remove the ability of removing whole conjunctions from the model in the search procedure. This ensures that the final model consists of exactly \(\texttt{max\_conj}\) terms and that no extensively complex conjunctions make up the model. This also allows for a simple variable selection of marginal effects by additionally restricting \(\texttt{max\_vars} = \texttt{max\_conj}\).

The purpose of \(\texttt{nodesize}\) is to ensure that each leaf contains enough observations for concluding meaningful models, i.e., stable means, or if a continuous covariable is included, regression models. A proper value for \(\texttt{conjsize}\) avoids evaluating models with uninformative conjunctions, i.e., conjunctions for which a split does not imply meaningful information due to a low number of observations. Note that for the observed values, it holds \(\texttt{nodesize}_\text{obs} \le \texttt{conjsize}_\text{obs}\), since the decision tree can further split the space. Thus, in practice, \(\texttt{nodesize}\) and \(\texttt{conjsize}\) can be set to the same value. Similar to Malley et al. (2012) who regarded probability estimation trees, we recommend a value between 1% and 10% of the total number of training observations for obtaining stable leaf estimates.

Using these parameter restrictions, a grid search evaluating all possible parameter combinations is then carried out (based on validation data) in order to identify the best setting. In Sect. 5, hyperparameter optimization following this scheme is performed.

3.7 Consistency of logicDT

In this section, we now study theoretical properties of logicDT, more precisely, the consistency of logicDT. For this purpose, we consider the core logicDT methodology, i.e., only permitting binary predictors. Without loss of generality, we assume a continuous outcome. Binary risk estimation/binary classification can be viewed as a special case using the Brier score as score function in an empirical risk minimization framework. The following theorem states that logicDT is strongly consistent. The proof of this theorem is given in Appendix 2.

Theorem 1

(Consistency of logicDT) Suppose \(\mu : \lbrace 0,1 \rbrace ^p \rightarrow {\mathcal{Y}}\) is a p-dimensional regression function and that the outcome Y with

$$\begin{aligned} \mathbb{E}[Y \mid {\varvec{X}}] = \mu ({\varvec{X}}) \end{aligned}$$

is bounded. Then, logicDT fitted via simulated annealing is strongly consistent, i.e., almost sure convergence

$$\begin{aligned} \mathbb{E}_{({\varvec{X}}, Y)} \left[ (\mu ({\varvec{X}}) - T_n ({\varvec{X}}))^2 \right] \,\ \xrightarrow [n \rightarrow \infty ]{\text{a.s.}} \,\ 0 \end{aligned}$$

holds for fitted logicDT models \(T_n\) to training data sets \({\mathcal{D}}_n = \lbrace ({\varvec{x}}_1,y_1), \ldots , ({\varvec{x}}_n,y_n) \rbrace\).

The following remark provides an application of Theorem 1 to hard classifications, in which the misclassification rate is evaluated.

Remark 1

For the binary classification/risk estimation case, alternatively to considering the Brier score, the excess misclassification rate is bounded by

$$\begin{aligned} 0 \,\ {}&\le \,\ \mathbb{P}_{({\varvec{X}}, Y)}({\hat{\varphi }}_{T_n}({\varvec{X}}) \ne Y) \ - \ \mathbb{P}_{({\varvec{X}}, Y)}(\varphi ^*({\varvec{X}}) \ne Y) \\ \,\ {}&\le \,\ 2 \sqrt{ \mathbb{E}_{({\varvec{X}}, Y)} \left[ (\mu ({\varvec{X}}) - T_n({\varvec{X}}))^2 \right] } \,\ \xrightarrow [n \rightarrow \infty ]{\text{a.s.}} \,\ 0 \end{aligned}$$

for the classifiers \({\hat{\varphi }}_{T_n}({\varvec{x}}) = \mathbbm{1}(T_n({\varvec{x}}) \ge 0.5)\) and the Bayes classifier \(\varphi ^*\) (see, e.g., Theorem 1.1, Györfi et al., 2002).

Thus, the misclassification rate of the best possible classifier \(\varphi ^*\) will be asymptotically almost surely attained by logicDT.

Note that Theorem 1 holds as long as the proposed hyperparameters are properly chosen so that the true underlying model satisfies the chosen hyperparameters. More precisely, \(\texttt{max\_vars}\) and \(\texttt{max\_conj}\) need to be sufficiently big and \(\texttt{nodesize}\) and \(\texttt{conjsize}\) need to be sufficiently small.

3.8 Computational complexity of logicDT

In this section, we study the computational complexity of logicDT, which is mainly controlled by the complexities of conducting a simulated-annealing-based search and fitting decision trees. A guarantee for obtaining a globally optimal model is only given if infinite iterations (or iterations in the magnitude of the size of the complete search space) are carried out in the simulated-annealing-based search (Van Laarhoven & Aarts, 1987). In practice, this is because of the size of the search space, typically, infeasible. Therefore, this asymptotic search is in practical applications approximated using a finite number of iterations (for more details on the search process, see Appendix 1). Therefore, we assume that the number of search steps is given by a finite number M.

Using the complexities of simulated annealing, decision tree fitting, and tree training data set transformation and using Algorithm 2, the computational complexity of logicDT is given in the following theorem. The proof of this theorem is given in Appendix 3.

Theorem 2

(Computational complexity of logicDT) Suppose M is the number of search steps performed, n training observations are given, and the hyperparameters \(\texttt{max\_vars}\), \(\texttt{max\_conj}\), \(\texttt{nodesize}\) are fixed. Then, the computational complexity of logicDT is given by

$$\begin{aligned} {\mathcal{O}}\left( M n\left[ \texttt{max\_vars} + \texttt{max\_conj} \frac{n}{\texttt{nodesize}}\right] \right) . \end{aligned}$$

Using Theorem 2, results about appropriate numbers M of search iterations based on the Markov chain length (i.e., the number of search iterations for a fixed temperature), and assumptions on the hyperparameter choices, the following corollary states that the computational complexity of logicDT is polynomial in p. The corresponding proof is, again, provided in Appendix 3.

Corollary 1

(Polynomial complexity of logicDT) Assume that the parameters \(\texttt{max\_vars}\) and \(\texttt{max\_conj}\) both scale linearly with p and that the parameter \(\texttt{nodesize}\) is constant (with respect to n which is the worst-case scenario in which the logic decision tree may be arbitrarily deep). Further assume that the Markov chain length is fixed. Then, the computational complexity of logicDT is given by

$$\begin{aligned} {\mathcal{O}}\left( n^2 p^2 \log (p)\right) . \end{aligned}$$

If instead the Markov chain length is chosen in the magnitude of the number of neighbor states per state (as suggested by Aarts & Van Laarhoven, 1985), the computational complexity of logicDT is given by

$$\begin{aligned} {\mathcal{O}}\left( n^2 p^4 \log (p)\right) . \end{aligned}$$

3.9 Bagged logicDT

If a single model consisting of relatively few variables cannot explain the whole variation in the outcome from the whole set of predictors or if the predictive power is of higher interest than the interpretability of the model, ensemble models consisting of several simpler models might be a preferable choice.

A particularly simple, yet effective approach is bagging (Breiman, 1996), in which for a given number of bagging iterations (e.g., 500), a single model is fitted on a random subset of the original training data set. The random subsets are typically generated via bootstrapping, i.e., performing random draws from the original training data with replacement n times. The resulting model is the ensemble of all models. Predictions are performed by averaging the predictions of the individual models. The number of iterations should, as in random forests, be chosen such that more iterations cannot reduce the generalization error substantially anymore.

Since sufficient bagging iterations are also desired in logicDT, simulated annealing with a proper amount of iterations itself might just be too slow. Moreover, the main issue of greedy search approaches, i.e., that a globally optimal state could be missed due to being stuck in a local optimum, might be diminished through considering different subsets of the training data set and stabilizing the model over them. In other words, the variance stabilizing property of bagging might be sufficient to account for the drawbacks of a greedy search (Murthy & Salzberg, 1995).

For the usage of logicDT in an ensemble framework, we, therefore, propose a greedy search for fitting individual logicDT models. In this greedy search, the same state modifications as in the simulated-annealing-based search are used (see Sect. 3.2). In contrast to simulated annealing, the greedy search deterministically chooses the best neighbor in each iteration. Thus, for each current state, all its neighbors are evaluated and the neighbor with the lowest score amongst all neighbors is chosen as new state. Note that for increasing numbers of predictors and increasing numbers of allowed terms and total variables, the number of eligible neighbors per state increases quadratically, thus, slowing down the greedy search. For handling higher-dimensional data, a randomization of the greedy search might be a solution which we, however, did not consider in this article.

Another very useful property of bagging is that in the fitting of an individual model not all observations from the training data are employed. The not considered observations called oob (out-of-bag) observations can, therefore, be used to estimate the generalization error, similar to using independent test data. This estimate is called the oob error and is obtained by only using models that were not built using the considered observation. More precisely, the oob error is calculated by averaging over the oob errors of the observations, where the oob error of an observation can be computed by only choosing the models which did not use this observation for training and by temporarily constructing an ensemble from this subset of models for predicting the outcome of this observation. In particular, for the estimation of variable importance measures (VIMs), bagging and oob observations are very beneficial. As discussed in the following section, we, therefore, also use them in the construction of the VIM considered in logicDT.

4 Variable importance measures

In many applications, it is useful to measure the influence of the input variables or their interactions on the prediction of an outcome. Variable importance measures (VIMs) directly try to quantify this influence. Typically, this influence is estimated by comparing two models, namely

  • the original full model containing the term of interest and

  • a kind of informatively reduced model, in which the term of interest no longer plays an informative role.

Then, the difference between the prediction errors of these two models is computed and is taken as an estimate of how the prediction based on the model improves if the term is properly included, where the prediction errors are, e.g., given by the mean squared error in regression tasks or \(1-\text{AUC}\) in binary risk estimation tasks.

4.1 Computation of VIMs

Let \(\epsilon (\varvec{{\tilde{X}}})\) be a prediction error measure capturing the performance of a fitted model informatively using only the input variables in \(\varvec{{\tilde{X}}} \subseteq {\varvec{X}}\), interpreting the random vector of input variables \({\varvec{X}} = (X_1, \ldots , X_p)\) as a set \({\varvec{X}} = \lbrace X_1, \ldots , X_p \rbrace\). Then, the importance of an input variable \(X_i\) is given by

$$\begin{aligned} \text{VIM}(X_i) \ =\ \epsilon ({\varvec{X}} \setminus X_i) - \epsilon ({\varvec{X}}). \end{aligned}$$
(7)

Here, \(\epsilon ({\varvec{X}} \setminus X_i)\) describes the prediction error of the reduced model informatively excluding the variable \(X_i\) and \(\epsilon ({\varvec{X}})\) describes the prediction error of the original full model.

Bagging allows the unbiased estimation of VIMs on the full training data set by performing oob predictions. Moreover, bagging also has the advantage that multiple potentially different models are explored stabilizing the VIMs themselves. Thus, for estimating VIMs in logicDT, bagging is used and the discussed VIMs are computed on the oob observations.

4.1.1 Permutation VIM and removal VIM

One particularly popular approach for estimating the reduced model is the permutation VIM used in random forests (Breiman, 2001). In this approach, for estimating the importance of a certain input variable, its corresponding observations are randomly permuted and predictions with this random permutation are performed. Typically, the VIM data set is permuted multiple times in the specific predictor and the average prediction error of these permutations is compared against the original error.

As an alternative, the reduced model can also be directly fitted using a reduced training data set from which the predictor of interest was removed (Mentch & Hooker, 2016). In the following, we call this approach the removal VIM.

4.1.2 Logic VIM

For binary predictors, we additionally propose a specific third procedure for computing VIMs. The idea of this logic VIM is based on considering each possible predictor setting of the input variable of interest equally, i.e., for a binary predictor \(X_1 \in \lbrace 0,1 \rbrace\), the error of the reduced model is estimated by performing predictions fixing \(X_1 = 0\), performing predictions fixing \(X_1 = 1\) and averaging these predictions before computing the error. Thus, for each observation, the prediction of the reduced model considers both possible decision tree paths, one for \(X_1 = 0\) and one for \(X_1 = 1\), equally and is generated without knowledge about \(X_1\).

4.2 Adjustment for interactions

In standard VIM procedures such as the permutation VIM in random forests, only importances of single input variables are considered. In the context of logicDT, we measure the importance of terms, i.e., of identified single input variables or conjunctions of several input variables. For instance, if the resulting model consists of \(\lbrace \lbrace X_1 \rbrace , \lbrace X_2 \wedge X_3^c \rbrace \rbrace\), we are interested in specifying the importance of \(X_1\) as well as the importance of the term \(X_2 \wedge X_3^c\). This is achieved by considering terms such as \(X_2 \wedge X_3^c\) as single input variables, i.e., by directly considering a tree training data set as in Eq. (4).

Since decision trees can handle interactions themselves, it might be possible that, e.g., \(X_1\) as well as the interaction \(X_1 \wedge X_2^c\) exhibit strong effects on the outcome. However, due to the strong marginal effect, only the single predictors \(X_1\) and \(X_2\) might be included in the logicDT model, complicating the estimation of the importance of the interaction.

Hence, we propose a novel VIM adjustment procedure for interactions that quantifies the importance of interactions that were not identified by a supervised learner such as logicDT. This VIM adjustment approach presented in the following does not depend on logicDT, but enables logicDT to appropriately estimate interaction importances. Therefore, they could, in principle, be applied to all black-box models for estimating interaction importances.

The idea behind the VIM adjustment procedure is based on considering several predictors at once, i.e., the reduced model results from reducing multiple variables in one step. Comparing the performances of this reduced model and the original model yields a joint VIM of the set of predictors (Bureau et al., 2005). Analogously to Eq. (7), the joint VIM is obtained by

$$\begin{aligned} \text{VIM}(X_{i_1}, \ldots , X_{i_k}) \ =\ \epsilon ({\varvec{X}} \setminus \lbrace X_{i_1}, \ldots , X_{i_k} \rbrace ) - \epsilon ({\varvec{X}}). \end{aligned}$$
(8)

Since this joint VIM still includes the marginal effects of the individual predictors and their sub-interactions of an order lower than the order of the actual interaction influencing the outcome, we propose the interaction VIM that corrects for any effects contained in the regarded interaction. This interaction VIM of \(X_{i_1} \wedge \cdots \wedge X_{i_k}\) is given by

$$\begin{aligned} \text{VIM}(X_{i_1} \wedge \cdots \wedge X_{i_k}) \ =\ &\text{VIM}(X_{i_1}, \ldots , X_{i_k} \mid {\varvec{X}} \setminus {\varvec{Z}}) \\&- \sum _{\lbrace j_1, \ldots , j_l \rbrace {\subsetneq } \lbrace i_1, \ldots , i_k \rbrace } \text{VIM}(X_{j_1} \wedge \cdots \wedge X_{j_l} \mid {\varvec{X}} \setminus {\varvec{Z}}), \end{aligned}$$
(9)

where \({\varvec{Z}} := \lbrace X_{i_1}, \ldots , X_{i_k} \rbrace\) is the set of input variables in the considered interaction. In our notation, \(\wedge\) denotes the interaction importance, while commas represent the joint importance. By \(\text{VIM}({\varvec{A}} \mid {\varvec{X}} \setminus {\varvec{Z}})\), the VIM of \({\varvec{A}}\) considering the predictor set excluding the variables in \({\varvec{Z}}\), i.e., the improvement of additionally considering \({\varvec{A}}\), while regarding only the predictors in \({\varvec{X}} \setminus {\varvec{Z}}\), is denoted. The interaction importance captures the importance of a general meaning of interaction, i.e., it considers whether some variables do interact in any way and quantifies the effect of the joint presence of these variables adjusted for single occurrences. For a predictor set \(\varvec{{\tilde{A}}} := \lbrace X_{j_1}, \ldots , X_{j_l} \rbrace \subseteq {\varvec{Z}}\), the restricted joint VIM, i.e., the VIM of \(\varvec{{\tilde{A}}}\) considering only the predictors \({\varvec{X}} \setminus {\varvec{Z}}\) in the reduced model, is, following Eq. (8), given by

$$\begin{aligned} \text{VIM}(\varvec{{\tilde{A}}} \mid {\varvec{X}} \setminus {\varvec{Z}}) \ =\ \epsilon ({\varvec{X}} \setminus {\varvec{Z}}) - \epsilon (\varvec{{\tilde{A}}} \cup ({\varvec{X}} \setminus {\varvec{Z}})). \end{aligned}$$
(10)

Excluding all variables in \({\varvec{Z}}\) composing the interaction in the respective reference models is crucial for isolating the effects that should be adjusted for. If, e.g., a two-way interaction \(X_{1} \wedge X_{2}\) is studied, its interaction VIM (9) is given by

$$\begin{aligned} \text{VIM}(X_{1} \wedge X_{2}) \ =\ {}&\text{VIM}(X_{1}, X_{2} \mid {\varvec{X}} \setminus \lbrace X_1, X_2 \rbrace ) \\&- \text{VIM}(X_{1} \mid {\varvec{X}} \setminus \lbrace X_1, X_2 \rbrace ) - \text{VIM}(X_{2} \mid {\varvec{X}} \setminus \lbrace X_1, X_2 \rbrace ). \end{aligned}$$
(11)

If, e.g., \(\text{VIM}(X_{1} \mid {\varvec{X}} \setminus X_{1}) \ \ \ \overset{(7), (10)}{=} \ \ \ \text{VIM}(X_{1})\) would be used instead of \(\text{VIM}(X_{1} \mid {\varvec{X}} \setminus \lbrace X_1, X_2 \rbrace )\) in Eq. (11), the whole importance of \(X_{1}\), that also contains the interaction with \(X_{2}\), would be subtracted from the joint importance not isolating the interaction importance that should be estimated.

Recursively applying Eq. (11) to the general case in Eq. (9) yields

$$\begin{aligned} \text{VIM}(X_{i_1} \wedge \cdots \wedge X_{i_k}) \ =\ \sum _{\lbrace j_1, \ldots , j_l \rbrace \subseteq \lbrace i_1, \ldots , i_k \rbrace } (-1)^{k-l} \cdot \text{VIM}(X_{j_1}, \ldots , X_{j_l} \mid {\varvec{X}} \setminus {\varvec{Z}}). \end{aligned}$$

Utilizing Eq. (10), this formula for the interaction VIM can also be written in terms of prediction errors \(\epsilon\), i.e., as

$$\begin{aligned} \text{VIM}(X_{i_1} \wedge \cdots \wedge X_{i_k}) \ =\ \sum _{\lbrace j_1, \ldots , j_l \rbrace {\subseteq } \lbrace i_1, \ldots , i_k \rbrace } (-1)^{l+1} \cdot \epsilon ({\varvec{X}} \setminus \lbrace X_{j_1}, \ldots , X_{j_l} \rbrace ) - \epsilon ({\varvec{X}}). \end{aligned}$$

This formula can be used for efficiently computing the interaction VIM by directly considering prediction errors.

The interaction VIM (9) is similar to the interaction effect statistic proposed by Friedman and Popescu (2008), which utilizes the same effect decomposition and is based on the explained variance of partial dependence functions instead of VIMs. Friedman and Popescu (2008) theoretically justified this effect decomposition by showing that their statistic is zero, if the null hypothesis of no present interaction effect holds true. For example, for analyzing a two-way interaction \(X_1 \wedge X_2\), Friedman and Popescu (2008) evaluate \(F_{X_1,X_2}-F_{X_1}-F_{X_2}\), in which \(F_\cdot\) denotes partial dependence functions of the considered input variables. This term is analogous to the interaction VIM in Eq. (11) for \(X_1 \wedge X_2\) with the difference that VIMs, i.e., performance metrics, are used instead of partial dependence functions. Moreover, the input feature effect decomposition utilized by the proposed interaction VIM is also used by the Shapley interaction index (Lundberg et al., 2020; Fujimoto et al., 2006). However, in machine learning applications, Shapley values are based on direct predictions of the fitted model instead of performance metrics such as VIMs.

For all three procedures for constructing VIMs mentioned in Sect. 4.1, the reduced joint model can be intuitively constructed.

In the permutation VIM, the input variables of interest, i.e., the input variables participating in the interaction for which the interaction VIM should be computed, are simply permuted together by, e.g., permuting the values of each input variable separately.

For the removal VIM, the set of input variables of interest is removed as a whole from the total set of input variables.

The logic VIM proposed in Sect. 4.1.2 performs uninformative predictions of an input variable by considering both possible decision tree paths for an observation and averaging the prediction. To generalize the logic VIM to multiple input variables at once for computing the interaction VIM, all possible predictor settings \({\varvec{x}} \in \lbrace 0,1 \rbrace ^p\) for the p input variables that shall be informatively excluded are used to generate predictions. These \(2^p\) predictions are averaged to create the prediction of the reduced model.

In logicDT, the logic VIM is used in conjunction with the proposed adjustment for interaction effects. Quantifying the importance of specific conjunctions, that are, e.g., identified by logicDT, will be discussed in the following section. In Sect. 5, the permutation VIM, the removal VIM, and the logic VIM are evaluated in empirical studies.

4.3 Adjustment for conjunctions

The VIM adjustment approach introduced in Sect. 4.2 only captures the importance of a general meaning of interactions, i.e., it just considers the question whether some variables do interact in some way. Since logicDT is aimed at identifying specific conjunctions (and also determines the values of a VIM for them, if the conjunctions have been identified by logicDT), a further adjustment approach is proposed that tries to identify the specific conjunction leading to an interaction effect. For example, if the importance of the interaction between \(X_1\) and \(X_2\) was quantified using the interaction adjustment proposed in Sect. 4.2, the approach presented in the following assigns a Boolean conjunction to this importance, e.g., the Boolean conjunction \(X_1 \wedge X_2^c\). The proposed procedure is, again, applicable to any kind of supervised learning model. However, due to considering Boolean conjunctions, the input variables for which the importance should be quantified need to be binary.

This approach considers each possible conjunction of the identified interaction and chooses the conjunction that leads to the most severe deviation in the outcome, i.e., the conjunction with the strongest effect on the outcome. The VIM of this conjunction is the corresponding interaction VIM derived in Sect. 4.2.

The idea of this method is to consider the values of the outcome for each possible scenario of the interacting variables, e.g., for \(X_1 \wedge (X_2^c \wedge X_3)\), where we assume that the terms \(X_1\) and \(X_2^c \wedge X_3\) were identified by logicDT. In this example, thus, two interacting terms are regarded, i.e., the \(2^2 = 4\) possible scenarios \(X_1 = 0\) or \(X_1 = 1\) in combination with \(X_2^c \wedge X_3 = 0\) or \(X_2^c \wedge X_3 = 1\) are considered. For each setting, the corresponding outcome values are compared to the outcome values of the complementary set, i.e., the set in which the considered conjunction is equal to zero. This means that in the considered example the four statistical tests

$$\begin{aligned}&H_0:\ \mathbb{E}\left[ Y \mid C_i = 1\right] = \mathbb{E}\left[ Y \mid C_i = 0\right] \hspace{1.5em} \\\text{vs.} \hspace{1.5em} &H_1:\ \mathbb{E}\left[ Y \mid C_i = 1\right] \ne \mathbb{E}\left[ Y \mid C_i = 0\right] , \end{aligned}$$

with

$$\begin{aligned}&C_1 = X_1 \wedge (X_2^c \wedge X_3), \quad C_2 = X_1^c \wedge (X_2^c \wedge X_3), \\&C_3 = X_1 \wedge (X_2^c \wedge X_3)^c, \quad C_4 = X_1^c \wedge (X_2^c \wedge X_3)^c \end{aligned}$$

potentially negating the subterms, are performed for \(i \in \lbrace 1,2,3,4 \rbrace\). For continuous outcomes, Welch’s t-test is performed for comparing the means between these two groups, i.e., the group in which the considered conjunction is equal to one and the group in which the considered conjunction is equal to zero. For binary outcomes, Fisher’s exact test is performed for testing different underlying case probabilities. The combination with the lowest p-value is chosen as the explanatory term for the interaction effect. E.g., in the above example, if the smallest p-value results from considering \(X_1 = 0\) and \((X_2^c \wedge X_3) = 1\), the term \(X_1^c \wedge (X_2^c \wedge X_3)\) is chosen as the conjunction responsible for the interaction effect.

5 Experiments

In the following, we evaluate the performance of logicDT on simulated and real data considering classification and regression problems and compare logicDT with other similar methods. More precisely, we compare logicDT and bagged logicDT with conventional decision trees (Breiman et al., 1984), DL8.5 (Aglin et al., 2020a), random forests (Breiman, 2001), gradient boosting (Friedman, 2001), logic regression (Ruczinski et al., 2003), logic regression with bagging (Schwender & Ickstadt, 2007), MOB (model-based recursive partitioning, Zeileis et al., 2008), interaction forests (Hornung & Boulesteix, 2022), and RuleFit (Friedman & Popescu, 2008). Since DL8.5 (as similar openly available optimal decision tree algorithms such as MurTree proposed by Demirović et al., 2022) are currently only implemented for classification tasks, DL8.5 is only applied to the considered classification tasks. All analyses are carried out using R (R Core Team, 2022), except for the application of DL8.5, which is performed using the Python implementation of Aglin et al. (2020b).

5.1 Simulation study

We, first, consider the situation of genetic association studies in which single genes/genetic pathways are analyzed and typically not more than a few tens of SNPs (single nucleotide polymorphisms) are considered. Afterwards, we consider a more complex setting with more SNPs to evaluate if logicDT is also applicable to high-dimensional problems.

5.1.1 First simulation setup

We analyze the performance of logicDT and the other supervised learning procedures first in four different simulation scenarios in which we consider binary predictors and

  • a binary outcome (such as a disease status) without an additional continuous covariable,

  • a binary outcome with a continuous covariable,

  • a continuous outcome (such as the blood pressure) without a continuous covariable, and

  • a continuous outcome with a continuous covariable.

Our simulations are based on the problem of analyzing risk factors in genetic epidemiology. Thus, the generated input variables can be interpreted as SNPs that count the number of minor alleles at a specific locus, i.e., the number of occurrences of a less frequent base-pair substitution at a specific location in the DNA. Due to humans being diploid organisms, i.e., carrying two complete sets of chromosomes, SNPs can take the values 0, 1, or 2. Similar to, e.g., logic regression, for the application of logicDT to SNP data, each SNP is divided into the binary input variables \(\text{SNP}_D = \mathbbm{1}(\text{SNP} \ne 0)\) and \(\text{SNP}_R = \mathbbm{1}(\text{SNP} = 2)\), coding for a dominant and a recessive effect, respectively, such that no information is lost. Conventional decision trees also implicitly divide SNPs into dominant and recessive effects by considering SNPs as numerical variables such that a split can occur on \((\lbrace 0 \rbrace , \lbrace 1,2 \rbrace )\) or on \((\lbrace 0,1 \rbrace , \lbrace 2 \rbrace )\). Combined with the greedy search of decision trees over all possible splits, this is equivalent to directly considering the binary variables \(\text{SNP}_D\) and \(\text{SNP}_R\) (Lau et al., 2022).

The genotypes of the SNPs are generated independently, resembling sets of SNPs from which, as often done in practice, highly correlated SNPs have been removed using linkage-disequilibrium-based pruning (see, e.g., Purcell et al., 2007). The distributions of the SNPs are defined via the MAF (minor allele frequency), i.e., the proportion of minor allele occurrences, yielding the binomial distribution \(\text{Bin}(2, \text{MAF})\) for each SNP. For all simulated SNPs, we consider a MAF of 0.25. For each data set, 50 SNPs are generated so that \({\varvec{X}} = (\text{SNP}_1, \ldots , \text{SNP}_{50})\). However, in the considered scenarios described below, only a small fraction influences the outcome such that most input variables act as noise regarding the outcome.

For the analysis of the influence of a continuous covariable, an environmental variable (e.g., an air pollution indicator) is generated following a truncated normal distribution (truncated at zero, since values below zero often do not occur in practice). In particular, the environmental term E is generated by considering a \({\mathcal{N}}(20, 100)\)-distributed random variable \(E'\) and setting values below zero to zero so that \(E = \max (0, E')\). The truncated values might, e.g., be interpreted as measurements below a detection limit.

Since DL8.5 can only incorporate binary input variables, E is dichotomized into a binary variable by considering \(E_\text{bin} = \mathbbm{1}(E > 20)\) for fitting and evaluating DL8.5 models, where the cutoff 20 is chosen due to \(\mathbb{E}[E] = \mathbb{P}(E'> 0) \mathbb{E}[E' \mid E' > 0] \approx 20\).

For the first simulation scenario considering a binary outcome without any continuous covariables, the outcome is generated following the model

$$\begin{aligned} \text{logit}(\mathbb{P}(Y = 1 \mid {\varvec{X}}))\ =\ -0.4 + \Big (&\sqrt{\log (1.5)} \cdot \mathbbm{1}(\text{SNP}_1> 0) \\&+ \sqrt{\log (2)} \cdot \mathbbm{1}(\text{SNP}_2 > 0 \wedge \text{SNP}_3 = 0) \Big )^2. \end{aligned}$$

Therefore, \(\text{SNP}_1\) exhibits a moderate marginal effect and \(\text{SNP}_2\) and \(\text{SNP}_3\) interact with each other. The linear predictor on the right-hand side is squared which means that, on the scale of the total linear predictor, the term \(\mathbbm{1}(\text{SNP}_1 > 0)\) interacts with the term \(\mathbbm{1}(\text{SNP}_2 > 0 \wedge \text{SNP}_3 = 0)\). Thus, this resembles a situation in which it might be useful to be able to model interactions between interactions, since the underlying scale of the linear predictor is unknown prior to the analyses, which is usually the case in practice. The intercept of \(-0.4\) ensures that the resulting data sets are approximately balanced, i.e., that the fraction of cases is approximately equal to 50%.

In the second scenario, a gene-environment interaction is introduced such that the outcome in this case is modeled by

$$\begin{aligned} \text{logit}(\mathbb{P}(Y = 1 \mid {\varvec{X}}, E))\ =\ {}&-0.45 + \log (2) \cdot \mathbbm{1}(\text{SNP}_1> 0) \\&+ \log (3) \cdot \frac{E}{20} \cdot \mathbbm{1}(\text{SNP}_2 > 0 \wedge \text{SNP}_3 = 0). \end{aligned}$$

Thus, the environmental variable only influences the outcome, if the term \(\mathbbm{1}(\text{SNP}_2 > 0 \wedge \text{SNP}_3 = 0)\) holds true. This kind of gene-environment interaction might be reasonable for substances that are usually harmless, but might cause, e.g., allergic reactions in individuals with a certain genetic makeup.

Analogously to the first scenario, the third scenario consists of data sets in which the outcome is modeled by

$$\begin{aligned} \mathbb{E}[Y \mid {\varvec{X}}]\ =\ -0.4 + \Big (&\sqrt{\log (1.5)} \cdot \mathbbm{1}(\text{SNP}_1> 0) \\&+ \sqrt{\log (2)} \cdot \mathbbm{1}(\text{SNP}_2 > 0 \wedge \text{SNP}_3 = 0) \Big )^2. \end{aligned}$$

Here and in the following scenario, random noise generated from \({\mathcal{N}}(0, 1)\) is added to the linear predictor.

As in the second scenario, the fourth scenario follows the underlying model

$$\begin{aligned} \mathbb{E}[Y \mid {\varvec{X}}, E]\ =\ {}&-0.75 + \log (2) \cdot \mathbbm{1}(\text{SNP}_1> 0) \\&+ \log (4) \cdot \frac{E}{20} \cdot \mathbbm{1}(\text{SNP}_2 > 0 \wedge \text{SNP}_3 = 0). \end{aligned}$$

For each simulation scenario, 100 independent data sets are generated. For each data set, it is assumed that this is the only data set available. Thus, for each replication, the data set is randomly divided into a training, a validation, and a test data set. Thus, for the evaluation of logicDT and comparable methods, we perform 100 independent evaluations. In many practical applications such as in the construction of genetic risk scores, there is only data for a relatively small number of observations available. Hence, in our simulations, the randomly generated data sets consist of 1000 observations each. From each of these data sets, \(0.7 \cdot 1000 = 700\) randomly chosen observations are used as the intermediary data set for training and validating the model and the remaining 300 observations yield the test data set for the final evaluation. The intermediary data set is further randomly divided into \(0.25 \cdot 700 = 175\) observations for choosing the best set of hyperparameters and \(0.75 \cdot 700 = 525\) observations for training in the hyperparameter optimization. After the optimal hyperparameter setting has been identified, the final models are trained on the intermediary data set consisting of 700 observations.

The predictive performance of logicDT and the comparable methods is assessed using the AUC for binary outcomes and using the complement of the NRMSE (normalized root mean squared error) for continuous outcomes on test data predictions.

5.1.2 Hyperparameter optimization

As described in Sect. 3.6, the model complexity parameters \(\texttt{max\_vars}\) (maximum number of total variables) and \(\texttt{max\_conj}\) (maximum number of conjunctions) of logicDT should be tuned. In this application, we prohibit removing complete conjunctions to ensure that the models consist of exactly \(\texttt{max\_conj}\) conjunctions. Furthermore, the minimum number \(\texttt{nodesize}\) of observations belonging to a leaf and the minimum number \(\texttt{conjsize}\) of observations belonging to a conjunction and its negation are tuned using the same value, respectively. This ensures that the trees are grown to the ideal depth and prevents that models using uninformative conjunctions are evaluated.

For bagged logicDT models, \(\texttt{max\_vars}\) and \(\texttt{max\_conj}\) are tuned using the same parameter setting and allowing the removal of complete conjunctions in contrast to fitting single logicDT models.

In Table 1, the considered hyperparameter settings for logicDT, bagged logicDT, and the comparable tree-based statistical learning methods are summarized. For logicDT, the hyperparameter settings proposed in Sect. 3.6 are considered. For the regarded comparable methods, common hyperparameter choices are considered and the best performing one is chosen. For all methods except for gradient boosting and RuleFit, a grid search among all proposed settings is performed, due to relatively few plausible settings. For gradient boosting and RuleFit, a sequential Bayesian hyperparameter search is carried out (Bergstra et al., 2011; Wilson, 2021), since a finetuning of the learning rate parameter (for a fixed number of boosting iterations) is required. Additionally, the subsample fraction and the minimum node size, which can also be considered as continuous hyperparameters, have to be configured jointly in gradient boosting and RuleFit. For this sequential search, 100 different settings are evaluated.

Table 1 Regarded hyperparameter settings with according descriptions

For logicDT, Fig. 5 shows the validation data performances for the considered settings of \(\texttt{max\_vars}\) and \(\texttt{max\_conj}\) combined with the respective best setting for \(\texttt{nodesize}\)/\(\texttt{conjsize}\). For each scenario, the highest performance is yielded by \(\texttt{max\_vars} = 3\) and \(\texttt{max\_conj} = 2\) corresponding to the true underlying simulation models. Generally, the following pattern can be observed. For many \(\texttt{max\_conj}\) settings, the maximizing setting is given by \(\texttt{max\_vars} = \texttt{max\_conj} + 1\), which is due to the fact that in this case, additionally to single variables as terms, a conjunction of two variables is contained in the model.

Fig. 5
figure 5

Predictive performances of different hyperparameter settings for the parameters \(\texttt{max\_vars}\) (maximum number of variables) and \(\texttt{max\_conj}\) (maximum or exact number of terms) in logicDT in the simulation study considering four different scenarios. The performance for binary outcomes is measured by the AUC and the performance for continuous outcomes is measured by the complement of the NRMSE (normalized root mean squared error). Results on validation data sets for the best respective setting of the parameter \(\texttt{nodesize}\)/\(\texttt{conjsize}\) in the set \(\lbrace 1\%, 5\%, 10\% \rbrace\). The evaluated hyperparameter settings are listed in Table 1. Justifications for evaluating these settings are given in Sect. 3.6

For most considered hyperparameter settings, the validation performance does not seem to vary too severely between similar settings, which indicates that a slight hyperparameter misspecification might not substantially impair the predictive performance of the resulting logicDT model.

5.1.3 Predictive performance

Figure 6 depicts the performances of logicDT, the comparable methods, and the true underlying model in the simulation study, where the performance of the true model was assessed by performing predictions using the true regression functions presented in Sect. 5.1.1.

Fig. 6
figure 6

Predictive performances of logicDT and the comparable methods in the simulation study considering four different scenarios. The performance for binary outcomes is measured by the AUC and the performance for continuous outcomes is measured by the complement of the NRMSE (normalized root mean squared error)

In the first simulation scenario considering a binary outcome without an environmental covariable, most notably, standard logicDT and bagged logicDT lead to the best performances, i.e., the largest AUC values, which almost coincide with the performance of the true model. Among the comparable methods, logic bagging seems to be the best method.

For the second scenario in which also a gene-environment interaction is considered, logicDT, bagged logicDT, gradient boosting, logic regression, and logic bagging induce similar results superior to the remaining methods. Here, logicDT and logic regression seem to produce slightly better results than the other procedures.

For the third and fourth simulation scenarios considering a continuous outcome without or with an environmental covariable, logicDT and bagged logicDT yield the highest predictive performances close to the true underlying models. When considering no environmental covariable, logic bagging seems to be the best method among the comparable methods. MOB yields the highest performance among the comparable methods when including an environmental covariable.

5.1.4 Variable importance

Using the VIMs and adjustment approaches for interactions and conjunctions proposed in Sect. 4, we computed variable importances in the four different simulation scenarios. We fitted bagged logicDT models on the 100 complete sub data sets for each scenario. The VIMs themselves were computed using out-of-bag data. For properly summarizing the 100 repetitions, means of the 100 repetitions were computed. A term not occurring in one repetition received a VIM of zero. Additionally, asymptotic 95% confidence intervals for these means \(\mu\) were calculated by \({\hat{\mu }} \pm 1.96 \cdot \widehat{\text{se}}\), where \(\widehat{\text{se}}\) is the estimated standard error. For binary outcomes, the AUC was used for determining VIMs, while for continuous outcomes, the MSE was employed.

Figure 7 depicts the determined VIMs. For all four scenarios and all three considered measures, the true influential input variables \(\text{SNP1D},\ \text{SNP2D},\ \text{SNP3D}\) receive the highest importance values. Theoretically non-influential terms comprised of variables not influencing the outcome were assigned importance values close to zero in all cases. In the first simulation scenario, the logic VIM and the removal VIM both assign the triplet \(\text{SNP1D}\ \wedge \ \text{SNP2D}\ \wedge \ \text{SNP3D}^c\) the highest importance among all interactions. The permutation VIM favors the sub-conjunction \(\text{SNP2D}\ \wedge \ \text{SNP3D}^c\) of this triplet. Both interpretations are correct regarding the true model in their own sense, since the term \(\text{SNP2D}\ \wedge \ \text{SNP3D}^c\) interacts with \(\text{SNP1D}\) due to squaring the linear predictor.

Fig. 7
figure 7

Logic, removal, and permutation VIMs yielded by bagged logicDT models for the four scenarios in the simulation study. Adjustment for interactions and conjunctions was performed. Means and asymptotic 95% confidence intervals for the 100 repetitions are presented. Negations of input variables are denoted using a minus sign in the front

In the remaining three scenarios, all VIMs assign the term \(\text{SNP2D}\ \wedge \ \text{SNP3D}^c\) the highest importance among all interactions. However, in the third scenario considering, as in the first scenario, the square of the linear predictor, the conjunction \(\text{SNP1D}\ \wedge \ \text{SNP2D}\ \wedge \ \text{SNP3D}^c\) and additionally sub-conjunctions receive importance values greater than zero. In the last scenario considering a continuous outcome and an influential environmental covariable, the interaction \(\text{SNP2D}\ \wedge \ \text{SNP3D}^c\) received the highest importance overall for all three importance measures.

In the first three scenarios, the three single input variables yield the highest importances. This is due to the fact that the VIM of single input variables coincides with the standard definition of VIMs, i.e., the difference in error when informatively removing a single input variable. Thus, the VIM of a single input variables captures all of its effects, including effects of interaction in which this input variable participates. In the fourth scenario, the two-way interaction \(\text{SNP2D}\ \wedge \ \text{SNP3D}^c\) seems to be identified in almost every logicDT application so that the single input variables \(\text{SNP2D}\) and \(\text{SNP3D}\) receive lower importances due to being identified less often. Hence, the importances should be compared in groups corresponding to the interaction order, i.e., marginal importances should be compared to each other, two-way interactions should be compared to each other, and so forth.

In summary, all measures yield very similar and plausible results. The determination of the logic VIM is considerably faster than the determination of the removal VIM and the permutation VIM, since the model does not have to be refitted and predictions do not have to be performed for a high number of permutations for computing the logic VIM. Instead, for a term consisting of k variables, only \(2^k\) predictions have to performed and compared to the original prediction.

5.1.5 Second simulation setup

To investigate if logicDT is also suitable in scenarios in which a larger amount of input variables is considered and more input variables influence the outcome, we evaluate logicDT and the comparable methods in additional simulations. Two scenarios are investigated, one considering a binary outcome and one considering a continuous outcome, that are both simulated according to the linear model

$$\begin{aligned} \begin{aligned} g(\mathbb{E}[Y \mid {\varvec{X}}, E])\ =\ {}&-0.25 + \log (2) \cdot \mathbbm{1}(\text{SNP}_1> 0) + \log (2.5) \cdot \frac{E}{20} \cdot \mathbbm{1}(\text{SNP}_2> 0) \\&- \log (1.5) \cdot \mathbbm{1}(\text{SNP}_3 = 2) - \log (1.5) \cdot \mathbbm{1}(\text{SNP}_4 = 0) \\&+ \log (3) \cdot \frac{E}{20} \cdot \mathbbm{1}(\text{SNP}_5> 0) \cdot \mathbbm{1}(\text{SNP}_6 = 2) \\&- \log (3) \cdot \mathbbm{1}(\text{SNP}_7 > 0) \cdot \mathbbm{1}(\text{SNP}_8 = 0) \cdot \mathbbm{1}(\text{SNP}_9 < 2), \end{aligned} \end{aligned}$$
(12)

where g is the logit function for the binary outcome and the identity function for the continuous outcome. This model was chosen, since it exhibits a more complex structure, as nine SNPs influence the outcome as main effects, two-way interactions, three-way interactions, or gene-environment interactions. In total, 1000 SNPs (i.e., 2000 binary input variables coding for dominant and recessive modes of inheritance for these SNPs) and one continuous covariable were simulated for data sets with sample size \(n=1000\). The input variables are simulated analogously to the ones in Sect. 5.1.1. Both scenarios are, again, evaluated based on 100 independent replications, i.e., 100 random data sets, which are analogously to Sect. 5.1.1 divided into training, validation, and test data sets.

5.1.6 Predictive performance

In Fig. 8, the predictive performance of logicDT and the comparable methods in the application to the two additional simulation scenarios are depicted. Both scenarios seem to be relatively complex, since the discrepancy between the predictive performance of the true model and the fitted models is larger than, e.g., in the previously conducted simulations.

Fig. 8
figure 8

Predictive performances of logicDT and the comparable methods in the simulation study considering two more complex scenarios. The performance for the binary outcome is measured by the AUC and the performance for the continuous outcome is measured by the complement of the NRMSE (normalized root mean squared error)

For the binary outcome, the best performance is induced by gradient boosting, closely followed by logicDT, bagged logicDT, random forests, logic regression, and logic bagging. Out of these methods, logicDT and logic regression are the only methods that yield interpretable models. Conventional decision trees, DL8.5, MOB, and RuleFit lead to lower AUCs.

For the continuous outcome, the best results are induced by logicDT, bagged logicDT, gradient boosting, logic regression, logic bagging, and RuleFit. The other interpretability-focused methods, namely conventional decision trees and MOB, yield lower predictive performances.

Hence, logicDT seems to be also applicable and yielding comparatively high predictive performances, when considering scenarios with larger numbers of input variables (here, 2000 binary input variables) and influential input variables.

5.1.7 Variable importance

In Fig. 9, the estimated variable importances by bagged logicDT in the application to the two additional simulation scenarios are displayed. Since a relatively complex scenario is considered, not every influential term is identified. Nonetheless, for the binary outcome and each considered VIM type, each term with a strongly positive variable importance is truly influential in the underlying data-generating model (12). Moreover, for both the binary and the continuous outcome and all VIM types, the two-way interaction \(\text{SNP8D}^c \wedge \text{SNP7D}\) is correctly identified.

Fig. 9
figure 9

Logic, removal, and permutation VIMs yielded by bagged logicDT models for the two more complex scenarios in the simulation study. Adjustment for interactions and conjunctions was performed. Means and asymptotic 95% confidence intervals for the 100 repetitions are presented. Negations of input variables are denoted using a minus sign in the front

For the continuous outcome and the permutation VIM, the five top-ranking importances correspond to truly influential terms. However, the terms showing the next highest importances corresponding to theoretically non-influential terms such as \((\text{SNP8D}^c \wedge \text{SNP7D})^c \wedge \text{SNP2D}\) indicate that these terms are influential as well due to their importance confidence intervals fully being above zero. This issue of falsely identified terms seems to be alleviated when employing the logic VIM or the removal VIM due to less non-influential terms that yield VIM confidence intervals fully above zero when using these VIMs. This, thus, indicates that the logic VIM and the removal VIM in conjunction with the adjustment for interactions can also be employed in more complex scenarios with a larger number of input variables.

5.2 Real data application

We have also applied logicDT and the comparable statistical learning methods to several real data sets, from which the data set of the SALIA study is of particular interest. Therefore, we consider, first, in the following subsections this study and the performance of logicDT and the comparable methods in their application to the data from the SALIA study. Afterwards, we summarize the results of the analyses of the other data sets in Sect. 5.2.4. A more detailed discussion of these evaluations can be found in Appendix 4.

5.2.1 SALIA study

logicDT was applied to a real data set from a German cohort study called the SALIA study (Study on the Influence of Air Pollution on Lung, Inflammation and Aging, Schikowski et al., 2005). The results of logicDT were compared to the results of the methods also considered in the comparisons in Sect. 5.1. The data set consists of data from 517 women, from which 123 had a rheumatic disease so that 394 women did not show a rheumatic disease. For these women, data from 77 SNPs from the HLA-DRB1 gene, which presumably plays a major role in the heritability of rheumatoid arthritis (Clarke & Vyse, 2009), are available. For more details about the SALIA study itself and an analysis of rheumatic diseases in the SALIA study, see Krämer et al. (2010) and Lau et al. (2022), respectively.

The analysis was performed using a similar scheme as in the simulation study. For 100 independent repetitions, training, validation and test data sets were randomly drawn from the total data set. Hyperparameter optimization was performed using, again, the parameter values summarized in Table 1.

5.2.2 Predictive performance

In Fig. 10, the performances of logicDT and the comparable methods in their application to the SNP data from the SALIA study are shown. This figure reveals that all evaluated statistical learning procedures induce similarly high AUCs, except for conventional decision trees, DL8.5, and RuleFit, which show inferior predictive performances. RuleFit seems to have issues to detect a signal in the data set at all, despite optimizing its hyperparameters.

Fig. 10
figure 10

Predictive performances of logicDT and the comparable methods in the evaluation of the SALIA data

We would like to point out that logicDT is the only other procedure than conventional decision trees, DL8.5, logic regression, and RuleFit that yields easily interpretable prediction models. In contrast to these models, logicDT still leads to comparatively high predictive performances. Single logic regression models yield similar AUCs as logicDT. However, due to logic regression models including complex terms consisting of mixtures of Boolean conjunctions and disjunctions, logic regression models tend to be harder to interpret than logicDT models.

Figure 11 shows the fitted logicDT model on the complete SALIA data. This tree is still relatively easy to interpret, i.e., it is easy to understand how predictions are made and which interactions are involved in the prediction. In comparison, the fitted logic regression model on the complete SALIA is given by

$$\begin{aligned} \text{logit}(\mathbb{P}(Y = 1 \mid {\varvec{X}})) =&-1.14 \\&-19.63 \cdot \mathbbm{1}((\text{rs113608847D} \wedge (\text{rs113505515D}^c \vee \text{rs9270143R})) \\&\hphantom{-19.63 \cdot \mathbbm{1}(}\wedge (\text{rs1060176D} \vee (\text{rs28724138R}^c \wedge \text{rs17884945R}^c))) \\&-2.91 \cdot \mathbbm{1}((\text{rs34578704D}^c \wedge \text{rs34084957D}) \\&\hphantom{-2.91 \cdot \mathbbm{1}(}\vee ((\text{rs41288045R} \vee \text{rs9269814D}^c ) \vee \text{rs72844253R})) \\&+1.41 \cdot \mathbbm{1}((\text{rs113322920D} \vee \text{rs36101847R} ) \wedge \text{rs17879702D}^c ). \end{aligned}$$

For this model, it is not trivial which interactions are involved in the prediction and how predictions for \(\mathbb{P}(Y = 1 \mid {\varvec{X}})\) are constructed.

Fig. 11
figure 11

Fitted logicDT model on the complete SALIA data

5.2.3 Variable importance

Figure 12 illustrates the measured variable importances in the application to the SALIA data for the three proposed VIM approaches using bagged logicDT models. In the top row, the importances for the top 5 single input variables are depicted. In the second and third row, the importances for the top 5 two-way and three-way interactions are shown, respectively.

Fig. 12
figure 12

Logic, removal, and permutation VIMs yielded by bagged logicDT models in the evaluation of the SALIA data—divided into VIMs of single input variables, two-way interactions and three-way interactions. Adjustment for interactions and conjunctions was performed. Means and asymptotic 95% confidence intervals for the 100 repetitions are presented. Negations of input variables are denoted using a minus sign in the front

For verifying whether the terms identified by logicDT really have an influence on the outcome of interest, i.e., the rheumatic disease status, we considered for each identified term X in Fig. 12 a logistic regression model

$$\begin{aligned} \text{logit}\left( \mathbb{P}(Y = 1 \mid X = x)\right) = \beta _0 + \beta _1 x \end{aligned}$$
(13)

and performed statistical hypothesis tests testing whether the respective term has an influence on the outcome, i.e., testing \(H_0:\ \beta _1 = 0\) vs. \(H_1:\ \beta _1 \ne 0\) using a Wald test. For each set of five identified terms, we evaluated how many terms lead to significant coefficients in the model from Eq. (13) using a significance level of \(\alpha = 5\%\) and adjusting for multiple testing using the method by Benjamini and Hochberg (1995).

Table 2 shows the results for this post-hoc analysis. None of the identified single input variables proves to be significant. However, for the logic VIM, four of the five identified two-way interactions and all five three-way interactions seem to have a significant influence on the outcome. For the more computationally intensive removal and permutation VIMs, the results seem to be inferior, since only two of the five two-way interactions are significant, and three or four of the five three-way interactions, respectively, are significant.

Table 2 Numbers of identified terms from Fig. 12 that were significant with respect to \(\alpha = 5\%\) using a false discovery rate adjustment

Note that the VIMs of the single input variables depicted in Fig. 12 are considerably higher than the VIMs of the interaction terms, yet the single input variables were not significant. As discussed in the simulation study in Sect. 5.1.4, this is due to the fact that the VIMs for single input variables also capture the importance of interactions that contain the input variable of interest. Thus, if a single input variable is part of many interactions, this inflates its importance value without leading to a significant main effect of the variable. For example, the most influential input variable across all three VIM calculation approaches, rs1060176D, is in every considered situation part of one identified interaction term.

5.2.4 Additional real data evaluations

logicDT and the comparable methods are also evaluated in additional experiments using 24 real data sets from various application fields. The main result is that logicDT induces high predictive performances among single-model procedures in the application to these additional real data sets. Among the ensemble methods, bagged logicDT also induces for most data sets relatively high predictive performances. More details on the analyses of the additional real data sets can be found in Appendix 4.

6 Discussion

In this article, we have presented a statistical learning procedure called logicDT that is specifically tailored to finding interactions between binary input variables and that can also take continuous covariables into account by fitting regression models in the decision tree branches. In contrast to, e.g., logic regression, all possible interactions of the binary input data with this continuous covariable can be included in the prediction model as well as interactions between interactions of the binary input data. logicDT is aimed at maximizing both predictive power and interpretability motivated by applications in genetic epidemiology.

As a simulation study as well as real data applications show, logicDT is able to fulfill these objectives and yields comparable or better predictive performances as similar methods, while maintaining interpretability, which is lost when applying most other approaches. Moreover, through simulated annealing and theory on decision trees, theoretical success of logicDT, i.e., that the true underlying regression function is asymptotically attained, could be proven.

For maximizing the predictive performance regardless of being able to interpret how exactly predictions are made, bagging can be applied to logicDT, yielding performances as state-of-the-art algorithms such as random forests or gradient boosting.

Through different VIMs and VIM adjustment approaches for measuring the importances of interactions and specific conjunctions, highly predictive bagged logicDT models are still very useful for deriving which variables influence the outcome in interaction with which other variables. In comparison to standard VIM approaches, the proposed interaction VIM is able to capture influences of interactions and is not restricted to single input variables. Note that the proposed VIM adjustment approaches can also be applied to other statistical learning procedures, e.g., black-box methods such as deep neural networks or random forests, since no restricting assumptions on the model fitting procedure itself are made in these approaches.

Fitting logicDT models is computationally intensive due to the global search via simulated annealing, and takes, in particular, more time than fitting conventional decision trees that employ a greedy algorithm. However, as could be seen in the simulation study and the real data applications, logicDT consistently outperformed conventional decision trees considering the predictive performance. Moreover, logicDT still does not seem to be slower than other interpretability-focused methods such as logic regression or RuleFit. A model fitting time evaluation of logicDT and other procedures in the simulation study and real data application can be found in Appendix 5.

logicDT was designed for interpretable modeling in low- to mid-dimensional problems, e.g., considering single genes, pathways, or selections of SNPs that were significantly influencing the outcome in prior analyses. However, in theory, logicDT can be applied to problems with an arbitrarily large number p of input variables. Nonetheless, as shown in Sect. 3.8, the computational complexity of logicDT is polynomial in p under certain assumptions. Moreover, in practice, only finitely many computational resources are available. In simulations considering 1000 SNPs (i.e., \(p=2000\) input variables due to splitting each SNP into two binary variables) and a more complex underlying model, logicDT still induced relatively high predictive performances (see Sect. 5.1.5). Hence, we recommend applying logicDT in situations with \(p \le 2000\). For comparison, in the software implementation of logic regression, where also a stochastic search algorithm is employed, the authors allow a maximum of \(p=1000\) input variables (Kooperberg & Ruczinski, 2022).

The main issue of conventional decision trees is its instability issue, i.e., that small modifications of the training data set imply unproportionally severe alterations of the fitted model. This behavior is mainly induced by the greedy fitting algorithm (Li & Belford, 2002; Murthy & Salzberg, 1995). logicDT aims at identifying the globally optimal set of predictors and interactions responsible for the variation in the outcome. Thus, only important predictors are used for fitting the decision tree and interactions are already covered by single splits. Therefore, the instability issue should be diminished by logicDT.

The search procedure in logicDT utilizes the training data both for fitting decision trees and scoring them for guiding the search, which might suggest that this might lead to overfitting. However, both training trees based on states and evaluating states are part of the logicDT fitting procedure and the balance of overfitting and underfitting is controlled by the hyperparameters tuned using independent validation data (see Sect. 3.6). Moreover, established statistical modeling approaches such as stepwise linear regression or logic regression also employ the full training data set for both fitting the models and guiding the search. Nonetheless, one idea might be to further split the available training data into training data for fitting the decision tree based on the considered state and inner validation data for scoring the state’s performance. However, due to the need for further splitting the available data, less observations are available for both the tree fitting step and the scoring step, leading to a decreased performance (on independent test data) compared to the original algorithm in empirical experiments (see Appendix 6). Moreover, the resulting model should not heavily rely on the data split used for this inner validation. Hence, ideally, multiple data splits—fitting and scoring multiple trees for one state and averaging the results as in (inner) cross-validation—should be used, leading to an increased computational burden.

Bagged logicDT was designed for situations in which a larger number of input variables influences the outcome or variable/interaction term importances shall be measured. In the simulation study conducted in Sect. 5.1.1, bagged logicDT performed similarly well compared to logicDT due to single logic decision trees being able to fully capture the considered underlying models. In additional simulations considering scenarios with larger numbers of influential input variables (see Sect. 5.1.5) and real data evaluations (see Appendix 4), bagged logicDT was able to achieve higher predictive performances in comparison to logicDT. Nevertheless, in these additional analyses, logicDT induced strong performances compared to other single-model methods.

For bagged logicDT, one idea to further increase its performance might be to further randomize the search similar to random forests. This could be realized by selecting a random sample of the neighbor states to be evaluated in each iteration of the greedy search, which is similar to randomly sampling potential splitting variables in random forests. However, this would create another hyperparameter—the number of randomly drawn candidate neighbor states—that potentially should be tuned and could depend on the total number of neighbor states that can change for each considered state.

logicDT is motivated by applications in genetic epidemiology in which mainly binary input data is analyzed. Although not considered in this article, it is possible to generalize logicDT to numerical input data by considering numerical interactions \(\prod _j X_j\) instead of Boolean conjunctions \(\bigwedge _j X_j\), where in the case of binary input data, these two definitions coincide.

The development of logicDT was, more precisely, motivated by the problem of constructing genetic risk scores that are usually built based on linkage-disequilibrium-based pruned SNPs, i.e., SNPs that can be interpreted as independent variables (So & Sham, 2017; Dudbridge & Newcombe, 2015). Therefore, throughout this manuscript, the assumption was made that there are no strong correlations between the considered input variables. In future research, logicDT and the interaction VIM could be analyzed and potentially adjusted for settings in which strong correlations between input variables exist so that, ideally, input variables (highly) correlated with truly predictive input variables do not diminish the importance of these truly predictive input variables.

If, additionally, a quantitative variable such as a quantitative environmental variable is considered, logicDT uses this covariable to fit regression models in the leaves of the decision tree. Since logicDT splits, in the context of genetic epidemiology, on genetic variants, a gene-environment is present if and only if the leaf regression models differ more than by fixed offsets describing marginal effects of the genetic variants. Thus, in future research, logicDT could be expanded for statistically testing the presence of a gene-environment interaction in the considered subregion of the DNA.

Moreover, the proposed interaction importance measuring methodology could also be expanded for statistically testing if certain single input variables or interaction terms significantly influence the outcome. This can, e.g., be used in the context of genetic epidemiology, testing the presence of gene-gene interactions. For implementing this testing procedure, the variable importance testing framework proposed by Watson and Wright (2021) might be applied to the importance measures proposed in this manuscript.

7 Conclusion

logicDT yields highly interpretable decision trees with superior predictive performances compared to other single-model procedures such as standard decision trees by being able to detect interaction effects between binary predictors on split level. Fitting ensembles of logicDT models through bagging can further increase the predictive performance if many predictors have effects on the outcome. The novel VIM adjustment procedure can be applied to these logicDT ensembles to derive which input variables influence the outcome in which interplay and magnitude—also measuring the importance of interaction effects between input variables.