1 Introduction

Mixed-effect or multilevel models (Snijders and Bosker 2012; Pinheiro and Bates 2006) are a valuable class of models able to deal with hierarchical/clustered data. Typical hierarchical data consist of statistical units (level 1 units) nested into clusters (level 2 units). Classic examples are students clustered within schools (individual cross-sectional data) or children’s growth evaluated at several time points (repeated measures). Mixed-effect models consider unit clustering, including both fixed and random effects in the model. The standard linearity assumption for the fixed effects is too stringent in many situations, and a more flexible model specification, including non-linear effects and interactions, might be required. A worthwhile approach exploits regression trees (CART) (Breiman et al. 1984) to capture non-linear fixed effects and interactions via a piece-wise constant regression function. Several tree-based algorithms have been proposed in the literature to improve CART performance in different ways. For instance, see Loh (2002), Hothorn et al. (2006), Dusseldorp et al. (2010) and the subsequent literature to ensemble learning.

Tree-based models typically assume iid data and therefore ignore a clustered data structure when present. As it is well known that ignoring the clustering structure can lead to biased inference (see, e.g. Bryk and Raudenbush 2001, for linear models), some proposals have been suggested in the literature to adapt tree-based models to clustered, multilevel and longitudinal data. In the framework of longitudinal data, Segal (1992) was, up to our knowledge, the first to deal with this topic, generalizing the CART algorithm and its loss function to the case of correlated multiple binary responses. Zhang (1998) discussed the case of multiple binary response variables using as impurity measure the generalized entropy criterion, linked to the log-likelihood of a specific exponential family distribution. Other contributions are due to Abdolell et al. (2002), Loh and Zheng (2013), Eo and Cho (2014), among many. Being specific for longitudinal data, these solutions cannot be adopted when the clustered data do not have a regular structure.

In the general framework of multilevel data, regression trees have been extended to clustered data by Hajjem et al. (2011) and Sela and Simonoff (2012) modelling fixed effects with a decision tree while automatically accounting for random effects with a linear mixed model in a separate step. In particular, Hajjem et al. (2011) proposed mixed-effect regression trees (MERT) where CART selects the tree that models the fixed effect part, and a node-invariant linear structure is used to model the random effects. The algorithm is implemented within the framework of the expectation-maximization algorithm. Sela and Simonoff (2012) independently proposed a similar solution, called Random Effects/EM trees (RE-EM tree). It is shown that random effect regression trees are less sensitive to parametric assumptions and provide substantially improved predictive power compared to linear models with random effects and regression trees without random effects. The literature has grown with variants and extensions (e.g. Hajjem et al. 2014; Fu and Simonoff 2015; Hajjem et al. 2017; Miller et al. 2017; Pellagatti et al. 2021). Other solutions, including trees and clustered data, have been proposed in the framework of subgroups analysis. See for instance Fokkema et al. (2018) and Seibold et al. (2019).

This paper proposes a semi-parametric mixed-effect model where trees are added to a linear component for capturing interactions and non-linear effects. The idea of adding a linear component to a tree is not new in the literature. While regression trees can easily capture complex dependencies by reconstructing the entire regression function, they need many fortuitous splits to recreate a linear or quasi-linear function (Friedman et al. 2001). For this reason, tree-based models are said to struggle in fitting linear or additive dependencies. To overcome this issue, Dusseldorp and Meulman (2004) were the first to propose to combine a linear part and a regression tree in a single model, called regression trunk model, that can be efficiently estimated by a Simultaneous Threshold Interaction Modelling Algorithm (STIMA) proposed by Dusseldorp et al. (2010). Here we extend regression trunk models to the case of the mixed-effect models. Moreover, we include multiple trees in the tree component instead of one. For the particular application we have in mind, we consider here the case of three trees. The first tree captures non-linear relationships and interactions among level 1 predictors. The second tree captures non-linear relationships and interactions among level 2 predictors, while the third is devoted to detecting cross-level interactions. This last kind of interaction is of particular interest in multilevel analysis (see, for instance, Bauer and Curran 2005).

Following the reasoning of Efron (2020), the proposed approach lies in between a pure predictive algorithm and a more traditional regression method, providing an easy to interpret mixed-effect model with good predictive properties. The linear component helps to maintain the three trees as short as possible, while the three trees act as weak learners from the point of view of prediction. The resulting model is more easily interpretable than a single tree or a random forest. Moreover, it has a better predictive performance than a linear mixed-effect model. We call this class Three-tree mixed-effect (3Trees) models.

For the estimation of 3Trees models, we propose two different algorithms. The first algorithm relies on a backfitting-like procedure for selecting the three trees component, and then it jointly estimates the linear and tree-based parts. It provides good predictions of the response Y for new units in already observed or new clusters. Therefore, this algorithm is useful when a study has predictive purposes only. The second algorithm modifies the first one by introducing a post-selection inference procedure, which provides a pruning method based on hypothesis tests and valid confidence intervals for the predicted values. Specifically, we apply a split sample procedure (Cox 1975) on clusters. We evaluate the proposed algorithms through simulations. It turns out that both algorithms are computationally efficient and have satisfactory performance.

The remainder of the paper is organized as follows. Section 2 describes the proposed Three-tree mixed-effect model and its interpretation. Section 3 presents the two estimating algorithms. Section 4 reports a simulation study to evaluate the performance of the two algorithms in the case of a random intercept model. Section 5 illustrates an application of the 3Trees model to analyze the Maths scores achieved by Italian children at a national standardized test. The last section offers concluding remarks and outlines directions for future work.

2 The Three-tree mixed-effect model

In this section, we present the mixed-effect model called Three-tree mixed-effect model (3Trees model). For simplicity, we focus on the case of random intercept mixed models, but the extension to random slopes is straightforward. See Appendix B for an example.

Suppose that \(Y_{ij}\) is a quantitative response variable measured on unit i, \(\,i = 1, \ldots , n_j\), belonging to cluster/group j, with \(j = 1 \ldots , J\) and \(n_{tot} = \sum _{j=1}^J n_j\). We assume that the true data generating process has the form

$$\begin{aligned} Y_{ij}={\mathbf {x}}_{ij}' \varvec{\beta } + {\mathbf {z}}_{j}'\varvec{\gamma } + g({\mathbf {x}}_{ij}, {\mathbf {z}}_{j}) + u_j + \varepsilon _{ij} \end{aligned}$$
(1)

where \({\mathbf {x}}_{ij}\) is the vector of \(p_1\) individual-level predictors, \(\varvec{\beta }\) the associated main fixed effect coefficients, including \(\beta _0\) as intercept, while \({\mathbf {z}}_{j}\) is the vector of \(p_2\) cluster-level predictors and \(\varvec{\gamma }\) the associated main fixed effect coefficients. Let \(p = p_1 + p_2\). Here \(g(\cdot )\) is an unknown function \({\mathbb {R}}^p \rightarrow {\mathbb {R}}\), ruling the non-linear fixed-effects, assumed to not vary across clusters. The level 1 errors \(\varepsilon _{ij}\) are assumed to be iid \(N(0, \sigma _{\varepsilon }^2)\). The level 2 errors (random effects) \(u_j\) are supposed to be iid \(N(0, \sigma _u^2 )\). Moreover, the level 1 and level 2 errors are assumed to be independent. Thus, the model assumes that responses of units belonging to the same cluster are positively correlated, with correlation equal to

$$\begin{aligned} ICC=\frac{\sigma _u^2}{\sigma _u^2+\sigma _{\varepsilon }^2} , \end{aligned}$$
(2)

although responses of units belonging to different clusters are uncorrelated. In model (1), for each cluster j, the random quantity \( b_j = \beta _0 + u_j\) is the so-called random intercept, varying across clusters, with \(b_j {\perp \!\!\!\perp \,}b_{j'}\) for all \(j \ne j' \in \{1, \ldots , J\}\).

As the non-linear function \(g(\cdot )\) is unknown, we propose a nonparametric approximation based on regression trees. In particular, considering a two-level structure, we propose to use three trees. The first tree accounts for non-linearities at level 1, the second tree accounts for non-linearities at level 2, and the third one accounts for non-linearities among all predictors. The primary purpose of the third tree is to account for cross-level interactions, namely interactions among level 1 and level 2 predictors. We call this model the Three-tree mixed-effect (3Trees) model

$$\begin{aligned} Y_{ij} = {\mathbf {x}}_{ij}' \varvec{\beta }+ {\mathbf {z}}_{j}'\varvec{\gamma } + \sum _{t=1}^{3} T_{t}({\mathbf {h}}_{tij}) + u_j + \varepsilon _{ij}. \end{aligned}$$
(3)

In the 3Trees model, the function \(g(\cdot )\) of Eq. (1) is approximated by the sum of three regression trees, \(T_{t}\) (\(t = 1,2,3\)), each constructed over a subset of predictors. Specifically, \({\mathbf {h}}_{\varvec{1} ij} \subseteq {\mathbf {x}}_{ij}\,\) (level 1 predictors), \(\,{\mathbf {h}}_{\varvec{2} ij} \subseteq {\mathbf {z}}_{j}\) (level 2 predictors) and \({\mathbf {h}}_{\varvec{3} ij} \subseteq ({\mathbf {x}}_{ij} \cup \, {\mathbf {z}}_{ij})\) (level 1 and level 2 predictors).

The model is additive in its components, and the tree component stands as a region-specific intercept. Model (3) can be written equivalently as

$$\begin{aligned} Y_{ij}={\mathbf {x}}_{ij}' \varvec{\beta }+ {\mathbf {z}}_{j}'\varvec{\gamma } + \sum _{t=1}^{3} \sum _{m=1}^{M_t} \mu _{t m} {\mathbb {I}}\{{\mathbf {h}}_{\varvec{t} ij} \in R_{t m} \} + u_j + \varepsilon _{ij}, \end{aligned}$$
(4)

where \(R_{t 1}, \ldots , R_{t M_t}\) is the partition of the predictor space corresponding to the tree \(T_{t}\). From this point of view, each tree acts as a factor with \(M_t\) categories. Each category corresponds to a tree leaf, representing a region of the selected partition of the predictor space. The region \(R_{t m}\) can be identified by multiplying all the dummy variables defined by the binary splits along the path from the root node to each leaf in the tree \(T_{t}\). Therefore, some classical identifiability constraints, such as sum-to-zero or corner constraints, must be included for each of these three factors.

When the unknown regression function can be assumed to be quasi-linear (Wermuth and Cox 1998), the number of leaf nodes \(M_t\) \((t=1,2,3)\) can be kept small. In this situation, the proposed class of models is most convenient, providing a good balance between performance and interpretability. The presence of three trees instead of one, e.g. as used for iid data in STIMA (Dusseldorp et al. 2010), allows us to achieve the same goodness of fit with trees having smaller depth. In addition, devoting the trees to specific sets of predictors improves the ability of the algorithm to detect different kinds of non-linear effects.

The choice of the subsets of variables \({\mathbf {h}}_{\varvec{1} ij}\) \({\mathbf {h}}_{\varvec{2} ij}\) and \({\mathbf {h}}_{\varvec{3} ij}\) to be included in each tree is driven by subject-matter considerations. As a general practice, we recommend including all the potentially relevant variables and leaving it to the algorithm to select the most pertinent. Notice that the presence of the linear component attenuates the typical issue of the tree-based algorithms with correlated predictors.

It is worth noting that if the regression function has strong non-linearities, with abrupt inversions in the sign of dependence, model (3) may require deep trees to achieve good performance. One can include flexible functions in the linear part to overcome this issue, such as polynomials or splines. In this case, the so-called linear component captures such strong non-linearities, while the tree component mostly picks up the interaction effects.

The main difference of our procedure with respect to previous proposals (Hajjem et al. 2011; Sela and Simonoff 2012), is the inclusion of the linear component \({\mathbf {x}}_{ij}' \varvec{\beta }+{\mathbf {z}}_j' \varvec{\gamma }\) in the mixed-effect model (3). This inclusion allows to avoid overfitting and helps interpretation.

3 Two estimating algorithms

We propose two iterative procedures to select the trees and estimate the model parameters. While previous proposals for trees with clustered data were based on a kind of EM algorithm (see, for instance, Hajjem et al. 2011), we propose a procedure similar to backfitting for the selection of the trees.

Recalling that we denoted the total number of sample units as \(n_{tot}=\sum _{j=1}^J n_j\), let us call \(\varvec{X}\) the \(n_{tot} \times p_1\) matrix of the individual-level predictors, with rows \(\mathbf{x} _{ij}'\). Let \(\varvec{Z}\) be the \(n_{tot} \times p_2\) matrix of the cluster-level predictors, with rows \(\mathbf{z }_{ij}=\mathbf{z }_j'\) for all the units of the same cluster, \(i=1, \ldots ,n_j\). Denote \(\varvec{H}_L = (\varvec{X}, \varvec{Z})\) the \({n_{tot} \times p}\) matrix, with \(p=p_1 + p_2\) of all the predictors included into the linear component, \(\varvec{H}_1\) the matrix \({n_{tot} \times h_1}\,\), \(\,h_1 \le p_1\), whose columns are selected from \(\varvec{X}\) to be considered for the tree \(T_1\,\), \(\,\varvec{H}_2\) the matrix \({n_{tot} \times h_2}\,\), \(\,h_2 \le p_2\), whose columns are selected from \(\varvec{Z}\) to be considered for the tree \(T_2\) and \(\varvec{H}_3\) the matrix \({n_{tot} \times h_3}\,\), \(\,h_3 \le p\), whose columns are selected from \(\varvec{H}_L\) for \(T_3\). Finally, call \(\varvec{y}\) the \(n_{tot} \times 1\) vector of the responses.

figure a

The first algorithm we are proposing, called 3Trees-Alg and whose pseudo-code can be found in Algorithm 1, has predictive purposes. It consists of two main steps: the Selection step and the Estimation step. The Selection step aims to find the best trees conditionally on the linear component. The algorithm initialises the sum of the three trees equal to the sample mean and the linear component at the standard least squared estimates. This choice provides a neutral starting point that, in our experience, facilitates algorithm convergence. Then the procedure iteratively runs the CART algorithm (Breiman et al. 1984) on the partial residuals with respect to the other trees and the mixed effect linear component. At the end of each iteration, the algorithm computes the Mean Squared Error (MSE) based on the predictions for each tree and the mixed-effect model. The iterative procedure stops when the convergence criterion is met, namely, the difference between MSE in two successive iterations is below a given threshold, or the maximum number of iterations is achieved. Notice that as the regression function is not smooth, the algorithm can fall into a periodic loop. Hiabu et al. (2021), in a different framework, proved the validity of backfitting in the presence of non-smoothness. Therefore, the Selection step of 3Trees-Alg returns the trees corresponding to the minimum MSE. For the Selection step, one has also to set the tuning parameters of the CART algorithm, such as the maximum depth of the tree, the minimum number of units in a leaf node and the complexity parameter CART. If this latter parameter is set to a value different from zero, then the CART algorithm within 3Trees-Alg performs tree pruning at each iteration via cross-validation.

In the second step of the 3Trees-Alg algorithm, the Estimation step, each tree selected in the Selection step is specified as a factor, and a mixed-effect model with the linear component and the three new factors as in (4) is fitted via maximum likelihood. The step returns the estimates of all the parameters \(\mu _{tm}\), with \(m=1, \ldots , M_{t}\), \(t=1,2,3\), for the tree component, the vectors of coefficients for the linear components \(\varvec{\beta }\,\) and \(\,\varvec{\gamma }\), and the variances \(\,\sigma _u^2\), \(\sigma ^2_{\varepsilon }\).

It is worth pointing out that the tree component is selected non-parametrically. The CART procedure to choose the trees is greedy, and therefore it addresses the splittings of the predictors providing the local largest decrease in MSE. This implies a lack of order invariance in the tree component. Namely, if we permute the order of the trees in the Selection step, the resulting selected trees can be different. The order invariance can be guaranteed when the subsets of variables included in the trees are disjoint, i.e. each variable can participate in the construction of only one tree, and the variables are independent between trees. However, in multilevel settings, we are often interested in detecting cross-level interactions, which requires defining a third tree that includes all predictors. Thus the trees cannot be kept disjoint. In principle, one could try all six possible orderings of trees and choose by predictive performance comparison. However, we suggest the order dictated by the standard strategy of model building in multilevel analysis (Snijders and Bosker 2012), namely first select level 1 predictors, then level 2 predictors, and finally cross-level interactions.

The 3Trees-Alg algorithm, described in Algorithm 1, is implemented in a user written R code (avalaible from the authors), using the package lme4 (Bates et al. 2015a) for the estimation of the mixed models and on rpart (Therneau and Atkinson 2019) for selecting the trees via the CART algorithm.

3.1 Algorithm based on post-selection inference

The proposed tree-based mixed-effect model inherits the theoretical properties of parametric models if the model is well specified. If the selected regression function is not assumed to be the true one but rather a good approximation of it, inference is referred to the projection parameters (see, for instance Buja et al. 2019), where the regression function is optimally projected into the space of the semilinear functions in (4). The tree component identification utilizes the CART algorithm that is proved to provide a locally optimal solution (Breiman et al. 1984).

In a context where the model is selected using the data, as in our procedure, classical tools for inference, such as p-values and confidence intervals, are invalid (see, for instance Benjamini 2010; Berk et al. 2013). In particular, as confidence intervals are computed after model selection, the classical procedures provide narrower confidence intervals than due at the nominal level, with an actual coverage below the nominal one. Interesting literature on post-selection inference proposes methods mainly in the context of ordinary least square and Lasso-type estimators for iid data. A simple post-selection inference procedure, proper when the interest is to discover dependencies and influential predictors in a data-driven model, is the split sampling procedure (Cox 1975). It involves the random division of the sample into two parts, one used to select the model and one used for inference. This procedure allows deriving conditional (post-selection) tests or confidence intervals. In addition, when required, inference based on sample splitting followed by bootstrap provides assumption-lean, robust confidence intervals for the projection parameters (Rinaldo et al. 2019).

To obtain adequate inference for the parameters of a 3Trees model, we propose adopting a split sampling procedure. At this aim, we need to assume that the maximum size of the selected model is under control. This corresponds to fixing the maximum depth of each tree.

For multilevel data, the standard assumption of iid data applies at the cluster-level. Therefore, sample splitting is obtained by randomly partitioning the J clusters into two subsets, approximately of equal size. This partition induces a partition of the original data set \( {\mathcal {D}}\) into two subsets, say \( {\mathcal {D}}_S\) and \( {\mathcal {D}}_E\). The first subset, \( {\mathcal {D}}_S\), is used to select the three trees using the Selection step of Algorithm 1. The second subset, \( {\mathcal {D}}_E\), is used for inference on model parameters using the standard linear mixed effect procedures and software. The proposed algorithm, called 3Trees-Alg-PSI, is summarized in Algorithm 2.

figure b

Applying 3Trees-Alg-PSI, we obtain an interpretable predictive model laying in between exploratory and confirmatory data analysis. The confidence intervals computed using 3Trees-Alg-PSI can also be used to prune a tree when the complexity parameter in the CART sub-step is set to zero or kept very small. For this purpose, consider the two regions \(R_{t m}\) and \(R_{t m'}\) corresponding to the left and right terminal leaves, produced by the same variable splitting of the tree \(T_{t}\). Let \(\mu _{t m}\) and \(\mu _{t m'}\) be the corresponding coefficients in the 3Trees model as in (4). Pruning the tree corresponds to joining \(R_{t m}\) and \(R_{t m'}\). Therefore, the tree can be safely pruned whenever the null hypothesis

$$\begin{aligned} H_0: \mu _{t m} = \mu _{t m'} \quad \text { vs} \quad H_A: \mu _{t m} \ne \mu _{t m'} \end{aligned}$$

cannot be rejected or, equivalently, when the confidence intervals for the two parameters overlap. To perform a deeper pruning, one has to check the equality of all the coefficients involved in the pruning. The pruned 3Trees model is nested in the larger 3Trees model by imposing equality constraints on the parameters. Therefore, standard techniques for nested model parameters, such as the Likelihood Ratio test, can be safely utilized.

4 Simulation studies

In this section, we propose two simulation studies to evaluate the algorithms presented in Sect. 3 for fitting 3Trees models.

The first simulation study aims at evaluating the predictive performance of Algorithm  1 (3Trees-Alg). In contrast, the second one is designed to assess the inferential accuracy in terms of confidence intervals of Algorithm 2 (3Trees-Alg-PSI).

We consider three scenarios with different data generating processes. Each scenario includes two individual-level predictors, \(X_1\) and \(X_2\), having bivariate Normal distribution, \(N_2(\varvec{0}, \varvec{\varSigma })\), and two cluster-level predictors, \(Z_1\) and \(Z_2\), with distribution \(N_2(\varvec{0},\varvec{\varGamma })\). We set the following values for the covariance matrices

$$\begin{aligned} \varvec{\varSigma }= \varvec{\varGamma }= \left( \begin{array}{lcc} 1.0 &{} \, &{} 0.4 \\ 0.4 &{} &{} 1.0 \end{array}\right) \end{aligned}$$

The random components are \(u_j \sim N(0, \sigma _u^2)\), with \(\sigma _u^2=3\), and \(\varepsilon _{ij} \sim N(0, \sigma _{\varepsilon }^2)\), with \( \sigma _{\varepsilon }^2=1\). Those values imply a high Intraclass Correlation Coefficient (\(ICC=0.75\)), which allows us to clearly show the relevance of using the random effects in predicting the response. The data generating processes of the three scenarios differ in the shape of the regression function, as described in the following.

Scenario 1: we assume the following linear model

$$\begin{aligned} Y_{ij}= \beta _0 + \beta _1 X_{1ij} + \beta _2 X_{2ij} + \gamma _1 Z_{1j} + \gamma _2 Z_{2j} + u_j + \varepsilon _{ij} \end{aligned}$$

where \(\beta _0 = 5\,\), \(\,\beta _1 = 1\,\), \(\,\beta _2 = 0\,\), \(\,\gamma _1 = 1\) and \(\gamma _2 = 0\). In this scenario, the true model is a linear mixed effect model.

Scenario 2: we assume a quasi-linear model that includes linear terms, threshold non-linear terms and interaction terms, as follows

$$\begin{aligned} \begin{aligned} Y_{ij} = \beta _0&+ \beta _1 X_{1ij} + \beta _2 X_{2ij} + \gamma _1 Z_{1j} + \gamma _2 Z_{2j} + \mu _1 {\mathbb {I}}(X_{1ij}\ge 0) + \\&+\mu _2 {\mathbb {I}}(Z_{2j}<0) + \mu _3 {\mathbb {I}}(X_{1ij}< 0){\mathbb {I}}(Z_{2j}\ge 0) + u_j + \varepsilon _{ij}, \end{aligned} \end{aligned}$$

where \(\beta _0 = 5\,\), \(\,\beta _1 = 1\,\), \(\,\beta _2 = 0\,\), \(\,\gamma _1 = 1\) and \(\gamma _2 = 0\,\), \(\,\mu _1 = 2\,\), \(\,\mu _2 = 3\) and \(\mu _3 = -3\).

Scenario 3: we use the non-linear model of Hajjem et al. (2014), except that we replace the third and fourth level 1 predictors (\(X_3\) and \(X_4\)) with level 2 predictors, denoted as \(Z_1\) and \(Z_2\).

$$\begin{aligned} Y_{ij}= & {} \beta _0 + \beta _1 X_{1ij} + \beta _2 X_{2ij} + \gamma _1 Z_{1j} + \gamma _2 Z_{2j} + \mu _1 X_{2ij}^2 \\&+ \mu _2 Z_{1j} \ln |X_{1ij}| + u_j + \varepsilon _{ij}, \end{aligned}$$

with \(\beta _0 = 0\,\), \(\,\beta _1 = 2\,\), \(\,\beta _2 = 0\,\), \(\,\gamma _1 = 4\) and \(\gamma _2 = 0\,\), \(\,\mu _1 = 2\,\), \(\,\mu _2 = 2\).

In each scenario, some parameters of the linear part are set to zero to provide a glimpse of the ability of the proposed methods to correctly identify predictors with no linear effects in this small dimensional setting.

4.1 Predictive performance

For each scenario described in the previous section, we generate 500 data sets with \(J=500\) clusters of \(n_j=100\) units, for a total sample size of \(50\,000\) units. We randomly split each cluster into two equal parts to form the train and the test sets, each composed of \(J=500\) clusters of \(n_j=50\) units.

We compare the predictive performance of Algorithm 1 (3Trees-Alg) for fitting the three-tree models with the following commonly used methods: RE-EM trees (Sela and Simonoff 2012), regression trees (CART) (Breiman et al. 1984), and a linear mixed-effect model with main effects only (Linear-ME) fitted with maximum likelihood. In addition, we compare the performance of mixed effect random forest as proposed by Hajjem et al. (2014). As a benchmark, we also report the results for a mixed-effect model specified as the data generating model (True-ME). Notice that in Scenario 1, the True-ME model and the Linear-ME model coincide.

We implement the 3Trees-Alg in R (R Core Team 2020), using the packages rpart (Therneau and Atkinson 2019) for the tree component and lme4 (Bates et al. 2015b) for the mixed-effect linear component. As tuning parameters for the tree component we set rpart maxdepth tuning parameters at 2 for the first two scenarions, and at 3 for the third one. The rpart complexity parameter cp is set at 0.0001. The rpart package has also been used for CART, while lme4 has also been used to fit linear mixed-effect models Linear-ME and True-ME. The RE-EM algorithm has been implemented using the R package REEMtree (Sela and Simonoff 2021) with the complexity parameter for pruning at 0.01 and the number of standard errors used in pruning set at 1. For the mixed-effect random forest, we used two different R functions. The first function, MixRF, that we used with the default setting, is implemented in the package MixRF (Wang et al. 2016). The second function, MERF, is implemented in the package LongituRF (Capitaine et al. 2021). We used the default setting for this function, but without the stochastic longitudinal component, not adequate for the scenarios considered in this section.

In clustered data, we can distinguish two types of predicted values for the response variable: (i) the value of the response for a statistical unit of a hypothetical (new) cluster and (ii) the value of the response for a statistical unit belonging to a cluster already included in the sample. The two types of prediction differ in the value assigned to the random effect. In the first type, the random effect is set at its expectation, namely zero, thus predicting the response with the fixed part only. On the other hand, in the second type, the random effect is set at the BLUP prediction for the observed cluster, thus predicting the response with the fixed part plus the BLUP prediction (Robinson 1991; Skrondal and Rabe-Hesketh 2009).

To evaluate the predictive performance of the methods under comparison, we compute the Mean Squared Error on the train data and the Predictive Mean Squared Error on the test data (new data). We denote with MSE and PMSE, respectively, the measures referred to the predictions with the fixed part only. In addition, we denote with clusMSE and clusPMSE the measures when the prediction refers to a new unit in an observed cluster, i.e. including the random effect.

Table 1 Monte Carlo averages and standard deviations of the Mean Squared Error and Predictive Mean Squared Error, computed with fixed effects only (MSE, PMSE), or including BLUP predictions of the random effects (\(clus\hbox {MSE}\) and \(clus\hbox {PMSE}\)). Scenarios 1, 2 and 3 (\(J=500\,\), \(\,n_j=50\) for both the training set and the test set, number of Monte Carlo runs: 500, except for MixRF and MERF with 250 runs)

Table 1 reports the Monte Carlo averages and standard deviations of the mean squared errors for the methods under comparison. The predictive performance of the proposed 3Trees-Alg algorithm, as measured by \(clus\mathrm{PMSE}\) and \(\mathrm{PMSE}\), is very close to the benchmark of the correctly specified mixed-effect model (True-ME) in Scenario 1 and Scenario 2. As expected, in the highly non-linear model of Scenario 3, the performance of 3Trees-Alg does not reach the benchmark. However, it is still better than the other algorithms, apart from the two mixed-effect random forest algorithms. It is worth noting that the performance of the 3Trees-Alg improves when the depth of the trees increases. As an example, for Scenario 3, we report the performance of 3Trees-Alg with trees of depth 6. This performance is substantially improved but at the cost of a less easy interpretation. An interesting aspect is the comparison of MSE and PMSE, which are pretty similar for the 3Trees algorithm, suggesting that it is never overfitting.

As expected, the error measures for predicting a unit in a new cluster (MSE and PMSE) are higher than the corresponding measures relative to the prediction for a unit of an observed cluster (\(clus\hbox {MSE}\) and \(clus\hbox {PMSE}\)). The magnitude is much larger (about three times) due to the high proportion of the between-cluster variation in the simulated data (ICC=0.75). It is worth noting that \(clus\hbox {MSE}\) and \(clus\hbox {PMSE}\) are obtained using BLUP predictions of level 2 errors, which are crucial for obtaining an accurate prediction for a unit of an observed cluster.

The standard deviation of the random effects \(\sigma _u\) is interesting per se and, in addition, it plays a key role in the prediction for an observed cluster since it affects the BLUP of the random effect. We compare the performance of the competing methods in estimating the parameter \(\sigma _u\) by plotting the Monte Carlo distributions (except for CART, which does not provide an estimate of \(\sigma _u\)). In Scenario 1 (Fig. 1), there are no relevant differences, except for MERF, which seems to underestimate the random effect variance.

Fig. 1
figure 1

Monte Carlo distribution of the estimate of the parameter for the random effects standard deviation \(\sigma _u\) for the 3Trees and RE-EM algorithms, compared with the mixed-effect model with the true (linear) regression functions, under Scenario 1

On the other hand, Scenarios 2 and 3 are noteworthy, and the results are reported in Fig. 2. In all cases, the proposed method (3Trees-Alg) yields estimates of \(\sigma _u\) similar to the benchmark (True-ME), whereas the performance of RE-EM is not satisfactory. Indeed, RE-EM and MixRF tend to overestimate \(\sigma _u\), with a large bias in Scenario 3, while MERF underestimates it in Scenario 2. The 3Trees algorithm performs better than RE-EM and Linear-ME also in estimating the level 1 standard deviation \(\sigma _{\varepsilon }\), as shown by the plots of the Monte Carlo distributions reported in Appendix A (Figs. 6 and 7).

Fig. 2
figure 2

Monte Carlo distributions of the estimate the random effects standard deviation \(\sigma _u\) for the 3Trees and RE-EM algorithms, compared with the mixed-effect models with the true and the linear regression functions respectively, under Scenario 2 and 3

Another aspect to consider is the estimate of the parameters of the linear component, \(\beta _1\,\), \(\,\beta _2\,\), \(\,\gamma _1\) and \(\gamma _2\). Notice that the proposed algorithm is not meant to recover the true data generating process but to predict the response by selecting a quasi-linear model closest to the true data generating process. When the tree component adequately approximates interactions and non-linearities, the parameters of the linear part should be close to the true ones. Table 2 reports the Monte Carlo averages and standard deviations of such estimates for the linear component. In both Scenario 1 and 2, the algorithm 3Trees-Alg provides estimates as good as the benchmark algorithm for null and non-null parameters. In Scenario 3, the parameters for the main effects of the level 1 predictors are correctly captured by the 3Trees-Alg algorithm, together with the null coefficient of the cluster-level predictor \(Z_2\). Conversely, the coefficient of \(Z_1\) turns out to be underestimated. This behaviour is due to the particular shape of the non-linear component, i.e. the interaction term \(Z_{1j}\ln |X_{1ij}|\), which is probably inaccurately accounted for by the tree component. This behaviour confirms that this scenario would probably require a deeper tree to capture such a particular non-linear shape.

The simulation results are similar when we consider a smaller number of clusters, and smaller clusters (\(J=50\) and \(n_j=15\)), with a test set for prediction evaluation of equal size. For these simulations, we also consider the predictive performance of a single regression trunk model for iid data (Dusseldorp et al., 2010) estimated via the R package stima with maxsplit set to 2. Table 7 of Appendix A reports the results for this case for the three scenarios. Simulations indicate that the behaviour of the proposed algorithm with respect to the benchmark is confirmed in terms of average clusPMSE, with a slight increase in Monte Carlo variability.

Further settings are considered in Table 8 of Appendix A. Specifically, we report simulation results for five variations of Scenario 2. In Scenario 2A, we set the predictors’ covariance at 0.75 instead of 0.4 to check the algorithm’s behaviour in the case of highly correlated predictors. Scenario 2B considers the case of independent predictors instead. The 3Trees-Alg algorithm seems not influenced by the correlation between predictors, reporting similar performance in the two scenarios. Scenario 2C considers the case of a lower value of the intraclass correlation coefficient. In this case, the performance remains unchanged in terms of clusPMSE, but it is improved in terms of PMSE due to the reduced impact of the random component. Finally, in Scenarios 2D and 2E, the coefficients of the linear part increase to 5 or decrease to 0.5. Overall, the performance of the 3Trees algorithm is essentially unchanged. To check the behaviour of the procedures with more sample sizes, Table 9 in Appendix A presents the results of the same study as Table 2, where other eight non-significant explanatory variables were added to Scenario 2. The procedure is still able to capture which variables have no linear effect in this quasi-linear scenario.

Table 2 Monte Carlo averages and standard deviations of the estimates of the parameters in the linear component in Scenarios 1, 2 and 3 (\(J=500\,\), \(\,n_j=50\) for both the training set and the test set, number of Monte Carlo runs: 500)

4.2 Inferential accuracy

The Algorithm 3Trees-Alg-PSI (Algorithm  2 in Sect. 3.1) is devised to provide valid confidence intervals. We run a further simulation study applying the 3Trees-Alg-PSI Algorithm to evaluate its performance, using 500 data sets of \(J=500\) clusters of size \(n_j=100\).

For comparison, we compute the confidence intervals for the parameters of interest using a correctly-specified mixed effect model (true model) fitted either on the entire data set (True-ME) and on the same split sample used in the Estimation step of Algorithm 3Trees-Alg-PSI, labelled Split-True-ME. We compare the Monte Carlo averages of the interval length and the Monte Carlo actual coverage. We are reporting here the results for Scenario 2 (Table 3), while those for Scenario 1 and 3 are in Appendix A (Table 10).

Table 3 Monte Carlo average length and Monte Carlo coverage of the 95% confidence intervals for the parameters of the linear component and for the random effects standard deviation. (Scenario 2 \(J=500\), \(n_j=100\), number of Monte Carlo runs: 500)

Note that the 3Trees-Alg-PSI is directly comparable with Split-True-ME, as they both are fitted on half of the data set. In summary, the length of the confidence intervals obtained with the 3Trees-Alg-PSI is in line with a mixed effect model fitted on the halved sample (Split-True-ME). Instead, the confidence intervals obtained by True-ME are shorter since they are obtained using the whole data set. Therefore, the loss of efficiency is attributable to the reduced sample size and not to the tree selection procedure. Moreover, 3Trees-Alg-PSI provides confidence intervals of the expected coverage, in line with correctly specified mixed effect models, as if the regression function was known. The results for Scenario 1 (Table 10) are in line with those of Scenario 2. As expected, for Scenario 3 the performances are worse. Deeper trees are required for this data generating process complexity, as highlighted by simulations in Table 1.

5 An illustrative example on invalsi tests in italian schools

We apply a 3Trees model to data from the Italian Invalsi test (see, e.g. Cardone et al. 2019). These data concern students who participated in the Invalsi Math tests while attending \(5^{th}\) grade in 2013–2014 and then attending the \(8^{th}\) grade in 2016–2017. The Invalsi data have a multilevel structure where the individual-level units are represented by pupils and the cluster-level units by the schools, grouping the students. The data set contains \(409\,528\) pupils in \(5\,777\) Italian schools, with an average number of tested students per school of 70.94. Here the aim is to predict the Math achievement at \(8^{th}\) grade, having some clues on understanding why results differ. The student and school-level predictors are described in Table 4.

Table 4 Student and school level variables

We exploit the 3Trees model (4) with pupils as level 1 units and schools as level 2 units, where the random effect part is aimed to capture unobserved factors at the school-level. We first use the 3Trees-Alg (Algorithm 1) to predict the Math score at the \(8^{th}\) grade. Table 5 compares the performance of the 3Trees model using Algorithm 1 (with tree depths set at 3) and the same algorithms considered in the simulation study. The 3Trees-Alg has the best performance for predicting the math score of a student in an observed school (clusPMSE) and predicting the response of a new student in a new school (PMSE). The RE-EM provides the second-best prediction when cpmin\(=0.0001\), providing a tree with 55 terminal nodes. The performance when cpmin\(=0.001\) is worse, but the resulting tree, with 15 terminal nodes, is more interpretable. The CART algorithm, ignoring the two-level structure of the data, performs worse than 3Trees and Linear-ME but avoids overfitting, thanks to the pruning procedure based on cross-validation.

The three methods accounting for clustering show a reduction of the PMSE when using the random effects to predict the response for a new unit in an observed cluster (\(clus\hbox {PMSE}\)). For the 3Trees model, this reduction is about \(12\%\), which is much smaller than what was observed in the simulations (Table 1) due to the lower value of the ICC (0.13 versus 0.75).

Table 5 Comparison of the the predictive performance for Math score in Invalsi data (\(clus\hbox {MSE}\) and MSE on train data, \(clus\hbox {PMSE}\) and PMSE on test data)

In order to make inference on the coefficients of the predictors, we apply the 3Trees-Alg-PSI (Algorithm 2), which randomly splits the schools into two independent sub-samples for post-selection inference. Estimating the contextual effects of socio-economic status (SES) and mathematical background (MATH5) of pupils belonging to the same school is relevant for inferential purposes. To this end, we add to the predictors the school averages of these two variables, named gr.M(SES) and gr.M(MATH5). To facilitate the interpretation of the results and keep control of the maximum size of the selected model, we set the maximum tree depth at three.

We first describe the tree component in the 3Trees model selected by 3Trees-Alg-PSI. The algorithm represents the predictors’ partition of each tree as a factor with \(M_t\) labels corresponding to the partition regions. The factor is then parameterized as \((M_t -1)\) dummy variables for the usual identifiability constraints. The first region plays as the reference. For simplicity, in the factor description, T stands for Tree and R for Region. The regions selected on the student-level predictors by the first tree are depicted in Fig. 3 and summarized as follows.

$$ \begin{aligned} \begin{array}{ll} \hbox {T1 R1.8 (reference):} &{} \, {\mathbb {I}} \left( 23 \le \text {MATH5}< 67 \,\, \& \,\, \text {SES} \ge -1.21\right) \\ \hbox {T1 R1.9 :} &{} \, {\mathbb {I}} \left( 23 \le \text {MATH5}< 67 \,\, \& \,\, \text {SES}< -1.21\right) \\ \hbox {T1 R1.10 :} &{} \, {\mathbb {I}} \left( 17 \le \text {MATH5}< 23 \right) \\ \hbox {T1 R1.11 :} &{} \, {\mathbb {I}} \left( \text {MATH5}< 17\right) \\ \hbox {T1 R1.12 :} &{} \, {\mathbb {I}} \left( \text {MATH5} \ge 91 \,\, \& \,\, \text {SES}< -0.62\right) \\ \hbox {T1 R1.13 :} &{} \, {\mathbb {I}} \left( 23 \le \text {MATH5}< 67 \,\, \& \,\, \text {SES}< -0.62\right) \\ \hbox {T1 R1.14 :} &{} \, {\mathbb {I}} \left( \text {MATH5} \ge 67 \,\, \& \,\, -0.62 \le \text {SES} < -0.16\right) \\ \hbox {T1 R1.15 :} &{} \, {\mathbb {I}} \left( \text {MATH5} \ge 67 \,\, \& \,\, \text {SES} \ge -0.16\right) \\ \end{array} \end{aligned}$$
Fig. 3
figure 3

First tree of the 3Trees model on Invalsi data using Algorithm 2

The second tree, on the school-level predictors, is depicted in Fig. 4 and its selected regions are listed below.

$$ \begin{aligned} \begin{array}{ll} \hbox {T2 R2.4 (reference) :} &{} \, {\mathbb {I}} \left( \text {gr.M(MATH5)} \ge 82 \,\, \& \,\, \text {SCSIZE}< 3.5 \right) \\ \hbox {T2 R2.5:} &{} \, {\mathbb {I}} \left( \text {gr.M(MATH5)} \ge 82 \,\, \& \,\, \text {SCSIZE} \ge 3.5 \right) \\ \hbox {T2 R2.12 :} &{} \, {\mathbb {I}} \left( \text {gr.M(MATH5)}< 54 \,\, \& \,\, \text {gr.M(SES)} \ge -0.89 \right) \\ \hbox {T2 R2.13 :} &{} \, {\mathbb {I}} \left( \text {gr.M(MATH5)}< 54 \,\, \& \,\, \text {gr.M(SES)}< -0.89 \right) \\ \hbox {T2 R2.14 :} &{} \, {\mathbb {I}} \left( 54 \le \text {gr.M(MATH5)}< 82 \,\, \& \,\, \text {gr.M(SES)} \ge -0.88 \right) \\ \hbox {T2 R2.15 :} &{} \, {\mathbb {I}} \left( 54 \le \text {gr.M(MATH5)}< 82 \,\, \& \,\, \text {gr.M(SES)} < -0.88 \right) \\ \end{array} \end{aligned}$$
Fig. 4
figure 4

Second tree of the 3Trees model on Invalsi data using Algorithm 2

Finally, the third tree jointly considers all predictors. The selected regions are depicted in Fig. 5 and listed below.

$$ \begin{aligned} \begin{array}{ll} \hbox {T3 R2.8 (ref.):} &{} \, {\mathbb {I}} \left( 35 \le \text {MATH5}< 77 \,\, \& \,\, \text {AREA} \ne \text {South, Islands} \right) \\ \hbox {T3 R3.9 :} &{} \, {\mathbb {I}} \left( 35 \le \text {MATH5}< 77 \,\, \& \,\, \text {AREA} = \text {South, Islands} \right) \\ \hbox {T3 R3.10:}&{} \, {\mathbb {I}} \left( \text {MATH5} \ge 77 \,\, \& \,\, \text {AREA} = \text {South}, \text {Islands} \right) \\ \hbox {T3 R3.11 :}&{} \, {\mathbb {I}} \left( \text {MATH5}< 77 \,\, \& \,\, \text {AREA} \ne \text {South}, \text {Islands} \right) \\ \hbox {T3 R3.12 :} &{} \, {\mathbb {I}} \left( \text {MATH5}< 35 \,\, \& \,\, \text {AREA} = \text {Center} \right) \\ \hbox {T3 R3.13 :} &{} \, {\mathbb {I}} \left( \text {MATH5}< 35 \,\, \& \,\, \text {AREA} = \text {Nord--West}, \text {Nord--East} \right) \\ \hbox {T3 R3.14 :} &{} \, {\mathbb {I}} \left( \text {MATH5}< 35 \,\, \& \,\, \text {AREA} = \text {South, Islands} \,\, \& \,\, \text {gr.M(MATH5}< 64 \right) \\ \hbox {T3 R3.15 :} &{} \, {\mathbb {I}} \left( \text {MATH5} < 35 \,\, \& \,\, \text {AREA} = \text {South, Islands} \,\, \& \,\, \text {gr.M(MATH5} \ge 64 \right) \\ \end{array} \end{aligned}$$
Fig. 5
figure 5

Third tree of the 3Trees model on Invalsi data using Algorithm 2

The parameter estimates and their confidence intervals are reported in Table 6 in the column labelled 3Trees model. As expected, the Math score at \(5^{th}\) grade (MATH5) is a key predictor, with a significant main effect in the linear component, in addition to non-linear and interaction effects captured by the first and third trees. Concerning other student-level predictors, it is worth noting that one of the most relevant predictors is the socio-economic status, confirming an enduring issue and the necessity of additional support for children with low socio-economic status in order to overcome social inequalities.

Regarding the school-level predictors, the coefficient of type of school (SCTYPE) is negative. Thus, conditionally on the other predictors, students of public schools have a lower performance.

Looking at the confidence intervals, we note that two school-level predictors, namely the number of classes in the school (CLSIZE) and the school’s location (TOWN), have confidence intervals crossing zero. In addition, the tree for the school-level predictors has two regions (R2.12 and R2.13) whose confidence intervals for the coefficients largely overlap. Similarly, in the third tree, the regions R3.10 and R3.11. Consequently, these two trees can be pruned. Table 6 also reports the parameter estimates obtained from a reduced model where the trees are pruned and the non-significant predictors SCSIZE and TOWN are omitted. This reduced model has four parameters less than the full model and is nested in it. The pruned 3Trees model is preferred based on the Likelihood Ratio test comparing the two models (\(\hbox {statistic} = 4.67\), \(\hbox {df} = 4\), p-\(\hbox {value}=0.322\)).

Table 6 Estimates and 95% confidence intervals for the 3Trees model using Algorithm 2 and the corresponding pruned tree

6 Conclusions

Tree-based regression models are a class of predictive algorithms that are conceptually simple and able to take into account both interactions and non-linearities. However, they struggle with linear dependencies and assume iid data.

This paper proposes a multilevel extension of regression trunk models where the tree component consists of multiple trees. In particular, we call 3Trees model a mixed-effect model with a linear part and three trees. The trees are aimed to take into account non-linearities and interactions among individual-level predictors, cluster-level predictors and cross-levels. The model can be easily extended to a larger number of trees.

The key idea behind our proposal is to develop a flexible class of models capable of accurately approximating the true regression function while preserving interpretability. Indeed, the limitation of a standard regression tree is that, even if it is able to reconstruct a regression function in the presence of non-linear effects and interactions, it does not directly reveal which effect is linear and which is not, which predictor is involved in an interaction and which is not. As explained by Gottard et al. (2020), in most tree-based algorithms, including CART and random forests, the ordering of the splitting variables and their importance can be different from the ordering and the importance of the direct effects on the response. The interpretation of a tree is in terms of predictive power and not on attribution (Efron 2020) and direct effects as in traditional regression. From this point of view, 3Trees models constitute an interesting bridge between a proper statistical model and a machine learning algorithm.

To estimate the 3Trees model parameters, we introduce two iterative algorithms. The first algorithm is specific for predictive purposes and showed a good predictive performance in the simulation study and the applied study on Invalsi data. It produces transparent and interpretable predictions. However, this algorithm does not provide valid confidence intervals, as the shape of the regression function is learned during the estimation process. We propose a post-selection inference procedure that provides valid confidence intervals for the coefficients and predicted values to overcome this limitation. This procedure is based on the splitting sample procedure, which is easy to apply and shows good performance if the number of clusters in the data is not too small. Notice that, in likelihood-based inference for multilevel models, the number of clusters is crucial for estimating the level 2 variance \(\sigma ^2_u\), which in turn affects the standard errors of regression coefficients, especially those related to level 2 predictors (Elff et al. 2021). In the case of a few clusters, say less than 30 in each split sample, the level 2 variance is appreciably underestimated. Thus the confidence intervals are shorter than due. This issue can be addressed by REML estimation, though this comes at the cost of losing efficiency and it prevents using LR tests for pruning.

The proposed algorithms for 3Trees models can be easily extended to allow for predictors with random slopes (see Appendix B). Indeed, there is no need to modify the algorithm but to specify the random component properly. It is worth noting that the role of a random slope is to account for a large unexplained between-cluster variation in the regression coefficient of a level 1 predictor. This scenario is less likely in a 3Trees model due to the flexibility of the tree component, notably for what concerns cross-level interactions. It can happen that the increase in complexity induced by random slopes may not be adequately rewarded in terms of predictive power. It will be up to the researcher to decide whether to adopt a random slope or leave cross-level interactions to be explained by the trees. In general, random slopes should be considered only for predictors playing a key role in the research aims.

Finally, further work is needed to apply the proposed 3Trees model and related algorithms to high dimensional settings. When the number of predictors increases, the proposed algorithms can be modified to deal with the high dimensionality issue. For instance, one could replace the likelihood-based estimation procedure with the lasso-type estimator of Groll and Tutz (2014). In addition, recently Rügamer et al. (2022) proposed a procedure for post-selection inference for linear mixed-effect models and additive models, implementing a multi-stage selection procedure. Further research is needed to evaluate if their proposal can be adapted to the case of 3Trees models.