1 Introduction

Censoring occurs when, beyond some threshold value, the observed outcome is equal to the threshold instead of the true latent outcome value. For example, scientific equipment can often only make accurate measurements within a known range of outcome values, and observations outside this range are set to its limits. Often the estimand of interest is the conditional expectation or conditional average treatment effect on the outcome before censoring. Estimation of a standard regression model using data without censored values, or with censored observations set equal to threshold values, results in biased estimates. Tobit models directly model the latent outcome and censoring process (Tobin 1958).

In this paper, we combine the Bayesian Type I Tobit model (Chib 1992) with Bayesian Additive Regression Trees (Chipman et al. 2010). The latent outcome (before censoring) is modeled as a sum-of-trees, which allows for nonlinear functions of covariates. The error term is modeled as a Dirichlet process mixture of normal distributions, as in fully nonparametric BART (George et al. 2019). Smooth data generating processes with sparsity are modelled by soft trees with a Dirichlet prior on splitting variable probabilities, as introduced by Linero and Yang (2018).

In simulations and applications to real data, TOBART-1 outperforms a Tobit gradient boosted tree method, Grabit (Sigrist and Hirnschall 2019), a Tobit Gaussian Process model (Groot and Lucas 2012), standard linear Tobit, and simple hurdle models based on standard machine learning methods. Unlike other methods, TOBART-1 accounts for model uncertainty and can non-parametrically model the error term. Posterior intervals are available for censored outcomes, uncensored outcomes, conditional expectations, and probabilities of censoring. Grabit, Gaussian Processes, and other methods rely on cross-validation for parameter tuning and are sensitive to the tuned variance of the error term, whereas TOBART-1 performs well without parameter tuning and accounts for uncertainty in the variance of the error term.

TOBART-1 with a Dirichlet process mixture of normal distributions for the error term (TOBART-1-NP) removes the restrictive normality assumption often imposed in censored outcome models. We observe that this can lead to more accurate outcome predictions in simulations with non-normally distributed errors, and in real data applications, which may involve non-normally distributed outcomes.Footnote 1

A variety of methods have been proposed for nonparametric and semiparametric censored outcome models. Lewbel and Linton (2002) describe a local linear kernel estimator for the setting in which both the uncensored outcome mean function of regressors and error distribution are unknown. Fan and Gijbels (1994) describe a quantile-based local linear approximation method. Huang (2021) introduces a semiparametric method involving B-splines. Chen et al. (2005) use a local polynomial method. Other papers on the topic of semiparametric and nonparametric censored outcome regression include Cheng and Small (2021), Heuchenne and Van Keilegom (2007, 2010), Huang et al. (2019), Oganisian et al. (2021). Gaussian Process censored outcome regression methods are applied by Groot and Lucas (2012), Cao et al. (2018), Gammelli et al. (2020, 2022), Basson et al. (2023). Zhang et al. (2021), Wu et al. (2018) implement censored outcome neural network methods.

A number of recent papers have considered Tobit model selection and regularization. Zhang et al. (2012) describe Focused Information Criteria based Tobit model selection and averaging. Jacobson and Zou (2024) provide theoretical and empirical results for Tobit with a Lasso penalty and a folded concave penalty (SCAD). Müller and van de Geer (2016) and Soret et al. (2018) describe a LASSO penalized censored outcome models. Bradic and Guo (2016) study robust penalized estimators for censored outcome regression.

The Bayesian Tobit literature includes quantile regression methods (Ji et al. 2012; Yu and Stander 2007; Alhamzawi 2016), and Bayesian elastic net Tobit (Alhamzawi 2020). Ji et al. (2012) account for model uncertainty by implementing Tobit quantile regression with Stochastic Search Variable Selection. However, the outcome and latent variable are modeled as linear functions of covariates. TOBART-1 provides a competing approach to the methods referenced above that does not impose linearity.

The remainder of the paper is structured as follows: In Sect. 2 we describe the TOBART-1 model and Markov chain Monte Carlo (MCMC) implementation, Sect. 3 contains simulation studies for prediction and treatment effect estimation with censored data, Sect. 4 contains applications to real world data, and Sect. 5 concludes the paper.

2 Methods

2.1 Review of Bayesian Additive Regression Trees (BART)

Suppose there are n observations, and the \(n \times p\) matrix of explanatory variables, X, has \(i^{th}\) row \(x_i=[x_{i1},...,x_{ip}]\). Following the notation of Chipman et al. (2010), let T be a binary tree consisting of a set of interior node decision rules and a set of terminal nodes, and let \(M = \{ \mu _1,..., \mu _b \}\) denote a set of parameter values associated with each of the b terminal nodes of T. The interior node decision rules are binary splits of the predictor space into the sets \(\{ x_{is} \le c \}\) and \(\{ x_{is} > c \}\) for continuous \(x_{s}\). Each observation’s \(x_i\) vector is associated with a single terminal node of T, and is assigned the \(\mu \) value associated with this terminal node. For a given T and M, the function \(g(x_i;T,M)\) assigns a \(\mu \in M\) to \(x_i\).

For the standard BART model, the outcome is determined by a sum of trees,

$$\begin{aligned} Y_i = \sum _{j=1}^m g(x_i; T_j, M_j)+\varepsilon _i \end{aligned}$$

where \(g(x_i;T_j,M_j)\) is the output of a decision tree. \(T_j\) refers to a decision tree indexed by \(j=1,...,m\), where m is the total number of trees in the model. \(M_j\) is the set of terminal node parameters of \(T_j\), and \(\varepsilon _i \overset{i.i.d}{\sim } N(0, \sigma ^2)\).

Prior independence is assumed across trees \(T_j\) and across terminal node means \(M_j = (\mu _{1j}...\mu _{b_j j})\) (where \(1,...,b_j\) indexes the terminal nodes of tree j). The form of the prior used by Chipman et al. (2010) is \(p(M_1,...,M_m,T_1,...,T_m,\sigma ) \propto \left[ \prod _j \left[ \prod _k p(\mu _{kj}|T_j) \right] p(T_j)\right] p(\sigma ) \) where \(\mu _{kj} | T_j \overset{i.i.d}{\sim } N(0,\sigma _{\mu }^2)\) where \(\sigma _{\mu } = \frac{0.5}{\kappa \sqrt{m}}\) and \(\kappa \) is a user-specified hyper-parameter.

Chipman et al. (2010) set a regularization prior on the tree size and shape \(p(T_j)\). The probability that a given node within a tree \(T_j\) is split into two child nodes is \(\alpha (1+d_h)^{-\beta }\), where \(d_h\) is the depth of (internal) node h, and the parameters \(\alpha \) and \(\beta \) determine the size and shape of \(T_j\) respectively. Chipman et al. (2010) use uniform priors on available splitting variables and splitting points. The model precision \(\sigma ^{-2}\) has a conjugate prior distribution \(\sigma ^{-2} \sim Ga(\frac{v}{2}, \frac{v \lambda }{2})\) with degrees of freedom v and scale \(\lambda \).

Samples from \(p((T_1, M_1),...,(T_m,M_m), \sigma | y)\) can be made by a Bayesian backfitting MCMC algorithm. This algorithm involves m successive draws from \((T_j, M_j )| T_{(j)}, M_{(j)}, \sigma , y \) for \(j=1,...,m\), where \(T_{(j)}, M_{(j)} \) are the trees and parameters for all trees except the \(j^{th}\) tree, followed by a draw of \(\sigma \) from the full conditional \(\sigma | T_1,...,T_m,M_1,...,M_m,y\). After burn-in, the sequence of \(f^*\) draws, \(f_1^*,...,f_Q^*\), where \(f^*(.)= \sum _{j=1}^m g(. \; T_j^*, M_j^*)\), is an approximate sample of size Q from p(f|y).

2.2 Soft trees and sparse splitting rules

In addition to the standard Bayesian tree model for \(f(\varvec{x}_i)\) described in Sect. 2.1, we also implement TOBART and TOBART-NP with soft trees and sparse splitting rules as described by Linero and Yang (2018). Predictions from soft trees are weighted linear combinations of all terminal node parameter values, with the weights being functions of distances between covariates and splitting points. The prediction from a single tree function is

$$\begin{aligned}{} & {} g(\varvec{x}_i; T_j, M_j) = \sum _{\ell = 1}^{L_j} \mu _{j,\ell } \xi (\varvec{x}_i, T_j, \ell )\\{} & {} \xi (\varvec{x}_i, T_j, \ell ) = \prod _{b \in \mathcal {A}(\ell )} \zeta \left( \frac{ x_{{j_b}} - C_b}{\tau _b} \right) ^{ \mathbb {I} \{ x_{j_b} > C_b \} } \\{} & {} \quad \times \Big \{ 1 - \zeta \left( \frac{ x_{{j_b}} - C_b}{\tau _b} \right) \Big \}^{ \mathbb {I} \{ x_{j_b} \le C_b \} } \end{aligned}$$

where \(L_j\) is the number of leaves in the \(j^{th}\) tree, \(\mu _{j,\ell }\) is the \(\ell ^{th}\) terminal node parameter of the \(j^{th}\) tree, \(\mathcal {A}(\ell )\) denotes the set of ancestor nodes of terminal node \(\ell \). The splitting variable, splitting point, and bandwidth parameter at internal node b are denoted by \(x_{{j_b}}\), \(C_b\), and \(\tau _b\) respectively. The gating function \(\zeta \) is the logistic function \( \zeta (x) = (1+\exp (-x))^{-1}\).

Sparse splitting rules are introduced by placing a Dirichlet prior on the splitting probabilities \((s_1,\dots , s_p) \sim \mathcal {D} (\frac{a}{p},\dots , \frac{a}{p})\). The parameter a controls the level of sparsity and has the prior \(\text {Beta}(0.5,1)\). Linero and Yang (2018) demonstrate that soft trees allow BART to model smooth functions, and the Dirichlet prior on splitting probabilities adapts to unknown levels of sparsity to provide improved predictions on high dimensional data sets.

2.3 Type I Tobit and TOBART

2.3.1 Type I Tobit model

The Type I Tobit model with censoring from below at a and censoring from above at b is:

$$\begin{aligned} Y_i^*= & {} \varvec{x}_i \varvec{\beta } + \varepsilon _i \, \ \varepsilon _i \sim i.i.d. \ N(0, \sigma ^2)\\ Y_i= & {} {\left\{ \begin{array}{ll} a \ \text {if} \ Y_i^* \le a \\ Y_i^* \ \text {if} \ a< Y_i^* < b \\ b \ \text {if} \ b \le Y_i^* \end{array}\right. } \end{aligned}$$

where a normal prior is placed on \(\beta \), and an inverse gamma prior is placed on \(\sigma ^2\) (Chib 1992).

2.3.2 Type I TOBART model

The Type I TOBART model replaces the linear combination \(\varvec{x}_i \varvec{\beta }\) with the sum-of-trees function \(f(\varvec{x}_i)\):

$$\begin{aligned} Y_i^*= & {} f(\varvec{x}_i) + \varepsilon _i \, \ \varepsilon _i \sim i.i.d. \ N(0, \sigma ^2)\\ Y_i= & {} {\left\{ \begin{array}{ll} a \ \text {if} \ Y_i^* \le a \\ Y_i^* \ \text {if} \ a< Y_i^* < b \\ b \ \text {if} \ b \le Y_i^* \end{array}\right. } \end{aligned}$$

where a BART prior is placed on \( f(\varvec{x}_i)\) and an inverse gamma prior is placed on \(\sigma ^2\).Footnote 2

2.3.3 Type I TOBART Gibbs sampler

Tobit can be implemented by MCMC with data augmentation (Chib 1992). The realization, \(y_i^*\), of the variable \(Y_i^*\) is observed for uncensored outcomes, and is sampled from its full conditional for censored outcomes.

$$\begin{aligned} y_i^*= & {} y_i \text { if } y_i \in (a,b) \, \ \text { and }\\ y_i^*\sim & {} {\left\{ \begin{array}{ll} \mathcal{T}\mathcal{N}_{[-\infty ,a]}(f(\varvec{x}_i), \sigma ^2) \ \text {if} \ y_i = a \\ \mathcal{T}\mathcal{N}_{[b,\infty ]}(f(\varvec{x}_i), \sigma ^2) \ \text {if} \ y_i = b \end{array}\right. } \end{aligned}$$

where \(\mathcal{T}\mathcal{N}_{[l,u]}\) denotes a normal distribution truncated to the interval [lu]. The full conditionals for \(f(\varvec{x}_i)\) and \(\sigma ^2\) are standard full conditionals for BART with \(y_i^*\) as the dependent variable and \(\varvec{x}_i\) as the potential splitting variables. Appendix A contains a description of a sampler that produces draws \(f^{(1)}(\varvec{x}_i), \ldots ,f^{(D)}(\varvec{x}_i)\) and \(\sigma ^{(1)},\ldots ,\sigma ^{(D)}\).

2.3.4 Predicting outcomes with TOBART

The conditional mean of the latent variable is \(f(\varvec{x}_i)\). If censoring is also applied to the test data, then the outcomes are predicted by averaging the standard Tobit expectation formula across MCMC iterations:

For all MCMC iterations \(d=1,...,D\) calculate

$$\begin{aligned} E[Y_i|X_i{} & {} =\varvec{x}_i, f^{(d)}, \sigma ^{(d)}] = a \Phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \\{} & {} \quad +f^{(d)}(\varvec{x}_i) \Bigg [ \Phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \Phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \Bigg ]\\{} & {} \quad + \sigma ^{(d)} \Bigg ( \phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \Bigg ) \\{} & {} \quad +b \Bigg [ 1 - \Phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \Bigg ] \end{aligned}$$

The predicted outcome is \(\frac{1}{D} \sum _{d=1}^D E[Y_i|X_i=\varvec{x}_i, f^{(d)}, \sigma ^{(d)}]\). The expectation conditional on the outcome not being in the censored range is:

$$\begin{aligned}{} & {} E[Y_i| a< Y_i < b, X_i=\varvec{x}_i, f^{(d)}, \sigma ^{(d)}] \\ {}{} & {} = f^{(d)}(\varvec{x}_i) + \sigma ^{(d)} \frac{ \phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) }{ \Phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \Phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) } \end{aligned}$$

2.4 Nonparametric Type I TOBART

2.4.1 Nonparametric Type I TOBART model

The accuracy of the conditional expectation of the TOBART model depends on the validity of the assumption of normality of the errors. More general censored outcomes can be modelled by assuming a Dirichlet Process mixture distribution for the error terms.

$$\begin{aligned} y_i^*= & {} f(\varvec{x}_i) + \varepsilon _i \,\ y_i = {\left\{ \begin{array}{ll} a \ \text {if} \ y_i^* \le a \\ y_i^* \ \text {if} \ a< y_i^* < b \\ b \ \text {if} \ b \le y_i^* \end{array}\right. } \\ \varepsilon _i\sim & {} i.i.d. \ N(\gamma _i, \sigma _i^2) \, \ \vartheta _i = (\gamma _i, \sigma _i) \sim G\\ G\sim & {} \mathcal{D}\mathcal{P}(G_0, \alpha ) \end{aligned}$$

The distribution of the error term is specified similarly to George et al. (2019). The base distribution \(G_0\) is defined as follows:

$$\begin{aligned}{} & {} p(\gamma , \sigma | \nu , \lambda _1, \gamma _0, k_0) = p(\sigma | \nu , \lambda ) p(\gamma | \sigma , \gamma _0, k_0)\\{} & {} \sigma ^2 \sim \frac{\nu \lambda }{\chi _{\nu }^2} \, \ \gamma | \sigma \sim \mathcal {N} \Big (\gamma _0, \frac{\sigma ^2}{k_0} \Big ) \end{aligned}$$

where, in contrast to the standard BART prior of Chipman et al. (2010), \(\nu \) is set to 10 instead of 3.Footnote 3 The parameter \(\lambda \) is set such that the \(q^{th}\) quantile of the prior distribution of \(\sigma \) is the sample standard deviation of the outcome, or of the residuals from a linear model. For TOBART-NP, \(q=0.9\) instead of 0.95.Footnote 4 The prior on \(\alpha \) is the \(\alpha \sim \Gamma (2,2)\) prior introduced by Escobar and West (1995) and applied by Van Hasselt (2011).Footnote 5

The outcome is scaled by subtracting the sample mean before applying the Gibbs sampler, therefore George et al. (2019) set \(\gamma _0 = 0\).Footnote 6 The parameter \(k_0\) is scaled with the marginal distribution of \(\gamma \) ( \(\gamma \sim \frac{\sqrt{\lambda } }{ \sqrt{k_0 } } t_{\nu } \)). Given \(k_s\) (set to 10 by default), \(k_0\) is set such that \( \max _{i=1,...,n} |e_i| = k_s \frac{\sqrt{\lambda } }{ \sqrt{k_0 } } \) where \(k_s = 10\). and \(e_1,...,e_n\) are the residuals from a linear model.Footnote 7 The Gibbs sampler for TOBART-NP is described in Appendix A.

For each MCMC iteration, d, and observation i, we obtain \(\vartheta _i^{(d)} = (\gamma _i^{(d)}, \sigma _i^{(d)})\). The conditional expectation, \(E[y_i|\varvec{x}_i, f^{(d)}, \gamma _i^{(d)}, \sigma ^{(d)}] \), is calculated as outlined in Sect. 2.3.4.

2.5 Treatment effect estimation for censored outcomes

Let a binary variable \(T_i\) equal 1 if unit i is assigned to treatment and 0 if i is assigned to the control group. The potential outcomes under treatment and control group allocation are denoted by \(Y_i(1)\) and \(Y_i(0)\) respectively. Similarly, the potential outcomes of the latent outcome are denoted by \(Y_i^*(1), Y_i^*(0)\). Assume the data generating process is as follows:

$$\begin{aligned} Y_i^*= & {} \mu (\varvec{x}_i) + \tau (\varvec{x}_i) T_i + \varepsilon _i \, \ \varepsilon _i \sim \mathcal {N}(0,\sigma ^2)\\ Y_i= & {} {\left\{ \begin{array}{ll} a \ \text {if} \ Y_i^* \le a \\ Y_i^* \ \text {if} \ a< Y_i^* < b \\ b \ \text {if} \ b \le Y_i^* \end{array}\right. } \end{aligned}$$

where \(\mu (\varvec{x}_i)\) and \(\tau (\varvec{x}_i)\) are possibly nonlinear functions of covariates. Assume conditional unconfoundedness, i.e. \(Y_i^*(1),Y_i^*(0) \perp T_i | X_i\). The estimand is the conditional average treatment effect on \(Y_i^*\), i.e., \( E[Y_i^*(1) - Y_i^*(0) | X_i = \varvec{x}_i] = \tau (\varvec{x}_i) \). However, a model naively trained on only uncensored outcomes estimates the following effectsFootnote 8

$$\begin{aligned}{} & {} E[Y_i(1) | a< y_i< b, X_i = \varvec{x}_i] \\{} & {} \quad -E[Y_i(0) | a< y_i < b, X_i = \varvec{x}_i] = \tau (\varvec{x}_i) \\{} & {} \quad +\sigma \Bigg ( \frac{ \phi \Big (\frac{a - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)))}{\sigma } \Big ) - \phi \Big (\frac{b - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)) }{\sigma } \Big ) }{ \Phi \Big (\frac{b - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)) }{\sigma } \Big ) - \Phi \Big (\frac{a - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)) }{\sigma } \Big ) } \\{} & {} \quad -\frac{ \phi \Big (\frac{a - \mu (\varvec{x}_i)}{\sigma } \Big ) - \phi \Big (\frac{b - \mu (\varvec{x}_i)}{\sigma } \Big ) }{ \Phi \Big (\frac{b - \mu (\varvec{x}_i)}{\sigma } \Big ) - \Phi \Big (\frac{a - \mu (\varvec{x}_i)}{\sigma } \Big ) } \Bigg ) . \end{aligned}$$

A sufficiently flexible nonparametric method, without restrictive assumptions on the error term, will produce estimates that approximate the expression above. A model naively trained on the full data set with censoring similarly gives biased estimates (see Appendix B). By directly modelling \(Y_i^*\), censored outcome models avoid the bias described above. Similar biases occur if the error term is not normally distributed.

3 Simulation studies

3.1 Description of prediction simulations

We adapt the data generating processes (DGPs) introduced by Friedman (1991) to a censored regression setting. This DGP has often been applied in comparisons of semiparametric regression methods. We also make use of the censored outcome simulations described by Groot and Lucas (2012), Sigrist and Hirnschall (2019), and Jacobson and Zou (2024) for fair comparison against competing methods with existing synthetic censored data.

The covariates \(x_1,....,x_p\) are independently sampled from the uniform distribution on the unit interval. The outcome before censoring is generating from one of the following functions:

  • \( y^* = 10 \sin (\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) \) with censoring from below at the \(15^{th}\) percentile of the training data \(y^*\) values (Friedman 1991).Footnote 9

  • \( y^* = 10 \sin (\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) \) with censoring from below at the \(15^{th}\) percentile of the training data \(y^*\) values, and from above at the \(85^{th}\) percentile of the training data \(y^*\) values (Friedman 1991).

  • \( y^* = 6 (x_1 - 2)^2 \sin (2(6x_1 - 2) ) + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) \) with censoring from below at the \(40^{th}\) percentile of the training data \(y^*\) values (Groot and Lucas 2012).

  • \( y^* = \sum _{k=1}^5 0.3 \max (x_k,0) + \sum _{k=1}^3 \sum _{j=k+1}^4 \max (x_k x_j,0) + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) \) with censoring from above at the \(95^{th}\) percentile of the training data \(y^*\) values (Sigrist and Hirnschall 2019). For this simulation, \(x_1,....,x_p\) are uniformly distributed on \([-1,1]\) instead of [0, 1].Footnote 10

  • \( y^* = 3 + 5 x_1 + x_2 + \frac{x_3}{2} - 2 x_4 + \frac{x_5}{10} + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) \) with censoring from below at the \(25^{th}\) percentile of the training data \(y^*\) values (Jacobson and Zou 2024).

The variance of the error, \(\sigma ^2\), is set to 1. See the Supplementary Appendix (Online Resource 1) for the results obtained from simulations with \(\sigma \in \{0.1, 2\}\). We also consider deviations from the assumption of normally distributed errors. In particular, we include results for simulations in which \(\varepsilon \) is generated from Skew-t, and \(\text {Weibull}(1/2, 1/5)\) distributions.Footnote 11 The number of covariates, p, is set to 30. We generate 500 training and 500 test observations.

3.2 Prediction simulation results

We compare the performance of TOBART-1, TOBART-1-NP, Soft TOBART-1, and Soft TOBART-1-NP against Grabit (Sigrist and Hirnschall 2019), linear Tobit (Tobin 1958), BART (Chipman et al. 2010), Random Forests (RF) (Breiman 2001), Gaussian Processes, and a Tobit Gaussian Process model (Groot and Lucas 2012).Footnote 12 The results for a Gaussian Process (GP) with only 5 variables (always including all informative variables) are included because GPs were observed to produce inaccurate predictions when applied to data with 30 variables.Footnote 13 Censored outcome predictions are evaluated using Mean Squared Error (MSE), and predicted probabilities of censoring are evaluated using the Brier Score.Footnote 14Footnote 15 All results are averaged over 5 repetitions.Footnote 16

The results for simulations with normally distributed errors are presented in Tables 1 and 2. The TOBART algorithms generally outperform competing methods across all DGPs, except unsurprisingly for the linear Jacobson and Zou (2024) simulations linear Tobit is outperformed only by Soft TOBART. TOBART-NP can slightly improve on TOBART in some cases, but generally the results are similar when errors are normally distributed. The differences in criteria across methods are small for the more linear DGPs from Sigrist and Hirnschall (2019) and Jacobson and Zou (2024), as linear Tobit is designed for a linear DGP, and the nonlinear methods BART and RF can model the relatively simple response surface well. It is worth noting that TOBART outperforms Grabit even though the true standard deviation, \(\sigma =1\), is included as one of five possible Grabit hyperparameter values in cross-validation. The same pattern of results can be observed for simulations with \(\sigma =0.1\) and \(\sigma = 2\) in the Supplementary Appendix. The Supplementary Appendix contains comparisons of Area Under the Curve for all methods and DGPs, from which similar conclusions can be drawn.

The results for Skew-t and Weibull distributed errors are also presented in Tables 1 and 2.Footnote 17 The TOBART models outperform all other methods for almost all DGPs and criteria. The results for the Weibull distribution generally favour TOBART-NP and Soft TOBART-NP, indicating that there is some improvement from the Dirichlet Process model when the errors are sufficiently non-Gaussian.

The average coverage and length of \(95\%\) prediction intervals for the latent outcomes and the observed outcomes are given in the Supplementary Appendix (Online Resource 1). For most DGPs and error distributions, TOBART and Soft TOBART provides the closest to 95% coverage of prediction intervals for both latent and observed outcomes. For some DGPs with non-normal errors, the more conservative intervals produced by TOBART-NP and Soft TOBART-NP provide better coverage.

Table 1 Simulation study, mean squared error
Table 2 Simulation study, Brier score

3.3 Description of treatment effect simulations

A number of recent simulation studies have demonstrated that BART is among the most accurate treatment effect estimation methods (Wendling et al. 2018; McConnell and Lindner 2019; Dorie et al. 2019; Hahn et al. 2019). However, in practice many data sets, including randomized trial data sets, contain censored outcomes. For example, antibody concentrations or environmental levels of chemicals can only be measured accurately within a certain range as a result of limitations of measuring equipment. Often economic data is censored due to privacy considerations, for example income might be censored above a certain threshold. TOBART provides a machine learning treatment effect estimation method with uncertainty quantification that can be applied to this data while still making use of the information provided by censored observations. We demonstrate the effectiveness of TOBART by censoring the outcomes of DGPs from published studies of machine learning methods for treatment effect estimation. The chosen data generating processes contain linear and non-linear functions of covariates, constant and heterogeneous effects, and various degrees of confounding.

3.3.1 Censored Caron et al. (2022) simulations

\(P=10\) covariates are generated from a multivariate Gaussian distribution, \(X_1,\ldots , X_{10} \sim \mathcal {MVN}(\varvec{0}, \Sigma )\), with \(\Sigma _{jk} = 0.6^{|j-k|} + 0.1 \mathbb {I}(j \ne k) \). The binary treatment variable is Bernoulli distributed, \(Z_i \sim \text {Bern}(\pi (\varvec{x}_i)) \), where

$$\begin{aligned} \pi (\varvec{x}_i) = \Phi (-0.4 + 0.3 X_{i,1} + 0.2 X_{i,2} ) \end{aligned}$$

and \(\Phi (\cdot )\) is the cumulative distribution function of the standard normal distribution.

The prognostic score function, \(\mu (\varvec{x}_i)\), and CATE function, \(\tau (\varvec{x}_i)\), are defined as

$$\begin{aligned} \mu (\varvec{x}_i)= & {} 3 + X_{i,1} + 0.8 \sin ( X_{i,2} ) + 0.7 X_{i,3} X_{i,4} - X_{i,5}\\ \tau (\varvec{x}_i)= & {} 2 + 0.8 X_{i,1} - 0.3 X_{i,2}^2 \end{aligned}$$

The outcome before censoring is generated as:

$$\begin{aligned} Y_i^* = \mu (\varvec{x}_i) + \tau (\varvec{x}_i) Z_i + \varepsilon _i \, \ \text {where} \ \varepsilon _i \sim \mathcal {N}(0,1) \end{aligned}$$

The number of sampled observations is 200. The observed outcome \(Y_i\) is censored from below at the \(15^{th}\) percentile of the generated \(Y_i^*\) values, and from above at the \(85^{th}\) percentile.

Table 3 Treatment effect simulation results

3.3.2 Censored Friedberg et al. (2020) simulations

\(P=20\) covariates are generated from independent standard uniform distributions \(X_1,...,X_{20} \sim \mathcal {U}[0,1]\). There is no confounding as \(\pi (\varvec{x}_i)=0.5\) and \(Z_i \sim \text {Bern}(\pi (\varvec{x}_i)) \). The prognostic score function, \(\mu (\varvec{x}_i)\), and CATE function, \(\tau (\varvec{x}_i)\), are defined as \( \mu (\varvec{x}_i) = 0 \) and

$$\begin{aligned} \tau (\varvec{x}_i)= & {} \left( 1 + \frac{1}{1 + \exp \left( -20\left( X_{i,1} - \frac{1}{3}\right) \right) } \right) \\{} & {} \quad \times \left( 1 + \frac{1}{1 + \exp \left( -20\left( X_{i,2} - \frac{1}{3}\right) \right) } \right) . \end{aligned}$$

The outcome before censoring is generated as:

$$\begin{aligned} Y_i^* = \mu (\varvec{x}_i) + \tau (\varvec{x}_i) Z_i + \varepsilon _i \, \ \text {where} \ \varepsilon _i \sim \mathcal {N}(0,1) \end{aligned}$$

The number of sampled observations is 200. The observed outcome \(Y_i\) is censored from below at the \(15^{th}\) percentile of the generated \(Y_i^*\) values, and from above at the \(85^{th}\) percentile.

3.3.3 Censored Nie and Wager (2021) simulations

The covariates are generated as follows across scenarios A to D. In simulation A, \(X_1,...,X_{12} \sim \mathcal {U}[0,1]\). In simulations B to D, \(X_1,...,X_{12} \sim \mathcal {N}(0,1)\).

\(\pi (\varvec{x}_i)\) is defined as follows across scenarios A to D: (A) \( \text {trim}_{0.1} \{ \sin (\pi X_{i,1} X_{i,2} ) \} \), (B) constant equal to 0.5, (C) \(1/\{1 + \exp (X_{i,2} + X_{i,3} )\}\), (D) \(1/\{1 + \exp (-X_{i,1}) + \exp ( - X_{i,2} )\}\).

\(\mu (\varvec{x}_i)\) is defined as follows across scenarios A to D: (A) \(\sin (\pi X_{i,1} X_{i,2}) + 2 (X_{i,3}-0.5)^2 + X_{i,4} + 0.5 X_{i,5} \), (B) \(\max \{X_{i,1} + X_{i,2}, X_{i,3},0 \} \), (C) \(2 \log \{ 1 + \exp ( X_{i,1} + X_{i,2} + X_{i,3} ) \} \), (D) \(\frac{1}{2} [ \max \{ X_{i,1} + X_{i,2} + X_{i,3},0 \} + \max \{ X_{i,4} + X_{i,5},0 \} ] \).

\(\tau (\varvec{x}_i)\) is defined as follows across scenarios A to D: (A) \( ( X_{i,1} + X_{i,2})/2\), (B) \( X_{i,1} + \log \{1 + \exp ( X_{i,2}) \}\), (C) constant equal to 1, (D) \(\max \{ X_{i,1} + X_{i,2} + X_{i,3},0 \} - \max \{ X_{i,4} + X_{i,5},0 \}\).

The outcome before censoring is generated as:

$$\begin{aligned} Y_i^* = \mu (\varvec{x}_i) + \tau (\varvec{x}_i) (Z_i-0.5) + \varepsilon _i \, \ \text {where} \ \varepsilon _i \sim \mathcal {N}(0,1) \end{aligned}$$

The number of sampled observations is 200. The observed outcome \(Y_i\) is censored from below at the \(15^{th}\) percentile of the generated \(Y_i^*\) values, and from above at the \(85^{th}\) percentile.

3.4 Treatment effect simulation results

All methods are evaluated in terms of Precision in Estimation of Heterogeneous Effects (PEHE), which is defined as \(\frac{1}{N}\sum _{i=1}^N (\hat{\tau }(\varvec{x}_i) - \tau (\varvec{x}_i) )^2\). Confidence intervals are evaluated in terms of average coverage of \(95\%\) intervals and average length of intervals.

The results are presented in Table 3. For all DGPs, at least one TOBART method attains lower PEHE than all other methods, often by a large margin. Local Linear Forests (Friedberg et al. 2020) attain similar PEHE to TOBART and TOBART-NP for Nie and Wager (2021) DGP D, which involves partly linear prognostic and treatment effect functions, although soft TOBART is notably more accurate. The average coverages of TOBART and soft TOBART credible intervals for \(\tau (\varvec{x}_i)\) are generally much closer to \(95\%\) than the coverages of intervals produced by competing methods. TOBART-NP produces very wide credible intervals relative to TOBART. TOBART-NP produces better coverage than TOBART for four DGPs.

Table 4 Data application: number of observations (n), number of covariates (p), and proportions censored from below and above
Table 5 Data application results: mean squared error of outcome predictions relative to TOBART; Brier score and AUROC for predicted probabilities of censoring; 95% posterior predictive interval average coverage and length. Average over 10 random splits into 70% training data 30% test data. Minimum MSE, minimum Brier score, maximum AUROC, and coverage values closest to 0.95 are in bold
Table 6 Fake censoring data application results: MSE of outcome and latent outcome predictions relative to TOBART; latent outcome 95% posterior predictive interval average coverage and length. Average over 5 random splits into 70% training data 30% test data. Minimum MSE, and coverage values closest to 0.95 are in bold

4 Data application

For the data application, we consider the same methods as in Sect. 3.2, excluding Gaussian Processes and adding a hurdle model combining linear regression and probit. For each data set, we average results over 10 training-test splits. Each split is defined by taking a random sample of \(\text {floor}(0.7n)\) training observations stratified by censorship status. Categorical variables were encoded as sets of dummy variables. The numbers of observations, covariates, and proportions of censored observations are given in Table 4. Appendix C contains data descriptions with references to original sources.

4.1 Data application results

The data application results are presented in Table 5. For most data sets the results are similar across methods, particularly when methods are evaluated in terms of Brier score for predicted probabilities of censoring. Soft TOBART-NP performs best in terms of area under the receiver operating characteristics curve (AUROC). TOBART can give notably lower MSE of outcome predictions relative to other methods for some data sets. Prediction interval coverage is generally similar across methods, although prediction interval length for TOBART-based methods can be notably smaller than for standard BART.

In contrast to the simulation studies above, there is not a clear winning method in Table 5. Although censored outcome models have been applied to these data sets in previous work, perhaps other models are more suitable for some data sets. This is evidenced by the fact that for many data sets the combination of probit and a linear model outperforms Tobit. Therefore for some data sets zero inflated, hurdle, or sample selection models might be more appropriate. For the data sets on which Tobit outperforms probit and a linear model in terms of MSE, namely Recon and Atrazine, the best method is Soft TOBART. The TOBART models also notably outperform other methods when applied to the BostonHousing and Missouri data sets.

A lesson from this study is that it is important to select the appropriate model for the data set. The TOBART and Grabit methods are designed for the same form of DGPs, therefore it is arguably fairer to compare these two methods. Soft TOBART produces lower MSE predictions than Grabit across almost all data sets.Footnote 18 Nonetheless, the results are less impressive than those observed in the simulation study. Possible explanations for this include slow mixing of the TOBART Markov Chain, small sample sizes for some data sets, and very small or very large proportions of censored outcomes.

Censored outcome models are intended for prediction of latent outcomes, and it is not possible to evaluate these predictions by only using censored outcome data. In order to demonstrate the usefulness of TOBART for modelling of latent outcomes using real data, we artificially censor outcomes from some real datasets. The data sets summarized in Table 4 were previously studied by Kapelner and Bleich (2016) (Ozone and Ankara) and Linero and Yang (2018) (all other data sets).Footnote 19 We introduce fake censoring from below and above at the \(15^{th}\) and \(85^{th}\) percentiles respectively. Therefore the true values of “censored” outcomes are known.

The results in Table 6 suggest that TOBART-based methods can produce much more accurate predictions of latent outcomes than competing methods even if differences in MSE of observed outcome predictions are relatively small. Latent outcome posterior predictive interval coverage is generally much better for TOBART than for BART. This is unsurprising, as the BART posterior predictive intervals are not designed for latent outcomes.

5 Conclusion

Type I TOBART produces accurate predictive probabilities of censoring, predictions of outcomes, and treatment effect estimates. TOBART-NP, gives better uncertainty quantification for some simulated DGPs. Advantages of TOBART over competing methods include the fact that hyperparameter tuning is not required, and the straightforward combination of the method with other variations on BART to allow for smooth DGPs and sparsity (Linero and Yang 2018).

6 Supplementary information

The online supplementary appendix contains (A) additional simulation study results, (B) additional data application results, and (C) implementation details and parameter settings.