Type I Tobit Bayesian Additive Regression Trees for censored outcome regression

O’Neill, Eoghan

doi:10.1007/s11222-024-10434-4

Type I Tobit Bayesian Additive Regression Trees for censored outcome regression

Original Paper
Open access
Published: 24 May 2024

Volume 34, article number 123, (2024)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Type I Tobit Bayesian Additive Regression Trees for censored outcome regression

Download PDF

Eoghan O’Neill ORCID: orcid.org/0000-0002-1274-4248¹

229 Accesses
Explore all metrics

Abstract

Censoring occurs when an outcome is unobserved beyond some threshold value. Methods that do not account for censoring produce biased predictions of the unobserved outcome. This paper introduces Type I Tobit Bayesian Additive Regression Tree (TOBART-1) models for censored outcomes. Simulation results and real data applications demonstrate that TOBART-1 produces accurate predictions of censored outcomes. TOBART-1 provides posterior intervals for the conditional expectation and other quantities of interest. The error term distribution can have a large impact on the expectation of the censored outcome. Therefore, the error is flexibly modeled as a Dirichlet process mixture of normal distributions. An R package is available at https://github.com/EoghanONeill/TobitBART.

Statistical Inference on Middle-Censored Data in a Dependent Setup

Article 01 September 2015

Censored Multivariate Linear Regression Model

Maximum likelihood methods in a robust censored errors-in-variables model

Article 26 March 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Censoring occurs when, beyond some threshold value, the observed outcome is equal to the threshold instead of the true latent outcome value. For example, scientific equipment can often only make accurate measurements within a known range of outcome values, and observations outside this range are set to its limits. Often the estimand of interest is the conditional expectation or conditional average treatment effect on the outcome before censoring. Estimation of a standard regression model using data without censored values, or with censored observations set equal to threshold values, results in biased estimates. Tobit models directly model the latent outcome and censoring process (Tobin 1958).

In this paper, we combine the Bayesian Type I Tobit model (Chib 1992) with Bayesian Additive Regression Trees (Chipman et al. 2010). The latent outcome (before censoring) is modeled as a sum-of-trees, which allows for nonlinear functions of covariates. The error term is modeled as a Dirichlet process mixture of normal distributions, as in fully nonparametric BART (George et al. 2019). Smooth data generating processes with sparsity are modelled by soft trees with a Dirichlet prior on splitting variable probabilities, as introduced by Linero and Yang (2018).

In simulations and applications to real data, TOBART-1 outperforms a Tobit gradient boosted tree method, Grabit (Sigrist and Hirnschall 2019), a Tobit Gaussian Process model (Groot and Lucas 2012), standard linear Tobit, and simple hurdle models based on standard machine learning methods. Unlike other methods, TOBART-1 accounts for model uncertainty and can non-parametrically model the error term. Posterior intervals are available for censored outcomes, uncensored outcomes, conditional expectations, and probabilities of censoring. Grabit, Gaussian Processes, and other methods rely on cross-validation for parameter tuning and are sensitive to the tuned variance of the error term, whereas TOBART-1 performs well without parameter tuning and accounts for uncertainty in the variance of the error term.

TOBART-1 with a Dirichlet process mixture of normal distributions for the error term (TOBART-1-NP) removes the restrictive normality assumption often imposed in censored outcome models. We observe that this can lead to more accurate outcome predictions in simulations with non-normally distributed errors, and in real data applications, which may involve non-normally distributed outcomes.^{Footnote 1}

A variety of methods have been proposed for nonparametric and semiparametric censored outcome models. Lewbel and Linton (2002) describe a local linear kernel estimator for the setting in which both the uncensored outcome mean function of regressors and error distribution are unknown. Fan and Gijbels (1994) describe a quantile-based local linear approximation method. Huang (2021) introduces a semiparametric method involving B-splines. Chen et al. (2005) use a local polynomial method. Other papers on the topic of semiparametric and nonparametric censored outcome regression include Cheng and Small (2021), Heuchenne and Van Keilegom (2007, 2010), Huang et al. (2019), Oganisian et al. (2021). Gaussian Process censored outcome regression methods are applied by Groot and Lucas (2012), Cao et al. (2018), Gammelli et al. (2020, 2022), Basson et al. (2023). Zhang et al. (2021), Wu et al. (2018) implement censored outcome neural network methods.

A number of recent papers have considered Tobit model selection and regularization. Zhang et al. (2012) describe Focused Information Criteria based Tobit model selection and averaging. Jacobson and Zou (2024) provide theoretical and empirical results for Tobit with a Lasso penalty and a folded concave penalty (SCAD). Müller and van de Geer (2016) and Soret et al. (2018) describe a LASSO penalized censored outcome models. Bradic and Guo (2016) study robust penalized estimators for censored outcome regression.

The Bayesian Tobit literature includes quantile regression methods (Ji et al. 2012; Yu and Stander 2007; Alhamzawi 2016), and Bayesian elastic net Tobit (Alhamzawi 2020). Ji et al. (2012) account for model uncertainty by implementing Tobit quantile regression with Stochastic Search Variable Selection. However, the outcome and latent variable are modeled as linear functions of covariates. TOBART-1 provides a competing approach to the methods referenced above that does not impose linearity.

The remainder of the paper is structured as follows: In Sect. 2 we describe the TOBART-1 model and Markov chain Monte Carlo (MCMC) implementation, Sect. 3 contains simulation studies for prediction and treatment effect estimation with censored data, Sect. 4 contains applications to real world data, and Sect. 5 concludes the paper.

2 Methods

2.1 Review of Bayesian Additive Regression Trees (BART)

Suppose there are n observations, and the $n \times p$ matrix of explanatory variables, X, has $i^{th}$ row $x_i=[x_{i1},...,x_{ip}]$. Following the notation of Chipman et al. (2010), let T be a binary tree consisting of a set of interior node decision rules and a set of terminal nodes, and let $M = \{ \mu _1,..., \mu _b \}$ denote a set of parameter values associated with each of the b terminal nodes of T. The interior node decision rules are binary splits of the predictor space into the sets $\{ x_{is} \le c \}$ and $\{ x_{is} > c \}$ for continuous $x_{s}$. Each observation’s $x_i$ vector is associated with a single terminal node of T, and is assigned the $\mu $ value associated with this terminal node. For a given T and M, the function $g(x_i;T,M)$ assigns a $\mu \in M$ to $x_i$.

For the standard BART model, the outcome is determined by a sum of trees,

$$\begin{aligned} Y_i = \sum _{j=1}^m g(x_i; T_j, M_j)+\varepsilon _i \end{aligned}$$

where $g(x_i;T_j,M_j)$ is the output of a decision tree. $T_j$ refers to a decision tree indexed by $j=1,...,m$, where m is the total number of trees in the model. $M_j$ is the set of terminal node parameters of $T_j$, and $\varepsilon _i \overset{i.i.d}{\sim } N(0, \sigma ^2)$.

Prior independence is assumed across trees $T_j$ and across terminal node means $M_j = (\mu _{1j}...\mu _{b_j j})$ (where $1,...,b_j$ indexes the terminal nodes of tree j). The form of the prior used by Chipman et al. (2010) is $p(M_1,...,M_m,T_1,...,T_m,\sigma ) \propto \left[ \prod _j \left[ \prod _k p(\mu _{kj}|T_j) \right] p(T_j)\right] p(\sigma ) $ where $\mu _{kj} | T_j \overset{i.i.d}{\sim } N(0,\sigma _{\mu }^2)$ where $\sigma _{\mu } = \frac{0.5}{\kappa \sqrt{m}}$ and $\kappa $ is a user-specified hyper-parameter.

Chipman et al. (2010) set a regularization prior on the tree size and shape $p(T_j)$. The probability that a given node within a tree $T_j$ is split into two child nodes is $\alpha (1+d_h)^{-\beta }$, where $d_h$ is the depth of (internal) node h, and the parameters $\alpha $ and $\beta $ determine the size and shape of $T_j$ respectively. Chipman et al. (2010) use uniform priors on available splitting variables and splitting points. The model precision $\sigma ^{-2}$ has a conjugate prior distribution $\sigma ^{-2} \sim Ga(\frac{v}{2}, \frac{v \lambda }{2})$ with degrees of freedom v and scale $\lambda $.

Samples from $p((T_1, M_1),...,(T_m,M_m), \sigma | y)$ can be made by a Bayesian backfitting MCMC algorithm. This algorithm involves m successive draws from $(T_j, M_j )| T_{(j)}, M_{(j)}, \sigma , y $ for $j=1,...,m$, where $T_{(j)}, M_{(j)} $ are the trees and parameters for all trees except the $j^{th}$ tree, followed by a draw of $\sigma $ from the full conditional $\sigma | T_1,...,T_m,M_1,...,M_m,y$. After burn-in, the sequence of $f^*$ draws, $f_1^*,...,f_Q^*$, where $f^*(.)= \sum _{j=1}^m g(. \; T_j^*, M_j^*)$, is an approximate sample of size Q from p(f|y).

2.2 Soft trees and sparse splitting rules

In addition to the standard Bayesian tree model for $f(\varvec{x}_i)$ described in Sect. 2.1, we also implement TOBART and TOBART-NP with soft trees and sparse splitting rules as described by Linero and Yang (2018). Predictions from soft trees are weighted linear combinations of all terminal node parameter values, with the weights being functions of distances between covariates and splitting points. The prediction from a single tree function is

$$\begin{aligned}{} & {} g(\varvec{x}_i; T_j, M_j) = \sum _{\ell = 1}^{L_j} \mu _{j,\ell } \xi (\varvec{x}_i, T_j, \ell )\\{} & {} \xi (\varvec{x}_i, T_j, \ell ) = \prod _{b \in \mathcal {A}(\ell )} \zeta \left( \frac{ x_{{j_b}} - C_b}{\tau _b} \right) ^{ \mathbb {I} \{ x_{j_b} > C_b \} } \\{} & {} \quad \times \Big \{ 1 - \zeta \left( \frac{ x_{{j_b}} - C_b}{\tau _b} \right) \Big \}^{ \mathbb {I} \{ x_{j_b} \le C_b \} } \end{aligned}$$

where $L_j$ is the number of leaves in the $j^{th}$ tree, $\mu _{j,\ell }$ is the $\ell ^{th}$ terminal node parameter of the $j^{th}$ tree, $\mathcal {A}(\ell )$ denotes the set of ancestor nodes of terminal node $\ell $. The splitting variable, splitting point, and bandwidth parameter at internal node b are denoted by $x_{{j_b}}$, $C_b$, and $\tau _b$ respectively. The gating function $\zeta $ is the logistic function $ \zeta (x) = (1+\exp (-x))^{-1}$.

Sparse splitting rules are introduced by placing a Dirichlet prior on the splitting probabilities $(s_1,\dots , s_p) \sim \mathcal {D} (\frac{a}{p},\dots , \frac{a}{p})$. The parameter a controls the level of sparsity and has the prior $\text {Beta}(0.5,1)$. Linero and Yang (2018) demonstrate that soft trees allow BART to model smooth functions, and the Dirichlet prior on splitting probabilities adapts to unknown levels of sparsity to provide improved predictions on high dimensional data sets.

2.3 Type I Tobit and TOBART

2.3.1 Type I Tobit model

The Type I Tobit model with censoring from below at a and censoring from above at b is:

$$\begin{aligned} Y_i^*= & {} \varvec{x}_i \varvec{\beta } + \varepsilon _i \, \ \varepsilon _i \sim i.i.d. \ N(0, \sigma ^2)\\ Y_i= & {} {\left\{ \begin{array}{ll} a \ \text {if} \ Y_i^* \le a \\ Y_i^* \ \text {if} \ a< Y_i^* < b \\ b \ \text {if} \ b \le Y_i^* \end{array}\right. } \end{aligned}$$

where a normal prior is placed on $\beta $, and an inverse gamma prior is placed on $\sigma ^2$ (Chib 1992).

2.3.2 Type I TOBART model

The Type I TOBART model replaces the linear combination $\varvec{x}_i \varvec{\beta }$ with the sum-of-trees function $f(\varvec{x}_i)$:

$$\begin{aligned} Y_i^*= & {} f(\varvec{x}_i) + \varepsilon _i \, \ \varepsilon _i \sim i.i.d. \ N(0, \sigma ^2)\\ Y_i= & {} {\left\{ \begin{array}{ll} a \ \text {if} \ Y_i^* \le a \\ Y_i^* \ \text {if} \ a< Y_i^* < b \\ b \ \text {if} \ b \le Y_i^* \end{array}\right. } \end{aligned}$$

where a BART prior is placed on $ f(\varvec{x}_i)$ and an inverse gamma prior is placed on $\sigma ^2$.^{Footnote 2}

2.3.3 Type I TOBART Gibbs sampler

Tobit can be implemented by MCMC with data augmentation (Chib 1992). The realization, $y_i^*$, of the variable $Y_i^*$ is observed for uncensored outcomes, and is sampled from its full conditional for censored outcomes.

$$\begin{aligned} y_i^*= & {} y_i \text { if } y_i \in (a,b) \, \ \text { and }\\ y_i^*\sim & {} {\left\{ \begin{array}{ll} \mathcal{T}\mathcal{N}_{[-\infty ,a]}(f(\varvec{x}_i), \sigma ^2) \ \text {if} \ y_i = a \\ \mathcal{T}\mathcal{N}_{[b,\infty ]}(f(\varvec{x}_i), \sigma ^2) \ \text {if} \ y_i = b \end{array}\right. } \end{aligned}$$

where $\mathcal{T}\mathcal{N}_{[l,u]}$ denotes a normal distribution truncated to the interval [l, u]. The full conditionals for $f(\varvec{x}_i)$ and $\sigma ^2$ are standard full conditionals for BART with $y_i^*$ as the dependent variable and $\varvec{x}_i$ as the potential splitting variables. Appendix A contains a description of a sampler that produces draws $f^{(1)}(\varvec{x}_i), \ldots ,f^{(D)}(\varvec{x}_i)$ and $\sigma ^{(1)},\ldots ,\sigma ^{(D)}$.

2.3.4 Predicting outcomes with TOBART

The conditional mean of the latent variable is $f(\varvec{x}_i)$. If censoring is also applied to the test data, then the outcomes are predicted by averaging the standard Tobit expectation formula across MCMC iterations:

For all MCMC iterations $d=1,...,D$ calculate

$$\begin{aligned} E[Y_i|X_i{} & {} =\varvec{x}_i, f^{(d)}, \sigma ^{(d)}] = a \Phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \\{} & {} \quad +f^{(d)}(\varvec{x}_i) \Bigg [ \Phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \Phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \Bigg ]\\{} & {} \quad + \sigma ^{(d)} \Bigg ( \phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \Bigg ) \\{} & {} \quad +b \Bigg [ 1 - \Phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) \Bigg ] \end{aligned}$$

The predicted outcome is $\frac{1}{D} \sum _{d=1}^D E[Y_i|X_i=\varvec{x}_i, f^{(d)}, \sigma ^{(d)}]$. The expectation conditional on the outcome not being in the censored range is:

$$\begin{aligned}{} & {} E[Y_i| a< Y_i < b, X_i=\varvec{x}_i, f^{(d)}, \sigma ^{(d)}] \\ {}{} & {} = f^{(d)}(\varvec{x}_i) + \sigma ^{(d)} \frac{ \phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) }{ \Phi \Big (\frac{b - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) - \Phi \Big (\frac{a - f^{(d)}(\varvec{x}_i)}{\sigma ^{(d)}} \Big ) } \end{aligned}$$

2.4 Nonparametric Type I TOBART

2.4.1 Nonparametric Type I TOBART model

The accuracy of the conditional expectation of the TOBART model depends on the validity of the assumption of normality of the errors. More general censored outcomes can be modelled by assuming a Dirichlet Process mixture distribution for the error terms.

$$\begin{aligned} y_i^*= & {} f(\varvec{x}_i) + \varepsilon _i \,\ y_i = {\left\{ \begin{array}{ll} a \ \text {if} \ y_i^* \le a \\ y_i^* \ \text {if} \ a< y_i^* < b \\ b \ \text {if} \ b \le y_i^* \end{array}\right. } \\ \varepsilon _i\sim & {} i.i.d. \ N(\gamma _i, \sigma _i^2) \, \ \vartheta _i = (\gamma _i, \sigma _i) \sim G\\ G\sim & {} \mathcal{D}\mathcal{P}(G_0, \alpha ) \end{aligned}$$

The distribution of the error term is specified similarly to George et al. (2019). The base distribution $G_0$ is defined as follows:

$$\begin{aligned}{} & {} p(\gamma , \sigma | \nu , \lambda _1, \gamma _0, k_0) = p(\sigma | \nu , \lambda ) p(\gamma | \sigma , \gamma _0, k_0)\\{} & {} \sigma ^2 \sim \frac{\nu \lambda }{\chi _{\nu }^2} \, \ \gamma | \sigma \sim \mathcal {N} \Big (\gamma _0, \frac{\sigma ^2}{k_0} \Big ) \end{aligned}$$

where, in contrast to the standard BART prior of Chipman et al. (2010), $\nu $ is set to 10 instead of 3.^{Footnote 3} The parameter $\lambda $ is set such that the $q^{th}$ quantile of the prior distribution of $\sigma $ is the sample standard deviation of the outcome, or of the residuals from a linear model. For TOBART-NP, $q=0.9$ instead of 0.95.^{Footnote 4} The prior on $\alpha $ is the $\alpha \sim \Gamma (2,2)$ prior introduced by Escobar and West (1995) and applied by Van Hasselt (2011).^{Footnote 5}

The outcome is scaled by subtracting the sample mean before applying the Gibbs sampler, therefore George et al. (2019) set $\gamma _0 = 0$.^{Footnote 6} The parameter $k_0$ is scaled with the marginal distribution of $\gamma $ ( $\gamma \sim \frac{\sqrt{\lambda } }{ \sqrt{k_0 } } t_{\nu } $). Given $k_s$ (set to 10 by default), $k_0$ is set such that $ \max _{i=1,...,n} |e_i| = k_s \frac{\sqrt{\lambda } }{ \sqrt{k_0 } } $ where $k_s = 10$. and $e_1,...,e_n$ are the residuals from a linear model.^{Footnote 7} The Gibbs sampler for TOBART-NP is described in Appendix A.

For each MCMC iteration, d, and observation i, we obtain $\vartheta _i^{(d)} = (\gamma _i^{(d)}, \sigma _i^{(d)})$. The conditional expectation, $E[y_i|\varvec{x}_i, f^{(d)}, \gamma _i^{(d)}, \sigma ^{(d)}] $, is calculated as outlined in Sect. 2.3.4.

2.5 Treatment effect estimation for censored outcomes

Let a binary variable $T_i$ equal 1 if unit i is assigned to treatment and 0 if i is assigned to the control group. The potential outcomes under treatment and control group allocation are denoted by $Y_i(1)$ and $Y_i(0)$ respectively. Similarly, the potential outcomes of the latent outcome are denoted by $Y_i^*(1), Y_i^*(0)$. Assume the data generating process is as follows:

$$\begin{aligned} Y_i^*= & {} \mu (\varvec{x}_i) + \tau (\varvec{x}_i) T_i + \varepsilon _i \, \ \varepsilon _i \sim \mathcal {N}(0,\sigma ^2)\\ Y_i= & {} {\left\{ \begin{array}{ll} a \ \text {if} \ Y_i^* \le a \\ Y_i^* \ \text {if} \ a< Y_i^* < b \\ b \ \text {if} \ b \le Y_i^* \end{array}\right. } \end{aligned}$$

where $\mu (\varvec{x}_i)$ and $\tau (\varvec{x}_i)$ are possibly nonlinear functions of covariates. Assume conditional unconfoundedness, i.e. $Y_i^*(1),Y_i^*(0) \perp T_i | X_i$. The estimand is the conditional average treatment effect on $Y_i^*$, i.e., $ E[Y_i^*(1) - Y_i^*(0) | X_i = \varvec{x}_i] = \tau (\varvec{x}_i) $. However, a model naively trained on only uncensored outcomes estimates the following effects^{Footnote 8}

$$\begin{aligned}{} & {} E[Y_i(1) | a< y_i< b, X_i = \varvec{x}_i] \\{} & {} \quad -E[Y_i(0) | a< y_i < b, X_i = \varvec{x}_i] = \tau (\varvec{x}_i) \\{} & {} \quad +\sigma \Bigg ( \frac{ \phi \Big (\frac{a - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)))}{\sigma } \Big ) - \phi \Big (\frac{b - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)) }{\sigma } \Big ) }{ \Phi \Big (\frac{b - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)) }{\sigma } \Big ) - \Phi \Big (\frac{a - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)) }{\sigma } \Big ) } \\{} & {} \quad -\frac{ \phi \Big (\frac{a - \mu (\varvec{x}_i)}{\sigma } \Big ) - \phi \Big (\frac{b - \mu (\varvec{x}_i)}{\sigma } \Big ) }{ \Phi \Big (\frac{b - \mu (\varvec{x}_i)}{\sigma } \Big ) - \Phi \Big (\frac{a - \mu (\varvec{x}_i)}{\sigma } \Big ) } \Bigg ) . \end{aligned}$$

A sufficiently flexible nonparametric method, without restrictive assumptions on the error term, will produce estimates that approximate the expression above. A model naively trained on the full data set with censoring similarly gives biased estimates (see Appendix B). By directly modelling $Y_i^*$, censored outcome models avoid the bias described above. Similar biases occur if the error term is not normally distributed.

3 Simulation studies

3.1 Description of prediction simulations

We adapt the data generating processes (DGPs) introduced by Friedman (1991) to a censored regression setting. This DGP has often been applied in comparisons of semiparametric regression methods. We also make use of the censored outcome simulations described by Groot and Lucas (2012), Sigrist and Hirnschall (2019), and Jacobson and Zou (2024) for fair comparison against competing methods with existing synthetic censored data.

The covariates $x_1,....,x_p$ are independently sampled from the uniform distribution on the unit interval. The outcome before censoring is generating from one of the following functions:

$ y^* = 10 \sin (\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) $ with censoring from below at the $15^{th}$ percentile of the training data $y^*$ values (Friedman 1991).^{Footnote 9}
$ y^* = 10 \sin (\pi x_1 x_2) + 20 (x_3 - 0.5)^2 + 10 x_4 + 5 x_5 + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) $ with censoring from below at the $15^{th}$ percentile of the training data $y^*$ values, and from above at the $85^{th}$ percentile of the training data $y^*$ values (Friedman 1991).
$ y^* = 6 (x_1 - 2)^2 \sin (2(6x_1 - 2) ) + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) $ with censoring from below at the $40^{th}$ percentile of the training data $y^*$ values (Groot and Lucas 2012).
$ y^* = \sum _{k=1}^5 0.3 \max (x_k,0) + \sum _{k=1}^3 \sum _{j=k+1}^4 \max (x_k x_j,0) + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) $ with censoring from above at the $95^{th}$ percentile of the training data $y^*$ values (Sigrist and Hirnschall 2019). For this simulation, $x_1,....,x_p$ are uniformly distributed on $[-1,1]$ instead of [0, 1].^{Footnote 10}
$ y^* = 3 + 5 x_1 + x_2 + \frac{x_3}{2} - 2 x_4 + \frac{x_5}{10} + \varepsilon \, \ \varepsilon \sim \mathcal {N}(0,\sigma ^2) $ with censoring from below at the $25^{th}$ percentile of the training data $y^*$ values (Jacobson and Zou 2024).

The variance of the error, $\sigma ^2$, is set to 1. See the Supplementary Appendix (Online Resource 1) for the results obtained from simulations with $\sigma \in \{0.1, 2\}$. We also consider deviations from the assumption of normally distributed errors. In particular, we include results for simulations in which $\varepsilon $ is generated from Skew-t, and $\text {Weibull}(1/2, 1/5)$ distributions.^{Footnote 11} The number of covariates, p, is set to 30. We generate 500 training and 500 test observations.

3.2 Prediction simulation results

We compare the performance of TOBART-1, TOBART-1-NP, Soft TOBART-1, and Soft TOBART-1-NP against Grabit (Sigrist and Hirnschall 2019), linear Tobit (Tobin 1958), BART (Chipman et al. 2010), Random Forests (RF) (Breiman 2001), Gaussian Processes, and a Tobit Gaussian Process model (Groot and Lucas 2012).^{Footnote 12} The results for a Gaussian Process (GP) with only 5 variables (always including all informative variables) are included because GPs were observed to produce inaccurate predictions when applied to data with 30 variables.^{Footnote 13} Censored outcome predictions are evaluated using Mean Squared Error (MSE), and predicted probabilities of censoring are evaluated using the Brier Score.^{Footnote 14}^{Footnote 15} All results are averaged over 5 repetitions.^{Footnote 16}

The results for simulations with normally distributed errors are presented in Tables 1 and 2. The TOBART algorithms generally outperform competing methods across all DGPs, except unsurprisingly for the linear Jacobson and Zou (2024) simulations linear Tobit is outperformed only by Soft TOBART. TOBART-NP can slightly improve on TOBART in some cases, but generally the results are similar when errors are normally distributed. The differences in criteria across methods are small for the more linear DGPs from Sigrist and Hirnschall (2019) and Jacobson and Zou (2024), as linear Tobit is designed for a linear DGP, and the nonlinear methods BART and RF can model the relatively simple response surface well. It is worth noting that TOBART outperforms Grabit even though the true standard deviation, $\sigma =1$, is included as one of five possible Grabit hyperparameter values in cross-validation. The same pattern of results can be observed for simulations with $\sigma =0.1$ and $\sigma = 2$ in the Supplementary Appendix. The Supplementary Appendix contains comparisons of Area Under the Curve for all methods and DGPs, from which similar conclusions can be drawn.

The results for Skew-t and Weibull distributed errors are also presented in Tables 1 and 2.^{Footnote 17} The TOBART models outperform all other methods for almost all DGPs and criteria. The results for the Weibull distribution generally favour TOBART-NP and Soft TOBART-NP, indicating that there is some improvement from the Dirichlet Process model when the errors are sufficiently non-Gaussian.

The average coverage and length of $95\%$ prediction intervals for the latent outcomes and the observed outcomes are given in the Supplementary Appendix (Online Resource 1). For most DGPs and error distributions, TOBART and Soft TOBART provides the closest to 95% coverage of prediction intervals for both latent and observed outcomes. For some DGPs with non-normal errors, the more conservative intervals produced by TOBART-NP and Soft TOBART-NP provide better coverage.

Table 1 Simulation study, mean squared error

Full size table

Table 2 Simulation study, Brier score

Full size table

3.3 Description of treatment effect simulations

A number of recent simulation studies have demonstrated that BART is among the most accurate treatment effect estimation methods (Wendling et al. 2018; McConnell and Lindner 2019; Dorie et al. 2019; Hahn et al. 2019). However, in practice many data sets, including randomized trial data sets, contain censored outcomes. For example, antibody concentrations or environmental levels of chemicals can only be measured accurately within a certain range as a result of limitations of measuring equipment. Often economic data is censored due to privacy considerations, for example income might be censored above a certain threshold. TOBART provides a machine learning treatment effect estimation method with uncertainty quantification that can be applied to this data while still making use of the information provided by censored observations. We demonstrate the effectiveness of TOBART by censoring the outcomes of DGPs from published studies of machine learning methods for treatment effect estimation. The chosen data generating processes contain linear and non-linear functions of covariates, constant and heterogeneous effects, and various degrees of confounding.

3.3.1 Censored Caron et al. (2022) simulations

$P=10$ covariates are generated from a multivariate Gaussian distribution, $X_1,\ldots , X_{10} \sim \mathcal {MVN}(\varvec{0}, \Sigma )$, with $\Sigma _{jk} = 0.6^{|j-k|} + 0.1 \mathbb {I}(j \ne k) $. The binary treatment variable is Bernoulli distributed, $Z_i \sim \text {Bern}(\pi (\varvec{x}_i)) $, where

$$\begin{aligned} \pi (\varvec{x}_i) = \Phi (-0.4 + 0.3 X_{i,1} + 0.2 X_{i,2} ) \end{aligned}$$

and $\Phi (\cdot )$ is the cumulative distribution function of the standard normal distribution.

The prognostic score function, $\mu (\varvec{x}_i)$, and CATE function, $\tau (\varvec{x}_i)$, are defined as

$$\begin{aligned} \mu (\varvec{x}_i)= & {} 3 + X_{i,1} + 0.8 \sin ( X_{i,2} ) + 0.7 X_{i,3} X_{i,4} - X_{i,5}\\ \tau (\varvec{x}_i)= & {} 2 + 0.8 X_{i,1} - 0.3 X_{i,2}^2 \end{aligned}$$

The outcome before censoring is generated as:

$$\begin{aligned} Y_i^* = \mu (\varvec{x}_i) + \tau (\varvec{x}_i) Z_i + \varepsilon _i \, \ \text {where} \ \varepsilon _i \sim \mathcal {N}(0,1) \end{aligned}$$

The number of sampled observations is 200. The observed outcome $Y_i$ is censored from below at the $15^{th}$ percentile of the generated $Y_i^*$ values, and from above at the $85^{th}$ percentile.

Table 3 Treatment effect simulation results

Full size table

3.3.2 Censored Friedberg et al. (2020) simulations

$P=20$ covariates are generated from independent standard uniform distributions $X_1,...,X_{20} \sim \mathcal {U}[0,1]$. There is no confounding as $\pi (\varvec{x}_i)=0.5$ and $Z_i \sim \text {Bern}(\pi (\varvec{x}_i)) $. The prognostic score function, $\mu (\varvec{x}_i)$, and CATE function, $\tau (\varvec{x}_i)$, are defined as $ \mu (\varvec{x}_i) = 0 $ and

$$\begin{aligned} \tau (\varvec{x}_i)= & {} \left( 1 + \frac{1}{1 + \exp \left( -20\left( X_{i,1} - \frac{1}{3}\right) \right) } \right) \\{} & {} \quad \times \left( 1 + \frac{1}{1 + \exp \left( -20\left( X_{i,2} - \frac{1}{3}\right) \right) } \right) . \end{aligned}$$

The outcome before censoring is generated as:

$$\begin{aligned} Y_i^* = \mu (\varvec{x}_i) + \tau (\varvec{x}_i) Z_i + \varepsilon _i \, \ \text {where} \ \varepsilon _i \sim \mathcal {N}(0,1) \end{aligned}$$

The number of sampled observations is 200. The observed outcome $Y_i$ is censored from below at the $15^{th}$ percentile of the generated $Y_i^*$ values, and from above at the $85^{th}$ percentile.

3.3.3 Censored Nie and Wager (2021) simulations

The covariates are generated as follows across scenarios A to D. In simulation A, $X_1,...,X_{12} \sim \mathcal {U}[0,1]$. In simulations B to D, $X_1,...,X_{12} \sim \mathcal {N}(0,1)$.

$\pi (\varvec{x}_i)$ is defined as follows across scenarios A to D: (A) $ \text {trim}_{0.1} \{ \sin (\pi X_{i,1} X_{i,2} ) \} $, (B) constant equal to 0.5, (C) $1/\{1 + \exp (X_{i,2} + X_{i,3} )\}$, (D) $1/\{1 + \exp (-X_{i,1}) + \exp ( - X_{i,2} )\}$.

$\mu (\varvec{x}_i)$ is defined as follows across scenarios A to D: (A) $\sin (\pi X_{i,1} X_{i,2}) + 2 (X_{i,3}-0.5)^2 + X_{i,4} + 0.5 X_{i,5} $, (B) $\max \{X_{i,1} + X_{i,2}, X_{i,3},0 \} $, (C) $2 \log \{ 1 + \exp ( X_{i,1} + X_{i,2} + X_{i,3} ) \} $, (D) $\frac{1}{2} [ \max \{ X_{i,1} + X_{i,2} + X_{i,3},0 \} + \max \{ X_{i,4} + X_{i,5},0 \} ] $.

$\tau (\varvec{x}_i)$ is defined as follows across scenarios A to D: (A) $ ( X_{i,1} + X_{i,2})/2$, (B) $ X_{i,1} + \log \{1 + \exp ( X_{i,2}) \}$, (C) constant equal to 1, (D) $\max \{ X_{i,1} + X_{i,2} + X_{i,3},0 \} - \max \{ X_{i,4} + X_{i,5},0 \}$.

The outcome before censoring is generated as:

$$\begin{aligned} Y_i^* = \mu (\varvec{x}_i) + \tau (\varvec{x}_i) (Z_i-0.5) + \varepsilon _i \, \ \text {where} \ \varepsilon _i \sim \mathcal {N}(0,1) \end{aligned}$$

The number of sampled observations is 200. The observed outcome $Y_i$ is censored from below at the $15^{th}$ percentile of the generated $Y_i^*$ values, and from above at the $85^{th}$ percentile.

3.4 Treatment effect simulation results

All methods are evaluated in terms of Precision in Estimation of Heterogeneous Effects (PEHE), which is defined as $\frac{1}{N}\sum _{i=1}^N (\hat{\tau }(\varvec{x}_i) - \tau (\varvec{x}_i) )^2$. Confidence intervals are evaluated in terms of average coverage of $95\%$ intervals and average length of intervals.

The results are presented in Table 3. For all DGPs, at least one TOBART method attains lower PEHE than all other methods, often by a large margin. Local Linear Forests (Friedberg et al. 2020) attain similar PEHE to TOBART and TOBART-NP for Nie and Wager (2021) DGP D, which involves partly linear prognostic and treatment effect functions, although soft TOBART is notably more accurate. The average coverages of TOBART and soft TOBART credible intervals for $\tau (\varvec{x}_i)$ are generally much closer to $95\%$ than the coverages of intervals produced by competing methods. TOBART-NP produces very wide credible intervals relative to TOBART. TOBART-NP produces better coverage than TOBART for four DGPs.

Table 4 Data application: number of observations (n), number of covariates (p), and proportions censored from below and above

Full size table

Table 5 Data application results: mean squared error of outcome predictions relative to TOBART; Brier score and AUROC for predicted probabilities of censoring; 95% posterior predictive interval average coverage and length. Average over 10 random splits into 70% training data 30% test data. Minimum MSE, minimum Brier score, maximum AUROC, and coverage values closest to 0.95 are in bold

Full size table

Table 6 Fake censoring data application results: MSE of outcome and latent outcome predictions relative to TOBART; latent outcome 95% posterior predictive interval average coverage and length. Average over 5 random splits into 70% training data 30% test data. Minimum MSE, and coverage values closest to 0.95 are in bold

Full size table

4 Data application

For the data application, we consider the same methods as in Sect. 3.2, excluding Gaussian Processes and adding a hurdle model combining linear regression and probit. For each data set, we average results over 10 training-test splits. Each split is defined by taking a random sample of $\text {floor}(0.7n)$ training observations stratified by censorship status. Categorical variables were encoded as sets of dummy variables. The numbers of observations, covariates, and proportions of censored observations are given in Table 4. Appendix C contains data descriptions with references to original sources.

4.1 Data application results

The data application results are presented in Table 5. For most data sets the results are similar across methods, particularly when methods are evaluated in terms of Brier score for predicted probabilities of censoring. Soft TOBART-NP performs best in terms of area under the receiver operating characteristics curve (AUROC). TOBART can give notably lower MSE of outcome predictions relative to other methods for some data sets. Prediction interval coverage is generally similar across methods, although prediction interval length for TOBART-based methods can be notably smaller than for standard BART.

In contrast to the simulation studies above, there is not a clear winning method in Table 5. Although censored outcome models have been applied to these data sets in previous work, perhaps other models are more suitable for some data sets. This is evidenced by the fact that for many data sets the combination of probit and a linear model outperforms Tobit. Therefore for some data sets zero inflated, hurdle, or sample selection models might be more appropriate. For the data sets on which Tobit outperforms probit and a linear model in terms of MSE, namely Recon and Atrazine, the best method is Soft TOBART. The TOBART models also notably outperform other methods when applied to the BostonHousing and Missouri data sets.

A lesson from this study is that it is important to select the appropriate model for the data set. The TOBART and Grabit methods are designed for the same form of DGPs, therefore it is arguably fairer to compare these two methods. Soft TOBART produces lower MSE predictions than Grabit across almost all data sets.^{Footnote 18} Nonetheless, the results are less impressive than those observed in the simulation study. Possible explanations for this include slow mixing of the TOBART Markov Chain, small sample sizes for some data sets, and very small or very large proportions of censored outcomes.

Censored outcome models are intended for prediction of latent outcomes, and it is not possible to evaluate these predictions by only using censored outcome data. In order to demonstrate the usefulness of TOBART for modelling of latent outcomes using real data, we artificially censor outcomes from some real datasets. The data sets summarized in Table 4 were previously studied by Kapelner and Bleich (2016) (Ozone and Ankara) and Linero and Yang (2018) (all other data sets).^{Footnote 19} We introduce fake censoring from below and above at the $15^{th}$ and $85^{th}$ percentiles respectively. Therefore the true values of “censored” outcomes are known.

The results in Table 6 suggest that TOBART-based methods can produce much more accurate predictions of latent outcomes than competing methods even if differences in MSE of observed outcome predictions are relatively small. Latent outcome posterior predictive interval coverage is generally much better for TOBART than for BART. This is unsurprising, as the BART posterior predictive intervals are not designed for latent outcomes.

5 Conclusion

Type I TOBART produces accurate predictive probabilities of censoring, predictions of outcomes, and treatment effect estimates. TOBART-NP, gives better uncertainty quantification for some simulated DGPs. Advantages of TOBART over competing methods include the fact that hyperparameter tuning is not required, and the straightforward combination of the method with other variations on BART to allow for smooth DGPs and sparsity (Linero and Yang 2018).

6 Supplementary information

The online supplementary appendix contains (A) additional simulation study results, (B) additional data application results, and (C) implementation details and parameter settings.

Notes

A Dirichlet process mixture for the error term distribution has previously been included in a censored outcome model by Kottas and Krnjajić (2009).
$\sigma ^{-2} \sim Ga(\frac{v}{2}, \frac{v \lambda }{2})$. For standard BART, $\lambda $ is set such that the $q^{th}$ quantile of the prior distribution of $\sigma $ is the sample standard deviation of the residuals from a linear model. For censored outcomes, this may give poor calibration of the $\sigma $ prior. We consider four options in a simulation study in appendix E. A sample standard deviation estimate from an intercept-only Tobit model generally gives good results, although often there is little difference across $\lambda $ values.
George et al. (2019) recommend $\nu =10$ as the spread of the error increases when there are many components and the spread of a single components can be reduced by increasing $\nu $. This gives better results than $\nu =3$ for some DGPs in a simulation study in Appendix E.
This is complicated by the censoring of the outcome. Some options are: 1. Estimate the standard deviation assuming that censored outcome is normally distributed. 2. Estimate the standard deviation of a linear type I Tobit model (contains option 1 as a special case but not feasible when there are more regressors than observations). 3. Estimate the standard deviation of the censored outcome without accounting for censoring. We use option 2 for TOBART-NP.
The TobitBART package also includes an option for the prior described by Rossi (2014) and George et al. (2019), $p(\alpha ) \propto \left( 1 - \frac{\alpha - \alpha _{min}}{\alpha _{max} - \alpha _{min}} \right) ^{\psi }$, where $\alpha _{min}$ and $\alpha _{max}$ are set so that the modal numbers of components are $I_{min} = 1$ and $I_{max} = [(0.1)n]$ respectively, and $\psi = 0.5$.
However, the mean cannot be estimated for censored data without making further assumptions. Options include: 1. Estimate the mean (and variance) of a censored normal distribution. 2. Calculate the sample mean of the censored outcome without accounting for censoring. We use option 1.
The residuals likely underestimate the true errors for censored observations.
This bias occurs if all the uncensored observations are included in one regression and differences in predictions for $T_i=1$ and $T_i=0$ are obtained, i.e. an S-learner approach (Künzel et al. 2019), or if the two conditional expectations are obtained from separate regressions for treated and untreated uncensored observations, i.e. a T-Learner approach. In both cases, the conditional expectations are not equal to the expectation of the latent outcome.
The original Friedman simulations did not involve censoring.
This simulation differs somewhat from the original simulation of Sigrist and Hirnschall (2019) for which the variable determining censoring was not perfectly correlated with the observed outcome before censoring.
Bradic and Guo (2016) considered $\text {Weibull}(1/2, 1/5)$ errors in a simulation study.
Standard BART for continuous outcomes is trained on censored outcomes. Probit BART is trained on a binary variable indicating censorship. Similarly, Random Forests are separately trained on continuous censored outcomes and a binary censorship indicator.
The GP Matlab code was obtained from https://www.cs.ru.nl/~perry/software/tobit1.html.
See the Supplementary Appendix (Online Resource 1) for implementation details and parameter settings.
Latent outcome predictions similarly demonstrate that TOBART outperforms other methods, and these results are available on request. However, it is unsurprising that Tobit based latent outcome predictions outperform naive approaches due to the aforementioned censoring bias.
Computational times are included in Appendix D.
Results for t-distributed errors with $\nu =3$ are in the Supplementary Appendix.
Potentially Grabit could produce better results with more hyperparameter tuning, although this would be computationally costly.
The outcomes for all data except Ozone and Ankara were transformed to resemble normally distributed using code provided by Linero and Yang (2018).

References

Alhamzawi, A.: A new Bayesian elastic net for Tobit regression. J. Phys. Conf. Ser. 1664(1), 012047 (2020). https://doi.org/10.1088/1742-6596/1664/1/012047
Article Google Scholar
Alhamzawi, R.: Bayesian elastic net Tobit quantile regression. Commun. Stat. Simul. Comput. 45(7), 2409–2427 (2016). https://doi.org/10.1080/03610918.2014.904341
Article MathSciNet Google Scholar
Antoniak, C.E.: Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Stat. 2(6), 1152–1174 (1974). https://doi.org/10.1214/aos/1176342871
Article MathSciNet Google Scholar
Basson, M., Louw, T.M., Smith, T.R.: Variational Tobit Gaussian process regression. Stat. Comput. 33(3), 64 (2023). https://doi.org/10.1007/s11222-023-10225-3
Article MathSciNet Google Scholar
Bradic, J., Guo, J.: Robust confidence intervals in high-dimensional left-censored regression. (2016) arXiv:1609.07165 [math.ST]
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article Google Scholar
Bush, C.A., MacEachern, S.N.: A semiparametric Bayesian model for randomised block designs. Biometrika 83(2), 275–285 (1996). https://doi.org/10.1093/biomet/83.2.275
Article Google Scholar
Cao, F., Ba, S., Brenneman, W.A., Joseph, V.R.: Model calibration with censored data. Technometrics 60(2), 255–262 (2018). https://doi.org/10.1080/00401706.2017.1345704
Article MathSciNet Google Scholar
Caron, A., Baio, G., Manolopoulou, I.: Shrinkage Bayesian causal forests for heterogeneous treatment effects estimation. J. Comput. Graph. Stat. 31(4), 1202–1214 (2022). https://doi.org/10.1080/10618600.2022.2067549
Article MathSciNet Google Scholar
Chen, S., Dahl, G.B., Khan, S.: Nonparametric identification and estimation of a censored location-scale regression model. J. Am. Stat. Assoc. 100(469), 212–221 (2005). https://doi.org/10.1198/016214504000000836
Article MathSciNet Google Scholar
Cheng, J., Small, D.S.: Semiparametric models and inference for the effect of a treatment when the outcome is nonnegative with clumping at zero. Biometrics 77(4), 1187–1201 (2021). https://doi.org/10.1111/biom.13368
Article MathSciNet Google Scholar
Chib, S.: Bayes inference in the Tobit censored regression model. J. Econom. 51(1–2), 79–99 (1992). https://doi.org/10.1016/0304-4076(92)90030-U
Article MathSciNet Google Scholar
Chib, S., Greenberg, E.: Additive cubic spline regression with Dirichlet process mixture errors. J. Econom. 156(2), 322–336 (2010). https://doi.org/10.1016/j.jeconom.2009.11.002
Article MathSciNet Google Scholar
Chipman, H.A., George, E.I., McCulloch, R.E.: BART: Bayesian additive regression trees. Ann. Appl. Stat. 4(1), 266–298 (2010). https://doi.org/10.1214/09-AOAS285
Article MathSciNet Google Scholar
Conley, T.G., Hansen, C.B., McCulloch, R.E., Rossi, P.E.: A semi-parametric Bayesian approach to the instrumental variable problem. J. Econom. 144(1), 276–305 (2008). https://doi.org/10.1016/j.jeconom.2008.01.007
Article MathSciNet Google Scholar
Dorie, V., Hill, J., Shalit, U., Scott, M., Cervone, D.: Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Stat. Sci. 34(1), 43–68 (2019). https://doi.org/10.1214/18-STS667
Article MathSciNet Google Scholar
Escobar, M.D.: Estimating normal means with a Dirichlet process prior. J. Am. Stat. Assoc. 89(425), 268–277 (1994). https://doi.org/10.1080/01621459.1994.10476468
Article MathSciNet Google Scholar
Escobar, M.D., West, M.: Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 90(430), 577–588 (1995). https://doi.org/10.1080/01621459.1995.10476550
Article MathSciNet Google Scholar
Escobar, M.D., West, M.: Computing nonparametric hierarchical models. In: Dey, D., Müller, P., Sinha, D. (eds.)Practical nonparametric and semiparametric Bayesian statistics, 1–22. New York, NY: Springer New York. (1998) https://doi.org/10.1007/978-1-4612-1732-9_1
Fan, J., Gijbels, I.: Censored regression: local linear approximations and their applications. J. Am. Stat. Assoc. 89(426), 560–570 (1994). https://doi.org/10.1080/01621459.1994.10476781
Article MathSciNet Google Scholar
Friedberg, R., Tibshirani, J., Athey, S., Wager, S.: Local linear forests. J. Comput. Graph. Stat. 30(2), 503–517 (2020). https://doi.org/10.1080/10618600.2020.1831930
Article MathSciNet Google Scholar
Friedman, J.H.: Multivariate Adaptive Regression Splines. Ann. Stat. 19(1), 1–67 (1991). https://doi.org/10.1214/aos/1176347963
Article MathSciNet Google Scholar
Gammelli, D., Peled, I., Rodrigues, F., Pacino, D., Kurtaran, H.A., Pereira, F.C.: Estimating latent demand of shared mobility through censored Gaussian processes. Transp. Res. Part C Emerg. Technol. 120, 102775 (2020). https://doi.org/10.1016/j.trc.2020.102775
Article Google Scholar
Gammelli, D., Rolsted, K.P., Pacino, D., Rodrigues, F.: Generalized multi-output Gaussian process censored regression. Pattern Recogn. 129, 108751 (2022). https://doi.org/10.1016/j.patcog.2022.108751
Article Google Scholar
George, E., Laud, P., Logan, B., McCulloch, R., Sparapani, R.: Fully nonparametric Bayesian additive regression trees. In: Jeliazkov, I., Tobias, J.L. (eds.) Topics in identification, limited dependent variables, partial observability, experimentation, and flexible modeling: part B, Volume 40, 89–110. Emerald Publishing Limited. (2019) https://doi.org/10.1108/S0731-90532019000040B006
Groot, P., Lucas, P.J.: Gaussian process regression with censored data using expectation propagation. In: Proceedings of the 6th European workshop on probabilistic graphical models, 115–122 (2012)
Hahn, P.R., Dorie, V., Murray, J.S.: Atlantic causal inference conference (ACIC) data analysis challenge 2017. (2019) arXiv:1905.09515 [stat.ME]
Harrison, D., Jr., Rubinfeld, D.L.: Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag. 5(1), 81–102 (1978). https://doi.org/10.1016/0095-0696(78)90006-2
Article Google Scholar
Helsel, D.R.: Nondetects and data analysis. Statistics for censored environmental data Statistics in Practice. Wiley-Interscience (2005)
Heuchenne, C., Van Keilegom, I.: Location estimation in nonparametric regression with censored data. J. Multivar. Anal. 98(8), 1558–1582 (2007). https://doi.org/10.1016/j.jmva.2007.03.008
Article MathSciNet Google Scholar
Heuchenne, C., Van Keilegom, I.: Estimation in nonparametric location-scale regression models with censored data. Ann. Inst. Stat. Math. 62(3), 439–463 (2010). https://doi.org/10.1007/s10463-009-0219-3
Article MathSciNet Google Scholar
Huang, H.: Novel semi-parametric Tobit additive regression models. (2021) arXiv:2107.01497 [stat.ME]
Huang, H., Tang, Y., Li, Y., Liang, H.: Estimation in additive models with fixed censored responses. J. Nonparametr. Stat. 31(1), 131–143 (2019). https://doi.org/10.1080/10485252.2018.1537441
Article MathSciNet Google Scholar
Jacobson, T., Zou, H.: High-dimensional censored regression via the penalized Tobit likelihood. J. Bus. Econ. Stat. 42(1), 286–297 (2024). https://doi.org/10.1080/07350015.2023.2182309
Article MathSciNet Google Scholar
Ji, Y., Lin, N., Zhang, B.: Model selection in binary and Tobit quantile regression using the Gibbs sampler. Comput. Stat. Data Anal. 56(4), 827–839 (2012). https://doi.org/10.1016/j.csda.2011.10.003
Article MathSciNet Google Scholar
Job, J.S., Halsey, N.A., Boulos, R., Holt, E., Farrell, D., Albrecht, P., Brutus, J.R., Adrien, M., Andre, J., Chan, E., Kissinger, P., Boulos, C.: Successful immunization of infants at 6 months of age with high dose Edmonston-Zagreb measles vaccine. Pediatr. Infect. Dis. J. 10(4), 303–311 (1991). https://doi.org/10.1097/00006454-199104000-00008
Article Google Scholar
Junk, G.A., Spalding, R.F., Richard, J.J.: Areal, vertical, and temporal differences in ground water chemistry: II. Organic constituents. J. Environ. Qual. 9(3), 479–483 (1980). https://doi.org/10.2134/jeq1980.00472425000900030031x
Article Google Scholar
Kapelner, A., Bleich, J.: bartMachine: machine learning with Bayesian additive regression trees. J. Stat. Softw. 70(4), 1–40 (2016). https://doi.org/10.18637/jss.v070.i04
Article Google Scholar
Kim, H., Loh, W.Y., Shih, Y.S., Chaudhuri, P.: Visualizable and interpretable regression models with good prediction power. IIE Trans. 39(6), 565–579 (2007). https://doi.org/10.1080/07408170600897502
Article Google Scholar
Kottas, A., Krnjajić, M.: Bayesian semiparametric modelling in quantile regression. Scand. J. Stat. 36(2), 297–319 (2009). https://doi.org/10.1111/j.1467-9469.2008.00626.x
Article MathSciNet Google Scholar
Künzel, S.R., Sekhon, J.S., Bickel, P.J., Yu, B.: Metalearners for estimating heterogeneous treatment effects using machine learning. Proc. Natl. Acad. Sci. 116(10), 4156–4165 (2019). https://doi.org/10.1073/pnas.1804597116
Article Google Scholar
Leisch, F., Dimitriadou, E.: MLbench: machine learning benchmark problems. R foundation for statistical computing. R package version 2.1-3.1 (2021)
Lewbel, A., Linton, O.: Nonparametric censored and truncated regression. Econometrica 70(2), 765–779 (2002). https://doi.org/10.1111/1468-0262.00304
Article MathSciNet Google Scholar
Linero, A.R., Yang, Y.: Bayesian regression tree ensembles that adapt to smoothness and sparsity. J. R. Stat. Soc. Ser. B Stat Methodol. 80(5), 1087–1110 (2018). https://doi.org/10.1111/rssb.12293
Article MathSciNet Google Scholar
McConnell, K.J., Lindner, S.: Estimating treatment effects with machine learning. Health Serv. Res. 54(6), 1273–1282 (2019). https://doi.org/10.1111/1475-6773.13212
Article Google Scholar
McCulloch, R.E., Sparapani, R.A., Logan, B.R., Laud, P.W.: Causal inference with the instrumental variable approach and Bayesian nonparametric machine learning. (2021) arXiv:2102.01199 [stat.ML]
Mente, S.R., Lombardo, F.: A recursive-partitioning model for blood-brain barrier permeation. J. Comput. Aided Mol. Des. 19, 465–481 (2005). https://doi.org/10.1007/s10822-005-9001-7
Article Google Scholar
Moulton, L.H., Halsey, N.A.: A mixture model with detection limits for regression analyses of antibody response to vaccine. Biometrics 51(4), 1570–1578 (1995). https://doi.org/10.2307/2533289
Article Google Scholar
Mueller, D., Ruddy, B., Battaglin, W.: Logistic model of nitrate in streams of the upper-midwestern United States. J. Environ. Qual. 26, 1223–1230 (1997). https://doi.org/10.2134/jeq1997.00472425002600050005x
Article Google Scholar
Müller, P., van de Geer, S.: Censored linear model in high dimensions. TEST 25(1), 75–92 (2016). https://doi.org/10.1007/s11749-015-0441-7
Nie, X., Wager, S.: Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108(2), 299–319 (2021). https://doi.org/10.1093/biomet/asaa076
Oganisian, A., Mitra, N., Roy, J.A.: A Bayesian nonparametric model for zero-inflated outcomes: Prediction, clustering, and causal estimation. Biometrics 77(1), 125–135 (2021). https://doi.org/10.1111/biom.13244
Pace, R.K., Gilley, O.W.: Using the spatial configuration of the data to improve estimation. J. Real Estate Financ. Econ. 14(3), 333–340 (1997). https://doi.org/10.1023/A:1007762613901
Article Google Scholar
Rossi, P.: Bayesian non-and semi-parametric methods and applications. Princeton University Press, The Econometric and Tinbergen Institutes Lectures (2014)
Sigrist, F., Hirnschall, C.: Grabit: gradient tree-boosted Tobit models for default prediction. J. Bank. Financ. 102, 177–192 (2019). https://doi.org/10.1016/j.jbankfin.2019.03.004
Soret, P., Avalos, M., Wittkop, L., Commenges, D., Thiébaut, R.: Lasso regularization for left-censored Gaussian outcome and high-dimensional predictors. BMC Med. Res. Methodol. 18(1), 1–13 (2018). https://doi.org/10.1186/s12874-018-0609-4
Article Google Scholar
Tobin, J.: Estimation of relationships for limited dependent variables. Econometrica 26(1), 24–36 (1958). https://doi.org/10.2307/1907382
Article MathSciNet Google Scholar
Van Hasselt, M.: Bayesian inference in a sample selection model. J. Econom. 165(2), 221–232 (2011). https://doi.org/10.1016/j.jeconom.2011.08.003
Article MathSciNet Google Scholar
Wendling, T., Jung, K., Callahan, A., Schuler, A., Shah, N.H., Gallego, B.: Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Stat. Med. 37(23), 3309–3324 (2018). https://doi.org/10.1002/sim.7820
Article MathSciNet Google Scholar
West, M., Müller, P., Escobar, M.D.: Hierarchical priors and mixture models, with application in regression and density estimation. In: Smith, A.F.M., Freeman, P.R. (eds.) Aspects of uncertainty: a tribute to D. V. Lindley, 363–386. Chichester; New York: Wiley (1994)
Wu, W., Yeh, M.Y., Chen, M.S.: Deep censored learning of the winning price in the real time bidding. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2526–2535 (2018)
Yu, K., Stander, J.: Bayesian analysis of a Tobit quantile regression model. J. Econom. 137(1), 260–276 (2007). https://doi.org/10.1016/j.jeconom.2005.10.002
Article MathSciNet Google Scholar
Zhang, J., Li, Z., Song, X., Ning, H.: Deep Tobit networks: A novel machine learning approach to microeconometrics. Neural Netw. 144, 279–296 (2021). https://doi.org/10.1016/j.neunet.2021.09.003
Article Google Scholar
Zhang, X., Wan, A.T.K., Zhou, S.Z.: Focused information criteria, model selection, and model averaging in a Tobit model with a nonzero threshold. J. Bus. Econ. Stat. 30(1), 132–142 (2012). https://doi.org/10.1198/jbes.2011.10075
Article MathSciNet Google Scholar
Zirschky, J.H., Harris, D.J.: Geostatistical analysis of hazardous waste site data. J. Environ. Eng. 112(4), 770–784 (1986). https://doi.org/10.1061/(ASCE)0733-9372(1986)112:4(770)

Download references

Acknowledgements

The author gratefully acknowledges helpful comments from Mikhail Zhelonkin, Chen Zhou, and participants at the Econometric Institute internal seminar.

Author information

Authors and Affiliations

Econometric Institute, Erasmus University Rotterdam, Rotterdam, Netherlands
Eoghan O’Neill

Authors

Eoghan O’Neill
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.O’N is the sole author of this paper.

Corresponding author

Correspondence to Eoghan O’Neill.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 159 KB)

Appendices

TOBART-1 Gibbs sampler

1.1 Gibbs sampler algorithms

For completeness of exposition, we describe here the full conditional samples from $p((T_k, M_k)| \{(T_j,M_j)\}_{j \ne k}, \sigma , \varvec{y}^*) \ k = 1, \dots , m $ introduced by Chipman et al. (2010) in Algorithm 1. This sample is separated into a Metropolis-Hastings draw of $p(T_k| \{(T_j,M_j)\}_{j \ne k}, \sigma , \varvec{y}^*) \ k = 1, \dots , m $ following by a closed form (multivariate normal) draw from $p( M_k| T_k, \{(T_j,M_j)\}_{j \ne k}, \sigma , \varvec{y}^*) \ k = 1, \dots , m)$.

The TOBART and TOBART-NP Gibbs samplers are described in algorithm 2.

1.2 TOBART-NP out of sample distribution of the error

For test data predictive intervals, we may sample TOBART-NP error term values for out of sample observations. Van Hasselt (2011) describes the sampling method as follows. Let $\tilde{n}$ denote the index of an out of sample observation. At iteration $t \in \{ 1,...,T\} $ of the Markov chain, given $\{\vartheta _{i,t}\}_{i=1}^n$, generate an out-of-sample value $\vartheta _{\tilde{n}, t}$ according to:

$$\begin{aligned} \vartheta _{\tilde{n}, t} {\left\{ \begin{array}{ll} = \vartheta _{i,t} \ \text {with probability} \ \frac{1}{\alpha + n} \ \text {for} \ i = 1,...,n \\ \sim G_0 \ \text {with probability} \ \frac{\alpha }{\alpha + n} \end{array}\right. } \end{aligned}$$

An estimate of the posterior predictive distribution of the error is

$$\begin{aligned} \hat{f}(u |y, s) = \frac{1}{T} \sum _{t=1}^T f(u| \mu _{\tilde{n},t}, \sigma _{\tilde{n},t}^2) \end{aligned}$$

Also, samples $u_{\tilde{n}}^{(t)}$ can be made from $\mathcal {N} \Big ( \mu _{\tilde{n},t}, \sigma _{\tilde{n},t}^2 \Big ) $ for each iteration t of the MCMC sampler.

Treatment effect with censored outcomes; additional details

A model naively trained on the full dataset with censoring estimates the following:

$$\begin{aligned}{} & {} E[Y_i(1) - Y_i(0) | \varvec{x}_i] = \tau (\varvec{x}_i) \\{} & {} \quad \times \Bigg ( \Phi \Big (\frac{b - \mu (\varvec{x}_i) - \tau (\varvec{x}_i) }{\sigma } \Big ) - \Phi \Big (\frac{a - \mu (\varvec{x}_i) - \tau (\varvec{x}_i) }{\sigma } \Big ) \Bigg )\\{} & {} \quad +\mu (\varvec{x}_i) \Bigg ( \Phi \Big (\frac{b - \mu (\varvec{x}_i) - \tau (\varvec{x}_i) }{\sigma } \Big ) \\{} & {} \quad -\Phi \Big (\frac{a - \mu (\varvec{x}_i) - \tau (\varvec{x}_i) }{\sigma } \Big ) - \Phi \Big (\frac{b - \mu (\varvec{x}_i)}{\sigma } \Big )\\{} & {} \quad + \Phi \Big (\frac{a - \mu (\varvec{x}_i)}{\sigma } \Big ) \Bigg ) \\{} & {} \quad +\sigma \Bigg ( \phi \Big (\frac{a - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)))}{\sigma } \Big )\\{} & {} \quad - \phi \Big (\frac{b - ( \mu (\varvec{x}_i) + \tau (\varvec{x}_i)) }{\sigma } \Big ) - \phi \Big (\frac{a - \mu (\varvec{x}_i)}{\sigma } \Big )\\{} & {} \quad +\phi \Big (\frac{b - \mu (\varvec{x}_i)}{\sigma } \Big ) \Bigg ) \\{} & {} \quad + a \Bigg ( \Phi \Big (\frac{a - \mu (\varvec{x}_i) - \tau (\varvec{x}_i) }{\sigma } \Big ) - \Phi \Big (\frac{a - \mu (\varvec{x}_i) }{\sigma } \Big ) \Bigg )\\{} & {} \quad + b \Bigg ( \Phi \Big (\frac{b - \mu (\varvec{x}_i) }{\sigma } \Big ) - \Phi \Big (\frac{b - \mu (\varvec{x}_i) - \tau (\varvec{x}_i) }{\sigma } \Big ) \Bigg ) \end{aligned}$$

Description of data sets

antibody: Measles vaccine response data set obtained from Moulton and Halsey (1995), originally from Job et al. (1991). The outcome is an antibody measurement censored from below at 0.1, $n= 330$ and $p = 3$.
Recon: Atrazine concentrations in streams throughout the Midwestern United States. Data available in the R package NADA (Helsel 2005) sourced from Mueller et al. (1997). The outcome is Atrazine concentration, censored from below at 0.05, $n= 423$ and $p = 108$.
Atrazine: Atrazine concentrations in Nebraska ground water. Data available in the R package NADA (Helsel 2005) sourced from Junk et al. (1980). The outcome is Atrazine concentration, censored from below at 0.01, $n= 48$ and $p = 2$.
SedPb: Lead concentrations in stream sediments before and after wildfires. Data available in the R package NADA (Helsel 2005). The outcome is Lead concentration, censored from below at 4, $n= 82$ and $p = 2$.
Pollen_Thia: Thiamethoxam concentrations in pollen from the Ontario Pollen Monitoring Network. Data available in the R package NADA2 (Helsel 2005) sourced from Junk et al. (1980). The outcome is Thiamethoxam concentration, censored from below at 0.05, $n= 204$ and $p = 4$.
Missouri: TCDD concentrations used by Zirschky and Harris (1986) in a geostatistical analysis of Hazardous waste data in Missouri. Data available in the R package CensSpatial (Helsel 2005). The outcome is censored from below at 0.1, $n= 127$ and $p = 3$.
BostonHousing: Housing data for 506 census tracts of Boston from the 1970 census available in the R package mlbench (Leisch and Dimitriadou 2021), sourced from Harrison Jr and Rubinfeld (1978), Pace and Gilley (1997). Outcome is median value of owner-occupied homes in USD 1000’s censored from above at 50. $n = 506 $, $p = 108$.
ankara: Mean temperature and other weather data for Ankara from 1994 to 1998.
ozone: Ozone concentrations and weather data for Los Angeles in 1976.
bbb: Log of the ratio of the concentration of a compound in the brain and in the blood (Mente and Lombardo 2005). Availeble in R package caret.
tri, ais, hatco, servo, cpu, diamonds, tec: See Kim et al. (2007) for data sources.

Comparison of simulation study computational times

This appendix contains a comparison of average computational times, in minutes, across iterations for each DGP of the simulation study. The times for BART, RF, and Soft BART do not contain the time taken to train separate models for estimation of binary censoring probabilities. The Grabit time does not contain the considerable time required for parameter tuning by 5-fold cross-validation (the model was re-trained 135 times in each fold for different parameter settings).

The Gaussian Process MATLAB code written by Groot and Lucas (2012) was called in R via the R package R.matlab. All other functions were implemented in R. Therefore the Gaussian Process times are omitted for fair speed comparison. The Gaussian Process functions were fast, and ran for at most a few minutes per iteration. Tables 7, 8 and 9 contain the computational times for simulations with normal, skew-t, and Weibull distributed errors respectively.

Table 7 Simulation study, normal distribution, $\sigma =1$, computational times, in minutes

Full size table

Table 8 Simulation study, Skew-t distribution, computational times, in minutes

Full size table

Table 9 Simulation study, Weibull distribution, computational times, in minutes

Full size table

Table 10 Simulation study, mean squared error. Different Prior calibration settings for error term distribution

Full size table

Simulation study - TOBART prior settings

This appendix contains a comparison of simulation study results for different prior parameter settings.

For standard TOBART, we present results for different $\lambda $ parameter settings. Recall that $\sigma ^{-2} \sim Ga(\frac{v}{2}, \frac{v \lambda }{2})$ and $\lambda $ is set such that the $q^{th}$ quantile of the prior distribution of $\sigma $ is equal to some estimate $\hat{\sigma }$. For standard BART, $\hat{\sigma }$ is the sample standard deviation of the residuals from a linear model. However, a standard linear model does not account for censoring, and therefore may give poor prior calibration.

We consider the following options:

naive sd: The sample standard deviation of the outcomes without accounting for censoring.
Tobit sd: The maximum likelihood estimate of the standard deviation of the error term from a linear Tobit model (with covariates).
cens sd: The maximum likelihood estimate of the standard deviation of the error term from an intercept-only linear Tobit model. This is an estimate of the standard deviation of $y^*$ that adjusts for censoring, assuming normality and no effects of covariates.
lm sd: The default BART setting. The sample standard deviation of residuals from a linear model.

A limitation of the options that account for censoring is that the estimates rely on the assumption of normality. unsurprisingly, we observe that no setting provides the best results for all DGPs in Table 10. The $\hat{\sigma }$ estimate from an intercept-only Tobit model gives good results. It is generally larger than the estimates from other options and results in a less informative prior. Therefore we apply this option for our main results.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

O’Neill, E. Type I Tobit Bayesian Additive Regression Trees for censored outcome regression. Stat Comput 34, 123 (2024). https://doi.org/10.1007/s11222-024-10434-4

Download citation

Received: 15 October 2023
Accepted: 30 April 2024
Published: 24 May 2024
DOI: https://doi.org/10.1007/s11222-024-10434-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Type I Tobit Bayesian Additive Regression Trees for censored outcome regression

Abstract

Similar content being viewed by others

Statistical Inference on Middle-Censored Data in a Dependent Setup

Censored Multivariate Linear Regression Model

Maximum likelihood methods in a robust censored errors-in-variables model

1 Introduction

2 Methods

2.1 Review of Bayesian Additive Regression Trees (BART)

2.2 Soft trees and sparse splitting rules

2.3 Type I Tobit and TOBART

2.3.1 Type I Tobit model

2.3.2 Type I TOBART model

2.3.3 Type I TOBART Gibbs sampler

2.3.4 Predicting outcomes with TOBART

2.4 Nonparametric Type I TOBART

2.4.1 Nonparametric Type I TOBART model

2.5 Treatment effect estimation for censored outcomes

3 Simulation studies

3.1 Description of prediction simulations

3.2 Prediction simulation results

3.3 Description of treatment effect simulations

3.3.1 Censored Caron et al. (2022) simulations

3.3.2 Censored Friedberg et al. (2020) simulations

3.3.3 Censored Nie and Wager (2021) simulations

3.4 Treatment effect simulation results

4 Data application

4.1 Data application results

5 Conclusion

6 Supplementary information

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 159 KB)

Appendices

TOBART-1 Gibbs sampler

1.1 Gibbs sampler algorithms

1.2 TOBART-NP out of sample distribution of the error

Treatment effect with censored outcomes; additional details

Description of data sets

Comparison of simulation study computational times

Simulation study - TOBART prior settings

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation