1 Introduction

Central to any data analysis procedure is data gathering. A practical problem that typically arises during the data gathering process is censoring, which occurs when we partially observe a measurement. An example of data censoring occurs when the measured value falls outside the sensitivity range of the measurement device (e.g. a temperature sensor). Specialized inference techniques are required to address the problems that arise from censored data.

Tobit models are a popular class of censored regression models, tracing back to the work of Tobin (1958). Subsequently, Amemiya (1984) provided a detailed survey and taxonomy of the different parametric variations of Tobit approaches. These models have been adapted and applied to numerous settings. For example, Allik et al. (2016) use a parametric type I Tobit model to develop a formulation of the Kalman filter suitable for censored observations. Recent censored regression frameworks have focused more on combining censored models with flexible architectures that can capture the underlying nonlinear relationships in data. These, for example, include deep neural networks (Wu et al. 2018), random forests (Hutter et al. 2013; Li and Bradic 2020) and Gaussian process models (Ertin 2007; Groot and Lucas 2012; Chen et al. 2013; Gammelli et al. 2020a, b).

Gaussian processes (GPs) provide a fully Bayesian nonparametric approach for performing inference for nonlinear functions and have become increasingly more popular in the machine learning community (MacKay 2004; Rasmussen and Williams 2006; Bishop 2009; Titsias and Lawrence 2010). Using the GP regression framework, we can derive the full Bayesian predictive density for such functions, allowing us to estimate a mean function and quantify uncertainty around the mean estimate (Snelson et al. 2004; Groot and Lucas 2012). In GP regression, point estimates for the unknown kernel function parameters are often obtained by maximizing the log marginal likelihood of the observed data or, in variational methods, a lower bound on the log marginal likelihood. However, the presence of the censored observations means that this marginal likelihood cannot be computed in closed-form.

Ertin (2007) proposed a censored GP regression framework, within the context of censored wireless sensor readings, by treating the censored variable as a mixture of a binary and a Gaussian random variable followed by defining a GP prior over the latent function values. Ertin (2007) circumvents the analytical intractability of the posterior density and the marginal likelihood of this model by approximating the posterior density with a Laplace approximation (Bishop 2009).

Groot and Lucas (2012) then extended the censored GP regression framework to include the type I Tobit model (see Amemiya 1984). They circumvent the analytically intractable posterior density by applying expectation propagation (Minka 2001a, b) with the goal to approximate the type I Tobit likelihood terms by local likelihood factors using non-normalized Gaussian density functions. This work has been applied to wind power forecasting (Chen et al. 2013), predicting clinical scores from neuro-imaging data (Rao et al. 2016), and modeling the demand for shared transport services while allowing for time-varying detection limits (Gammelli et al. 2020a).

Gammelli et al. (2020b) propose an extension of the work of Groot and Lucas (2012) by {1} incorporating a non-constant heteroskedastic observation model, {2} using a multi-output GP prior to exploit information from potentially correlated outputs to enable better modeling of the censored data, and {3} circumventing the analytical intractability that arises from the proposed framework by developing a variational lower bound on the log marginal likelihood which they optimized with stochastic variational inference (Hoffman et al. 2013; Blei et al. 2017).

In this article, we provide a mathematical tool that allows us to derive a closed-form variational lower bound on the log marginal likelihood of the original probabilistic model by applying variational sparse GP regression in conjunction with local variational methods. Our proposed methodology is closely related to the work of Ertin (2007) and Groot and Lucas (2012) and, similar to Gammelli et al. (2020b), relies on variational methods to perform approximate inference.

A key development in our approach is that we maximize a secondary variational lower bound on the Tobit model which relies on {1} the variational sparse GP regression framework developed by Titsias (2008, 2009) and {2} local variational methods which aim to lower bound the Tobit likelihood factors instead of approximating these factors (see Jordan et al. 1999; Nickisch and Rasmussen 2008; Bishop 2009). The use of the variational sparse GP framework results in a reduction in time complexity (Titsias 2009), thereby enabling us to perform inference on larger censored data sets previously intractable to an analysis by GP regression models. To the best of our knowledge, such an implementation does not yet exist in the current censored Gaussian process regression literature. We demonstrate that our variational inference-based framework computationally outperforms the competing benchmarks while maintaining comparable prediction accuracy.

The remainder of the article is structured as follows. Section 2 focuses on the theoretical development of the Tobit GP regression model and Section 3 introduces the variational approximations that allow us to derive a closed-form variational lower bound that can be used for Bayesian model training and inference. In Section 4 we derive the required equations for the latent function predictive posterior density, while Section 5 demonstrates the ability of the proposed framework to learn a latent function representation from observational data subject to artificial censoring. In Section 6 we end with a discussion followed by making explicit some of the limitations associated with the proposed framework.

2 The Tobit Gaussian Process Regression Model

In this section, we briefly review the standard GP regression model and then introduce the theoretical framework for Tobit GP regression.

Suppose we have a data set consisting of pairs \( \{(x_i,y_i )\}_{i=1}^N \). We assume that each observation \( y_i \) is a noisy, independent realization of an unknown latent function \( f_i=f(x_i) \) at scalar input \( x_i \), with additive noise from a zero mean Gaussian density with unknown variance \( {\sigma _y^2} \):

$$\begin{aligned} y_i = f_i + \epsilon _i; \quad \epsilon _i \sim {\mathcal {N}}(\epsilon _i \vert 0,\sigma _y^2) \end{aligned}$$
(1)

This induces a joint Gaussian likelihood function of the form

$$\begin{aligned} p({\varvec{y}} \vert {\varvec{f}},\sigma _y) = {\mathcal {N}}({\varvec{y}} \vert {\varvec{f}},\sigma _y^2{\varvec{I}}_{N N}) \end{aligned}$$
(2)

We denote with \({\mathcal {N}}(\cdot )\) the Gaussian density function, \({\varvec{y}} \in {\mathbb {R}}^{N \times 1} \) the vector of observed data, and \({\varvec{f}} \in {\mathbb {R}}^{N \times 1} \) the vector of latent function values at the training input locations \({\varvec{x}} \in {\mathbb {R}}^{N \times 1} \). The matrix \( {\varvec{I}}_{NN} \) denotes the \( N \times N \) identity matrix. Next, we specify a zero mean GP prior with kernel function \( k(x_i,x_j) \) such that

$$\begin{aligned} f \sim {\mathcal {G}}{\mathcal {P}}(0,k(x_i,x_j)) \end{aligned}$$
(3)

For the finite set of training input locations \( {\varvec{x}}\) associated with \( {\varvec{f}}\), the GP follows a multivariate Gaussian density with covariance matrix \({\varvec{K}}_{NN}\), the \({N \times N}\) covariance matrix which is constructed using the user-specified kernel function \( k(x_i,x_j) \) on the training input locations:

$$\begin{aligned} p({\varvec{f}} \vert {\varvec{\theta }}_k) = {\mathcal {N}}({\varvec{f}} \vert {\varvec{0}},{\varvec{K}}_{NN}) \end{aligned}$$
(4)

where \( {\varvec{\theta }}_k \) collectively denotes the typically unknown kernel function parameters.

Point estimates for the unknown kernel parameters \( {\varvec{\theta }}_k \) and unknown noise variance \( \sigma _y^2 \), which we collectively denote by the parameter vector \( {\varvec{\theta }}\), can be obtained by using gradient-based optimization to maximize the log marginal likelihood of the model which is given by

$$\begin{aligned} \ln {p({\varvec{y}})} = \ln {\left[ \int \limits _{{\varvec{f}}} p({\varvec{y}} \vert {\varvec{f}},\sigma _y^2)p({\varvec{f}} \vert {\varvec{\theta }}_k)d{\varvec{f}} \right] } \end{aligned}$$

For the Gaussian likelihood function in Eq. (2), the marginal likelihood of the model can be computed analytically as

$$\begin{aligned}{} & {} p({\varvec{y}} \vert {\varvec{\theta }}) = {\mathcal {N}}({\varvec{y}} \vert {\varvec{0}},{\varvec{C}}_{NN}) \nonumber \\{} & {} {{{\varvec{C}}}_{NN}} = {{{\varvec{K}}}_{NN}} + \sigma _y^2{\varvec{I}}_{NN} \end{aligned}$$
(5)

Refer to Rasmussen and Williams (2006) and Bishop (2009) for a detailed overview of the Gaussian process regression framework.

The Tobit Gaussian process regression model can be thought of as an extended version of the standard GP regression model as applied to censored observational data. For censored data, the standard GP regression likelihood function (see Eq. (2)) is no longer valid due to limitations that arise from our measurement sensitivity range.

Suppose that the detection limits for the measurement of interest are known in advance and constant with respect to time. When we observe that \( y_i = l_b \), where \( l_b \) corresponds to the lower detection limit, we only know an upper bound on the corresponding observation for \( y_i \), i.e., \( y_i \in (-\infty ,l_b]\), rendering the Gaussian assumption inappropriate (Groot and Lucas 2012).

To account for the limitation associated with the sensitivity range, we alter the way we construct our likelihood function. In latent function regions where we observe data, we retain the base GP architecture as outlined by Eqs. (1) to (2). However, in latent function regions where, for example, the measurement instrument/analysis procedure transforms (or reports) the data as the corresponding censored detection limit, we ask ourselves the following additional question:

What is the probability that the data, i.e., the random variable \( Y_i \) that is associated with marginal density \( p(y_i \vert f_i) \) , falls either (scenario 1) above the upper detection limit \( u_b \) or (scenario 2) below the lower detection limit \( l_b \) ?

In other words, when we consider the marginal density associated with the random variable \( Y_i \), we want to answer the following two questions (subject to which censoring scenario we consider)

$$\begin{aligned}{} & {} {\mathbb {P}}(Y_i\ge u_b) = 1 - \int \limits _{-\infty }^{u_b} p(y_i \vert f_i)d y_i \end{aligned}$$
(6)
$$\begin{aligned}{} & {} {\mathbb {P}}(Y_i\le l_b) = \int \limits _{-\infty }^{l_b} p(y_i \vert f_i)d y_i \end{aligned}$$
(7)

Note that \( {\mathbb {P}}(\cdot ) \) denotes the probability value whereas \( p(\cdot ) \) denotes the probability density function, associated with the random variable \( Y_i \), which we derive from Eq. (1) as

$$\begin{aligned} p(y_i \vert f_i) = {\mathcal {N}}(y_i \vert f_i,\sigma _y^2) \end{aligned}$$
(8)

From Eqs. (6), (7) and (8) we can construct a piece-wise defined likelihood, i.e., a mixed-likelihood, which we will denote with the symbol \( p_o(\cdot ) \), that accounts for data censoring as follows

$$\begin{aligned} p_o(y_i \vert f_i) = {\left\{ \begin{array}{ll} \Phi (l_b \vert f_i,\sigma _y^2)&{} {\text { if }}\; y_i = l_b \\ {\mathcal {N}}(y_i \vert f_i,\sigma _y^2)&{} {\text { if }}\; l_b< y_i < u_b \\ 1 - \Phi (u_b \vert f_i,\sigma _y^2) &{}{\text { if }}\; y_i = u_b \end{array}\right. } \end{aligned}$$
(9)

We denote with \( \Phi (\cdot ) \) the Gaussian cumulative distribution function (cdf). Furthermore, note that we implicitly assumed that the latent function is corrupted by noise and that the noise-corrupted data value is then censored and reported (Groot and Lucas 2012). For notational convenience we use \( \Phi (u_b \vert f_i,\sigma _y^2) \) to imply \( \Phi (\frac{u_b-f_i}{\sigma _y}) \).

Gammelli et al. (2020b) draws an interesting connection between heteroskedastic regression and censored observation models. The authors provide a qualitative understanding of the reasons why they suggest the use of input-dependent noise models and show, from a simulation-based perspective and real-world data sets, how heteroskedasticity can allow one to more accurately model the censored observations associated with Tobit-based likelihood functions.

As noted by Gammelli et al. (2020b), the likelihood variance parameter \( \sigma _y^2 \) directly controls the slope of the Gaussian cdf factors (see Eq. (9)) and would enforce the same amount of overestimation for all of the censored observations (refer to Appendix A, Section A.1). However, the amount of overestimation can be regulated/adjusted with a heteroskedastic parameterization for the variance. This would allow the Tobit model to automatically tune the amount of overestimation resulting in improved predictive performance. Consequently, we augment each Gaussian cdf factor in Eq. (9) with an additional variance parameter and construct an adjusted mixed-likelihood, which we will denote with the symbol \( p_m(\cdot ) \), that assigns the following probability/density function portions conditioned on the training input location

$$\begin{aligned} p_m(y_i \vert f_i) = {\left\{ \begin{array}{ll} \Phi (l_b \vert f_i,\sigma _y^2 + \sigma _{l_b}^2)&{} {\text { if }} y_i = l_b \\ {\mathcal {N}}(y_i \vert f_i,\sigma _y^2)&{} {\text { if }}\; l_b< y_i < u_b \\ 1 - \Phi (u_b \vert f_i,\sigma _y^2 + \sigma _{u_b}^2)&{} {\text { if }} y_i = u_b \end{array}\right. } \end{aligned}$$
(10)

Note that for training input locations associated with the lower detection limit \( l_b \) we assume a constant (with respect to the input \( x_i \)) heteroskedastic noise model with a total variance contribution which is the sum of the original mixed-likelihood variance in Eq. (9) and a regulating variance parameter. A similar argument holds for the upper detection limit \( u_b \) (refer to Appendix A Section A.2 for more details). Note that the variance parameter for the uncensored observations remains the same as in Eq. (9). Given a censored data set with a total of N entries, and assuming independence, we can construct our mixed-likelihood function as follows

$$\begin{aligned}{} & {} \prod _{i=1}^{N} p_m(y_i \vert f_i) = \prod _{y_i=l_b}[1 - \Phi (f_i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)] \nonumber \\{} & {} \quad \times \prod _{l_b<y_i<u_b} {\mathcal {N}}(y_i \vert f_i,\sigma _y^2)\prod _{y_i=u_b}\Phi (f_i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2) \end{aligned}$$
(11)

We arrived at Eq. (11) by using Eq. (10) and the Gaussian cdf property \( \Phi (y \vert x,\sigma ^2)=1-\Phi (x \vert y,\sigma ^2) \) (see Pishro-Nik 2014). Note that Eq. (11) is known as the Tobit likelihood function, or the type I Tobit model, and comprises a mixture of Gaussian density and Gaussian cdf likelihood terms (see Amemiya 1984; Groot and Lucas 2012). From here on we drop the dependence on any model parameters for notational convenience. We also abuse notation and use the following definition for notational convenience

$$\begin{aligned} p_m({\varvec{y}} \vert {\varvec{f}}) = \prod _{i=1}^{N} p_m(y_i \vert f_i) \end{aligned}$$
(12)

Inference about the latent function proceeds along the same line as for the standard GP regression model. We start with our mixed-likelihood function, as given by Eq. (12), and define a zero mean GP prior over \( {\varvec{f}} \) with kernel function \( k(x_i,x_j) \) (see Eqs. (3) and (4)). Using Bayes’ rule, we can compute a posterior density as follows

$$\begin{aligned} p({\varvec{f}} \vert {\varvec{y}}) = \frac{p_m({\varvec{y}} \vert {\varvec{f}})p({\varvec{f}})}{p({\varvec{y}})} \end{aligned}$$
(13)

Regardless of whether the factor \( p_m(y_i \vert f_i) \) or \( p_o(y_i \vert f_i) \) is used in the mixed-likelihood function, we refer to Eq. (13) as the Tobit Gaussian process regression (T-GPR) model. The marginal likelihood of the censored data set is given by

$$\begin{aligned} p({\varvec{y}}) = \int \limits _{{\varvec{f}}} p_m({\varvec{y}} \vert {\varvec{f}})p({\varvec{f}})d{\varvec{f}} \end{aligned}$$
(14)

However, unlike the standard GP regression model, the marginal likelihood given by Eq. (14) is analytically intractable due to the mixture of Gaussian density and Gaussian cdf likelihood terms (Ertin 2007; Groot and Lucas 2012).

3 Lower Bounding the Log Marginal Likelihood

Next, we propose circumventing this analytical intractability by adopting a variational inference-based framework, which will allow us to compute an analytically tractable lower bound on the true log marginal likelihood of the original probabilistic model given by Eq. (13). This new lower bound can then be used to perform Bayesian model training and inference.

3.1 Applying the Variational Sparse Gaussian Process Framework

The application of the standard GP regression model to large data sets has always been challenging due to the need to invert the \( N \times N \) covariance matrix \( {\varvec{C}}_{NN} \) (see Eq. (5)) which requires time complexity that scales as \( {\mathcal {O}}(N^3) \) where N is the number of data entries. For large data sets, the (numerical) inversion process becomes prohibitively slow rendering the standard GP regression model computationally intractable.

Consequently, practitioners have resorted to using approximate or sparse methodologies to address the limitations associated with the (numerical) inversion process. Much research has primarily focused on advanced sparse GP methodologies where a smaller set of M function points are used as support/inducing variables. For example, see the work of Csató and Opper (2002), Seeger et al. (2003) and Snelson and Ghahramani (2005). For a detailed and unifying view of sparse approximate GP regression, refer to the work of Quiñonero-Candela and Rasmussen (2005).

The variational sparse GP regression framework proposed by Titsias (2008, 2009) has sparked significant interest. The proposed methodology, with time complexity that scales as \( {\mathcal {O}}(NM^2) \), allows practitioners to circumvent the computational demands associated with inverting the required covariance matrix while also offering a formulation whereby practitioners can maximize a variational lower bound to select the inducing variable input locations and the model hyperparameters. Although the variational sparse GP regression framework was originally proposed for computational speedups, the framework has also been used as a mathematical tool to induce an analytically tractable lower bound for various state-of-the-art probabilistic models such as {1} the (B)GP-LVM (Titsias and Lawrence 2010; Damianou et al. 2016) and {2} deep Gaussian processes (Damianou and Lawrence 2013).

We adopt the variational sparse GP regression framework developed by Titsias (2008, 2009) for the following reasons: {1} We exploit the sparse framework for its original intent which is to offer computational speedups for large data sets, and {2} the sparse framework allows us to derive a variational lower bound on the log marginal likelihood of the T-GPR model in Eq. (13) which remains intractable. We will induce analytical tractability by exploiting local variational methods which result in a framework that can be used for model training and inference. Note that the variational lower bound of our proposed framework can also be used as a steppingstone to gain access to the (B)GP-LVM (see Titsias and Lawrence 2010; Damianou et al. 2016) as applied to censored observational data. It is worth pointing out that the standard GP latent variable model (Lawrence 2004), as well as the (B)GP-LVM counterpart (Titsias and Lawrence 2010), are typically applied in the context of uncensored observational data. See, for example, the applications in Urtasun et al. (2006), Lawrence (2007), Wang et al. (2008), Titsias and Lawrence (2010), Campbell and Yau (2015) and Zhang et al. (2017). However, we have yet to find any sparse GP inducing variable-based or (B)GP-LVM frameworks that explicitly incorporate the type I Tobit likelihood function to account for censoring in regression settings. The closest related literature we could find stems from the survival analysis branch of statistics and includes the work of Barrett and Coolen (2016), Saul et al. (2016), and Alaa and van der Schaar (2017). Another related approach includes the work of Lázaro-Gredilla (2012) who applied the Bayesian warped GP framework to censored data without explicitly accounting for the censoring mechanism in the likelihood function.

The synthesis of our proposed approach finds inspiration in the work of Saul et al. (2016), which itself builds on the ideas of Hensman et al. (2013) and Hensman et al. (2015). However, instead of resorting to numerical integration to address the intractability which arises from the non-Gaussian likelihood function (which is the type I Tobit likelihood function in our case), we exploit local variational methods (see Section 3.2).

In principle, the variational sparse GP regression framework developed by Titsias (2008, 2009) aims to minimize, in the Kullback-Leibler (\( {\mathcal {K}}{\mathcal {L}} \)) divergence sense, the dissimilarity between the approximate posterior and exact posterior density. Within the context of the Tobit GP regression model in Eq. (13), we start by augmenting the prior density with inducing variables \( {\varvec{u}} \) such that

$$\begin{aligned} p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}}) = \frac{p_{m}({\varvec{y}} \vert {\varvec{f}})p({\varvec{f}} \vert {\varvec{u}})p({\varvec{u}})}{p({\varvec{y}})} \end{aligned}$$
(15)

Note that Eq. (15) is equivalent to the original T-GPR model since we can recover Eq. (13) by marginalizing out the inducing variables \( {\varvec{u}} \). However, the reason we allow for the augmented inducing variables \( {\varvec{u}} \) stems from the fact that these variables allow us to produce analytically tractable (and computationally efficient) approximations. Our goal is to minimize the \( {\mathcal {K}}{\mathcal {L}} \)-divergence given by

$$\begin{aligned}{} & {} {\mathcal {K}}{\mathcal {L}}[q({\varvec{f}},{\varvec{u}}) \vert \vert p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}})] \nonumber \\{} & {} \quad =\iint \limits _{{{\varvec{u}}}{{\varvec{f}}}}q({\varvec{f}},{\varvec{u}})\ln \frac{q({\varvec{f}},{\varvec{u}})}{p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}})}d{\varvec{f}}d{\varvec{u}}\end{aligned}$$
(16)

We expand Eq. (16) by using Eq. (15) to obtain

$$\begin{aligned} \ln p({\varvec{y}}){} & {} = {\mathcal {K}}{\mathcal {L}}[q({\varvec{f}},{\varvec{u}}) \vert \vert p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}})] \\{} & {} \quad + \iint \limits _{{{\varvec{u}}}{{\varvec{f}}}}q({\varvec{f}},{\varvec{u}})\ln \frac{p_{m}({\varvec{y}} \vert {\varvec{f}})p({\varvec{f}} \vert {\varvec{u}})p({\varvec{u}})}{q({\varvec{f}},{\varvec{u}})}d{\varvec{f}}d{\varvec{u}} \end{aligned}$$

Next, we recall that the \( {\mathcal {K}}{\mathcal {L}} \)-divergence satisfies Gibb’s inequality (MacKay 2004), i.e.,

$$\begin{aligned} {\mathcal {K}}{\mathcal {L}}[q({\varvec{f}},{\varvec{u}}) \vert \vert p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}})] \ge 0 \end{aligned}$$

Therefore, we conclude that

$$\begin{aligned} \ln p({\varvec{y}}) \ge {\mathcal {F}}[q({\varvec{f}},{\varvec{u}})] \end{aligned}$$
(17)

The quantity \( {\mathcal {F}}[q({\varvec{f}},{\varvec{u}})] \) is given by

$$\begin{aligned} {\mathcal {F}}[q({\varvec{f}},{\varvec{u}})] =\iint \limits _{{{\varvec{u}}}{{\varvec{f}}}}q({\varvec{f}},{\varvec{u}})\ln \frac{p_{m}({\varvec{y}} \vert {\varvec{f}})p({\varvec{f}} \vert {\varvec{u}})p({\varvec{u}})}{q({\varvec{f}},{\varvec{u}})}d{\varvec{f}}d{\varvec{u}}\nonumber \\ \end{aligned}$$
(18)

We refer to the quantity in Eq. (18) as the variational lower bound. Other common names for this bound include the Evidence Lower BOund (ELBO), see Blei et al. (2017), or the variational free energy (MacKay 2004). Next, we note that maximizing the variational lower bound given by Eq. (18) is equivalent to minimizing the \( {\mathcal {K}}{\mathcal {L}} \)-divergence in Eq. (16). Following Titsias (2009), we select the following approximating variational posterior density

$$\begin{aligned} q({\varvec{f}},{\varvec{u}}) = p({\varvec{f}} \vert {\varvec{u}})q({\varvec{u}}) \end{aligned}$$
(19)

From Eq. (19) we see that under the selected variational approximation, the only free-form density we can optimize for is \( q({\varvec{u}}) \) since \( p({\varvec{f}} \vert {\varvec{u}}) \) corresponds to the conditional GP prior density under the augmented probability model (for further details, see Titsias 2009). Furthermore, since \( p({\varvec{f}} \vert {\varvec{u}}) \) does not have an explicit dependence on the data \( {\varvec{y}} \), the only way for the data \( {\varvec{y}} \) to affect \( {\varvec{f}} \) is through the inducing variables \( {\varvec{u}} \), i.e., \( {\varvec{u}} \) acts as a summary statistic, which is how we build sparsity into the model since \( M \ll N \) (see Bui and Turner 2014). The symbol M denotes the number of user-specified inducing variables.

With Eq. (19) we can simplify Eq. (18) to obtain the following variational lower bound

$$\begin{aligned} {\mathcal {F}}[q({\varvec{u}})] =\int \limits _{{\varvec{u}}}q({\varvec{u}})\left[ \int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln p_m({\varvec{y}} \vert {\varvec{f}})d{\varvec{f}} + \ln \frac{p({\varvec{u}})}{q({\varvec{u}})}\right] d{\varvec{u}}\nonumber \\ \end{aligned}$$
(20)

However, we note that Eq. (20) contains the following analytically intractable expectation

$$\begin{aligned} {\mathbb {E}}_{p({\varvec{f}} \vert {\varvec{u}})}[\ln p_m({\varvec{y}} \vert {\varvec{f}})] =\int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln p_m({\varvec{y}} \vert {\varvec{f}})d{\varvec{f}}\nonumber \\ \end{aligned}$$
(21)

The analytical intractability (see Eqs. (22) and (23) below) arises from the presence of the Gaussian cdf factors in the likelihood function. We note that

$$\begin{aligned}{} & {} {\mathbb {E}}_{p({\varvec{f}} \vert {\varvec{u}})}[\ln p_m({\varvec{y}} \vert {\varvec{f}})] \nonumber \\{} & {} \quad =\int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln \prod _{i = 1}^{N}p_m(y_i \vert f_i)d{\varvec{f}} \end{aligned}$$
(22)

From Eqs. (11) and (22) we have that

$$\begin{aligned}{} & {} {\mathbb {E}}_{p({\varvec{f}} \vert {\varvec{u}})}[\ln p_m({\varvec{y}} \vert {\varvec{f}})] \nonumber \\{} & {} \quad =\int \limits _{{{\varvec{f}}}_{l_b}} p({{\varvec{f}}}_{l_b} \vert {\varvec{u}})\ln {\left\{ \prod _{y_i=l_b}[1 - \Phi (f_i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)]\right\} }d{{\varvec{f}}}_{l_b}\nonumber \\{} & {} \qquad + \int \limits _{{{\varvec{f}}}_{u_n}}p({{\varvec{f}}}_{u_n} \vert {\varvec{u}})\ln {\left\{ \prod _{l_b<y_i<u_b} {\mathcal {N}}(y_i \vert f_i,\sigma _y^2)\right\} }d{{\varvec{f}}}_{u_n}\nonumber \\{} & {} \qquad + \int \limits _{{{\varvec{f}}}_{u_b}}p({{\varvec{f}}}_{u_b} \vert {\varvec{u}})\ln {\left\{ \prod _{y_i=u_b}\Phi (f_i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2)\right\} }d{{\varvec{f}}}_{u_b} \end{aligned}$$
(23)

Note that we used the marginalization property of the multivariate Gaussian density to arrive at Eq. (23). We denote with symbol \( {{\varvec{f}}}_{l_b} \) the vector of latent function values associated with the lower bound \( l_b \) censored observations. A similar argument holds for the latent function vector \( {{\varvec{f}}}_{u_b} \). Symbol \( {{\varvec{f}}}_{u_n} \) denotes the vector associated with the uncensored observations.

3.2 Local Variational Methods: Lower Bounding the Censored Variables

We circumvent the analytical intractability in Eq. (23) by implementing an alternative ‘local’ lower bounding strategy that shares similarities with the variational framework we have been working with. The variational inference framework we have been considering, within the context of the work of Titsias (2008, 2009), and in general, can be interpreted as a ‘global’ method in the sense that we directly seek an approximation to the entire posterior density over all the model random variables of interest. ‘Local’ variational methods provide an alternative approach and involve finding local bounds (either upper or lower) on functions over individual or groups of variables within the model (Gibbs and MacKay 2000 and Bishop 2009).

From Eq. (23) we see that the functions of interest, i.e., the functions that result in the expectation being analytically intractable, correspond to the Gaussian cdf likelihood factors. If we can construct local lower bounds for each Gaussian cdf factor present in Eq. (23), we can use the corresponding local lower bounds, in conjunction with Eq. (20), to develop a secondary variational lower bound on the log marginal likelihood, which we can use for Bayesian model training and inference about the latent function of interest. Following the approach outlined in Nickisch and Rasmussen (2008), we propose the following quadratic local lower bound on each Gaussian cdf likelihood factor in the logarithmic domain. Here we provide an example for the censored variables associated with \( l_b \).

$$\begin{aligned}{} & {} \ln {[1 - \Phi (f_i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)]} \nonumber \\{} & {} \quad \ge \frac{1}{\sigma _y^2 + \sigma _{l_b}^2}\left[ -\frac{1}{2}f_i^2 + b_i(f_i - l_b) + c_i\right] \end{aligned}$$
(24)

We compute the required local likelihood lower bound parameters \( b_i \) and \( c_i \) by requiring that, at some arbitrary (and freely optimizable variational) point \( \zeta _i \), the following conditions must hold

$$\begin{aligned}{} & {} \left. \ln {[1 - \Phi (f_i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)]}\right| _{f_i = \zeta _i} \nonumber \\{} & {} \quad =\left. \frac{1}{\sigma _y^2 + \sigma _{l_b}^2}\left[ -\frac{1}{2}f_i^2 + b_i(f_i - l_b) + c_i\right] \right| _{f_i = \zeta _i} \nonumber \\{} & {} \left. \frac{d}{df_i}\left( \ln {[1 - \Phi (f_i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)]}\right) \right| _{f_i = \zeta _i} \nonumber \\{} & {} \quad =\left. \frac{d}{df_i}\left( \frac{1}{\sigma _y^2 + \sigma _{l_b}^2}\left[ -\frac{1}{2}f_i^2 + b_i(f_i - l_b) + c_i\right] \right) \right| _{f_i = \zeta _i} \end{aligned}$$
(25)

Using Eqs. (24) to (25) we can show that

Fig. 1
figure 1

The black curve depicts the Gaussian cdf factor associated with the upper detection limit \( u_b \), viewed as a function of f, together with the local likelihood lower bound (depicted in blue, see Eqs. (29) to (30)) for various values of the freely optimizable variational parameter \( \zeta \) (blue cross). The black dot represents the Gaussian cdf factor output at the latent function test point (\( f_t = 2 \)). We see that by adjusting the parameter \( \zeta \) we can control the quality of the local likelihood lower bound output (blue dot). We also note that the local lower bound output becomes tight, i.e., exact, when \( \zeta = f_t \). (Color figure online)

$$\begin{aligned}{} & {} 1 - \Phi (f_i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2) \nonumber \\{} & {} \quad \ge \exp {\left\{ \frac{1}{\sigma _y^2 + \sigma _{l_b}^2}\left[ -\frac{1}{2}f_i^2 + b_i(f_i - l_b) + c_i\right] \right\} } \end{aligned}$$
(26)
$$\begin{aligned}{} & {} b_i = \zeta _i - (\sigma _y^2 + \sigma _{l_b}^2)\frac{{\mathcal {N}}(\zeta _i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)}{1 - \Phi (\zeta _i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)} \end{aligned}$$
(27)
$$\begin{aligned}{} & {} c_i = \frac{1}{2}\zeta _i^2 - b_i(\zeta _i - l_b) \nonumber \\{} & {} \qquad + \text { } (\sigma _y^2 + \sigma _{l_b}^2)\ln {[1 - \Phi (\zeta _i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)]} \end{aligned}$$
(28)

Note that a similar argument holds for the Gaussian cdf factor associated with \( \Phi (f_i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2) \). We can show that

$$\begin{aligned}{} & {} \Phi (f_i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2) \nonumber \\{} & {} \quad \ge \exp {\left\{ \frac{1}{\sigma _y^2 + \sigma _{u_b}^2}\left[ -\frac{1}{2}f_i^2 + b_i(f_i - u_b) + c_i\right] \right\} } \end{aligned}$$
(29)
$$\begin{aligned}{} & {} b_i = \zeta _i + (\sigma _y^2 + \sigma _{u_b}^2)\frac{{\mathcal {N}}(\zeta _i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2)}{\Phi (\zeta _i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2)} \nonumber \\{} & {} c_i = \frac{1}{2}\zeta _i^2 - b_i(\zeta _i - u_b) \nonumber \\{} & {} \qquad + \text { } (\sigma _y^2 + \sigma _{u_b}^2)\ln {\Phi (\zeta _i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2)} \end{aligned}$$
(30)

Refer to Fig. 1 for an illustration of the proposed local likelihood lower bound approach as applied to the Gaussian cdf factor associated with the upper detection limit \( u_b \) (see Eqs. (29) to (30)). Observe that the local lower bound parameters \( b_i \) and \( c_i \) only depend on the freely optimizable parameter \( \zeta _i \). In other words, we can merely adjust the parameter \( \zeta _i \) to improve the quality of the local lower bound. Next, we observe from Eqs. (11) and (26) to (30) that

$$\begin{aligned}{} & {} \prod _{y_i=l_b}[1 - \Phi (f_i \vert l_b,\sigma _y^2 + \sigma _{l_b}^2)] \ge g({{\varvec{f}}}_{l_b} \vert {\varvec{\zeta }}_{l_b},l_b,\sigma _y^2,\sigma _{l_b}^2)\nonumber \\ \end{aligned}$$
(31)
$$\begin{aligned}{} & {} \prod _{y_i=u_b}\Phi (f_i \vert u_b,\sigma _y^2 + \sigma _{u_b}^2) \ge g({{\varvec{f}}}_{u_b} \vert {\varvec{\zeta }}_{u_b},u_b,\sigma _y^2,\sigma _{u_b}^2) \nonumber \\ \end{aligned}$$
(32)

We note that

$$\begin{aligned}{} & {} g({{\varvec{f}}}_{l_b} \vert {\varvec{\zeta }}_{l_b},l_b,\sigma _y^2,\sigma _{l_b}^2)=\nonumber \\{} & {} \quad \exp \left\{ \frac{1}{\sigma _y^2 + \sigma _{l_b}^2} \left[ -\frac{1}{2}{{{\varvec{f}}}_{l_b}^T}\varvec{{f}}_{l_b} +\, {{{\varvec{b}}}_{l_b}^T}(\varvec{{f}}_{l_b} - l_b{\varvec{1}}_{l_b}) + {{{\varvec{c}}}_{l_b}^T}{\varvec{1}}_{l_b} \right] \right\} \nonumber \\ \end{aligned}$$
(33)
$$\begin{aligned}{} & {} g({{\varvec{f}}}_{u_b} \vert {\varvec{\zeta }}_{u_b},u_b,\sigma _y^2,\sigma _{u_b}^2) =\nonumber \\{} & {} \quad \exp \left\{ \frac{1}{\sigma _y^2 + \sigma _{u_b}^2} \left[ -\frac{1}{2}{{{\varvec{f}}}_{u_b}^T}\varvec{{f}}_{u_b} + {{{\varvec{b}}}_{u_b}^T}(\varvec{{f}}_{u_b} -u_b{\varvec{1}}_{u_b}) + {{{\varvec{c}}}_{u_b}^T}{\varvec{1}}_{u_b} \right] \right\} \nonumber \\ \end{aligned}$$
(34)

We denote with \( N_{l_b} \) the number of censored lower bound observations. The \( N_{l_b}\times 1 \) vectors \( {\varvec{b}}_{l_b} \) and \( {\varvec{c}}_{l_b} \) collect the element-wise entries, as calculated using Eqs. (27) and (28), for each element of the vector \( {\varvec{f}}_{l_b} \) (each of which is associated with a freely optimizable variational parameter \( \zeta _i \), which we collectively denote by the \( N_{l_b}\times 1 \) vector \( \varvec{\zeta }_{l_b} \)). The symbol \( {\varvec{1}}_{l_b} \) denotes the \( N_{l_b}\times 1 \) vector of ones. A similar argument holds for \( {\varvec{f}}_{u_b} \). Next, from Eqs. (21), (31) and (32) we can show that

$$\begin{aligned}{} & {} \int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln p_m({\varvec{y}} \vert {\varvec{f}})d{\varvec{f}} \nonumber \\{} & {} \quad \ge \int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln p_l({\varvec{y}}\vert {\varvec{f}})d{\varvec{f}} \end{aligned}$$
(35)

We denote with \( p_l({\varvec{y}} \vert {\varvec{f}}) \) the following

$$\begin{aligned}{} & {} p_l({\varvec{y}} \vert {\varvec{f}}) = g({{\varvec{f}}}_{l_b} \vert {\varvec{\zeta }}_{l_b},l_b,\sigma _y^2,\sigma _{l_b}^2) \nonumber \\{} & {} \quad \times \left[ \prod _{l_b<y_i<u_b} {\mathcal {N}}(y_i \vert f_i,\sigma _y^2)\right] g({{\varvec{f}}}_{u_b} \vert {\varvec{\zeta }}_{u_b},u_b,\sigma _y^2,\sigma _{u_b}^2) \end{aligned}$$
(36)

Observe that by our local likelihood lower bound design, we have that \( \ln {g({{\varvec{f}}}_{l_b} \vert {\varvec{\zeta }}_{l_b},l_b,\sigma _y^2,\sigma _{l_b}^2)} \) and \( \ln {g({{\varvec{f}}}_{u_b} \vert {\varvec{\zeta }}_{u_b},u_b,\sigma _y^2, \sigma _{u_b}^2)} \) are quadratic in the logarithmic domain. Consequently, we can analytically evaluate each Gaussian expectation on the right-hand side of the inequality in Eq. (35), circumventing the original analytical intractability that arose in Eq. (21) as a result of the presence of the Gaussian cdf likelihood factors. Using Eqs. (17) to (20) and (35) we also observe that

$$\begin{aligned}{} & {} \ln {p({\varvec{y}})} \ge {\mathcal {F}}[q({\varvec{u}})] \nonumber \\{} & {} \quad \ge \int \limits _{{\varvec{u}}}q({\varvec{u}})\left[ \int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln p_l({\varvec{y}} \vert {\varvec{f}})d{\varvec{f}} + \ln \frac{p({\varvec{u}})}{q({\varvec{u}})}\right] d{\varvec{u}} \end{aligned}$$
(37)

From Eq. (37) we see that by lower bounding each Gaussian cdf factor we have implicitly developed a secondary variational lower bound to the original lower bound \( {\mathcal {F}}[q({\varvec{u}})] \) (see Eq. (20)) stemming from the Kullback–Leibler divergence framework (which is itself a lower bound to the log marginal likelihood of the original probabilistic model). We denote our secondary variational lower bound as follows

$$\begin{aligned}{} & {} {\mathcal {F}}^*[q({\varvec{u}})]\nonumber \\{} & {} \ge \int \limits _{{\varvec{u}}}q({\varvec{u}})\!\left[ \int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln p_l({\varvec{y}} \vert {\varvec{f}})d{\varvec{f}} + \ln \frac{p({\varvec{u}})}{q({\varvec{u}})}\right] d{\varvec{u}}\nonumber \\ \end{aligned}$$
(38)

3.3 Deriving the Optimal q(u) Density and the ‘Collapsed’ Lower Bound

Next, we analytically maximize our secondary lower bound in Eq. (38) and note that we have the following integral constraint

$$\begin{aligned} \int \limits _{{\varvec{u}}}q({\varvec{u}})d{\varvec{u}} = 1 \end{aligned}$$
(39)

We construct our Lagrangian, subject to the integral constraint in Eq. (39), as follows (for more details, see Logan 2006)

$$\begin{aligned} {\mathcal {L}}[q({\varvec{u}}),\lambda ] = q({\varvec{u}})\left[ \Psi ({{\varvec{u}}}) + \ln \frac{p({\varvec{u}})}{q({\varvec{u}})} \right] + \lambda q({\varvec{u}}) \end{aligned}$$
(40)

We denote with symbol \( \lambda \) the Lagrange multiplier. Furthermore, we define \( \Psi ({{\varvec{u}}}) \) as follows

$$\begin{aligned} \Psi ({{\varvec{u}}}) = \int \limits _{{\varvec{f}}}p({\varvec{f}} \vert {\varvec{u}})\ln p_l({\varvec{y}} \vert {\varvec{f}})d{\varvec{f}} \end{aligned}$$

According to the Euler–Lagrange equation, the stationary condition for the optimal density \( q({\varvec{u}}) \) satisfies

$$\begin{aligned} \frac{\partial {{\mathcal {L}}[q({\varvec{u}}),\lambda ]}}{\partial {q({\varvec{u}})}} = 0 \end{aligned}$$
(41)

From Eqs. (40) and (41) we can show that the optimal \( q({\varvec{u}}) \) corresponds to

$$\begin{aligned} q({\varvec{u}}) = \frac{p({\varvec{u}})\exp {\{\Psi ({{\varvec{u}})}\}}}{\int \limits _{{\varvec{u}}}p({\varvec{u}})\exp {\{\Psi ({{\varvec{u}})}\}}d{\varvec{u}}} \end{aligned}$$
(42)

We back-substitute Eq. (42) into Eq. (38) to derive the corresponding optimal ‘collapsed’ secondary lower bound as

$$\begin{aligned} {\mathcal {F}}^*(\varvec{\theta }) = \ln {\int \limits _{{\varvec{u}}}p({\varvec{u}})\exp {\{\Psi ({{\varvec{u}})}\}}d{\varvec{u}}} \end{aligned}$$
(43)

Note that after marginalizing over the inducing variables \( {\varvec{u}} \), the resulting ‘collapsed’ secondary lower bound depends on the remaining model parameters, which we collectively denote by the parameter vector \( \varvec{\theta } \). From Eq. (42) we can analytically derive the optimal \( q({\varvec{u}}) \) and show that the density corresponds to a multivariate Gaussian parameterized by

$$\begin{aligned} q({\varvec{u}})= & {} {\mathcal {N}}({\varvec{u}} \vert {\varvec{\mu }}_u,{{\varvec{S}}}_u) \nonumber \\ {\varvec{\mu }}_u= & {} {\varvec{K}}_{MM}{\varvec{Q}}^{-1}{\varvec{K}}_{MN}^{l}\varvec{\Sigma }_{y_l}^{-1}{\varvec{y}}_l \nonumber \\ {{\varvec{S}}}_u= & {} {\varvec{K}}_{MM}{\varvec{Q}}^{-1}{\varvec{K}}_{MM} \nonumber \\ {\varvec{Q}}= & {} {\varvec{K}}_{MM} + {\varvec{K}}_{MN}^{l}\varvec{\Sigma }_{y_l}^{-1}{\varvec{K}}_{NM}^{l} \end{aligned}$$
(44)

The matrix \( {\varvec{K}}_{MM} \), which stems from the augmented probability model in Eq. (15), requires evaluating the user-specified kernel function between the freely optimizable inducing input locations. Furthermore, we note that

$$\begin{aligned}{} & {} {\varvec{y}}_l = \begin{bmatrix} {\varvec{b}}_{l_b}\\ {\varvec{y}}_{o}\\ {\varvec{b}}_{u_b}\end{bmatrix}; \text { } {\varvec{K}}_{NM}^{l} = \begin{bmatrix} {\varvec{K}}_{{N_{l_b}}M}\\ {\varvec{K}}_{{N_{y_o}}M}\\ {\varvec{K}}_{{N_{u_b}}M}\end{bmatrix} \nonumber \\{} & {} \varvec{\Sigma }_{y_l} = diag \begin{bmatrix} (\sigma _y^2 + \sigma _{l_b}^2){\varvec{I}}_{N_{l_b}N_{l_b}} \\ \sigma _y^2{\varvec{I}}_{N_{y_o}N_{y_o}} \\ (\sigma _y^2 + \sigma _{u_b}^2){\varvec{I}}_{N_{u_b}N_{u_b}} \end{bmatrix} \end{aligned}$$
(45)

We denote with Eq. (45) the block diagonal matrix \( \varvec{\Sigma }_{y_l} \). The symbol \( N_{u_b} \) refers to the number of censored upper bound observations. The symbol \( N_{y_o} \) denotes the number of noise-corrupted observations, collectively denoted by the vector \( {\varvec{y}}_{o} \in {\mathbb {R}}^{N_{y_o}\times 1} \), that are not censored. The matrix \( {\varvec{K}}_{{N_{l_b}}M} \) requires evaluating the user-specified kernel function between the training input locations associated with the vector \( {\varvec{f}}_{l_b} \) and the freely optimizable inducing input locations. A similar argument holds for matrix \( {\varvec{K}}_{{N_{u_b}}M} \). The matrix \( {\varvec{K}}_{{N_{y_o}}M} \) requires evaluating the kernel function between the training input locations associated with the vector \( {\varvec{f}}_{u_n} \) and the inducing input locations. We also note that \( {\varvec{K}}_{MN}^{l} = ({\varvec{K}}_{NM}^{l})^T \). After some algebraic manipulation of Eq. (43), we arrive at the following secondary variational lower bound

$$\begin{aligned} {\mathcal {F}}^*({\varvec{\theta }})= & {} \ln {\left\{ \frac{ \vert {\varvec{K}}_{MM} \vert ^{\frac{1}{2}}}{(2\pi )^{\frac{N_{y_o}}{2}}(\sigma _y^2)^{\frac{N_{y_o}}{2}}\vert {\varvec{Q}} \vert ^{\frac{1}{2}}} \exp {\left\{ {\mathcal {A}}_{{\mathcal {F}}^*}\right\} } \right\} } \nonumber \\{} & {} \qquad - \frac{1}{2}\textrm{tr}\left\{ \varvec{\Sigma }_{y_l}^{-1}\left[ {\varvec{K}}_{NN}^{l} - {\varvec{K}}_{NM}^{l}{\varvec{K}}_{MM}^{-1}{\varvec{K}}_{MN}^{l}\right] \right\} \nonumber \\ {\mathcal {A}}_{{\mathcal {F}}^*}= & {} -\frac{1}{2}{{\varvec{y}}_l}^T{\varvec{A}}{\varvec{y}}_l+\frac{1}{2}{{\varvec{b}}}^{T}\varvec{\Sigma }_{c}^{-1}{\varvec{b}} \nonumber \\{} & {} + \text { } {{\varvec{c}}}^{T}\varvec{\Sigma }_{c}^{-1}{\varvec{1}}^{*} -{{\varvec{b}}}^{T}\varvec{\Sigma }_{c}^{-1}{\varvec{d}} \end{aligned}$$
(46)

Refer to Section A.3 in Appendix A for details on the derivation of the secondary variational lower bound. Recall that \( \varvec{\theta } \) collectively denotes the model parameters, which include the kernel function parameters, the variance parameters from the adjusted mixed-likelihood function, the inducing variable input locations, and the local likelihood lower bound parameters. Matrices \( {\varvec{A}} \), \( \varvec{\Sigma }_{c} \) and \( {\varvec{K}}_{NN}^{l} \) can be computed as follows

$$\begin{aligned} {\varvec{A}}= & {} \varvec{\Sigma }_{y_l}^{-1} - \varvec{\Sigma }_{y_l}^{-1}{\varvec{K}}_{NM}^{l}{\varvec{Q}}^{-1}{\varvec{K}}_{MN}^{l}\varvec{\Sigma }_{y_l}^{-1} \nonumber \\ \varvec{\Sigma }_{c}= & {} diag \begin{bmatrix} (\sigma _y^2 + \sigma _{l_b}^2){\varvec{I}}_{N_{l_b}N_{l_b}} \\ (\sigma _y^2 + \sigma _{u_b}^2){\varvec{I}}_{N_{u_b}N_{u_b}} \end{bmatrix} \end{aligned}$$
(47)
$$\begin{aligned} {\varvec{K}}_{NN}^{l}= & {} diag \begin{bmatrix} {\varvec{K}}_{{N_{l_b}}{N_{l_b}}} \\ {\varvec{K}}_{{N_{y_o}}{N_{y_o}}} \\ {\varvec{K}}_{{N_{u_b}}{N_{u_b}}} \end{bmatrix} \end{aligned}$$
(48)

We denote with Eqs. (47) and (48) the block diagonal matrices \( \varvec{\Sigma }_{c} \) and \( {\varvec{K}}_{NN}^{l} \), respectively. Vectors \( {\varvec{b}} \), \( {\varvec{c}} \), \( {\varvec{1}}^{*} \) and \( {\varvec{d}} \) are defined as follows

$$\begin{aligned} {\varvec{b}}= & {} \begin{bmatrix} {\varvec{b}}_{l_b} \\ {\varvec{b}}_{u_b} \\ \end{bmatrix} \\ {\varvec{c}}= & {} \begin{bmatrix} {\varvec{c}}_{l_b} \\ {\varvec{c}}_{u_b} \\ \end{bmatrix} \nonumber \\ {\varvec{1}}^{*}= & {} \begin{bmatrix} {\varvec{1}}_{l_b} \\ {\varvec{1}}_{u_b} \\ \end{bmatrix} \nonumber \\ {\varvec{d}}= & {} \begin{bmatrix} (l_b)\times {\varvec{1}}_{l_b} \\ (u_b)\times {\varvec{1}}_{u_b} \\ \end{bmatrix} \end{aligned}$$

Furthermore, we note that Eq. (46) is a valid secondary variational lower bound on the log marginal likelihood of the probabilistic model (see Eq. (15)) which can be maximized, using gradient-based optimization, to find point estimates for \( \varvec{\theta } \). This allows us to perform variational Bayesian model training and inference. We, therefore, refer to our proposed methodology as the Variational Tobit Gaussian process regression (VT-GPR) framework.

It is worth pointing out that one common criticism of \( {\mathcal {K}}{\mathcal {L}} \)-divergence-based variational inference (see Eq. (16)) is its tendency to underestimate the posterior density variance (Blei et al. 2017). However, simulation-based studies performed by Titsias (2009, see Figures 1 and 2) indicate that with enough inducing/support variables, the variational sparse GP framework is able to match the standard GP model prediction results. In this regard, \( {\mathcal {K}}{\mathcal {L}} \)-divergence-based variational inference does not necessarily underestimate the posterior density variance. Furthermore, when we set \( M = N \) inducing variables and place them at the training input locations, i.e., \( {\varvec{u}} = {\varvec{f}} \), the variational sparse GP framework of Titsias (2008, 2009) reduces to the standard GP regression framework (Hensman et al. 2013). However, due to the presence of censored observations, we do expect that the VT-GPR framework will display deteriorated prediction performance in censored latent function regions as a result of our proposed local variational method providing limited domain support for each Gaussian cdf factor (see Fig. 1).

Note that for a numerically stable implementation of the secondary variational lower bound, we propose following the idea outlined in Titsias (2008) which relies on the addition of “jitter” to the main diagonal elements of matrix \( {\varvec{K}}_{MM} \) to stabilize the optimization routine. Furthermore, it is also worth pointing out that in the absence of any censored observations, our proposed secondary variational lower bound reduces to the variational sparse GP lower bound derived in Titsias (2008, 2009).

4 VT-GPR Model Predictions

Model predictions about the latent function f, which we collectively denote with the latent prediction vector \( {\varvec{f}}^{*} \), at unsampled locations \( {\varvec{x}}^{*} \) are in line with the framework proposed by Titsias (2008, 2009). Starting from the joint density we have that

$$\begin{aligned}{} & {} p({\varvec{f}}^{*} \vert {\varvec{y}}) = \iint \limits _{{{\varvec{u}}}{{\varvec{f}}}} p({\varvec{f}}^{*},{\varvec{f}},{\varvec{u}} \vert {\varvec{y}})d{\varvec{f}}d{\varvec{u}} \\{} & {} p({\varvec{f}}^{*} \vert {\varvec{y}}) = \iint \limits _{{{\varvec{u}}}{{\varvec{f}}}}p({\varvec{f}}^{*} \vert {\varvec{f}},{\varvec{u}},{\varvec{y}})p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}})d{\varvec{f}}d{\varvec{u}} \end{aligned}$$

Given that \( {\varvec{f}}^{*} \) is conditionally independent of \( {\varvec{f}} \) and \( {\varvec{y}} \) given \( {\varvec{u}} \) we have that

$$\begin{aligned} p({\varvec{f}}^{*} \vert {\varvec{y}}) = \iint \limits _{{{\varvec{u}}}{{\varvec{f}}}}p({\varvec{f}}^{*} \vert {\varvec{u}})p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}})d{\varvec{f}}d{\varvec{u}} \end{aligned}$$

From our variational approximation in Eq. (19), we know that

$$\begin{aligned} p({\varvec{f}},{\varvec{u}} \vert {\varvec{y}}) \approx p({\varvec{f}} \vert {\varvec{u}})q({\varvec{u}}) \end{aligned}$$

Therefore, we have that

$$\begin{aligned}{} & {} p({\varvec{f}}^{*} \vert {\varvec{y}}) \approx \iint \limits _{{{\varvec{u}}}{{\varvec{f}}}}p({\varvec{f}}^{*} \vert {\varvec{u}})p({\varvec{f}} \vert {\varvec{u}})q({\varvec{u}})d{\varvec{f}}d{\varvec{u}} \nonumber \\{} & {} p({\varvec{f}}^{*} \vert {\varvec{y}}) \approx q({\varvec{f}}^{*}) = \int \limits _{{\varvec{u}}} p({\varvec{f}}^{*} \vert {\varvec{u}})q({\varvec{u}})d{\varvec{u}} \end{aligned}$$
(49)

We note that

$$\begin{aligned}{} & {} p({\varvec{f}}^{*} \vert {\varvec{u}}) = {\mathcal {N}}({\varvec{f}}^{*} \vert {\varvec{K}}_{{N}^{*}M}{\varvec{K}}_{MM}^{-1}{\varvec{u}},\varvec{\Sigma }) \nonumber \\{} & {} \varvec{\Sigma } = {\varvec{K}}_{{N}^{*}{N}^{*}} - {\varvec{K}}_{{N}^{*}M}{\varvec{K}}_{MM}^{-1}{\varvec{K}}_{M{N}^{*}} \end{aligned}$$
(50)

From Eqs. (44), (49) and (50) we can show that the latent function predictive density \( q({\varvec{f}}^{*}) \) takes the form of a multivariate Gaussian density parameterized by

$$\begin{aligned}{} & {} q({\varvec{f}}^{*}) = {\mathcal {N}}({\varvec{f}}^{*} \vert \varvec{\mu }_{{{\varvec{f}}}^{*}},\varvec{\Sigma }_{{{\varvec{f}}}^{*}}) \\{} & {} \varvec{\mu }_{{{\varvec{f}}}^{*}} = {\varvec{K}}_{{N^{*}}M}{\varvec{Q}}^{-1}{\varvec{K}}_{MN}^{l}\varvec{\Sigma }_{y_l}^{-1}{\varvec{y}}_l \\{} & {} \varvec{\Sigma }_{{{\varvec{f}}}^{*}} = {\varvec{K}}_{{N^{*}}{N^{*}}} - {\varvec{K}}_{{N}^{*}M}{\varvec{K}}_{MM}^{-1}{\varvec{K}}_{M{N}^{*}}\\{} & {} \quad + \text { } {\varvec{K}}_{{N}^{*}M}{\varvec{Q}}^{-1}{\varvec{K}}_{M{N}^{*}} \end{aligned}$$

5 Experiments

To demonstrate the VT-GPR framework, we now consider its application to two synthetic examples and a real-world data set. For each synthetic example, we generate noise-corrupted observational data, which is then subjected to artificial censoring. Furthermore, throughout all our experiments we use the exponentiated quadratic kernel function (see Eq. (51)) with signal variance \( \sigma _f^2 \) and length scale l.

$$\begin{aligned} k(x_i,x_j) = \sigma _f^2\exp {\left\{ \frac{-(x_j - x_i)^2}{2l^2}\right\} } \end{aligned}$$
(51)

5.1 Synthetic Data: Example 1

In our first experiment, we reproduce the artificial example outlined in the work of Groot and Lucas (2012). They created a data set consisting of 30 equally spaced inputs on the interval [0, 1] and generated latent function outputs from

$$\begin{aligned} f(x) = (6x - 2)^{2}\sin {\left( 2(6x - 2)\right) } \end{aligned}$$
(52)

The data is then artificially contaminated by adding zero mean Gaussian distributed noise with variance \( \sigma _y^2=0.1 \). Groot and Lucas (2012) censor \( 40\% \) of the observations by calculating the 40th percentile of the data and use the corresponding value as the lower detection limit \( l_b \). We repeat this procedure for a randomly generated data set, using the available implementation of Groot and Lucas (2012).

We then train the T-GPR model on the censored data set using {1} expectation propagation (EP) (Groot and Lucas 2012), {2} the Laplace approximation (LA) (Ertin 2007) and {3} our proposed VT-GPR framework using gradient-based optimization. We used an in-house implementation of {1}, and {2} we use the implementation in the publicly available GPstuff MATLAB toolbox (Vanhatalo et al. 2013). For the VT-GPR training procedure we fixed (i.e., implicitly assumed) 15 as the number of inducing variables (see Section B.1 in Appendix B for more details).

To compare the T-GPR frameworks, we simulated 1000 additional independent data sets and trained the LA and EP-based T-GPR frameworks, as well as our VT-GPR framework, on each data set. For all data sets, we calculate the 40th percentile and use the corresponding value as the lower detection limit \(l_b \). We report {1} the root mean squared error (RMSE, Eq. (53)), {2} the mean absolute error (MAE, Eq. (54)), and {3} the mean negative log-loss (MNLL, Eq. (55)) to compare the model predictions from the various T-GPR frameworks to the true latent function. For all three criteria, smaller values imply better performance. We define the error measures as follows (see Rasmussen and Williams 2006; Lázaro-Gredilla et al. 2010; Groot and Lucas 2012):

$$\begin{aligned}{} & {} \text {RMSE}(f,\varvec{\mu }_{{\varvec{f}}^{*}}) =\sqrt{\frac{1}{N^{*}}{\sum _{i=1}^{N^{*}}}\left( f_i - \left( \varvec{\mu }_{{\varvec{f}}^{*}}\right) _i\right) ^2}\end{aligned}$$
(53)
$$\begin{aligned}{} & {} \text {MAE}(f,\varvec{\mu }_{{\varvec{f}}^{*}}) =\frac{1}{N^{*}}{\sum _{i=1}^{N^{*}}}\left| f_i - \left( \varvec{\mu }_{{\varvec{f}}^{*}}\right) _i \right| \end{aligned}$$
(54)
$$\begin{aligned}{} & {} \text {MNLL}(f,\varvec{\mu }_{{\varvec{f}}^{*}}) \nonumber \\{} & {} \quad =\frac{1}{N^{*}}{\sum _{i=1}^{N^{*}}} \left[ \frac{1}{2}\ln {(2\pi \sigma _i^2)} + \frac{\left( f_i - \left( \varvec{\mu }_{{\varvec{f}}^{*}}\right) _i\right) ^2}{2\sigma _i^2} \right] \end{aligned}$$
(55)

The symbol \( N^{*} \) denotes the total number of predicted latent function values. We denote with symbol \( f_i=f(x_i) \) the true underlying latent function value (see Eq. (52)), as indexed by \( x_i \), whereas the vector \( \varvec{\mu }_{{\varvec{f}}^{*}} \) denotes the mean model prediction. The symbol \( \sigma _i^2 \) denotes the marginal predictive variance associated with \( \left( \varvec{\mu }_{{\varvec{f}}^{*}}\right) _i \).

To illustrate the scalability of the proposed VT-GPR framework, we carry out a further experiment by training {1} the standard GP, {2} the LA-based, {3} the EP-based, and {4} the VT-GPR frameworks on increasingly larger data sets, holding fixed the other aspects of the example described above. For each data set and proposed framework, we initiate 10 randomly generated parameter starting points and then calculate the average run time per starting point. This procedure was repeated 10 times for each data set size, followed by averaging across the average computational run times.

Fig. 2
figure 2

Tobit Gaussian process regression results with \( l_b \) set to the 40th percentile of the uncensored observational data. Left Panel: T-GPR latent function predictive results using the Laplace approximation. Middle Panel: T-GPR latent function predictive results using the expectation propagation framework. Right Panel: VT-GPR latent function predictive results. Additional Information: The black ‘\(\times \)’-sign denotes the observational data (noisy and/or censored), the red line denotes the underlying latent function (see Eq. (52)) while the blue curve denotes the mean model prediction (model MAP estimate). The corresponding grey shaded area depicts the 99% point-wise credibility interval. Furthermore, the blue ‘+’-sign at the bottom of the right panel depicts the initial inducing input locations, which we initialized to evenly spaced input points across the function domain, while the optimized inducing input locations are depicted at the top of the right panel. We arbitrarily selected 15 as the number of inducing variables for our VT-GPR implementation (see Section B.1 in Appendix B for more details). (Color figure online)

Figure 2 shows the results from the reproduced example in Groot and Lucas (2012), where we censored based on a lower bound of \(l_b=-0.1185\). Qualitatively, from Fig. 2, we observe that all three T-GPR frameworks have the ability to learn an underlying representation that is consistent with the original latent function given by Eq. (52) from the censored data set. The interested reader is referred to the work of Groot and Lucas (2012, Figure 2) for comparisons of the LA and EP-based T-GPR frameworks against the standard GPR model (see Section 2) when the censored data are either treated as missing values or as uncensored observations exactly equal to the detection limit.

In Fig. 3 we depict and compare the distribution of the generated RMSE, MAE and MNLL results across the 1000 data sets using box plots. The additional dashed red line in Fig. 3 depicts the mean value of the generated results. Qualitatively, for the RMSE (left panel) and MAE (middle panel) results in Fig. 3, we observe that there is no significant difference in the predictive performance results across the T-GPR frameworks. Arguably, we can state that the Laplace-based T-GPR framework marginally outperforms the EP-based and VT-GPR frameworks. However, when we consider the MNLL performance measure results (right panel) we observe that the proposed VT-GPR framework performs worse when compared to the LA and EP-based frameworks.

Fig. 3
figure 3

Box plot visualization for the generated RMSE (left panel), MAE (middle panel) and MNLL (right panel) results, respectively, for each T-GPR framework. The dashed red line depicts the mean value for each quantitative performance measure across the 1000 additional independently generated data sets. The interquartile range is denoted at the bottom whisker of each box plot . (Color figure online)

To understand the discrepancy in the predictive performance behaviour we observe in Fig. 3, we stratify the error measures, relative to the underlying latent function values calculated from Eq. (52), by the lower detection limit \( l_b \) such that

$$\begin{aligned} \text {MSE}(f,\varvec{\mu }_{{\varvec{f}}^{*}})= & {} \frac{1}{N^{*}} {\sum _{f(x) \le l_b}} \left( f_i - \left( \varvec{\mu }_{{\varvec{f}}^{*}}\right) _i\right) ^2 \\{} & {} +\frac{1}{N^{*}}{\sum _{f(x) > l_b}} \left( f_i - \left( \varvec{\mu }_{{\varvec{f}}^{*}}\right) _i\right) ^2 \end{aligned}$$

Note that instead of using the RMSE we opted for the MSE (mean squared error) for convenience. This stratification procedure partitions each error measure into two different contributing error components. The first stratified error component considers the predictive performance in lower bound censored latent function regions, whereas the second error component considers predictive performance in uncensored latent function regions. The MAE and MNLL stratified error measures are constructed in a similar fashion. The bold values in Tables 1, 2, and, 3 indicate the best-performing framework, for each performance measure, for the example under consideration.

Table 1 summarises the mean error component contributions for each of the stratified error measures across the 1000 additional independently generated data sets. We observe that, in the lower bound censored latent function regions, the quantitative performance measure contributions for the VT-GPR framework are, on average, larger when compared to the LA and EP-based frameworks, indicating worse predictive performance. The deteriorated performance is especially noticeable from the MNLL performance measure. We suspect that the worse performance is a result of the single regulating variance parameter \( (\sigma _{l_b}^2) \) that we introduced in the adjusted mixed-likelihood (refer to Appendix B Section B.2 for more details).

However, in the uncensored latent function regions, the VT-GPR framework provides predictive performance results that are, on average, comparable to the LA and EP-based frameworks. Table 1 also highlights that, on average, for the example under consideration, the Laplace-based T-GPR framework outperforms the EP-based and VT-GPR frameworks.

Table 1 Summary of the mean contributions to each quantitative performance measure for the various T-GPR frameworks for Example 1
Fig. 4
figure 4

Average optimization run time for {1} the standard GP model (black), {2} the LA-based T-GPR framework (red), {3} the EP-based approach (magenta), and {4} the VT-GPR framework (blue). The various curves depict and compare the average log (base 10) computational run time for the different frameworks, across the 10 repeated experiments, together with three standard deviation error bars. (Color figure online)

Turning our attention to the scalability results, in Fig. 4 we depict and compare the average computational run time for the various frameworks, across the 10 repeated experiments, together with error bars corresponding to three standard deviations. From Fig. 4 we see a clear separation, i.e., no overlapping error bars, between the computational run times for the standard GP and the VT-GPR framework at a data set size of approximately 3500 points. Thus, at a data set size of approximately 3500 data points, the VT-GPR framework becomes computationally less demanding and starts outperforming the standard GP model. We also observe that after approximately 1500 data points, the VT-GPR framework starts to computationally outperform the LA and EP-based approaches.

5.2 Synthetic Data: Example 2

For our second experiment, we expand the example outlined in the work of Groot and Lucas (2012) by introducing an upper detection limit based on the 90th percentile of the uncensored observational data, thereby increasing the percentage of censored observations from 40% (Example 1) to 50%.

We create a data set consisting of 30 equally spaced inputs on the interval [0, 1.15] and generate latent function outputs from Eq. (52). Following this, we artificially contaminate the latent function outputs by adding zero mean Gaussian distributed noise with variance \(\sigma _y^2 = 0.1\). We censor the observations by calculating a lower detection limit \(l_b\) equal to the 40th percentile of the data and the upper detection limit \( u_b \) is calculated from the 90th percentile of the uncensored observational data.

Plots of the latent function predictive results from the single simulation, as well as box plots comparing the performance metrics across 1000 independently generated data sets, for the various T-GPR frameworks, are shown in Figs. 11 and 12, respectively, in Section B.3 of Appendix B. Note that, similar to Example 1, we fixed the number of inducing variables to 15 for all the VT-GPR training procedures. As in Example 1, all the T-GPR frameworks have the ability to learn a good underlying representation of the latent function from the censored data. However, the LA and EP-based T-GPR frameworks produce more conservative, i.e., larger, credibility intervals compared to our proposed VT-GPR framework. Contrasting the results from Example 1, the VT-GPR framework marginally outperforms the LA and EP-based frameworks in terms of RMSE and MAE; however, we again see that the VT-GPR framework performs worse on the MNLL.

As before, we stratify the error measures, relative to the underlying latent function values calculated from Eq. (52), by the lower detection limit \( l_b \) and the upper detection limit \( u_b \). Refer to Table 2 for a summary the mean error component contributions for each of the stratified error measures across the 1000 additional independently generated data sets.

Table 2 Summary of the mean contributions to each quantitative performance measure for the various T-GPR frameworks for Example 2

From Table 2 we observe that, on average, for the example under consideration, the VT-GPR framework seems to provide slightly better mean model prediction results (see the mean MSE and MAE) in the censored latent function regions and comparable results in the uncensored latent function regions. However, similar to Example 1, we observe that the MNLL error measure contributions for the VT-GPR framework are larger when compared to the LA and EP-based frameworks, indicating worse predictive performance. This arises due to the less conservative credibility intervals produced by the proposed VT-GPR framework relative to the competing benchmarks.

5.3 Real-World Data: Example 3

Our third experiment focuses on the application of the T-GPR frameworks on a real-world data set. We sourced an electrical conductivity (EC) data set, for the Vaal River at Groot Vadersbosch/Buffelsfontein, from the Department of Water and Sanitation, South Africa (DWS 2019). The electrical conductivity of water is a measure of its ability to conduct electrical current and is affected by the presence of positively and negatively charged ions from dissolved salts and other chemicals.

Water bodies, like the Vaal River, tend to have a constant EC range. Once the EC range has been established, it can be used as a baseline for comparison with future EC measurements. If we observe significant changes in the electrical conductivity, relative to the baseline, it can be an indicator that some source of pollution has entered the water body. Thus, we can think of EC as a useful measure of water quality where, generally speaking, lower EC values indicate better water quality (EPA 2022).

The Vaal River EC data, measured in milli-siemens per meter (mS/m), was collected between 03 January 1984 and 11 July 1997, with a manual EC sample taken from Monday to Friday (excluding holidays). To perform our regression analysis, we convert the EC measurement dates into serial numbers, where, by default, 01 January 1900 corresponds to serial number 1. We then subtract the serial number associated with 03 January 1984 such that the first EC entry in the Vaal River data set corresponds to taking an EC measurement on day 0 (our reference time stamp).

We create our data set by extracting the last 150 EC measurements and the corresponding time stamps. Next, we calculate the 10th and 90th percentile of the 150 data points and use the calculated values as the artificial lower detection limit \( l_b \) and upper detection limit \( u_b \), respectively. For the 150 EC data points, we find that \( l_b=26.5\) mS/m and \( u_b=43.5 \) mS/m. From an implementation perspective, recall that we used the publicly available GPstuff MATLAB toolbox to train the LA-based T-GPR framework. We selected the 150 EC measurements as this resulted in a numerically stable implementation for each of the T-GPR frameworks. Furthermore, the 150 measurements also capture enough interesting latent function behaviour to provide a fair predictive performance comparison.

Next, we train the following regression frameworks on the artificially censored EC data: {1} the standard Gaussian process regression (GPR) model, {2} the standard GPR model with the censored observations treated as missing (i.e., removing the censored data), {3} the LA-based T-GPR framework, {4} the EP-based approach, and {5} the VT-GPR framework. The latent function predictive results are shown in Fig. 5.

From Fig. 5 we qualitatively observe that the T-GPR frameworks (c, d, and e) produce fairly similar prediction results. Interestingly enough, the standard GPR model (a) produces prediction results that are in line with the results obtained from the various T-GPR frameworks in the uncensored latent function regions.

Fig. 5
figure 5

Gaussian process regression results with \( l_b \) (grey dashed line) and \( u_b \) (grey dotted line) set to the 10th and 90th percentile of the EC observational data, respectively. For panels (a) to (e) the blue curve denotes the mean model prediction (model MAP estimate) whereas the grey shaded area depicts the 99% point-wise credibility interval. The black ‘\(\times \)’-sign denotes the noisy uncensored EC data whereas the red ‘\(\times \)’-sign denotes the noisy uncensored EC data that is artificially censored during the model training procedures. Note that for panel (b) the artificially censored observations are treated as missing values. Panel (f) depicts the mean model prediction for the various regression frameworks shown in panels (a) to (e). We arbitrarily selected 35 as the number of inducing variables for our VT-GPR implementation. (Color figure online)

Table 3 Summary of the mean quantitative performance measures for the various Gaussian process regression frameworks for Example 3

However, the standard model appears to directly interpolate the censored observations (note how the MAP estimate peaks at the censored limits, also see panel (f) for this behaviour) in the censored latent function regions. This behaviour arises due to the censored observational data forming part of the observation vector \( {\varvec{y}} \) (see Section 2) which is a direct consequence of the standard GPR model’s inability to account for the data censoring mechanism.

The standard missing data GP regression model (b) also produces prediction results that are quite consistent with the various T-GPR frameworks but, contrasting the previous GP model depicted in (a), directly interpolates the missing data values. We also observe that the interpolating behaviour is accompanied by more conservative, i.e., larger, point-wise credibility intervals indicating that the model is less confident about the behaviour of the underlying latent function in the missing data regions. For a visual comparison of the mean model estimates across the various GPR frameworks, refer to panel (f) in Fig. 5.

Table 3 reports the mean quantitative performance measures for each of the 5 regression frameworks, together with three standard deviation error bars, obtained from 10 independent training runs where each training run was optimized across 1000 randomly generated parameter starting points. From an error measure perspective, we observe that the T-GPR frameworks outperform the standard GPR models. We also observe that the VT-GPR MNLL is higher when compared to the LA and EP-based frameworks.

This, again, emphasizes that the proposed VT-GPR framework produces less conservative, i.e., smaller, credibility intervals relative to the competing LA and EP-based benchmarks. However, when we consider the mean model prediction results (see panel (f), the RMSE, and the MAE), we observe that the T-GPR frameworks produce quite comparable performance results with the LA-based framework slightly outperforming the EP-based and VT-GPR frameworks.

6 Discussion and Limitations

In this article, we introduced a variational inference-based framework for training a GP regression model subject to censored data. Our proposed framework relies on the variational sparse GP inducing variable framework and local variational methods which allow us to variationally integrate out the latent function values associated with the Gaussian cdf factors (which would otherwise be analytically intractable). We demonstrated the proposed VT-GPR framework on synthetically produced, as well as a real-world, data set subject to artificial censoring and found that the framework can produce results comparable to other methodologies presented in the literature. However, the proposed VT-GPR framework computationally outperforms the standard GPR model, as well as the competing benchmarks, for larger data sets (refer to Fig. 4).

Furthermore, the proposed framework can also be used as a mathematical tool to gain access to the Tobit-based (B)GP-LVM with uncertain model inputs, i.e., x in our framework, if we restrict the uncertain inputs to have Gaussian prior densities (Titsias and Lawrence 2010; Damianou et al. 2016). This would allow us to extend the (B)GP-LVM to the censored data regime which can prove very useful in many real-world applications where practitioners typically collect noise-corrupted observations for the model inputs and outputs (where the output data can be subjected to some censoring mechanism).

Note that the VT-GPR framework’s biggest limitation arises from the proposed ‘local’ lower bounding strategy we introduced in Section 3.2. Recall that the local likelihood lower bound parameters \( b_i \) and \( c_i \) can be expressed in terms of a single freely optimizable parameter \( \zeta _i \). However, due to constraints imposed by asymptotics, the parameter \( a_i \) is restricted to the value \( (-\frac{1}{2}) \). Consequently, the local lower bound is unable to adjust its width (i.e., the function domain support) since the parameter \( a_i \) is fixed and not a function of \( \zeta _i \). This directly influences the VT-GPR framework’s approximation performance (Nickisch and Rasmussen 2008). Despite the introduction of the additional variance parameters in an attempt to regulate the local lower bound support, the VT-GPR framework still tends to underestimate the predictive variance, relative to the competing frameworks, in the censored latent function regions (see the MNLL error measure scores for the various illustrative examples).

Furthermore, due to the imposed local lower bounding strategy and the limitation associated with parameter \( a_i \), the optimal, and only free-form variational density, \( q({\varvec{u}}) \) must obey certain restrictions, i.e., the optimal density \( q({\varvec{u}}) \) must obey the restrictions associated with each local lower bound to ensure that we have a valid secondary variational lower bound, which can result in worse approximation/prediction performance (Nickisch and Rasmussen 2008).

Another limiting feature worth pointing out is the tightness of the secondary lower bound. Recall that we are free to choose the parameters \( \zeta _i \), which we do by finding the values of \( \zeta _i \) that maximize our lower bound. The resulting secondary variational lower bound value then represents the tightest bound within the entire family of bounds that can be used as an approximation to \( \ln {p({\varvec{y}})} \). However, the optimized bound will in general not be exact. Despite being able to exactly optimize the local lower bound for each Gaussian cdf factor, the required value for \( \zeta _i \) depends on the value of \( f_i \). Therefore, the local lower bound is tight for only one value of \( f_i \) (refer to Fig. 1). However, note that the quantity \( {\mathcal {F}}^{*}(\varvec{\theta }) \) is obtained by integrating over all possible values of the latent vector \( {\varvec{f}} \), followed by integrating over all possible values of the inducing variable vector \( {\varvec{u}} \). Consequently, the values of \( \zeta _i \) that maximize our secondary variational lower bound represent a compromise, as weighted by the variational posterior densities \( p({\varvec{f}} \vert {\varvec{u}}) \) and \( q({\varvec{u}}) \), which directly influences the predictive performance of the proposed VT-GPR framework (Bishop 2009).

For future work, we would like to explore the idea of allowing each censored observation to have its own unique regulating variance parameter which, from a theoretical standpoint, should increase the VT-GPR’s regulating capacity, resulting in improved approximation and prediction performance. This would be in line with the ideas initially proposed by Gammelli et al. (2020b).

Furthermore, since the values of \(\zeta _i\) represent a compromise, as weighted by the posterior densities, we can consider using some form of regularizer to encourage better point estimates for \(\zeta _i\) in an attempt to improve the approximation performance. Alternatively, we could dispense with the local lower bound approach introduced in Section 3.2 and follow the ideas outlined in Hensman et al. (2015, Section 4) where we use numerical integration to circumvent the intractabilities that arise from the presence of the Gaussian cdf terms in the likelihood function.