1 Introduction

Short-term forecasting of environmental processes has many applications, including solar and wind power generation, ambient air pollution, and extreme weather events. In this paper, we combine numerical model output with statistical methods to forecast hurricane wind intensity. Rather than providing a single value as the point prediction, we model the entire uncertainty distribution of the response given the numerical model forecast. This conditional distribution regression provides a comprehensive assessment of uncertainty, including the forecast distribution’s spread, skewness, and tail probabilities.

Conditional distribution estimation can be applied to many ecological and environmental datasets where the response is distributed in a non-Gaussian manner. For example, when forecasting exposure to air or water pollution, we may be interested in both the average exposure and the probability of exposure exceeding a critical threshold known to have adverse health effects. Similarly, when forecasting precipitation, modeling the entire predictive distribution might be critical for quantifying the likelihood of a severe event, such as the likelihood of a flood, as computed by applying a rainfall-runoff model to samples from the precipitation forecast distribution. In this paper, we adopt the framework for our short-term tropical cyclone intensity forecasting problem.

To provide a flexible prediction model, we incorporate supervised machine learning methods, which have become a popular tool for statistical analysis in the last few decades. Methods such as random forest regression, neural networks, and linear regression can be employed using state-of-the-art statistical software to clarify complicated relationships between covariates and target variables. Generally, machine learning predictive modeling has been developed for making point predictions such as the conditional mean or median. Accompanying prediction interval techniques provide uncertainty quantification. This differs from conditional density estimation, a technique that estimates the full distribution of the target variable given the covariates. In some applications, conditional density estimation is preferred. For instance, an estimate of a tropical cyclone’s maximum wind speed conditional on the sea surface temperature can provide information not available from a conditional mean estimate. A certain sea surface temperature might result in a strongly positively skewed maximum wind speed distribution, giving a better idea of the worst-case scenario under these conditions.

Various approaches have been developed to estimate the distribution of the target variable conditional on the covariates. One technique is to estimate the joint distribution of the target variable and covariates as well as the joint distribution of the covariates and divide the former by the latter. Kernel density estimation of these two densities is a common approach, first proposed by Rosenblatt (1969). Hyndman et al. (1996) modify the standard kernel density estimator to obtain a smoother with better bias properties. Hall et al. (1999) propose to use an adjusted Nadaraya–Watson estimator for the kernel estimation. These methods suffer from intractability when the covariate dimension increases. The proposed remedies for this issue have been modifications to reduce the covariate space or to develop a density estimator for high-dimensional data (Hall et al. 2004; Hall and Yao 2005; Fan et al. 2009).

Bayesian nonparametric mixture modeling is another common conditional density estimation approach. Finite mixture models (FMMs) are a subset of mixture modeling techniques that consider the conditional target distribution to be a mixture of several parametric (often Gaussian) distributions (Escobar and West 1995; Gilardi et al. 2002; Song et al. 2004; Rojas et al. 2005; Fahey et al. 2007). Covariate effects can be introduced in either the mixing proportions and/or densities. Bayesian Markov Chain Monte Carlo (MCMC) methods are often used to fit these models (Peng et al. 1996; Wood et al. 2002; Geweke and Keane 2007). FMMs require certain parameter specifications such as the mixing proportion values or number of densities, which can affect their overall inference capabilities.

Infinite mixture models are another common Bayesian nonparametric mixture modeling approach. One class of infinite mixture model techniques attempts to directly estimate the conditional density via an infinite set of mixture weights and a process mixing distribution prior dependent on the covariates. Dunson et al. (2007) develop a Bayesian density regression model using a local, covariate-weighted mixture of DP priors. Trippa et al. (2011) and Jara and Hanson (2011) propose use of a Polya Tree (PT) prior model and induce dependence through different definitions of the splitting probabilities. Tokdar et al. (2010) forego these priors and develops a model using logistic Gaussian processes and subspace projection. Still, Bayesian nonparametric density estimation analysis can be computationally burdensome as data complexity increases, leading to some variable selection techniques being proposed (Chung and Dunson 2009; Kundu and Dunson 2014). Infinite mixture models for estimating the joint distribution of the response and covariates have also been proposed (Müller et al. 1996; Shahbaba and Neal 2009; Park and Dunson 2010; Taddy and Kottas 2010; Hannah et al. 2011). A disadvantage of this class of techniques is that it does not directly estimate the conditional density, and also can be slow in terms of computational performance as the dimensions of the problem increase.

Machine learning algorithms are another useful and arguably more accessible class of conditional density estimation methods. One approach is to use an orthogonal series density estimator that adapts to the geometric features of the data and reduces the dimension of the problem, with additional improvements later proposed via incorporation of regression and deep learning algorithms (Efromovich 2010; Izbicki and Lee 2016, 2017; Dalmasso et al. 2020). Meinshausen (2006) proposes a foundational method of quantile regression forest (QRF). By noting all observations in each leaf, a random forest can be used to calculate the full conditional distribution as a weighted sum of sample quantiles across trees. Multiple conditional density estimation methods using random forests to improve on QRF accuracy and/or speed have been developed (Tung et al. 2014; Hothorn and Zeileis 2017; Pospisil and Lee 2018). Recently, Li et al. (2019) proposed deep distribution regression (DDR) as a deep network learning-based conditional distribution technique. Li et al. (2019) use cutpoints to discretize the response space and apply a multi-class classification method (such as a neural network) on the resulting bins. Li et al. (2019) also give an approach that accounts for bin ordering by applying a binary classification model for each cutpoint and jointly estimating the conditional cumulative distribution function. Payne et al. (2020) also develop a partition-based method with flexible logistic Gaussian processes fit within each partition. A Laplace approximation is used to overcome the analytical challenges with logistic Gaussian process evaluation.

Similar to DDR, we consider a conditional density estimation approach that incorporates machine learning algorithms for our short-term tropical cyclone intensity forecasting problem. A logistic transformation is made on the model output layer to obtain an expression of the conditional density function. The flexibility of the model specification allows for algorithms such as polynomial regression or deep learning models to be used. Our method evaluates only a single set of model parameters and simultaneously estimates the full conditional distribution, unlike the partition-based methods. This information-sharing allows our method to forecast well when minimal data is available, and the relatively limited number of parameters needing to be estimated ensure computational speed for the polynomial regression model choice. The gradient calculation can quickly become intractable for complex model choices, so we incorporate theory from ecological and epidemiological statistics. Fithian (2013) review models that can be used to evaluate presence-only survey data, including the inhomogeneous Poisson process (IPP) model. We adapt the IPP framework to our data setting to justify a discrete approximation of our forecasting method for computational purposes. We also justify a special case of this method through a matched case–control context to further increase computational efficiency (Jarner et al. 2002).

After a review of the method and some potential model choices, we discuss the computational considerations for its implementation. Following this, the methodological strengths and weaknesses of our method are explored with a simulation and a short-term forecasting application, with the takeaways and next steps summarized in a discussion section.

2 Methods

We are interested in approximating the conditional distribution of response variable \(\textit{Y}\in {\mathbb {R}}\) given the covariate information \({\varvec{X}}\in {\mathbb {R}}^{p}\), denoted \(h(\textit{y}|{\varvec{X}})\). Our method requires a lower and upper bound for the target variable, which we address through a transformation of the response variable onto the unit interval. Suppose we transform \(\textit{Y}\) through a cumulative distribution function G as \(\textit{Z}=G(\textit{Y}|{\varvec{X}})\in [0,1]\). Note that the transformation of \(\textit{Y}\) into \(\textit{Z}\) to be on the unit interval is not unique, we could instead determine an upper and lower interval bound for \(\textit{Y}\) on its original scale.

In this section, we will outline our method for approximating the conditional distribution of the transformed response, \(f(\textit{z}|{\varvec{X}})\), however the conditional density of the original \(h(\textit{y}|{\varvec{X}})\) can be recovered applying the change of variable formula as

$$\begin{aligned} h(\textit{y}|{\varvec{X}})=f(G(\textit{y})|{\varvec{X}})\bigg | \frac{\partial G}{\partial \textit{y}} G(\textit{y}) \bigg |. \end{aligned}$$
(1)

If \(f(\textit{z}|{\varvec{X}})\) is uniformly distributed, the resulting \(h(\textit{y}|{\varvec{X}})\) distribution will be governed by G. In other words, G is the base predicted distribution family, as opposed to the uniform distribution if no transformation of \(\textit{Y}\) is made.

2.1 Logistic transformation

Let \(q(\textit{z},{\varvec{X}})\) be a smooth function over \(\textit{z}\) and \({\varvec{X}}\). The logistic transformation (e.g. Lenk (1988)) relates \(q(\textit{z},{\varvec{X}})\) to \(f(\textit{z}|{\varvec{X}})\) as

$$\begin{aligned} f(\textit{z}|{\varvec{X}})=\frac{e^{q(\textit{z},{\varvec{X}})} }{ \int _{0}^{1} e^{q(\textit{u},{\varvec{X}})}d\textit{u}}. \end{aligned}$$
(2)

Since \(q(\textit{z},{\varvec{X}})=A(\textit{z},{\varvec{X}})+B({\varvec{X}})\) gives the same density as \(q(\textit{z},{\varvec{X}})=A(\textit{z},{\varvec{X}})\), the main effect terms for \({\varvec{X}}\) are removed. As the support of \(q(\textit{z},{\varvec{X}})\) is arbitrarily flexible, any smooth conditional probability density function \(f(\textit{z}|{\varvec{X}})\) can be modeled with this transformation. In practice, this integral may be intractable. Discrete approximation techniques are discussed in Sect. 3 after introducing potential model choices.

A smooth underlying \(q\) function allows for the simultaneous estimation of a single set of model parameters. A similar logistic transformation on an underlying model was used in Tokdar and Kadane (2012) to develop a simultaneous quantile regression estimation method. The information-sharing inherent in this approach enabled the estimation of multiple quantiles concurrently, improving on previous quantile regression estimation methods.

Another advantage of this method is its flexibility. The only required \(q\) function specification is smoothness, which allows for many non-parametric model possibilities. We consider two such models in this paper, which draw from machine learning ideas, a polynomial regression model and a deep learning model. However, our method can easily be applied to other smooth model choices such as an additive model with splines.

2.2 Polynomial regression model

The Weierstrass Approximation Theorem states that for any continuous real-valued function on a closed interval, there exists a polynomial function that can approximate it arbitrarily well (Weierstrass 1885). The polynomial function is therefore a logical candidate for the smooth function in our method. Let \(B\) be an integer representing the largest polynomial power used for the centered \(\textit{Z}\) values, with \(b\) representing the given polynomial power. Recall \(j=1,\ldots ,p\) represents the covariate. Also, let \(o=1,\ldots ,O\) index the polynomial degree associated with the covariate terms. We let \(o=2\) and give the second-order model as

$$\begin{aligned} q(\textit{z},{\varvec{X}})=\sum \limits _{b=1}^{B}\bigg [(\textit{z}-0.5)^b\xi _{b0}+\sum \limits _{j=1}^{p}\sum \limits _{o=1}^{2}(\textit{z}-0.5)^bX_{j}^{o}\xi _{bjo} + \sum \limits _{j\ne k}(\textit{z}-0.5)^bX_{j}X_{k}\upsilon _{bl}\bigg ] \end{aligned}$$
(3)

where \(\xi _{b0}\) represents the intercept, \(\xi _{bjo}\) represent the covariate coefficients, and \(\upsilon _{bl}\) represent the l interaction term coefficients. A higher order model follows this structure in the obvious way. The terms are centered by subtracting 0.5 to reduce collinearity, and the main effects of \({\varvec{X}}\) are removed because they do not affect the conditional distribution.

2.3 Deep learning model

A deep learning model is another natural choice for the underlying smooth function. The universal approximation theorem states that a feed-forward artificial neural network with at least one hidden layer can approximate a continuous function on a compact space arbitrarily well (Hornik et al. 1989). We propose a deep learning model with an input layer, at least one hidden layer, and an output layer. One hidden layer is given here for notational simplicity, but additional layers could be added if desired. Let \(\delta , \gamma \), and \(\beta \) represent the output layer, hidden layer, and input layer parameters, respectively. Let \(\textit{H}\) and \(\textit{I}\) represent the output and hidden layer nodes, respectively. Lastly, let \(r=1,\ldots ,R\) and \(t=1,\ldots ,T\) index the number of neurons in the hidden and output layer, respectively. The model is

$$\begin{aligned} q(\textit{z},{\varvec{X}})=\;&\sum \limits _{t=1}^{T} \delta _{t}f_A(\textit{H}_{t}), \end{aligned}$$
(4)
$$\begin{aligned} \textit{H}_{t}=\;&\gamma _{0t}+\sum \limits _{r=1}^{R} \gamma _{tr}f_A(\textit{I}_{r}), \end{aligned}$$
(5)
$$\begin{aligned} \textit{I}_{r}=\;&\beta _{0r}+\beta _{1r}(\textit{z}-.5)+\sum \limits _{j=2}^{p+1}\beta _{jr}X_{j}, \end{aligned}$$
(6)

where \(f_A\) is an activation function. Exponential linear unit (ELU) or rectified linear unit (ReLU) are two possible activation function options.

3 Computing

3.1 Inhomogeneous Poisson process (IPP) approximation

The only restriction for \(q(\textit{z},{\varvec{X}})\) is that it is smooth, potentially allowing the model to be highly complex. This model specification flexibility is an appealing feature, but can make the integral in the logistic transformation intractable. We can view our method in an inhomogeneous Poisson process (IPP) model framework to justify a discrete logistic transformation which is more computationally feasible. The conditional density in Sect. 2 has the form of an IPP model with domain on the unit interval [0, 1] and log-intensity \(q(\textit{z},{\varvec{X}})\).

Fithian (2013) describe a discrete approximation of the IPP model, which we can apply to our context. Suppose we have a dataset with \(i=1,\ldots ,n\) observations. We let \(\textit{z}_i\) denote the transformed response value for observation \(i\). We can view the univariate random variable \(\textit{Z}\) conditioned on \({\varvec{X}}\) as a location on the unit interval, so we can consider the observed data as realizations of a point process over the unit interval. We follow the IPP approximation literature and propose to approximate the likelihood contribution of observation \(i\) as

$$\begin{aligned} f(\textit{z}_i|{\varvec{X}})\approx \frac{e^{q(\textit{z}_i,{\varvec{X}})} }{ e^{q(\textit{z}_i,{\varvec{X}})}+\sum \nolimits _{k=1}^{K} e^{q(\textit{z}^*_{ik},{\varvec{X}})} } \end{aligned}$$
(7)

for \(k=1,\ldots ,K\). \(\textit{z}^*_{ik}\sim \text {Uniform}(0,1)\) controls are uniquely selected for each observation. Fithian (2013) argue that this Monte Carlo approximation to the denominator of Sect. 2 is accurate for sufficiently large \(K\) in terms of approximating continuous conditional densities. The main effects of \({\varvec{X}}\) are removed for this discrete logistic transformation just as they were in Sect. 2. \(\textit{z}^*_{ik}\) can instead be selected using a fixed grid across the unit interval, but we expect this choice would require a larger \(K\) unless the data is evenly spread across the response space. This even data spread is the motivation for our transformation of \(\textit{Y}\) by a CDF function, as a well-defined CDF can render the transformed data roughly uniform across the unit interval.

Another view of  Sect. 7 is that \(\textit{z}_i\) represents a sample from the location distribution of cases and the \(\textit{z}^*_{ik}\) represent \(K\) matched samples from the uniform control distribution (Jarner et al. 2002). As mentioned in Appendix A, even a small \(K\) provides valid information about the \(q\) function. Thus, we can consider either the IPP approximation with large \(K\) to approximate the IPP integrated intensity and the matched case–control approximation where \(K=1\). We expect that a larger \(K\) value will induce more accurate parameter estimation, but at an additional computational cost that may not always be feasible.

Standard optimization methods can be employed with this approximation by minimizing the negative log likelihood objective function. Let \(\varvec{\theta }\in {\mathbb {R}}^{m}\) represent the parameter vector for the chosen \(q\) model, which we can write as \(q(\textit{z},{\varvec{X}};\varvec{\theta })\). The negative log likelihood for our model is

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z}_i,{\varvec{X}}_i)=\sum \limits _{i=1}^{n}\bigg \{-q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })+\log \bigg [e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+\sum \limits _{k=1}^{K} e^{q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })} \bigg ]\bigg \}+\omega ||\varvec{\theta }||^2 \end{aligned}$$
(8)

where \(\omega \ge 0\) is a ridge penalty included to avoid model overfitting. For \(K=1\), the method effectively reduces to logistic regression and the polynomial model can be evaluated using penalized logistic regression analysis techniques (Friedman et al. 2010). This technique arrives at a solution extremely quickly, making the polynomial method very accessible for large datasets. For deep learning methods, we perform stochastic gradient descent. Details for these implementation choices can be found in Appendix B.

Let \(l=1,\ldots ,L\) index a set of transformed response values. We can predict the conditional distribution at these transformed response values given covariate vector \({\varvec{X}}\) and estimated parameter vector \(\varvec{\theta }={\hat{\varvec{\theta }}}\) as

$$\begin{aligned} f(\textit{z}_l|{\varvec{X}};{\hat{\varvec{\theta }}})\approx \frac{e^{q(\textit{z}_l,{\varvec{X}};{\hat{\varvec{\theta }}})} }{ \sum \nolimits _{j=1}^{L} e^{q(\textit{z}_j,{\varvec{X}};{\hat{\varvec{\theta }}})} } . \end{aligned}$$
(9)

This can be transformed back to the original scale via Sect. 1. A key advantage of our method is its simultaneous estimation of the model parameters. This structure ensures that we implicitly share information across all of our quantile estimates. For a method like DDR with a multinomial logistic regression classification model, each bin has its own set of parameters to be evaluated (excluding one bin, which serves as a reference for the others). If a bin contains few or no observations, then that bin’s parameter estimates may be volatile and unreliable. A large number of cut points may be desired to approximate a continuous distribution estimate, which makes it more likely there are empty or sparsely filled bins. Our method avoids this issue by estimating parameters for only a single model, implicitly assuring information is shared across all quantile estimates. For certain model choices, another benefit of this single set of model parameters is that our method becomes computationally quicker than DDR and even QRF.

4 Simulation study

We conduct a simulation study to evaluate our method against the aforementioned DDR and QRF methods (Meinshausen 2006; Li et al. 2019). We compare these three machine learning-based methods in terms of effectiveness in predicting the conditional distribution of the target variable, explained below.

We simulate data from four distributions, first used by Li et al. (2019) for their complicated structures. Model 1 has a linear mean function, but also an error term that varies with the covariates. The other three models have a nonlinear mean function. Models 2 and 3 are mixture distributions, while Model 4 uses a skew-normal distribution for the errors. Formally, the models are specified as

  • Model 1: \( Y={\varvec{X}}^{T} \varvec{\beta }_{1} +\exp \left( {\varvec{X}}^{T} \varvec{\beta }_{2}\right) * \epsilon \),

    • \({\varvec{X}}\sim \text {MVN}(\varvec{0},\varvec{I_5})\),

    • \(\varvec{\beta }_{1} \sim N\left( {\mathbf {0}}, \varvec{I_{5}}\right) \), \(\varvec{\beta }_{2} \sim N\left( {\mathbf {0}}, 0.45 \varvec{I_{5}}\right) \), \(\epsilon \sim N(0,1)\).

  • Model 2: \(Y=\left[ 10 \sin \left( 2 \pi X_{1} X_{2}\right) +10 X_{4}+\epsilon _{1}\right] \pi _{1}+\left[ 20\left( X_{3}-0.5\right) ^{2}+5 X_{5}+\epsilon _{2}\right] \left( 1-\pi _{1}\right) \),

    • \(X_{1}, \ldots , X_{10} {\mathop {\sim }\limits ^{i i d}} \text{ Uniform }(0,1)\),

    • \(\pi _{1} \sim \text{ Bernoulli } (0.5)\), \(\epsilon _{1} \sim N(0,2.25)\), \(\epsilon _{2} \sim N(0,1)\).

  • Model 3: \(Y=\left[ \sin \left( X_{1}\right) +\epsilon _{1}\right] \pi _{1}+\left[ 2 \sin \left( 1.5 X_{1}+1\right) +\epsilon _{2}\right] \left( 1-\pi _{1}\right) \),

    • \(X_{1} \sim \text {Uniform}(0,10)\),

    • \(\pi _{1} \sim \text{ Bernoulli }(0.5)\), \(\epsilon _{1} \sim N(0,0.09)\), \(\epsilon _{2} \sim N(0,0.64)\).

  • Model 4: \(Y=10 \sin \left( 2 \pi X_{1} X_{2}\right) +20\left( X_{3}-0.5\right) ^{2}+10 X_{4}+5 X_{5}+\epsilon \),

    • \(X_{1}, \cdots , X_{10} {\mathop {\sim }\limits ^{i i d}} \text{ Uniform }(0,1)\),

    • \(\epsilon \sim \text{ SkewNormal }(0,1,-5)\).

For each data model, we simulate 100 datasets of sizes of 200, 1000, or 4000 observations to explore the relative efficacy of our method for various sample sizes. The datasets are randomly divided into training and testing data using a 75%/25% split. The models are fit using the training data, and then the distribution for each testing dataset observation is determined. For all models, the covariate data was normalized.

To evaluate the accuracy of a distribution estimate, we first calculate the range of the training response data and further extend it by 10%. We then calculate 100 evenly-spaced cut points between the extended range boundaries. For each model, we calculate the empirical CDF value associated with every cut point to get the conditional distribution estimate for every observation. We use the divergence function associated with the continuous ranked probability score (CRPS) to evaluate method performance (Gneiting and Raftery 2007; Krüger et al. 2016). The CRPS divergence is defined as

$$\begin{aligned} d_{CRPS}=\frac{1}{N} \sum _{n=1}^{N} \int _{l}^{u}\left\{ {\hat{F}}\left( y | {\varvec{X}}_{n}\right) -F\left( y | {\varvec{X}}_{n}\right) \right\} ^{2} d y. \end{aligned}$$

This integral is approximated using 1000 evenly spaced grid points and the resulting approximation is normalized by the range of the data. For the simulation study, N denotes the number of testing set observations for the given data model.

We apply a matched case–control (MCC) justified approximation with \(K=1\) randomly selected controls to both the polynomial and deep learning models. Additionally, we apply an inhomogeneous Poisson process (IPP) justified approximation with \(K=10\) randomly selected controls to the deep learning model in the simulation data models with 200 observations. For the polynomial MCC approximation method, the first-order interaction terms between covariates and squared covariate terms were included in the covariate pool for Models 1, 2, and 4. For Model 3, there was only one covariate variable, so no interaction terms were possible. The highest polynomial power used in the model was \(B=3\).

Both deep learning approximations were applied using a model structure with one hidden layer. Thirty nodes feed into the hidden and output layers each and the chosen activation function was the exponential linear unit (ELU). For the polynomial and deep learning methods, we select the normal cumulative distribution function (CDF) \(\Phi \) to transform \(\textit{Y}\) as \(\textit{Z}=G(\textit{Y}|{\varvec{X}})=\Phi \bigg (\frac{\textit{Y}-{\varvec{X}}\varvec{\beta }}{\sigma }\bigg )\) and estimate mean coefficient \(\varvec{\beta }\) and standard deviation \(\sigma \) parameters using ordinary least squares (OLS) regression. This choice ensures that the base distribution prediction for each observation in the testing dataset is Gaussian and centered at the OLS conditional mean. A larger ridge penalty (which lessens the deviance of the parameters from each other) will influence the predicted distribution toward this base distribution.

The polynomial MCC approximation is evaluated using a penalized logistic regression method, while the deep learning approximations are evaluated using stochastic gradient descent. For more details on the implementation and evaluation of the models in these two methods, see Appendix B.

The classification models for the DDR method were constructed using the deep-conditional-distribution-regression Python package found at https://github.com/RLstat/deep-conditional-distribution-regression. The joint binary cross entropy loss objective function was selected due to its superior performance over the multinomial objective function in Li et al. (2019). Models were built with a single hidden layer and a 0% dropout rate. The ELU activation function is selected for the hidden layer, with a softmax activation function applied on the output layer.

The QRF method was utilized with 500 trees were built using the quantregForest package in R. This package predicts the conditional response values associated with inputted quantiles, so 100 evenly-spaced quantiles from 0.00001 to 0.99999 were generated and the QRF models estimated the cut points associated with these quantiles.

Fig. 2
figure 1

A boxplot of the distribution of CRPS divergences for each model and dataset size across 100 datasets for the QRF, DDR, Deep Learning \(K=1\), Polynomial \(K=1\), and Deep Learning \(K=10\) conditional distribution estimation methods. The y-axis scale is not synchronized across data models and dataset sizes

Figure 1 gives the simulation results. In general, both deep learning approximation methods performed well compared to DDR. The deep learning MCC approximation model outperformed DDR in terms of median CRPS divergence in 8 of the 12 data models. The polynomial MCC approximation model performed worse against DDR by comparison, only producing a lower median CRPS divergence in 4 of the 12 data models.

The deep learning IPP approximation method noticeably improved the CRPS divergence results compared to the deep learning MCC method in all four models with 200 observations. In Models 1, 3, and 4 with 200 observations, this approximation beat both QRF and DDR in terms of median CRPS divergence, suggesting that this deep learning approximation is more useful than the MCC approximation in situations with a small sample size. In a large dataset, each observation would likely have similar information as multiple other observations, so that the combined controls for these similar observations sufficiently represent the underlying model regardless of the choice of \(K\). For a smaller sample size, a larger choice of \(K\) is needed to ensure each observation is properly predicted as there may not be other observations with similar information. For more detailed analysis of the effect of increasing the number of controls for the deep learning method for 200 observations, see Appendix D.

Our deep learning MCC approximation method also outperformed QRF in terms of median CRPS divergence in 7 of the 12 data models, although the relative CRPS divergence ranges in Model 1 with 4000 observations suggests our method may not have produced better results in that data model. QRF fared better than the polynomial MCC approximation model in the majority of data models, although the polynomial MCC approximation produced lower median CRPS divergence values across dataset sizes in Model 3.

Our deep learning method performed relatively better in terms of CRPS divergence in Models 1 and 3 compared to Models 2 and 4. Model 1 had a normal distribution structure, which may have been advantageous for our method since we used the normal quantile function to transform our data. Model 3 was a mixture distribution as Model 2 was, but only had a single covariate compared to the 10 covariates in Models 2 and 4. For a high-dimensional dataset, each observation will be less likely to contain information similar to other observations. As a result, either more data or a larger number of controls is needed to identify the underlying model structure. This is a reason why the deep learning MCC approximation performs relatively poorly to DDR in the higher-dimensional Models 2 and 4 with 200 observations, whereas the corresponding deep learning IPP approximation performs relatively well.

Table 1 gives the average computation times for the polynomial and deep learning approximation methods for Model 1. The deep learning IPP approximation computation times for 1000 and 4000 observations were calculated on only 5 datasets, whereas the computation times for the other data models were calculated for all 100 datasets. The deep learning MCC and IPP approximations were significantly more computationally burdensome than the polynomial MCC approximation. The deep learning MCC approximation average computation time was over an hour for 4000 observations. On average, the deep learning IPP approximation for 10 controls took roughly five or six times as long to evaluate as the deep learning MCC approximation. Figure 1 suggests the deep learning MCC approximation and especially deep learning IPP approximation are preferable to the polynomial MCC approximation for conditional distribution estimation in many data models, however it may not be as readily scalable to larger datasets. On the contrary, the increase in computation time from 200 observations to 4000 observations for the polynomial MCC approximation was negligible. The polynomial MCC approximation is easily applicable to large datasets in data models where the deep learning approximations are computationally unfeasible.

Table 1 A table of the average computation times (in min) and associated standard errors for evaluating the data from Model 1 across all dataset sizes

5 Application to tropical cyclone intensity forecasting

We apply our method to calibrate short-term tropical cyclone wind intensity forecasts. A conditional distribution estimation approach to this problem could provide additional context on response distribution features to better inform policy decisions compared to a point estimate approach (Cloud et al. 2019). Our data comes from Hurricane Weather Research and Forecasting (HWRF) Model, developed and maintained by the U.S. Environmental Modeling Center (EMC) (Biswas et al. 2017). HWRF is a deterministic atmosphere-ocean model used for hurricane research and forecasting. The HWRF model includes a forecasted maximum 10-m wind speed value, which is designated as the covariate of interest. The actual maximum 10-m wind speed value is the response variable. Covariate and response information are recorded up to four times a day for each day a tropical cyclone is active in 6 h increments. At each time point, forecasted covariate data and response data are given for up to 96 h into the future by 3 h increments.

The full dataset contains information from 65 tropical cyclones located around the Atlantic Seaboard between 2013 and 2017. For this application, we focus on lag 3 and lag 6 forecast predictions and subset the overall dataset of 45,639 observations to obtain two smaller datasets of 1383 observations each for only these lag times. Observations with missing response values were removed. The final lag 3 and lag 6 datasets each had 1267 observations.

The polynomial regression method was implemented using the MCC approximation with a single control, \(K=1\). The highest polynomial power used in the polynomial model was \(B=3\), and the quadratic covariate term was included in the covariate matrix. The deep learning method was implemented using an IPP approximation with \(K=20\). The deep learning model was built with a single hidden layer, where 15 nodes feed into the hidden and output layers each, and uses an ELU activation function. The polynomial model was evaluated using penalized logistic regression and the deep learning model was evaluated using mini-batch stochastic gradient descent. For both methods, a variety of ridge penalties were considered. A ridge penalty of 0.000001 was selected for the deep learning method for both lags and the polynomial method for lag 3, and a ridge penalty of 0.0005 was selected for the polynomial method for lag 6. For the deep learning method, a variety of initial learning rates were also considered, with the optimally tuned models using an initial learning rate of 1 for both lags. Further details on how these models were fit are given in Appendix B.

The QRF model was built using 500 trees and evaluated using the quantregForest R package. The DDR method was run using a deep learning classification model and evaluated in Python using the deep-conditional-distribution-regression package. The model had one hidden layer with 15 nodes and a 0% dropout rate to mimic the deep learning approximation model specifications. The joint binary cross entropy loss objective function was selected. The ELU activation function was applied to the hidden layer, with the softmax activation function used for the output layer. As in the simulation study section, we select a normal CDF to transform \(\textit{Z}\) and estimate the CDF parameters using OLS regression.

The tropical cyclones were randomly assigned to one of five folds, and 5-fold cross validation was performed. For each fold, we calculate the CRPS of the testing set to evaluate method performance as the CRPS divergence is unavailable without knowledge of the true distribution (Matheson and Winkler 1976; Hersbach 2000). CRPS is defined as

$$\begin{aligned} CRPS=\frac{1}{N} \sum _{n=1}^{N} \int _{l}^{u}\left\{ {\hat{F}}\left( y | {\varvec{X}}_{n}\right) -I\left( y \ge Y_{n}\right) \right\} ^{2} d y. \end{aligned}$$

As with the CRPS divergence evaluation, the integral is approximated using 1000 evenly spaced grid points and the resulting approximation is normalized by the range of the data. For this application, N refers to the number of observations in the given testing fold. A sensitivity analysis for the number of CRPS grid points used for each method can be found in Appendix E.

Table 2 gives the average CRPS across folds and the accompanying standard error for each method. Our polynomial and deep learning approximation methods outperform QRF and DDR by these metrics. Additionally, the deep learning IPP approximation slightly outperforms the polynomial MCC approximation in terms of average CRPS. The lag 6 predictions result in a higher average CRPS for each method than the lag 3 predictions, due to the increased difficulty of forecasting further into the future.

Table 2 The 5-fold mean CRPS values for each lag time for the QRF, DDR, polynomial MCC approximation (\(K=1\)), and deep learning IPP approximation (\(K=20\)) conditional distribution estimation methods
Fig. 3
figure 2

Deep learning IPP approximation (\(K=20\)) conditional maximum 10-m wind speed distribution predictions for lag 3 and 6 using model constructed with all available data. Estimated conditional response probabilities for 100 equally spaced quantiles sequenced between 0.5 and 99.5 are displayed with linear interpolation between quantiles. The 0.5th and 99.5th quantile density function values are rounded to 0. \(\Pr (\textit{Y}=\textit{y}|X)\) refers to the relative probability that the maximum 10-m wind speed \(\textit{Y}=\textit{y}\) occurs given the HWRF-forecasted maximum 10-m wind speed value \(X\). \(\textit{Y}|X\) refers to the conditional response value \(\textit{Y}\) given the covariate \(X\)

Fig. 4
figure 3

Polynomial MCC approximation (\(K=1\)) conditional maximum 10-m wind speed distribution predictions for lag 3 and 6 using model constructed with all available data. Estimated conditional response probabilities for 100 equally spaced quantiles sequenced between 0.5 and 99.5 are displayed with linear interpolation between quantiles. The 0.5th and 99.5th quantile density function values are rounded to 0. \(\Pr (\textit{Y}=\textit{y}|X)\) refers to the relative probability that the maximum 10-m wind speed \(\textit{Y}=\textit{y}\) occurs given the HWRF-forecasted maximum 10-m wind speed value \(X\). \(\textit{Y}|X\) refers to the conditional response value \(\textit{Y}\) given the covariate \(X\)

As a comparison, the 5-fold mean CRPS value for the conditional Gaussian distribution evaluated via OLS estimation was calculated. The OLS-predicted mean CRPS values were 0.0220 (with standard error 0.0020) and 0.0269 (with standard error 0.0026) for lag 3 and lag 6, respectively. The deep learning IPP approximation and polynomial MCC approximation both outperform these estimated average CRPS values, although the improvement made by the polynomial MCC approximation is very slight. A difference between the polynomial MCC approximation and deep learning IPP approximation is present, however the single predictor and adequate sample size likely lessened the benefit of including additional controls, for similar reasons as offered in Sect. 4.

Figure 2 displays the deep learning IPP approximation conditional response distribution predictions for lag 3 and lag 6 when using all of the training data. The lag 3 predicted quantiles for this model look generally unimodal and Gaussian, with some slight left skewness for smaller covariate values. The lag 6 predicted quantiles are also generally unimodal and Gaussian for the larger covariate values, but exhibit some clear non-normality and skewness for the lower covariate values. Both models used for Fig. 2 were fit using the same ridge penalties and initial learning rates as the models used to calculate the CRPS values for each fold in Table 2. The predicted distributions for each individual fold that were used to calculate the mean CRPS results in Table 2 are not necessarily equivalently shaped to these plotted predicted distributions. For instance, the deep learning IPP approximation method for lag 6 in the third fold predicts a somewhat bimodal distribution of maximum 10-m wind speed for larger covariate values. For an example plot of the predicted distributions using both methods and lag times for an individual fold, see Appendix C. Overall, these somewhat Gaussian-shaped predicted distribution visualizations are consistent with the CRPS results that suggest the OLS-predicted distribution method performs only slightly worse than the deep learning IPP approximation for \(K=20\) in this application.

Figure 3 displays the polynomial MCC approximation conditional response distribution for lag 3 and lag 6 when using all of the training data. Again, the models used to predict these distribution quantiles maintained the same ridge penalties and initial learning rates as the corresponding models used to obtain the polynomial MCC approximation average CRPS in Table 2. The larger and smaller covariate values are associated with distributions with sharper peaks, whereas the predicted distributions for the middle covariate values have broader, less symmetric peaks. The lag 3 and lag 6 predicted distributions look more similar here than the lag 3 and lag 6 predicted distributions in Fig. 2.

6 Discussion

We propose a flexible conditional distribution estimation method, which can incorporate machine learning techniques such as deep learning models or polynomial regressions. We examined the performance of some of these model types for different data distributions in a simulation study, finding that our method implemented with a deep learning model outperformed other conditional distribution estimation methods using multiple data models. In a real world application of our method, we found both the deep learning and polynomial model-based methods provided useful insight on tropical cyclone maximum wind speed forecasting compared to other methods, with the deep learning method performing best in terms of the mean 5-fold CRPS performance metric.

Further approximation and/or computational techniques for this method can fully unlock its utility for conditional distribution estimation. We introduced an IPP-based discrete approximation with \(K\) controls to make model evaluation feasible, but were limited to selecting a small \(K\) and a relatively basic deep learning model structure with one hidden layer and 30 nodes. We expect that our method could substantially improve its predictive accuracy if a more complex deep learning model structure were tractable. Integration of our method with TensorFlow or another deep learning optimization programming language could be helpful in this regard. Perhaps an approximation that reduces the number of observations could also be incorporated to improve methodological accuracy.

Another potential methodological improvement is through the selection of the control values for our case–control based approximation. Fithian and Hastie (2014) describe a local case–control sampling technique meant to address conditional imbalance in addition to the marginal imbalance issue addressed by standard case–control sampling. Perhaps incorporating this approach or another weighted control selection technique could be adapted to our framework to improve the conditional distribution estimation for a smaller number of controls.

For complicated models with many parameters (from multiple covariates, layers, and/or nodes), the ridge penalty in Sect. 8 influences the parameters towards 0 so that they deviate less from each other. As a result, \(g(\textit{Z}|{\varvec{X}})\) tends toward a uniform distribution, and \(f(\textit{Y}|{\varvec{X}})\) consequently tends toward the distribution implied by the transformative cumulative distribution function. In both the simulation and application, a Gaussian cumulative distribution function was selected to transform \(\textit{Y}\) to \(\textit{Z}\). A conditionally normal response distribution is often assumed in statistics, so this specification is reasonable for many applications. Still, perhaps a more sophisticated optimization algorithm would allow for more deviance between parameter estimates and be less influenced by the choice of CDF.

Additionally, the CDF function parameters were estimated via OLS. OLS requires there to be more observations than covariates in order to obtain a unique parameter solution. This restriction might disallow the inclusion of higher order interactions in the polynomial approximation model when the sample size is small because it would result in more covariates than training observations. In this scenario, the CDF function transformation should not be used. Instead, boundaries should be chosen for \(\textit{Y}\) and our method can be analogously applied.