Nonparametric conditional density estimation in a deep learning framework for short-term forecasting

Huberman, David B.; Reich, Brian J.; Bondell, Howard D.

doi:10.1007/s10651-021-00499-z

Nonparametric conditional density estimation in a deep learning framework for short-term forecasting

Published: 20 May 2021

Volume 29, pages 677–704, (2022)
Cite this article

Download PDF

Environmental and Ecological Statistics Aims and scope Submit manuscript

Nonparametric conditional density estimation in a deep learning framework for short-term forecasting

Download PDF

583 Accesses
5 Citations
Explore all metrics

A Correction to this article was published on 26 August 2022

This article has been updated

Abstract

Short-term forecasting is an important tool in understanding environmental processes. In this paper, we incorporate machine learning algorithms into a conditional distribution estimator for the purposes of forecasting tropical cyclone intensity. Many machine learning techniques give a single-point prediction of the conditional distribution of the target variable, which does not give a full accounting of the prediction variability. Conditional distribution estimation can provide extra insight on predicted response behavior, which could influence decision-making and policy. We propose a technique that simultaneously estimates the entire conditional distribution and flexibly allows for machine learning techniques to be incorporated. A smooth model is fit over both the target variable and covariates, and a logistic transformation is applied on the model output layer to produce an expression of the conditional density function. We provide two examples of machine learning models that can be used, polynomial regression and deep learning models. To achieve computational efficiency, we propose a case–control sampling approximation to the conditional distribution. A simulation study for four different data distributions highlights the effectiveness of our method compared to other machine learning-based conditional distribution estimation techniques. We then demonstrate the utility of our approach for forecasting purposes using tropical cyclone data from the Atlantic Seaboard. This paper gives a proof of concept for the promise of our method, further computational developments can fully unlock its insights in more complex forecasting and other applications.

Realistic tropical cyclone wind and pressure fields can be reconstructed from sparse data using deep learning

Article Open access 03 January 2024

Ryan Eusebi, Gabriel A. Vecchi, … Mingjing Tong

Estimation of global tropical cyclone wind speed probabilities using the STORM dataset

Article Open access 10 November 2020

Nadia Bloemendaal, Hans de Moel, … Jeroen C. J. H. Aerts

Markovian approach to the frequency of tropical cyclones and subsequent development of univariate prediction model

Article 04 January 2022

Shreya Bhowmick & Surajit Chattopadhyay

1 Introduction

Short-term forecasting of environmental processes has many applications, including solar and wind power generation, ambient air pollution, and extreme weather events. In this paper, we combine numerical model output with statistical methods to forecast hurricane wind intensity. Rather than providing a single value as the point prediction, we model the entire uncertainty distribution of the response given the numerical model forecast. This conditional distribution regression provides a comprehensive assessment of uncertainty, including the forecast distribution’s spread, skewness, and tail probabilities.

Conditional distribution estimation can be applied to many ecological and environmental datasets where the response is distributed in a non-Gaussian manner. For example, when forecasting exposure to air or water pollution, we may be interested in both the average exposure and the probability of exposure exceeding a critical threshold known to have adverse health effects. Similarly, when forecasting precipitation, modeling the entire predictive distribution might be critical for quantifying the likelihood of a severe event, such as the likelihood of a flood, as computed by applying a rainfall-runoff model to samples from the precipitation forecast distribution. In this paper, we adopt the framework for our short-term tropical cyclone intensity forecasting problem.

To provide a flexible prediction model, we incorporate supervised machine learning methods, which have become a popular tool for statistical analysis in the last few decades. Methods such as random forest regression, neural networks, and linear regression can be employed using state-of-the-art statistical software to clarify complicated relationships between covariates and target variables. Generally, machine learning predictive modeling has been developed for making point predictions such as the conditional mean or median. Accompanying prediction interval techniques provide uncertainty quantification. This differs from conditional density estimation, a technique that estimates the full distribution of the target variable given the covariates. In some applications, conditional density estimation is preferred. For instance, an estimate of a tropical cyclone’s maximum wind speed conditional on the sea surface temperature can provide information not available from a conditional mean estimate. A certain sea surface temperature might result in a strongly positively skewed maximum wind speed distribution, giving a better idea of the worst-case scenario under these conditions.

Various approaches have been developed to estimate the distribution of the target variable conditional on the covariates. One technique is to estimate the joint distribution of the target variable and covariates as well as the joint distribution of the covariates and divide the former by the latter. Kernel density estimation of these two densities is a common approach, first proposed by Rosenblatt (1969). Hyndman et al. (1996) modify the standard kernel density estimator to obtain a smoother with better bias properties. Hall et al. (1999) propose to use an adjusted Nadaraya–Watson estimator for the kernel estimation. These methods suffer from intractability when the covariate dimension increases. The proposed remedies for this issue have been modifications to reduce the covariate space or to develop a density estimator for high-dimensional data (Hall et al. 2004; Hall and Yao 2005; Fan et al. 2009).

Bayesian nonparametric mixture modeling is another common conditional density estimation approach. Finite mixture models (FMMs) are a subset of mixture modeling techniques that consider the conditional target distribution to be a mixture of several parametric (often Gaussian) distributions (Escobar and West 1995; Gilardi et al. 2002; Song et al. 2004; Rojas et al. 2005; Fahey et al. 2007). Covariate effects can be introduced in either the mixing proportions and/or densities. Bayesian Markov Chain Monte Carlo (MCMC) methods are often used to fit these models (Peng et al. 1996; Wood et al. 2002; Geweke and Keane 2007). FMMs require certain parameter specifications such as the mixing proportion values or number of densities, which can affect their overall inference capabilities.

Infinite mixture models are another common Bayesian nonparametric mixture modeling approach. One class of infinite mixture model techniques attempts to directly estimate the conditional density via an infinite set of mixture weights and a process mixing distribution prior dependent on the covariates. Dunson et al. (2007) develop a Bayesian density regression model using a local, covariate-weighted mixture of DP priors. Trippa et al. (2011) and Jara and Hanson (2011) propose use of a Polya Tree (PT) prior model and induce dependence through different definitions of the splitting probabilities. Tokdar et al. (2010) forego these priors and develops a model using logistic Gaussian processes and subspace projection. Still, Bayesian nonparametric density estimation analysis can be computationally burdensome as data complexity increases, leading to some variable selection techniques being proposed (Chung and Dunson 2009; Kundu and Dunson 2014). Infinite mixture models for estimating the joint distribution of the response and covariates have also been proposed (Müller et al. 1996; Shahbaba and Neal 2009; Park and Dunson 2010; Taddy and Kottas 2010; Hannah et al. 2011). A disadvantage of this class of techniques is that it does not directly estimate the conditional density, and also can be slow in terms of computational performance as the dimensions of the problem increase.

Machine learning algorithms are another useful and arguably more accessible class of conditional density estimation methods. One approach is to use an orthogonal series density estimator that adapts to the geometric features of the data and reduces the dimension of the problem, with additional improvements later proposed via incorporation of regression and deep learning algorithms (Efromovich 2010; Izbicki and Lee 2016, 2017; Dalmasso et al. 2020). Meinshausen (2006) proposes a foundational method of quantile regression forest (QRF). By noting all observations in each leaf, a random forest can be used to calculate the full conditional distribution as a weighted sum of sample quantiles across trees. Multiple conditional density estimation methods using random forests to improve on QRF accuracy and/or speed have been developed (Tung et al. 2014; Hothorn and Zeileis 2017; Pospisil and Lee 2018). Recently, Li et al. (2019) proposed deep distribution regression (DDR) as a deep network learning-based conditional distribution technique. Li et al. (2019) use cutpoints to discretize the response space and apply a multi-class classification method (such as a neural network) on the resulting bins. Li et al. (2019) also give an approach that accounts for bin ordering by applying a binary classification model for each cutpoint and jointly estimating the conditional cumulative distribution function. Payne et al. (2020) also develop a partition-based method with flexible logistic Gaussian processes fit within each partition. A Laplace approximation is used to overcome the analytical challenges with logistic Gaussian process evaluation.

Similar to DDR, we consider a conditional density estimation approach that incorporates machine learning algorithms for our short-term tropical cyclone intensity forecasting problem. A logistic transformation is made on the model output layer to obtain an expression of the conditional density function. The flexibility of the model specification allows for algorithms such as polynomial regression or deep learning models to be used. Our method evaluates only a single set of model parameters and simultaneously estimates the full conditional distribution, unlike the partition-based methods. This information-sharing allows our method to forecast well when minimal data is available, and the relatively limited number of parameters needing to be estimated ensure computational speed for the polynomial regression model choice. The gradient calculation can quickly become intractable for complex model choices, so we incorporate theory from ecological and epidemiological statistics. Fithian (2013) review models that can be used to evaluate presence-only survey data, including the inhomogeneous Poisson process (IPP) model. We adapt the IPP framework to our data setting to justify a discrete approximation of our forecasting method for computational purposes. We also justify a special case of this method through a matched case–control context to further increase computational efficiency (Jarner et al. 2002).

After a review of the method and some potential model choices, we discuss the computational considerations for its implementation. Following this, the methodological strengths and weaknesses of our method are explored with a simulation and a short-term forecasting application, with the takeaways and next steps summarized in a discussion section.

2 Methods

We are interested in approximating the conditional distribution of response variable $\textit{Y}\in {\mathbb {R}}$ given the covariate information ${\varvec{X}}\in {\mathbb {R}}^{p}$, denoted $h(\textit{y}|{\varvec{X}})$. Our method requires a lower and upper bound for the target variable, which we address through a transformation of the response variable onto the unit interval. Suppose we transform $\textit{Y}$ through a cumulative distribution function G as $\textit{Z}=G(\textit{Y}|{\varvec{X}})\in [0,1]$. Note that the transformation of $\textit{Y}$ into $\textit{Z}$ to be on the unit interval is not unique, we could instead determine an upper and lower interval bound for $\textit{Y}$ on its original scale.

In this section, we will outline our method for approximating the conditional distribution of the transformed response, $f(\textit{z}|{\varvec{X}})$, however the conditional density of the original $h(\textit{y}|{\varvec{X}})$ can be recovered applying the change of variable formula as

$$\begin{aligned} h(\textit{y}|{\varvec{X}})=f(G(\textit{y})|{\varvec{X}})\bigg | \frac{\partial G}{\partial \textit{y}} G(\textit{y}) \bigg |. \end{aligned}$$

(1)

If $f(\textit{z}|{\varvec{X}})$ is uniformly distributed, the resulting $h(\textit{y}|{\varvec{X}})$ distribution will be governed by G. In other words, G is the base predicted distribution family, as opposed to the uniform distribution if no transformation of $\textit{Y}$ is made.

2.1 Logistic transformation

Let $q(\textit{z},{\varvec{X}})$ be a smooth function over $\textit{z}$ and ${\varvec{X}}$. The logistic transformation (e.g. Lenk (1988)) relates $q(\textit{z},{\varvec{X}})$ to $f(\textit{z}|{\varvec{X}})$ as

$$\begin{aligned} f(\textit{z}|{\varvec{X}})=\frac{e^{q(\textit{z},{\varvec{X}})} }{ \int _{0}^{1} e^{q(\textit{u},{\varvec{X}})}d\textit{u}}. \end{aligned}$$

(2)

Since $q(\textit{z},{\varvec{X}})=A(\textit{z},{\varvec{X}})+B({\varvec{X}})$ gives the same density as $q(\textit{z},{\varvec{X}})=A(\textit{z},{\varvec{X}})$, the main effect terms for ${\varvec{X}}$ are removed. As the support of $q(\textit{z},{\varvec{X}})$ is arbitrarily flexible, any smooth conditional probability density function $f(\textit{z}|{\varvec{X}})$ can be modeled with this transformation. In practice, this integral may be intractable. Discrete approximation techniques are discussed in Sect. 3 after introducing potential model choices.

A smooth underlying $q$ function allows for the simultaneous estimation of a single set of model parameters. A similar logistic transformation on an underlying model was used in Tokdar and Kadane (2012) to develop a simultaneous quantile regression estimation method. The information-sharing inherent in this approach enabled the estimation of multiple quantiles concurrently, improving on previous quantile regression estimation methods.

Another advantage of this method is its flexibility. The only required $q$ function specification is smoothness, which allows for many non-parametric model possibilities. We consider two such models in this paper, which draw from machine learning ideas, a polynomial regression model and a deep learning model. However, our method can easily be applied to other smooth model choices such as an additive model with splines.

2.2 Polynomial regression model

The Weierstrass Approximation Theorem states that for any continuous real-valued function on a closed interval, there exists a polynomial function that can approximate it arbitrarily well (Weierstrass 1885). The polynomial function is therefore a logical candidate for the smooth function in our method. Let $B$ be an integer representing the largest polynomial power used for the centered $\textit{Z}$ values, with $b$ representing the given polynomial power. Recall $j=1,\ldots ,p$ represents the covariate. Also, let $o=1,\ldots ,O$ index the polynomial degree associated with the covariate terms. We let $o=2$ and give the second-order model as

$$\begin{aligned} q(\textit{z},{\varvec{X}})=\sum \limits _{b=1}^{B}\bigg [(\textit{z}-0.5)^b\xi _{b0}+\sum \limits _{j=1}^{p}\sum \limits _{o=1}^{2}(\textit{z}-0.5)^bX_{j}^{o}\xi _{bjo} + \sum \limits _{j\ne k}(\textit{z}-0.5)^bX_{j}X_{k}\upsilon _{bl}\bigg ] \end{aligned}$$

(3)

where $\xi _{b0}$ represents the intercept, $\xi _{bjo}$ represent the covariate coefficients, and $\upsilon _{bl}$ represent the l interaction term coefficients. A higher order model follows this structure in the obvious way. The terms are centered by subtracting 0.5 to reduce collinearity, and the main effects of ${\varvec{X}}$ are removed because they do not affect the conditional distribution.

2.3 Deep learning model

A deep learning model is another natural choice for the underlying smooth function. The universal approximation theorem states that a feed-forward artificial neural network with at least one hidden layer can approximate a continuous function on a compact space arbitrarily well (Hornik et al. 1989). We propose a deep learning model with an input layer, at least one hidden layer, and an output layer. One hidden layer is given here for notational simplicity, but additional layers could be added if desired. Let $\delta , \gamma $, and $\beta $ represent the output layer, hidden layer, and input layer parameters, respectively. Let $\textit{H}$ and $\textit{I}$ represent the output and hidden layer nodes, respectively. Lastly, let $r=1,\ldots ,R$ and $t=1,\ldots ,T$ index the number of neurons in the hidden and output layer, respectively. The model is

$$\begin{aligned} q(\textit{z},{\varvec{X}})=\;&\sum \limits _{t=1}^{T} \delta _{t}f_A(\textit{H}_{t}), \end{aligned}$$

(4)

$$\begin{aligned} \textit{H}_{t}=\;&\gamma _{0t}+\sum \limits _{r=1}^{R} \gamma _{tr}f_A(\textit{I}_{r}), \end{aligned}$$

(5)

$$\begin{aligned} \textit{I}_{r}=\;&\beta _{0r}+\beta _{1r}(\textit{z}-.5)+\sum \limits _{j=2}^{p+1}\beta _{jr}X_{j}, \end{aligned}$$

(6)

where $f_A$ is an activation function. Exponential linear unit (ELU) or rectified linear unit (ReLU) are two possible activation function options.

3 Computing

3.1 Inhomogeneous Poisson process (IPP) approximation

The only restriction for $q(\textit{z},{\varvec{X}})$ is that it is smooth, potentially allowing the model to be highly complex. This model specification flexibility is an appealing feature, but can make the integral in the logistic transformation intractable. We can view our method in an inhomogeneous Poisson process (IPP) model framework to justify a discrete logistic transformation which is more computationally feasible. The conditional density in Sect. 2 has the form of an IPP model with domain on the unit interval [0, 1] and log-intensity $q(\textit{z},{\varvec{X}})$.

Fithian (2013) describe a discrete approximation of the IPP model, which we can apply to our context. Suppose we have a dataset with $i=1,\ldots ,n$ observations. We let $\textit{z}_i$ denote the transformed response value for observation $i$. We can view the univariate random variable $\textit{Z}$ conditioned on ${\varvec{X}}$ as a location on the unit interval, so we can consider the observed data as realizations of a point process over the unit interval. We follow the IPP approximation literature and propose to approximate the likelihood contribution of observation $i$ as

$$\begin{aligned} f(\textit{z}_i|{\varvec{X}})\approx \frac{e^{q(\textit{z}_i,{\varvec{X}})} }{ e^{q(\textit{z}_i,{\varvec{X}})}+\sum \nolimits _{k=1}^{K} e^{q(\textit{z}^*_{ik},{\varvec{X}})} } \end{aligned}$$

(7)

for $k=1,\ldots ,K$. $\textit{z}^*_{ik}\sim \text {Uniform}(0,1)$ controls are uniquely selected for each observation. Fithian (2013) argue that this Monte Carlo approximation to the denominator of Sect. 2 is accurate for sufficiently large $K$ in terms of approximating continuous conditional densities. The main effects of ${\varvec{X}}$ are removed for this discrete logistic transformation just as they were in Sect. 2. $\textit{z}^*_{ik}$ can instead be selected using a fixed grid across the unit interval, but we expect this choice would require a larger $K$ unless the data is evenly spread across the response space. This even data spread is the motivation for our transformation of $\textit{Y}$ by a CDF function, as a well-defined CDF can render the transformed data roughly uniform across the unit interval.

Another view of Sect. 7 is that $\textit{z}_i$ represents a sample from the location distribution of cases and the $\textit{z}^*_{ik}$ represent $K$ matched samples from the uniform control distribution (Jarner et al. 2002). As mentioned in Appendix A, even a small $K$ provides valid information about the $q$ function. Thus, we can consider either the IPP approximation with large $K$ to approximate the IPP integrated intensity and the matched case–control approximation where $K=1$. We expect that a larger $K$ value will induce more accurate parameter estimation, but at an additional computational cost that may not always be feasible.

Standard optimization methods can be employed with this approximation by minimizing the negative log likelihood objective function. Let $\varvec{\theta }\in {\mathbb {R}}^{m}$ represent the parameter vector for the chosen $q$ model, which we can write as $q(\textit{z},{\varvec{X}};\varvec{\theta })$. The negative log likelihood for our model is

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z}_i,{\varvec{X}}_i)=\sum \limits _{i=1}^{n}\bigg \{-q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })+\log \bigg [e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+\sum \limits _{k=1}^{K} e^{q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })} \bigg ]\bigg \}+\omega ||\varvec{\theta }||^2 \end{aligned}$$

(8)

where $\omega \ge 0$ is a ridge penalty included to avoid model overfitting. For $K=1$, the method effectively reduces to logistic regression and the polynomial model can be evaluated using penalized logistic regression analysis techniques (Friedman et al. 2010). This technique arrives at a solution extremely quickly, making the polynomial method very accessible for large datasets. For deep learning methods, we perform stochastic gradient descent. Details for these implementation choices can be found in Appendix B.

Let $l=1,\ldots ,L$ index a set of transformed response values. We can predict the conditional distribution at these transformed response values given covariate vector ${\varvec{X}}$ and estimated parameter vector $\varvec{\theta }={\hat{\varvec{\theta }}}$ as

$$\begin{aligned} f(\textit{z}_l|{\varvec{X}};{\hat{\varvec{\theta }}})\approx \frac{e^{q(\textit{z}_l,{\varvec{X}};{\hat{\varvec{\theta }}})} }{ \sum \nolimits _{j=1}^{L} e^{q(\textit{z}_j,{\varvec{X}};{\hat{\varvec{\theta }}})} } . \end{aligned}$$

(9)

This can be transformed back to the original scale via Sect. 1. A key advantage of our method is its simultaneous estimation of the model parameters. This structure ensures that we implicitly share information across all of our quantile estimates. For a method like DDR with a multinomial logistic regression classification model, each bin has its own set of parameters to be evaluated (excluding one bin, which serves as a reference for the others). If a bin contains few or no observations, then that bin’s parameter estimates may be volatile and unreliable. A large number of cut points may be desired to approximate a continuous distribution estimate, which makes it more likely there are empty or sparsely filled bins. Our method avoids this issue by estimating parameters for only a single model, implicitly assuring information is shared across all quantile estimates. For certain model choices, another benefit of this single set of model parameters is that our method becomes computationally quicker than DDR and even QRF.

4 Simulation study

We conduct a simulation study to evaluate our method against the aforementioned DDR and QRF methods (Meinshausen 2006; Li et al. 2019). We compare these three machine learning-based methods in terms of effectiveness in predicting the conditional distribution of the target variable, explained below.

We simulate data from four distributions, first used by Li et al. (2019) for their complicated structures. Model 1 has a linear mean function, but also an error term that varies with the covariates. The other three models have a nonlinear mean function. Models 2 and 3 are mixture distributions, while Model 4 uses a skew-normal distribution for the errors. Formally, the models are specified as

Model 1: $ Y={\varvec{X}}^{T} \varvec{\beta }_{1} +\exp \left( {\varvec{X}}^{T} \varvec{\beta }_{2}\right) * \epsilon $,
- ${\varvec{X}}\sim \text {MVN}(\varvec{0},\varvec{I_5})$,
- $\varvec{\beta }_{1} \sim N\left( {\mathbf {0}}, \varvec{I_{5}}\right) $, $\varvec{\beta }_{2} \sim N\left( {\mathbf {0}}, 0.45 \varvec{I_{5}}\right) $, $\epsilon \sim N(0,1)$.
Model 2: $Y=\left[ 10 \sin \left( 2 \pi X_{1} X_{2}\right) +10 X_{4}+\epsilon _{1}\right] \pi _{1}+\left[ 20\left( X_{3}-0.5\right) ^{2}+5 X_{5}+\epsilon _{2}\right] \left( 1-\pi _{1}\right) $,
- $X_{1}, \ldots , X_{10} {\mathop {\sim }\limits ^{i i d}} \text{ Uniform }(0,1)$,
- $\pi _{1} \sim \text{ Bernoulli } (0.5)$, $\epsilon _{1} \sim N(0,2.25)$, $\epsilon _{2} \sim N(0,1)$.
Model 3: $Y=\left[ \sin \left( X_{1}\right) +\epsilon _{1}\right] \pi _{1}+\left[ 2 \sin \left( 1.5 X_{1}+1\right) +\epsilon _{2}\right] \left( 1-\pi _{1}\right) $,
- $X_{1} \sim \text {Uniform}(0,10)$,
- $\pi _{1} \sim \text{ Bernoulli }(0.5)$, $\epsilon _{1} \sim N(0,0.09)$, $\epsilon _{2} \sim N(0,0.64)$.
Model 4: $Y=10 \sin \left( 2 \pi X_{1} X_{2}\right) +20\left( X_{3}-0.5\right) ^{2}+10 X_{4}+5 X_{5}+\epsilon $,
- $X_{1}, \cdots , X_{10} {\mathop {\sim }\limits ^{i i d}} \text{ Uniform }(0,1)$,
- $\epsilon \sim \text{ SkewNormal }(0,1,-5)$.

For each data model, we simulate 100 datasets of sizes of 200, 1000, or 4000 observations to explore the relative efficacy of our method for various sample sizes. The datasets are randomly divided into training and testing data using a 75%/25% split. The models are fit using the training data, and then the distribution for each testing dataset observation is determined. For all models, the covariate data was normalized.

To evaluate the accuracy of a distribution estimate, we first calculate the range of the training response data and further extend it by 10%. We then calculate 100 evenly-spaced cut points between the extended range boundaries. For each model, we calculate the empirical CDF value associated with every cut point to get the conditional distribution estimate for every observation. We use the divergence function associated with the continuous ranked probability score (CRPS) to evaluate method performance (Gneiting and Raftery 2007; Krüger et al. 2016). The CRPS divergence is defined as

$$\begin{aligned} d_{CRPS}=\frac{1}{N} \sum _{n=1}^{N} \int _{l}^{u}\left\{ {\hat{F}}\left( y | {\varvec{X}}_{n}\right) -F\left( y | {\varvec{X}}_{n}\right) \right\} ^{2} d y. \end{aligned}$$

This integral is approximated using 1000 evenly spaced grid points and the resulting approximation is normalized by the range of the data. For the simulation study, N denotes the number of testing set observations for the given data model.

We apply a matched case–control (MCC) justified approximation with $K=1$ randomly selected controls to both the polynomial and deep learning models. Additionally, we apply an inhomogeneous Poisson process (IPP) justified approximation with $K=10$ randomly selected controls to the deep learning model in the simulation data models with 200 observations. For the polynomial MCC approximation method, the first-order interaction terms between covariates and squared covariate terms were included in the covariate pool for Models 1, 2, and 4. For Model 3, there was only one covariate variable, so no interaction terms were possible. The highest polynomial power used in the model was $B=3$.

Both deep learning approximations were applied using a model structure with one hidden layer. Thirty nodes feed into the hidden and output layers each and the chosen activation function was the exponential linear unit (ELU). For the polynomial and deep learning methods, we select the normal cumulative distribution function (CDF) $\Phi $ to transform $\textit{Y}$ as $\textit{Z}=G(\textit{Y}|{\varvec{X}})=\Phi \bigg (\frac{\textit{Y}-{\varvec{X}}\varvec{\beta }}{\sigma }\bigg )$ and estimate mean coefficient $\varvec{\beta }$ and standard deviation $\sigma $ parameters using ordinary least squares (OLS) regression. This choice ensures that the base distribution prediction for each observation in the testing dataset is Gaussian and centered at the OLS conditional mean. A larger ridge penalty (which lessens the deviance of the parameters from each other) will influence the predicted distribution toward this base distribution.

The polynomial MCC approximation is evaluated using a penalized logistic regression method, while the deep learning approximations are evaluated using stochastic gradient descent. For more details on the implementation and evaluation of the models in these two methods, see Appendix B.

The classification models for the DDR method were constructed using the deep-conditional-distribution-regression Python package found at https://github.com/RLstat/deep-conditional-distribution-regression. The joint binary cross entropy loss objective function was selected due to its superior performance over the multinomial objective function in Li et al. (2019). Models were built with a single hidden layer and a 0% dropout rate. The ELU activation function is selected for the hidden layer, with a softmax activation function applied on the output layer.

The QRF method was utilized with 500 trees were built using the quantregForest package in R. This package predicts the conditional response values associated with inputted quantiles, so 100 evenly-spaced quantiles from 0.00001 to 0.99999 were generated and the QRF models estimated the cut points associated with these quantiles.

Figure 1 gives the simulation results. In general, both deep learning approximation methods performed well compared to DDR. The deep learning MCC approximation model outperformed DDR in terms of median CRPS divergence in 8 of the 12 data models. The polynomial MCC approximation model performed worse against DDR by comparison, only producing a lower median CRPS divergence in 4 of the 12 data models.

The deep learning IPP approximation method noticeably improved the CRPS divergence results compared to the deep learning MCC method in all four models with 200 observations. In Models 1, 3, and 4 with 200 observations, this approximation beat both QRF and DDR in terms of median CRPS divergence, suggesting that this deep learning approximation is more useful than the MCC approximation in situations with a small sample size. In a large dataset, each observation would likely have similar information as multiple other observations, so that the combined controls for these similar observations sufficiently represent the underlying model regardless of the choice of $K$. For a smaller sample size, a larger choice of $K$ is needed to ensure each observation is properly predicted as there may not be other observations with similar information. For more detailed analysis of the effect of increasing the number of controls for the deep learning method for 200 observations, see Appendix D.

Our deep learning MCC approximation method also outperformed QRF in terms of median CRPS divergence in 7 of the 12 data models, although the relative CRPS divergence ranges in Model 1 with 4000 observations suggests our method may not have produced better results in that data model. QRF fared better than the polynomial MCC approximation model in the majority of data models, although the polynomial MCC approximation produced lower median CRPS divergence values across dataset sizes in Model 3.

Our deep learning method performed relatively better in terms of CRPS divergence in Models 1 and 3 compared to Models 2 and 4. Model 1 had a normal distribution structure, which may have been advantageous for our method since we used the normal quantile function to transform our data. Model 3 was a mixture distribution as Model 2 was, but only had a single covariate compared to the 10 covariates in Models 2 and 4. For a high-dimensional dataset, each observation will be less likely to contain information similar to other observations. As a result, either more data or a larger number of controls is needed to identify the underlying model structure. This is a reason why the deep learning MCC approximation performs relatively poorly to DDR in the higher-dimensional Models 2 and 4 with 200 observations, whereas the corresponding deep learning IPP approximation performs relatively well.

Table 1 gives the average computation times for the polynomial and deep learning approximation methods for Model 1. The deep learning IPP approximation computation times for 1000 and 4000 observations were calculated on only 5 datasets, whereas the computation times for the other data models were calculated for all 100 datasets. The deep learning MCC and IPP approximations were significantly more computationally burdensome than the polynomial MCC approximation. The deep learning MCC approximation average computation time was over an hour for 4000 observations. On average, the deep learning IPP approximation for 10 controls took roughly five or six times as long to evaluate as the deep learning MCC approximation. Figure 1 suggests the deep learning MCC approximation and especially deep learning IPP approximation are preferable to the polynomial MCC approximation for conditional distribution estimation in many data models, however it may not be as readily scalable to larger datasets. On the contrary, the increase in computation time from 200 observations to 4000 observations for the polynomial MCC approximation was negligible. The polynomial MCC approximation is easily applicable to large datasets in data models where the deep learning approximations are computationally unfeasible.

Table 1 A table of the average computation times (in min) and associated standard errors for evaluating the data from Model 1 across all dataset sizes

Full size table

5 Application to tropical cyclone intensity forecasting

We apply our method to calibrate short-term tropical cyclone wind intensity forecasts. A conditional distribution estimation approach to this problem could provide additional context on response distribution features to better inform policy decisions compared to a point estimate approach (Cloud et al. 2019). Our data comes from Hurricane Weather Research and Forecasting (HWRF) Model, developed and maintained by the U.S. Environmental Modeling Center (EMC) (Biswas et al. 2017). HWRF is a deterministic atmosphere-ocean model used for hurricane research and forecasting. The HWRF model includes a forecasted maximum 10-m wind speed value, which is designated as the covariate of interest. The actual maximum 10-m wind speed value is the response variable. Covariate and response information are recorded up to four times a day for each day a tropical cyclone is active in 6 h increments. At each time point, forecasted covariate data and response data are given for up to 96 h into the future by 3 h increments.

The full dataset contains information from 65 tropical cyclones located around the Atlantic Seaboard between 2013 and 2017. For this application, we focus on lag 3 and lag 6 forecast predictions and subset the overall dataset of 45,639 observations to obtain two smaller datasets of 1383 observations each for only these lag times. Observations with missing response values were removed. The final lag 3 and lag 6 datasets each had 1267 observations.

The polynomial regression method was implemented using the MCC approximation with a single control, $K=1$. The highest polynomial power used in the polynomial model was $B=3$, and the quadratic covariate term was included in the covariate matrix. The deep learning method was implemented using an IPP approximation with $K=20$. The deep learning model was built with a single hidden layer, where 15 nodes feed into the hidden and output layers each, and uses an ELU activation function. The polynomial model was evaluated using penalized logistic regression and the deep learning model was evaluated using mini-batch stochastic gradient descent. For both methods, a variety of ridge penalties were considered. A ridge penalty of 0.000001 was selected for the deep learning method for both lags and the polynomial method for lag 3, and a ridge penalty of 0.0005 was selected for the polynomial method for lag 6. For the deep learning method, a variety of initial learning rates were also considered, with the optimally tuned models using an initial learning rate of 1 for both lags. Further details on how these models were fit are given in Appendix B.

The QRF model was built using 500 trees and evaluated using the quantregForest R package. The DDR method was run using a deep learning classification model and evaluated in Python using the deep-conditional-distribution-regression package. The model had one hidden layer with 15 nodes and a 0% dropout rate to mimic the deep learning approximation model specifications. The joint binary cross entropy loss objective function was selected. The ELU activation function was applied to the hidden layer, with the softmax activation function used for the output layer. As in the simulation study section, we select a normal CDF to transform $\textit{Z}$ and estimate the CDF parameters using OLS regression.

The tropical cyclones were randomly assigned to one of five folds, and 5-fold cross validation was performed. For each fold, we calculate the CRPS of the testing set to evaluate method performance as the CRPS divergence is unavailable without knowledge of the true distribution (Matheson and Winkler 1976; Hersbach 2000). CRPS is defined as

$$\begin{aligned} CRPS=\frac{1}{N} \sum _{n=1}^{N} \int _{l}^{u}\left\{ {\hat{F}}\left( y | {\varvec{X}}_{n}\right) -I\left( y \ge Y_{n}\right) \right\} ^{2} d y. \end{aligned}$$

As with the CRPS divergence evaluation, the integral is approximated using 1000 evenly spaced grid points and the resulting approximation is normalized by the range of the data. For this application, N refers to the number of observations in the given testing fold. A sensitivity analysis for the number of CRPS grid points used for each method can be found in Appendix E.

Table 2 gives the average CRPS across folds and the accompanying standard error for each method. Our polynomial and deep learning approximation methods outperform QRF and DDR by these metrics. Additionally, the deep learning IPP approximation slightly outperforms the polynomial MCC approximation in terms of average CRPS. The lag 6 predictions result in a higher average CRPS for each method than the lag 3 predictions, due to the increased difficulty of forecasting further into the future.

Table 2 The 5-fold mean CRPS values for each lag time for the QRF, DDR, polynomial MCC approximation ($K=1$), and deep learning IPP approximation ($K=20$) conditional distribution estimation methods

Full size table

As a comparison, the 5-fold mean CRPS value for the conditional Gaussian distribution evaluated via OLS estimation was calculated. The OLS-predicted mean CRPS values were 0.0220 (with standard error 0.0020) and 0.0269 (with standard error 0.0026) for lag 3 and lag 6, respectively. The deep learning IPP approximation and polynomial MCC approximation both outperform these estimated average CRPS values, although the improvement made by the polynomial MCC approximation is very slight. A difference between the polynomial MCC approximation and deep learning IPP approximation is present, however the single predictor and adequate sample size likely lessened the benefit of including additional controls, for similar reasons as offered in Sect. 4.

Figure 2 displays the deep learning IPP approximation conditional response distribution predictions for lag 3 and lag 6 when using all of the training data. The lag 3 predicted quantiles for this model look generally unimodal and Gaussian, with some slight left skewness for smaller covariate values. The lag 6 predicted quantiles are also generally unimodal and Gaussian for the larger covariate values, but exhibit some clear non-normality and skewness for the lower covariate values. Both models used for Fig. 2 were fit using the same ridge penalties and initial learning rates as the models used to calculate the CRPS values for each fold in Table 2. The predicted distributions for each individual fold that were used to calculate the mean CRPS results in Table 2 are not necessarily equivalently shaped to these plotted predicted distributions. For instance, the deep learning IPP approximation method for lag 6 in the third fold predicts a somewhat bimodal distribution of maximum 10-m wind speed for larger covariate values. For an example plot of the predicted distributions using both methods and lag times for an individual fold, see Appendix C. Overall, these somewhat Gaussian-shaped predicted distribution visualizations are consistent with the CRPS results that suggest the OLS-predicted distribution method performs only slightly worse than the deep learning IPP approximation for $K=20$ in this application.

Figure 3 displays the polynomial MCC approximation conditional response distribution for lag 3 and lag 6 when using all of the training data. Again, the models used to predict these distribution quantiles maintained the same ridge penalties and initial learning rates as the corresponding models used to obtain the polynomial MCC approximation average CRPS in Table 2. The larger and smaller covariate values are associated with distributions with sharper peaks, whereas the predicted distributions for the middle covariate values have broader, less symmetric peaks. The lag 3 and lag 6 predicted distributions look more similar here than the lag 3 and lag 6 predicted distributions in Fig. 2.

6 Discussion

We propose a flexible conditional distribution estimation method, which can incorporate machine learning techniques such as deep learning models or polynomial regressions. We examined the performance of some of these model types for different data distributions in a simulation study, finding that our method implemented with a deep learning model outperformed other conditional distribution estimation methods using multiple data models. In a real world application of our method, we found both the deep learning and polynomial model-based methods provided useful insight on tropical cyclone maximum wind speed forecasting compared to other methods, with the deep learning method performing best in terms of the mean 5-fold CRPS performance metric.

Further approximation and/or computational techniques for this method can fully unlock its utility for conditional distribution estimation. We introduced an IPP-based discrete approximation with $K$ controls to make model evaluation feasible, but were limited to selecting a small $K$ and a relatively basic deep learning model structure with one hidden layer and 30 nodes. We expect that our method could substantially improve its predictive accuracy if a more complex deep learning model structure were tractable. Integration of our method with TensorFlow or another deep learning optimization programming language could be helpful in this regard. Perhaps an approximation that reduces the number of observations could also be incorporated to improve methodological accuracy.

Another potential methodological improvement is through the selection of the control values for our case–control based approximation. Fithian and Hastie (2014) describe a local case–control sampling technique meant to address conditional imbalance in addition to the marginal imbalance issue addressed by standard case–control sampling. Perhaps incorporating this approach or another weighted control selection technique could be adapted to our framework to improve the conditional distribution estimation for a smaller number of controls.

For complicated models with many parameters (from multiple covariates, layers, and/or nodes), the ridge penalty in Sect. 8 influences the parameters towards 0 so that they deviate less from each other. As a result, $g(\textit{Z}|{\varvec{X}})$ tends toward a uniform distribution, and $f(\textit{Y}|{\varvec{X}})$ consequently tends toward the distribution implied by the transformative cumulative distribution function. In both the simulation and application, a Gaussian cumulative distribution function was selected to transform $\textit{Y}$ to $\textit{Z}$. A conditionally normal response distribution is often assumed in statistics, so this specification is reasonable for many applications. Still, perhaps a more sophisticated optimization algorithm would allow for more deviance between parameter estimates and be less influenced by the choice of CDF.

Additionally, the CDF function parameters were estimated via OLS. OLS requires there to be more observations than covariates in order to obtain a unique parameter solution. This restriction might disallow the inclusion of higher order interactions in the polynomial approximation model when the sample size is small because it would result in more covariates than training observations. In this scenario, the CDF function transformation should not be used. Instead, boundaries should be chosen for $\textit{Y}$ and our method can be analogously applied.

Change history

26 August 2022
A Correction to this paper has been published: https://doi.org/10.1007/s10651-022-00543-6

References

Biswas MK, Carson L, Newman K, Bernardet L, Kalina E, Grell E, Frimel J (2017) Community hwrf users guide v3. 9a
Chung Y, Dunson DB (2009) Nonparametric bayes conditional distribution modeling with variable selection. J Am Stat Assoc 104(488):1646–1660
Article PubMed PubMed Central Google Scholar
Cloud KA, Reich BJ, Rozoff CM, Alessandrini S, Lewis WE, Delle Monache L (2019) A feed forward neural network based on model output statistics for short-term hurricane intensity prediction. Weather and Forecasting 34(4):985–997
Article Google Scholar
Dalmasso N, Pospisil T, Lee AB, Izbicki R, Freeman PE, Malz AI (2020) Conditional density estimation tools in python and r with applications to photometric redshifts and likelihood-free cosmological inference. Astronomy and Computing 30:100362
Article Google Scholar
Dunson DB, Pillai N, Park J-H (2007) Bayesian density regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(2):163–183
Article Google Scholar
Efromovich S (2010) Orthogonal series density estimation. Wiley Interdisciplinary Reviews: Computational Statistics 2(4):467–476
Article Google Scholar
Escobar MD, West M (1995) Bayesian density estimation and inference using mixtures. J Am Stat Assoc 90(430):577–588
Article Google Scholar
Fahey MT, Thane CW, Bramwell GD, Coward WA (2007) Conditional gaussian mixture modelling for dietary pattern analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society) 170(1):149–166
Article Google Scholar
Fan J-Q, Peng L, Yao Q-W, Zhang W-Y (2009) Approximating conditional density functions using dimension reduction. Acta Mathematicae Applicatae Sinica, English Series 25(3):445–456
Article Google Scholar
Fithian W, Hastie T (2013) Finite-sample equivalence in statistical models for presence-only data. The annals of applied statistics 7(4):1917–1939
Article PubMed Google Scholar
Fithian W, Hastie T (2014) Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann Stat 42(5):1693
Article PubMed PubMed Central Google Scholar
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1
Article PubMed PubMed Central Google Scholar
Friedman J, Hastie T, Tibshirani R (2009) glmnet: Lasso and elastic-net regularized generalized linear models. R package version 1(4)
Geweke J, Keane M (2007) Smoothly mixing regressions. Journal of Econometrics 138(1):252–290
Article Google Scholar
Gilardi N, Bengio S, Kanevski M (2002) Conditional gaussian mixture models for environmental risk mapping. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 777–786. IEEE
Gneiting T, Raftery AE (2007) Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 102(477):359–378
Article CAS Google Scholar
Hall P, Wolff RC, Yao Q (1999) Methods for estimating a conditional distribution function. J Am Stat Assoc 94(445):154–163
Article Google Scholar
Hall P, Racine J, Li Q (2004) Cross-validation and the estimation of conditional probability densities. J Am Stat Assoc 99(468):1015–1026
Article Google Scholar
Hall P, Yao Q et al (2005) Approximating conditional distribution functions using dimension reduction. Ann Stat 33(3):1404–1421
Article Google Scholar
Hannah LA, Blei DM, Powell WB (2011) Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research 12(6)
Hersbach H (2000) Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting 15(5):559–570
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. pp. 1026–1034
Hornik K, Stinchcombe M, White H et al (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366
Article Google Scholar
Hothorn T, Zeileis A (2017) Transformation forests. arXiv preprint arXiv:1701.02110
Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336
Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Izbicki R, Lee AB (2016) Nonparametric conditional density estimation in a high-dimensional regression setting. J Comput Graph Stat 25(4):1297–1316
Article Google Scholar
Izbicki R, Lee AB et al (2017) Converting high-dimensional regression to high-dimensional conditional density estimation. Electronic Journal of Statistics 11(2):2800–2831
Article Google Scholar
Jara A, Hanson TE (2011) A class of mixtures of dependent tail-free processes. Biometrika 98(3):553–566
Article CAS PubMed PubMed Central Google Scholar
Jarner MF, Diggle P, Chetwynd AG (2002) Estimation of spatial variation in risk using matched case-control data. Biometrical Journal: Journal of Mathematical Methods in Biosciences 44(8):936–945
Article Google Scholar
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Krüger F, Lerch S, Thorarinsdottir TL, Gneiting T (2016) Predictive inference based on markov chain monte carlo output. arXiv preprint arXiv:1608.06802
Kundu S, Dunson DB (2014) Bayes variable selection in semiparametric linear models. J Am Stat Assoc 109(505):437–447
Article CAS PubMed PubMed Central Google Scholar
Lenk PJ (1988) The logistic normal distribution for bayesian, nonparametric, predictive densities. J Am Stat Assoc 83(402):509–516
Article Google Scholar
Li R, Bondell HD, Reich BJ (2019) Deep distribution regression. arXiv:1903.06023
Matheson JE, Winkler RL (1976) Scoring rules for continuous probability distributions. Manage Sci 22(10):1087–1096
Article Google Scholar
Meinshausen N (2006) Quantile regression forests. Journal of Machine Learning Research 7(Jun), 983–999
Müller P, Erkanli A, West M (1996) Bayesian curve fitting using multivariate normal mixtures. Biometrika 83(1):67–79
Article Google Scholar
Park J-H, Dunson DB (2010) Bayesian generalized product partition model. Statistica Sinica, 1203–1226
Payne RD, Guha N, Ding Y, Mallick BK (2020) A conditional density estimation partition model using logistic gaussian processes. Biometrika 107(1):173–190
Article CAS PubMed Google Scholar
Peng F, Jacobs RA, Tanner MA (1996) Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition. J Am Stat Assoc 91(435):953–960
Article Google Scholar
Pospisil T, Lee AB (2018) Rfcde: Random forests for conditional density estimation. arXiv preprint arXiv:1804.05753
Rojas AL, Genovese CR, Miller CJ, Nichol R, Wasserman L (2005) Conditional density estimation using finite mixture models with an application to astrophysics
Rosenblatt M (1969) Conditional probability density and regression estimators. Multivariate analysis II 25:31
Google Scholar
Shahbaba B, Neal R (2009) Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research 10(8)
Song X, Yang K, Pavel M (2004) Density boosting for gaussian mixtures. In International Conference on Neural Information Processing, pp. 508–515. Springer
Taddy MA, Kottas A (2010) A bayesian nonparametric approach to inference for quantile regression. Journal of Business & Economic Statistics 28(3):357–369
Article Google Scholar
Tokdar ST, Kadane JB et al (2012) Simultaneous linear quantile regression: a semiparametric bayesian approach. Bayesian Anal 7(1):51–72
Google Scholar
Tokdar ST, Zhu YM, Ghosh JK et al (2010) Bayesian density regression with logistic gaussian process and subspace projection. Bayesian Anal 5(2):319–344
Article Google Scholar
Trippa L, Müller P, Johnson W (2011) The multivariate beta process and an extension of the polya tree model. Biometrika 98(1):17–34
Article PubMed PubMed Central Google Scholar
Tung NT, Huang JZ, Khan I, Li MJ, Williams G et al. (2014) Extensions to quantile regression forests for very high-dimensional data
Weierstrass K (1885) Über die analytische darstellbarkeit sogenannter willkürlicher functionen einer reellen veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften zu Berlin 2:633–639
Google Scholar
Wood SA, Jiang W, Tanner M (2002) Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika 89(3):513–528
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, North Carolina State University, Campus Box 8203, Raleigh, NC, 27695, USA
David B. Huberman & Brian J. Reich
School of Mathematics and Statistics, University of Melbourne, Peter Hall Building, Parkville, VIC, 3122, Australia
Howard D. Bondell

Authors

David B. Huberman
View author publications
You can also search for this author in PubMed Google Scholar
Brian J. Reich
View author publications
You can also search for this author in PubMed Google Scholar
Howard D. Bondell
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David B. Huberman.

Additional information

Handling Editor: Luiz Duczmal.

We’d like to thank Dr. Kevin Gunn for lending equipment and technical assistance to obtain some of the results for the tropical cyclone intensity forecasting application.

Appendices

Appendix A: Case–control sampling justification

We exploit the relationship between an inhomogeneous Poisson process (IPP) model and our problem to re-frame case and control samples into presence and background samples from presence-only datasets. Presence-only data is generally used with species distribution modeling where surveyors record all species presences within a pre-specified region along with randomly sampled background data across the region. Fithian (2013) lay out how to model presence-only data in a two-dimensional spatial domain using an IPP model. For our method, the domain for cases (and controls) is [0, 1] as previously noted. The IPP requires an intensity function $\lambda $ to be specified, which represents the likelihood of a case being present at any location in the given domain. The average intensity function for response value $\textit{z}$ over the domain is given as

$$\begin{aligned} \Lambda =\int _{0}^{1}\lambda (\textit{z})d\textit{z}. \end{aligned}$$

(A1)

An IPP model with intensity function $\lambda $ gives the probability distributions for both the total number of cases as well as the locations of those cases. Conditional on the number of cases (governed by a Poisson distribution with mean $\Lambda $), the locations of the cases are independently and identically distributed as

$$\begin{aligned} \Pr (\textit{Z}=\textit{z})=\frac{\lambda (\textit{z})}{\Lambda }. \end{aligned}$$

(A2)

If we define $\lambda (\textit{z})=e^{q(\textit{z},{\varvec{X}})}$, then the IPP model becomes

$$\begin{aligned} \Pr (\textit{Z}=\textit{z})=\frac{e^{q(\textit{z},{\varvec{X}})}}{\int _0^1 e^{q(\textit{z},{\varvec{X}})} d\textit{z}} . \end{aligned}$$

(A3)

We recognize this distribution form in the continuous logistic transformation formula given in 2. The cases are independently and identically distributed, so the log likelihood objective function is

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z},{\varvec{X}})=\sum \limits _{i=1}^{n} \bigg [ q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })-\log \bigg ( \int _0^1 q(\textit{z},{\varvec{X}}_i;\varvec{\theta })d\textit{z}\bigg ) \bigg ] \end{aligned}$$

(A4)

where $i=1,\ldots ,n$ and $\varvec{\theta }$ is the parameter vector for the selected $q$ model, as before. Fithian (2013) discuss how to evaluate the integral in the denominator by approximating it using a finite set of control (background) samples. Let $k=1,\ldots ,K$ index the control samples, so that $\textit{z}_{ik}$ denotes the $k$th control value for observation $i$. Then, the log likelihood objective function with an approximated integral becomes

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z},{\varvec{X}})=\sum \limits _{i=1}^{n}\bigg [ q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })-\log \bigg ( e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+ \sum _{k=1}^{K} e^{q(\textit{z}_{ik},{\varvec{X}}_i;\varvec{\theta })} \bigg ) \bigg ]. \end{aligned}$$

(A5)

We recognize this as the log likelihood objective function in Sect. 8. Fithian and Hastie do not approximate the integral using a $q$ function with a case plus a set of controls, instead only using the set of controls. However, the case is only providing more information about the integral than the controls would on their own so we do not expect this to produce any inconsistencies or biases. They state that the control points can be uniformly sampled from the domain, which implies that selecting as few as $K=1$ controls still can provide valid inference about the target distribution function. Fithian and Hastie also state that the control points can be chosen through weighted sampling, referring to using quadrature weights. The additional ridge penalty component in 8 could have been added to the IPP likelihood as well.

Note that for presence-only data, Fithian and Hastie detail possible sampling bias issues that can arise due to imperfect detection of presences during data collection. However, we do not think detectability issues are relevant to our usage of this framework. We make the assumption that the detectability parameter equals 1 in our context, so we do not have to worry about sampling bias in this respect.

Appendix B: Computation

We use a variety of techniques to determine the optimal set of parameter estimates for the chosen model.

1.1 B.1: Polynomial regression model

For the polynomial regression model, we can manipulate the data in order to evaluate the parameters using penalized logistic regression. Recall the polynomial regression model in Sect. 3. We can express the polynomial model in 3 in matrix form as $q(\textit{z},{\varvec{X}})={\varvec{X}}(\textit{z},B)\varvec{\xi }$, where ${\varvec{X}}(\textit{z},B)$ is the full covariate matrix as a function of the transformed response $\textit{z}$ and the highest polynomial power $B$ and $\varvec{\xi }=[\varvec{\xi }_{10},\ldots ,\varvec{\xi }_{{B}p} ]$ is the associated parameter vector. Then, we can rewrite the objective function in Sect. 8 using the polynomial regression from Sect. 3 as

$$\begin{aligned} \ell (\varvec{\theta }|\textit{Z}_i,{\varvec{X}}_i)=&\sum \limits _{i=1}^{n}\bigg \{ -\log \frac{e^{{\varvec{X}}_i(\textit{z}_i,B)\varvec{\xi }}}{e^{{\varvec{X}}_i(\textit{z}_i,B)\varvec{\xi }}+e^{{\varvec{X}}_i(\textit{z}_{i1}^*,B)\varvec{\xi }}} \bigg \}+\omega ||\varvec{\theta }||^2 \end{aligned}$$

(B1)

$$\begin{aligned} =&\sum \limits _{i=1}^{n}\bigg \{ -\log \frac{1}{1+e^{[{\varvec{X}}_i(\textit{z}_{i1}^*,B)-{\varvec{X}}_i(\textit{z}_i,B)]\varvec{\xi }}} \bigg \}+\omega ||\varvec{\theta }||^2. \end{aligned}$$

(B2)

We recognize the form of this objective function for a logistic regression where every binary outcome equals 1. If we add a small amount of dummy observations where the covariate and outcome values are all 0, we can evaluate our parameter vector using penalized logistic regression. For each run of the polynomial regression case-control method, we used two dummy observations. The GLMNet package in R evaluates these parameters extremely fast, making this method practical and convenient (Friedman et al. 2009).

We ran preliminary simulation runs on a smaller number of datasets for a variety of ridge penalties to determine the appropriate penalty values to use for each scenario of the simulation and the application. For Model 1, 2, and 4 the ridge penalties used were 0.025, 0.05, and 0.05, respectively. For Model 3, the ridge penalty used for 200 observations was 0.025, whereas the ridge penalty used for the other two sample size settings was 0.01.

For the application, we considered a variety of ridge penalties to determine which penalty minimized the mean 5-fold CRPS. For lag 3 and lag 6, a ridge penalty of 0.000001 and 0.0005 were chosen, respectively.

1.2 B.2: Deep learning model

For the deep learning case–control approximation, we cannot manipulate the data like in Sect. B2 and instead rely on gradient descent to evaluate the parameters.

Deep learning models can be difficult to train with a basic gradient descent algorithm. We extend or alter the basic gradient descent algorithm in a multitude of ways designed to improve parameter estimation. Mini-batch gradient descent is used with batch sizes of 50 and an initial step size of 1. For the case study, the batch sizes are approximate as the number of observations is not divisible by 50. The models are run for 600 gradient descent steps for the simulation and 300 gradient descent steps for the case study, respectively. Adaptive moment estimation (ADAM) is implemented along with a step decay that halves the initial step size every 100 steps for the simulation models and 50 steps for the case study models (Kingma and Ba 2014). We apply the batch normalization algorithm described in Ioffe and Szegedy (2015). Model weights are initiated using the He initialization scheme, while the batch normalization shift and scale parameters are initialized to 0 and 1, respectively (He et al. 2015).

The gradient descent algorithm requires the calculation of a gradient vector, which contains the first derivative value of the objective function with respect to each parameter. We obtain these values using back-propagation. The individual chain rule components for the gradient vector calculations for a single observation and case/control value are given in Sect. B3. The gradient vector can be calculated by summing the appropriate terms across observations and cases/controls.

$$\begin{aligned} \frac{\partial \ell }{\partial q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}=&\frac{1}{e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+e^{q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}} \nonumber \\ \frac{\partial \ell }{\partial q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}=&\frac{e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}}{e^{q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}+e^{q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}}\nonumber \\ \frac{\partial q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}{\partial \delta _{t}}=&\frac{\partial q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}{\partial \delta _{t}}=f_A(\textit{H}_{t}) \nonumber \\ \frac{\partial q(\textit{z}_i,{\varvec{X}}_i;\varvec{\theta })}{\partial f_A(\textit{H}_{t})}=&\frac{\partial q(\textit{z}^{*}_{ik};{\varvec{X}}_i,\varvec{\theta })}{\partial f_A(\textit{H}_{t})}=\delta _{t}\nonumber \\ \frac{\partial f_A(\textit{H}_{t})}{\partial \textit{H}_{t}}=&{\left\{ \begin{array}{ll} 1 &{} \text {if } \textit{H}_{t}>0\\ \alpha \exp {\textit{H}_{t}} &{} \text {if } \textit{H}_{t}\le 0 \end{array}\right. } \nonumber \\ \frac{\partial \textit{H}_{t}}{\partial f_A(\textit{I}_{r})}=&\gamma _{tr}\nonumber \\ \frac{\partial \textit{H}_{t}}{\partial \gamma _{0t}}=&1 \nonumber \\ \frac{\partial \textit{H}_{t}}{\partial \gamma _{tr}}=&f_A(\textit{I}_{r}) \nonumber \\ \frac{\partial f_A(\textit{I}_{r})}{\partial \textit{I}_{r}} =&{\left\{ \begin{array}{ll} 1 &{} \text {if } \textit{I}_{r}>0\\ \alpha \exp {\textit{I}_{r}} &{} \text {if } \textit{I}_{r}\le 0 \end{array}\right. }\nonumber \\ \frac{\partial \textit{I}_{r}}{\beta _{0r}}=&1 \nonumber \\ \frac{\partial \textit{I}_{r}}{\beta _{1r}}=&\textit{z}_i\nonumber \\ \frac{\partial \textit{I}_{r}}{\beta _{jr}}=&X_{ji}. \end{aligned}$$

(B3)

The components for the batch normalization parameters can be obtained following the steps in (Ioffe and Szegedy 2015).

We ran preliminary simulation runs on a smaller number of datasets for a variety of ridge penalties to determine the appropriate penalty values to use for each scenario of the simulation and the application. For simulation Model 1, we used 0.025, 0.002, and 0.0025 as the penalties for 200, 1000, and 4000 observations respectively. For simulation Model 2, we used 0.015, 0.0025, and 0.001 as the penalties for 200, 1000, and 4000 observations respectively. For simulation Model 3, we used 0.0075, 0.001, and 0.001 as the penalties for 200, 1000, and 4000 observations respectively. For simulation Model 4, we used 0.01, 0.001, and 0.0005 as the penalties for 200, 1000, and 4000 observations respectively.

For the application, we considered a variety of ridge penalties and initial learning rates for the deep learning IPP approximation method. For both lags, the tuning settings that optimized the mean 5-fold CRPS were a ridge penalty of 0.000001 and an initial learning rate of 1.

Appendix C: Predicted distribution plots for individual fold

The predicted distribution plots for a range of covariate values using the full tropical cyclone dataset is given in Sect. 5, however the results in Table 2 were produced by obtaining separate models for each testing dataset fold. Here, we present an example of the predicted distribution plots for the third fold for both methods and both lags. The predicted distributions for the individual folds can vary somewhat compared to the overall predicted response distributions fitted using the full tropical cyclone dataset, as seen below for the deep learning IPP approximation. The polynomial MCC approximation predicted conditional distributions for the individual folds were generally similar to the overall predicted conditional distributions in Fig. 3.

Table 3 The 5-fold mean CRPS values for various numbers of grid points for the QRF, DDR, polynomial MCC approximation ($K=1$), and deep learning IPP approximation ($K=20$) conditional distribution estimation methods for predicting the lag 3 maximum 10-meter wind speed

Full size table

Table 4 The 5-fold mean CRPS values for various numbers of evenly spaced grid points for the QRF, DDR, polynomial MCC approximation ($K=1$), and deep learning IPP approximation ($K=20$) conditional distribution estimation methods for predicting the lag 6 maximum 10-m wind speed

Full size table

Appendix D: CRPS control sensitivity analysis

The simulation study results in Sect. 4 indicate that the number of $K$ controls is influential for deep learning approximation model performance when there are only 200 observations. The deep learning IPP approximation model with $K=1$ control was outperformed by QRF and DDR in three of the four data models, whereas the deep learning IPP approximation model with $K=10$ controls outperformed QRF and DDR in three of the four data models. We further explore the effect of the number of $K$ for these four data models and 200 observations in this section.

Figure 6 gives the average CRPS divergence for 200 observations under each data model for $K\in \{1,2,3,4,5,6,7,8,9,10,15,20\}$ controls. There is a notable decrease in CRPS divergence when increasing from $K=1$ to $K=2$ controls for all four data models. The decrease in CRPS divergence for $K>2$ is also present, however the improvements diminish as $K$ is increased.

These results suggest that it is important to use multiple controls when utilizing the deep learning IPP approximation for a small dataset. The CRPS divergence scores either improve or stay similar as more controls are used, however the benefit of increasing the number of controls beyond $K=2$ are increasingly marginal. The results in Table 1 point to the computational efficiency of the deep learning approximation method as a significant consideration, although not quite as significant for only 200 observations. Still, this suggests that the optimal selection of $K$ controls would factor in both accuracy and computation time rather than choosing the highest $K$ value possible.

Appendix E: CRPS grid points sensitivity analysis

The methods used for the case study were evaluated using the CRPS performance metric with 1000 evenly spaced grid points, consistent with the specifications used in Li et al. (2019). To investigate whether the case study results are impacted by the number of grid points, we conducted a sensitivity analysis. The CRPS scores for each case study method are calculated using 250, 500, 750, 1000, 1250, 1500, 1750, and 2000 evenly spaced grid points and the results are compared.

Tables 3 and 4 give the CRPS scores and accompanying standard errors for each method using the various amounts of evenly spaced grid points for the lag 3 and lag 6 datasets, respectively. The results across all methods and datasets are extremely insensitive to the number of evenly spaced grid points. The average CRPS scores for different number of grid points are nearly identical, indicating that the takeaways of the case study were not influenced by our choice to use 1000 evenly spaced grid points.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Huberman, D.B., Reich, B.J. & Bondell, H.D. Nonparametric conditional density estimation in a deep learning framework for short-term forecasting. Environ Ecol Stat 29, 677–704 (2022). https://doi.org/10.1007/s10651-021-00499-z

Download citation

Received: 20 August 2020
Revised: 25 February 2021
Accepted: 01 April 2021
Published: 20 May 2021
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10651-021-00499-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Nonparametric conditional density estimation in a deep learning framework for short-term forecasting

Abstract

Similar content being viewed by others

Realistic tropical cyclone wind and pressure fields can be reconstructed from sparse data using deep learning

Estimation of global tropical cyclone wind speed probabilities using the STORM dataset

Markovian approach to the frequency of tropical cyclones and subsequent development of univariate prediction model

1 Introduction