1 Introduction

Estimating regression quantiles—conditional quantiles of a response variable that depend on covariates in some form of regression equation—is a fundamental task in data-driven science. Focusing on the environmental sciences, quantile regression methods have been used to provide estimates of predictive uncertainty in forecast applications (Cawley et al. 2007); construct growth curves for organisms (Muggeo et al. 2013); relate soil moisture deficit with summer hot extremes (Hirschi et al. 2010); provide flood frequency estimates (Ouali et al. 2016); estimate rainfall Intensity–Duration–Frequency (IDF) curves (Ouali and Cannon 2018); determine the relation between rainfall intensity and duration and landslide occurrence (Saito et al. 2010); estimate trends in climate, streamflow, and sea level data (Koenker and Schorfheide 1994; Barbosa 2008; Allamano et al. 2009; Roth et al. 2015); downscale atmospheric model outputs (Friederichs and Hense 2007; Cannon 2011; Ben Alaya et al. 2016); and determine scaling relationships between temperature and extreme precipitation (Wasko and Sharma 2014), among other applications.

Quantile regression equations can be linear or nonlinear. In most variants, including the original linear model (Koenker and Bassett 1978), conditional quantiles for specified quantile probabilities are estimated separately by different regression equations; together, these different equations can be used to build up a piecewise estimate of the conditional response distribution. However, given finite samples, this flexibility can lead to “quantile crossing” where, for some values of the covariates, quantile regression predictions do not increase with the specified quantile probability \(\tau \). For instance, the \(\tau _{1}=0.1\)-quantile (10th-percentile) estimate may be greater in magnitude than the \(\tau _{2}=0.2\)-quantile (20th-percentile) estimate, which violates the property that the conditional quantile function be strictly monotonic. As Ouali et al. (2016) state, “crossing quantile regression is a serious modeling problem that may lead to an invalid response distribution”.

Three main approaches have been used to solve the quantile crossing problem: post-processing, stepwise estimation, and simultaneous estimation. In post-processing, non-crossing quantiles are enforced following model estimation by rearranging predictions so that they increase with increasing \(\tau \) (Chernozhukov et al. 2010). In stepwise estimation, regression equations are constructed iteratively, with constraints added so that each subsequent quantile regression function does not cross the one estimated previously (Liu and Wu 2009; Muggeo et al. 2013). Finally, in simultaneous estimation, quantile regression equations for all desired values of \(\tau \) are estimated at the same time, with additional constraints added to parameter optimization to ensure non-crossing (Takeuchi et al. 2006; Bondell et al. 2010; Liu and Wu 2011; Bang et al. 2016). Unlike sequential estimation, simultaneous estimation is attractive because it does not depend on the order in which quantiles are estimated. Furthermore, fitting for multiple values of \(\tau \) simultaneously allows one to “borrow strength” across regression quantiles and improve overall model performance (Bang et al. 2016). This property is especially useful for nonlinear quantile regression models, which are more prone to overfitting and quantile crossing in the face of small to moderate sample sizes (Muggeo et al. 2013).

Baldwin (2006), paraphrasing Persson (2001), states “...while there is only one way to be linear, there are an uncountable infinity of ways to be nonlinear. One cannot check them all”. For a flexible nonlinear model like a neural network, imposing extra constraints, for example as informed by process knowledge, can be useful for narrowing the overall search space of potential nonlinearities. As a simple example, growth curves should increase monotonically with the age of the organism, which led Muggeo et al. (2013) to introduce a monotonicity constraint in addition to the non-crossing constraint. Similarly, Roth et al. (2015) applied nonlinear monotone quantile regression to describe non-decreasing trends in rainfall extremes. Takeuchi et al. (2006) developed a nonparametric, kernelized version of quantile regression with similarities to support vector machines; both non-crossing and monotonicity constraints are considered, with directions on the incorporation of other constraints, such as positivity and additivity constraints, also provided. However, standard implementations of the kernel quantile regression model (e.g., Karatzoglou et al. 2004; Hofmeister 2017) are computationally costly, with complexity that is cubic in the number of samples, and do not explicitly implement the proposed constraints.

As an alternative, this study introduces an efficient, flexible nonlinear quantile regression model, the monotone composite quantile regression neural network (MCQRNN), that: (1) simultaneously estimates multiple non-crossing quantile functions; (2) allows for optional monotonicity, positivity/non-negativity, and additivity constraints, as well as fine-grained control on the degree of non-additivity; and (3) can be modified to estimate standard least-squares regression and non-crossing expectile regression functions. These features, which are combined into a single, unified framework, are made possible through a novel combination of elements drawn from the standard QRNN model (White 1992; Taylor 2000; Cannon 2011), the monotone multi-layer perceptron (MMLP) (Zhang and Zhang 1999; Lang 2005; Minin et al. 2010), the composite QRNN (CQRNN) (Xu et al. 2017), the expectile regression neural network (Jiang et al. 2017), and the generalized additive neural network (Potts 1999). To the best of the author’s knowledge, the MCQRNN model is the first neural network-based implementation of quantile regression that guarantees non-crossing of regression quantiles.

The MCQRNN model is developed in Sect. 2, starting from the MMLP model, leading to the MQRNN model, and then finally to the full MCQRNN. Approaches to enforce monotonicity, positivity/non-negativity, and generalized additive model constraints, as well as to estimate uncertainty in the conditional \(\tau \)-quantile functions, are also provided. In Sect. 3, the MCQRNN model is compared via Monte Carlo simulation to standard MLP, QRNN, and CQRNN models using combinations of three functions and error distributions from Xu et al. (2017). In Sect. 4, the MCQRNN model is applied to real-world climate data by estimating IDF curves at ungauged locations in Canada based on annual maximum rainfall series at neighbouring gauging stations. IDF curves, which are used in the design of civil infrastructure such as culverts, storm sewers, dams, and bridges, summarize the relationship between the intensity and occurrence frequency of extreme rainfall over averaging durations ranging from minutes to a day (Canadian Standards Association 2012). The intensity of extreme rainfall, a non-negative quantity, should increase monotonically as the annual probability of occurrence decreases (e.g., from \(1-\tau =0.5\) to 0.01 or, equivalently, a 2–100-year return period) and as the storm duration decreases (e.g., from 24-h to 5-min). Monotonicity and positivity/non-negativity constraints are thus key features of an IDF curve. MCQRNN IDF curve estimates are compared with those obtained by fitting separate QRNN models for each return period and duration, as done previously by Ouali and Cannon (2018). Finally, Sect. 5 provides closing remarks and suggestions for future research.

2 Modelling framework

2.1 Monotone multi-layer perceptron (MMLP)

The monotone composite quantile regression neural network (MCQRNN) model starts with the multi-layer perceptron (MLP) neural network with partial monotonicity constraints (Zhang and Zhang 1999) as its basis. For a data point with index t, the prediction \(\hat{y}(t)\) from a monotone MLP (MMLP) is obtained as follows. First, the V covariates, each assumed to be standardized to zero mean and unit standard deviation, are separated into two groups: \(x_{m\in M}(t)\) and \(x_{i\in I}(t)\) with combined indices \(\{M\cup I\,|\,1,\ldots ,V,\,V=(\#M+\#I)\}\), where M is the set of indices for covariates with a monotone increasing relationship with the prediction, I is the corresponding set of indices for covariates without monotonicity constraints, and \(\#\) denotes the number of set elements. Covariates are transformed into \(j=1,\ldots ,J\) hidden layer outputs

$$\begin{aligned} h_{j}(t)=f\left( \sum _{m\in M}x_{m}(t)\,\exp \left( W_{mj}^{(h)}\right) +\sum _{i\in I}x_{i}(t)\,W_{ij}^{(h)}+b_{j}^{(h)}\right) \end{aligned}$$
(1)

where \({\mathbf {W}}^{(h)}\) is a \(V\times J\) parameter matrix, \({\mathbf {b}}^{(h)}\) is a vector of J intercept parameters, and f is a smooth non-decreasing function, usually taken to be the hyperbolic tangent function. Finally, the model prediction is given as a weighted combination of the J hidden layer outputs

$$\begin{aligned} \hat{y}(t)=g\left( \sum _{j=1}^{J}h_{j}(t)\,\exp \left( w_{j}\right) +b\right) \end{aligned}$$
(2)

where \({\mathbf {w}}\) is a vector of J parameters, b is an intercept term, and g is a smooth non-decreasing inverse-link function.

Because both f and g are non-decreasing, partial monotonicity constraints (i.e., \(\frac{\partial \hat{y}}{\partial x_{m}}\ge 0\) everywhere) can be imposed by ensuring that all parameters leading from each monotone-constrained covariate \(x_{m}\) are positive (Zhang and Zhang 1999), in this case by applying the exponential function to the corresponding elements of \({\mathbf {W}}^{(h)}\) and all elements of \({\mathbf {w}}\). Decreasing relationships can be imposed by multiplying covariates by \(-\,1\). Also, extra hidden layers of positive parameters can be added to the model. As pointed out by Lang (2005) and Minin et al. (2010), an additional hidden layer is required for the MMLP to maintain its universal function approximation capabilities. While multiple hidden layers are included in the software implementation by Cannon (2017), for sake of simplicity, this study only considers the single hidden layer architecture of Zhang and Zhang (1999). In practice, simple functional relationships can still be represented by a single hidden layer model.

If M is the empty set and the positivity constraint on the \({\mathbf {w}}\) parameters is removed, this leads to the standard MLP model. If f and g are the identity function, the MMLP reduces to a linear model. If f is nonlinear, then the model can represent nonlinear relationships, including those involving interactions between covariates; the number of hidden layer outputs J further controls the potential complexity of the MLP mapping. All models in this study set f to be the hyperbolic tangent function.

Adjustable parameters (\({\mathbf {W}}^{(h)},\,{\mathbf {b}}^{(h)},\,{\mathbf {w}},\,b\)) in the MMLP are set by minimizing the least squares (LS) error function

$$\begin{aligned} E_{\mathrm{LS}}=\frac{1}{N}\sum _{t=1}^{N}\left( y(t)-\hat{y}(t)\right) ^{2} \end{aligned}$$
(3)

over a training dataset with N data points \(\{\left( {\mathbf {x}}(t),\,y(t)\right) |t=1,\ldots ,N\}\), where y(t) is the target value of the response variable. While LS regression is most common, different error functions are appropriate for different prediction tasks. Minimizing the LS error function is equivalent to maximum likelihood estimation for the conditional mean assuming a Gaussian error distribution with constant variance (i.e., a traditional regression task), while minimizing the least absolute error (LAE) function

$$\begin{aligned} E_{\mathrm{LAE}}=\frac{1}{N}\sum _{t=1}^{N}\left| y(t)-\hat{y}(t)\right| \end{aligned}$$
(4)

leads to a regression estimate for the conditional median (i.e., the \(\tau =0.5\)-quantile) (Koenker and Bassett 1978).

2.2 Monotone quantile regression neural network (MQRNN)

The fundamental quantity of interest here is not just the median, but rather predictions \(\hat{y}_{\tau }(t)\) of the conditional quantile associated with the quantile probability \(\tau \) (\(0<\tau <1\)). In this context, combining the MMLP architecture from Sect. 2.1, as given by Eqs. (1) and (2),

$$\begin{aligned} \hat{y}_{\tau }(t)=g\left[ \sum _{j=1}^{J}f\left( \sum _{m\in M}x_{m}(t)\,\exp \left( W_{mj}^{(h)}\right) +\sum _{i\in I}x_{i}(t)\,W_{ij}^{(h)}+b_{j}^{(h)}\right)\,\exp \left( w_{j}\right) +b\right] , \end{aligned}$$
(5)

with the quantile regression error function

$$\begin{aligned} E_{{\tau }}=\frac{1}{N}\sum _{t=1}^{N}\rho _{\tau }\left( y(t)-\hat{y}_{\tau }(t)\right) \end{aligned}$$
(6)

where

$$\begin{aligned} \rho _{\tau }(\varepsilon )=\left\{ \begin{array}{ll} \tau \,\varepsilon &{}\quad \varepsilon \ge 0\\ (\tau -1)\,\varepsilon &{}\quad \varepsilon <0 \end{array}\right. \end{aligned}$$
(7)

leads to estimates \(\hat{y}_{\tau }\) of the conditional \(\tau \)-quantile function (Koenker and Bassett 1978). The resulting model is referred to as the MQRNN. When \(\tau =0.5\), Eq. (6) is, up to a constant scaling factor, the same as the LAE function (Eq. 4) that yields the conditional median; for \(\tau \ne 0.5\), the asymmetric absolute value function gives different weight to positive/negative deviations. For example, fitting a model with \(\tau =0.95\) provides an estimate for the conditional 95th-percentile, i.e., a covariate-dependent probability of exceedance of 5%. Relaxing the monotonicity constraints gives the standard QRNN model as presented by Cannon (2011).

Parameters can be estimated by a gradient-based nonlinear optimization algorithm, with calculation of the gradient using backpropagation; given the simple relationship between Eqs. 4 and 6, the analytical expression for the gradient of the quantile regression error function follows from that of the LAE function (Hanson and Burr 1988). In this case, the derivative is undefined at the origin, which means that a smooth approximation is instead substituted for the exact quantile regression error function. Following Chen (2007) and Cannon (2011), a Huber-norm version of Eq. 7 replaces \(\rho _{\tau }(\varepsilon )\) in the quantile regression error function. This approximation, denoted by (A), is given by

$$\begin{aligned} \rho _{\tau }^{(A)}(\varepsilon )=\left\{ \begin{array}{ll} \tau \,\varphi (\varepsilon ) &{}\quad \varepsilon \ge 0\\ (\tau -1)\,\varphi (\varepsilon ) &{}\quad \varepsilon <0 \end{array}\right. \end{aligned}$$
(8)

where the Huber function

$$\begin{aligned} \varphi (\varepsilon )=\left\{ \begin{array}{ll} \frac{\varepsilon ^{2}}{2\alpha } &{}\quad 0\le \left| \varepsilon \right| \le \alpha \\ \left| \varepsilon \right| -\frac{\alpha }{2} &{}\quad \left| \varepsilon \right| >\alpha \end{array}\right. \end{aligned}$$
(9)

is a hybrid of the absolute value and squared error functions (Huber 1964).

The Huber function transitions smoothly from the squared error, which is applied around the origin (\(\pm \alpha \)) to ensure differentiability, and the absolute error. As \(\alpha \rightarrow 0\), the approximate error function converges to the exact quantile regression error function. It should be noted that a slightly different approximation is used by Muggeo et al. (2012). Based on experimental results (not shown), both approximations ultimately provide models that are indistinguishable. However, the Huber function approximation is used here for its added ability to emulate the LS cost function. For sufficiently large \(\alpha \), all model deviations are squared and the approximate error function instead becomes an asymmetric version of the LS error function (Eq. 3). For \(\tau =0.5\) and large \(\alpha \), the error function is symmetric and is, up to a constant scaling factor, equal to the LS error function. For \(\tau \ne 0.5\), the asymmetric LS error function results in an estimate of the conditional expectile function (Newey and Powell 1987; Yao and Tong 1996; Waltrup et al. 2015). Hence, depending on values of \(\alpha \) and \(\tau \), minimizing the approximate quantile regression error function can provide regression estimates for the conditional mean (\(\alpha \gg 0\), \(\tau =0.5\)), median (\(\alpha \rightarrow 0\), \(\tau =0.5\)), quantiles (\(\alpha \rightarrow 0\), \(0<\tau <1\)), and expectiles (\(\alpha \gg 0\), \(0<\tau <1\)) (Jiang et al. 2017). Unless noted otherwise, all subsequent references to \(\rho _{\tau }^{(A)}\) and \(E_{{\tau }}^{(A)}\) will refer to the conditional quantile form of the Huber function approximation.

Unlike linear regression, where the total number of model parameters is limited by the number of covariates V, the complexity of the MQRNN model also depends on the number of hidden layer outputs J. Model complexity, and hence J, should be set such that the model can generalize to new data, which, in practice, usually means avoiding overfitting to noise in the training dataset. Additionally, regularization terms that penalize the magnitude of the parameters, hence limiting the nonlinear modelling capability of the model, can be added to the error function

$$\begin{aligned} \tilde{E}_{{\tau }}^{(A)}=E_{\tau }^{(A)}+\lambda ^{(h)}\frac{1}{VJ}\sum _{i=1}^{V}\sum _{j=1}^{J}\left( W_{ij}^{(h)}\right) ^{2}+\lambda \frac{1}{J}\sum _{j=1}^{J}\left( w_{j}\right) ^{2} \end{aligned}$$
(10)

where \(\lambda ^{(h)}\ge 0\) and \(\lambda \ge 0\) are hyperparameters that control the size of the penalty applied to the elements of \({\mathbf {W}}^{(h)}\) and \({\mathbf {w}}\) respectively. Values of J and, optionally, the \(\lambda ^{(h)}\) and \(\lambda \) hyperparameters are typically set by minimizing out-of-sample generalization error, for example as estimated via cross-validation or modified versions of an information criterion like the Akaike information criterion (QAIC) (Koenker and Schorfheide 1994; Doksum and Koo 2000)

$$\begin{aligned} \hbox {QAIC}=-2\,\log \left( E_{{\tau }}\right) +2\,p \end{aligned}$$
(11)

where p is an estimate of the effective number of model parameters.

2.3 Monotone composite quantile regression neural network (MCQRNN)

The MQRNN model in Sect. 2.2 is specified for a single \(\tau \)-quantile; no efforts are made to avoid quantile crossing for multiple estimates. To date, the simultaneous estimation of multiple \(\tau \)-quantiles with guaranteed non-crossing has not been possible for QRNN models. However, simultaneous estimates for multiple values of \(\tau \) are used in the composite QRNN (CQRNN) model proposed by Xu et al. (2017). CQRNN shares the same goal as the linear composite quantile regression (CQR) model (Zou and Yuan 2008), namely to borrow strength across multiple regression quantiles to improve the estimate of the true, unknown relationship between the covariates and the response. This is especially valuable in situations where the error follows a heavy-tailed distribution. In CQR, the regression coefficients are shared across the different quantile regression models. Similarly, in CQRNN, the \({\mathbf {W}}^{(h)},\,{\mathbf {b}}^{(h)},\,{\mathbf {w}},\,b\) parameters are shared across the different QRNN models. Hence, the models are not explicitly trying to describe the full conditional response distribution, but rather a single \(\tau \)-independent function that best describes the true covariate-response relationship. Structurally, the CQRNN model is the same as the QRNN model. The only difference is the quantile regression error function, which is now summed over K (usually equally spaced) values of \(\tau \)

$$\begin{aligned} E_{\mathrm{C}\tau }^{(A)}=\frac{1}{KN}\sum _{k=1}^{K} \sum _{t=1}^{N}\rho _{\tau _{k}}^{(A)}\left( y(t)-\hat{y}_{\tau _{k}}(t)\right) \end{aligned}$$
(12)

where, for example, \(\tau _{k}=\frac{k}{K+1}\) for \(k=1,2,\ldots ,K\). Penalty terms can be added as in Eq. 10.

The MCQRNN model combines the MQRNN model architecture given by Eq. 5 with the composite quantile regression error function (Eq. 12) to simultaneously estimate non-crossing regression quantiles. To show how this is achieved, consider an \(N\times \#I\) matrix of covariates \({\mathbf {X}}\), a corresponding response vector \({\mathbf {y}}\) of length N, and the goal of estimating non-crossing quantile functions for \(\tau _{1}<\tau _{2}<\cdots <\tau _{K}\). First, create a new \(\#M=1\) monotone covariate vector \({\mathbf {x}}_{m}^{(S)}\) of length \(S=K\,N\), where (S) denotes stacked data, by repeating each of the K specified \(\tau \) values N times and stacking. Next, stack K copies of \({\mathbf {X}}\) and concatenate with \({\mathbf {x}}_{m}^{(S)}\) to form a stacked covariate matrix \({\mathbf {X}}^{(S)}\) of dimension \(S\times (1+\#I)\). Finally stack K copies of \({\mathbf {y}}\) to form \({\mathbf {y}}^{(S)}\). Taken together, this gives the stacked dataset

$$\begin{aligned} {\mathbf{X}}^{(S)}=\left[ \begin{array}{cccc} \tau _{1} &{}\quad x_{1}(1) &{}\quad \cdots &{}\quad x_{\#I}(1)\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ \tau _{1} &{}\quad x_{1}(N) &{}\quad \cdots &{}\quad x_{\#I}(N)\\ \tau _{2} &{}\quad x_{1}(1) &{}\quad \cdots &{}\quad x_{\#I}(1)\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ \tau _{2} &{}\quad x_{1}(N) &{}\quad \cdots &{}\quad x_{\#I}(N)\\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ \tau _{K} &{}\quad x_{1}(1) &{}\quad \cdots &{}\quad x_{\#I}(1)\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots \\ \tau _{K} &{}\quad x_{1}(N) &{}\quad \cdots &{}\quad x_{\#I}(N) \end{array}\right] ,\,{\mathbf{y}}^{(S)}=\left[ \begin{array}{c} y(1)\\ \vdots \\ y(N)\\ y(1)\\ \vdots \\ y(N)\\ \vdots \\ y(1)\\ \vdots \\ y(N) \end{array}\right] \end{aligned}$$
(13)

which is used to fit the MQRNN model. By treating the \(\tau \) values as a monotone covariate, predictions \(\hat{y}_{\tau }^{(S)}\) from Eq. 5 for fixed values of the non-monotone covariates are guaranteed to increase with \(\tau \). Non-crossing is imposed by construction. Defining \(\tau (s)=x_{1}^{(S)}(s)\), the composite quantile regression error function for the stacked data can be written as

$$\begin{aligned} E_{\mathrm{C}\tau }^{(A,S)}=\sum _{s=1}^{S}\omega _{\tau (s)}\rho _{\tau (s)}^{(A)}\left( y^{(S)}(s)-\hat{y}_{\tau (s)}^{(S)}(s)\right) \end{aligned}$$
(14)

where \(\omega _{\tau (s)}\) are weights that can be used to allow regression quantiles for each \(\tau _{k}\) to contribute different amounts to the total error (Jiang et al. 2012; Sun et al. 2013); constant weights \(\omega _{\tau (s)}=1/S\) lead to the standard composite quantile regression error function. Minimization of Eq. 14 results in the fitted MCQRNN model. (Note: non-crossing expectile regression models can be obtained by adjusting \(\alpha \gg 0\) in \(\rho _{\tau }^{(A)}\).) Following model estimation, conditional \(\tau \)-quantile functions can be predicted for any value of \(\tau _{1}\le \tau \le \tau _{K}\) by entering the desired value of \(\tau \) into the monotone covariate.

To illustrate, Fig. 1 shows results from a MCQRNN model (\(J=4\), \(\lambda ^{(h)}=0.00001\), \(\lambda =0\), \(K=9\), \(\tau =0.1,\,0.2,\,\ldots ,\,0.9\)) fit to 500 samples of synthetic data for the two functions from Bondell et al. (2010)

$$\begin{aligned} y_{1}=0.5+2x+\sin (2\pi x-0.5)+\varepsilon \end{aligned}$$
(15)

and

$$\begin{aligned} y_{2}=3x+\left[ 0.5+2x+\sin (2\pi x-0.5)\right] \varepsilon \end{aligned}$$
(16)

where x is drawn from the standard uniform distribution \(x\sim U(0,\,1)\) and \(\varepsilon \) from the standard normal distribution \(\varepsilon \sim N(0,\,1)\). All \(\tau \) are weighted equally in Eq. 14 (i.e., values of \(\omega _{\tau (s)}\) are constant). Results are compared with those from separate QRNN models (\(J=4\) and \(\lambda ^{(h)}=0.00001\)) for each \(\tau \)-quantile. Quantile curves cross for QRNN, especially at the boundaries of the training data, whereas the MCQRNN model is able to simultaneously estimate multiple non-crossing quantile functions that correspond more closely to the true conditional quantile functions. While quantile crossing in QRNN models can be minimized by selecting and applying a suitable weight penalty (Cannon 2011), non-crossing cannot be guaranteed, whereas MCQRNN models impose this constraint by construction.

Fig. 1
figure 1

Predictions from QRNN (a, c) and MCQRNN (b, d) models fit to synthetic data (black points) generated by Eq. 15 (a, b) and Eq. 16 (c, d) are shown in rainbow colours. Plots of the true conditional quantile functions are shown by solid grey lines. The nine curves from bottom to top represent \(\tau =0.1,0.2,\ldots ,0.9\)

2.4 Additional constraints and uncertainty estimates

As mentioned above, constraints in addition to non-crossing of quantile functions may be useful for some MCQRNN modelling tasks. Partial monotonicity constraints for specified covariates can be imposed as described in Sect. 2.1; positivity or non-negativity constraints can be added by setting g in Eq. 2 to the exponential or smooth ramp function (Cannon 2011), respectively; and covariate interactions can be restricted by the approach described in Appendix 1.

A form of the parametric bootstrap can be used to estimate uncertainty in the conditional \(\tau \)-quantile functions. While the MCQRNN model is explicitly optimized for K specified values of \(\tau \), the use of the quantile probability as a monotone covariate means that conditional \(\tau \)-quantile functions can be interpolated for any value of \(\tau _{1}\le \tau \le \tau _{K}\). Proper distribution, probability density, and quantile functions can then be constructed by assuming a parametric form for the tails of the distribution (Quiñonero Candela et al. 2006; Cannon 2011). The parametric bootstrap proceeds by drawing random samples from the resulting conditional distribution, refitting the MCQRNN model, making estimates of the conditional \(\tau \)-quantiles, and repeating many times. Confidence intervals are estimated from the bootstrapped conditional \(\tau \)-quantiles.

For illustration, examples of MCQRNN model outputs with positivity and monotonicity constraints, as well as confidence intervals obtained by the parametric bootstrap, are shown in Fig. 2 for the two Bondell et al. (2010) functions.

Fig. 2
figure 2

As in Fig. 1b, d, but for MCQRNN models with additional a positivity constraints and b positivity and monotonicity constraints, respectively. c, d Estimates of 95% confidence intervals, based on 500 parametric bootstrap datasets, for the \(\tau =0.1,\,0.5,\,0.9\)-quantile regression curves shown in Fig. 1b, d

3 Monte Carlo simulation

Given the close relationship between the MCQRNN and CQRNN models, performance is first assessed via Monte Carlo simulation using the experimental setup adopted by Xu et al. (2017) for CQRNN. The MCQRNN model is compared with standard MLP, QRNN, and CQRNN models on datasets generated for three example functions:

$$\begin{aligned} (\hbox {example}~1)\quad y=\sin (2x_{1})+2\exp \left( -16x_{2}^{2}\right) +0.5\varepsilon \end{aligned}$$
(17)

where \(x_{1}\sim N(0,\,1)\) and \(x_{2}\sim N(0,\,1)\);

$$\begin{aligned} (\hbox {example}~2)\quad y=\left( 1-x+2x^{2}\right) \exp \left( -0.5x^{2}\right) +\frac{(1+0.2x)}{5}\varepsilon \end{aligned}$$
(18)

where \(x\sim U(-4,\,4)\); and

$$\begin{aligned} (\hbox {example}~3)\quad y=\begin{array}{c} 40\exp \left\{ 8\left[ (x_{1}-0.5)^{2}+(x_{2}-0.5)^{2}\right] \right\} {/} \\ \left[ \exp \left\{ 8\left[ (x_{1}-0.2)^{2}+(x_{2}-0.7)^{2}\right] \right\} \right. \\ \left. +\exp \left\{ 8\left[ (x_{1}-0.7)^{2}+(x_{2}-0.7)^{2}\right] \right\} \right] +\varepsilon \end{array} \end{aligned}$$
(19)

where \(x_{1}\sim U(0,\,1)\) and \(x_{2}\sim U(0,\,1)\). For each of the three functions, random errors are generated from three different distributions: the normal distribution \(\varepsilon \sim N(0,\,0.25)\), Student’s t distribution with three degrees of freedom \(\varepsilon \sim t(3)\), and the chi-squared distribution with three degrees of freedom \(\varepsilon \sim \chi ^{2}(3)\). Monte Carlo simulations are performed for the nine resulting datasets.

To evaluate the benefit of adding MCQRNN’s non-crossing constraint to the simultaneous estimation of multiple regression quantiles, a second variant of CQRNN, referred to as CQRNN*, is included in the comparison. The CQRNN* model takes the same structure as MCQRNN, i.e., with \(\tau \) values included as an extra input variable (Eq. 13). However, partial monotonicity constraints are removed from the \(\tau \)-covariate; the exponential function is no longer applied to the relevant elements in \({\mathbf {W}}^{(h)}\) and all elements of \({\mathbf {w}}\). The resulting model provides estimates of multiple regression quantiles, but crossing can now occur. This differs from the CQRNN model of Xu et al. (2017), which estimates a single regression equation using the composite QR cost function, and MCQRNN, which additionally guarantees non-crossing of the multiple regression quantiles. Differences between the three models are illustrated in Fig. 3 on the example 2 dataset with \(\varepsilon \sim \chi ^{2}(3)\) distributed noise.

Fig. 3
figure 3

Predictions from a CQRNN, CQRNN*, and b MCQRNN models on the example 2 dataset (Eq. 18) with \(\varepsilon \sim \chi ^{2}(3)\) distributed noise. Black dots show the synthetic training data and the thick black line indicates the true underlying function. Predictions of the conditional mean by CQRNN, CQRNN*, and MCQRNN are shown by the blue line in (a), the red line in (a), and the red line in (b), respectively. For the CQRNN* and MCQRNN models, these values are obtained by taking the mean over predictions of the \(K=19~\tau \)-quantiles shown in grey. Places where CQRNN* quantiles cross are indicated by vertical grey dashed lines

For each example and error distribution in the Monte Carlo simulations, 400 samples are generated and split randomly into 200 training and 200 testing samples. Results for QRNN, MLP, CQRNN, CQRNN*, and MCQRNN models are compared by fitting to the training samples and evaluating on the testing samples. Simulations are repeated 1000 times. Following Xu et al. (2017), the number of hidden layer outputs in all models is set to \(J=4\) for example 1 and \(J=5\) for examples 2 and 3; for sake of simplicity, no weight penalty terms are added when fitting any of the models. (When comparing results with those reported by Xu et al. (2017), note that omitting weight penalty regularization here leads to smaller inter-model differences in performance within both the training and testing samples, which suggests potential instability in hyperparameter selection in the previous study.) The goal is to estimate the true functional relationship specified by Eqs. 1719. The QRNN model is fit for \(\tau =0.5\), whereas CQRNN, CQRNN*, and MCQRNN models use \(K=19\) equally spaced values of \(\tau \). In the case of CQRNN* and MCQRNN, evaluations are based on an estimate of the conditional mean function obtained by taking the mean over predictions for the \(K=19\) \(\tau \)-quantiles. Performance is measured by the root mean squared error (RMSE) between model predictions for the test samples and the actual values of y. For reference, training RMSE is also reported. Results are shown in Fig. 4.

Fig. 4
figure 4

Distribution of RMSE values over the 1000 Monte Carlo simulations for MLP (black), QRNN (green), CQRNN (blue), CQRNN* (orange) and MCQRNN (red) models in the a training and b testing datasets for examples 1, 2, and 3 from Xu et al. (2017) with \(N(0,\,0.25)\) (rnorm25), t(3) (rt3), and \(\chi ^{2}(3)\) (rchisq3) distributed noise. The central dot indicates the median RMSE and the lower and upper bars the 5th and 95th percentiles, respectively

As expected, the MLP model, which is fit using the LS error function and hence is optimal for normally distributed errors with constant variance, tends to perform best for the three examples when \(\varepsilon \sim N(0,\,0.25)\). Difference are, however, small for both training and testing datasets. Median RMSE values for each of the models fall within 10% of MLP in all cases and the 90% inter-percentile ranges are typically comparable. For the two non-normal error distributions, \(\varepsilon \sim t(3)\) and \(\varepsilon \sim \chi ^{2}(3)\), CQRNN* and MCQRNN models tend to outperform the other models on the testing datasets. Again, differences in median testing RMSE are small, especially among the QRNN-based models. In general, however, MLP performs worst, followed by QRNN and CQRNN, with CQRNN* and MCQRNN offering slight improvements. In terms of robustness, as measured by the 5th and 95th percentiles of testing RMSE, MLP is clearly least robust, while MCQRNN tends to perform best, especially for example 3. For this example and the two non-normal error distributions, MCQRNN also outperforms CQRNN*, which points to added value of the non-crossing constraint. Overall, the MCQRNN model performs well on the synthetic data from Xu et al. (2017). In the next section, the modelling framework is applied to real-world climate data. As a proof of concept, rainfall IDF curves are estimated by MCQRNN at ungauged locations in Canada and, following Ouali and Cannon (2018), results are compared against those obtained from QRNN models.

4 Rainfall IDF curves

4.1 Data

The design of some civil infrastructure—hydraulic, hydrological, and water resource structures—is based on the design flood, which is the flood hydrograph associated with a specified frequency of occurrence or return period. In the absence of gauged discharge data, rainfall data are instead used to generate a design storm, which can then be transformed into synthetic peak streamflows for the return period of interest. The design storm provides the temporal distribution of rainfall intensities associated with a specified return period and duration. The necessary information on the frequency of occurrence, duration, and intensity of rainstorms is compactly summarized in an IDF curve, and hence IDF curves are key sources of information for engineering design applications. IDF curves provided by Environment and Climate Change Canada (ECCC) summarize the relationship between annual maximum rainfall intensity for specified frequencies of occurrence (2-, 5-, 10-, 25-, 50- and 100-year return periods, i.e., \(\tau =0.5,\,0.8,\,0.9,\,0.96,\,0.98,\,0.99\)-quantiles) and durations (\(D = 5\)-, 10-, 15-, 30- and 60-min, 2-, 6-, 12- and 24-h) at locations in Canada with long records of short-duration rainfall rate observations. Annual maximum rainfall rate data for durations from 5-min to 24-h are archived by ECCC as part of the Engineering Climate Datasets (Environment and Climate Change Canada 2014). The rainfall rate dataset is based on tipping bucket rain gauge observations at 565 stations across Canada (Fig. 5). Record lengths range from 10 to 81-year, with a median length of 25-year. Information on the observing program, quality control, and quality assurance methods is provided in detail by Shephard et al. (2014).

Fig. 5
figure 5

Points (black circle) show locations of ECCC IDF curve stations; point size is proportional to station elevation. Shading indicates the climatological summer total precipitation (1971–2000)

Official ECCC IDF curves are constructed by first fitting the parametric Gumbel distribution to annual maximum rainfall rate series at each site for each duration. At the majority of stations, the actual curves are then based on best fit linear interpolation equations between log-transformed duration and log-transformed Gumbel quantiles for each of the specified return periods. For reference, IDF curves for Victoria Intl A, a station on the southwest coast of British Columbia, Canada, are shown in Fig. 6. Points indicate return values of rainfall intensity obtained from the fitted Gumbel distribution for each combination of return period and duration; the IDF curves for each return period are based on log–log interpolating equations through these points, and hence plot as straight lines.

Fig. 6
figure 6

Example ECCC IDF data for Victoria Intl A (station 1018621) in British Columbia, Canada. Points (\(\varvec{{\times }}\)) show quantiles associated with 2-, 5-, 10-, 25-, 50- and 100-year (from bottom to top) return period intensities estimated by fitting the Gumbel distribution by the method of moments to annual maximum rainfall rate data for 5-, 10-, 15-, 30- and 60-min, 2-, 6-, 12- and 24-h durations (left to right). Lines are from best fit linear interpolation equations between log-transformed duration and log-transformed Gumbel quantiles for each return period

Naturally, the ECCC approach cannot provide quantile estimates for locations where short-duration rainfall observations are not recorded or available. Parametric extreme value distributions, fit in conjunction with regionalization or regional regression models, have been used to estimate IDF curves at ungauged locations in Canada by Alila (1999, 2000), Kuo et al. (2012) and Mailhot et al. (2013). As a non-parametric alternative to standard parametric approaches, Ouali and Cannon (2018) recently evaluated regional QRNN models for IDF curves at ungauged locations. While results suggest that the QRNN model can outperform standard parametric methods, further improvements are still possible. In particular, Ouali and Cannon (2018) fit separate QRNN models for each \(\tau \)-quantile and duration, which means that quantile crossing is possible; further, rainfall intensities may not increase as storm duration decreases. Instead, use of the MCQRNN is proposed to ensure non-crossing quantiles and a monotone decreasing relationship with increasing storm duration. Estimation at ungauged sites typically relies on pooling gauged data from a homogeneous region around the site of interest, whether in geographic space or some derived hydroclimatological space (Ouarda et al. 2001), and then fitting a regression model linking spatial covariates with the short-duration rainfall rate response. As the focus of this study is on methods for conditional quantile estimation, and not the delineation of homogeneous regions, regionalizations here are based on a simple geographic region-of-influence (Burn 1990) in which data from the 80 nearest gauged sites are pooled together to form the training dataset for the site of interest. Following Aziz et al. (2014), this emphasizes the use of data from a large number of sites rather than the most homogeneous sites; it is then up to the regression model to infer relevant covariate-response relationships from within this larger pool of data. In areas with low station density, however, it is questionable whether any statistical regional frequency analysis technique can be used to reliably estimate rainfall extremes. Performance in sparsely monitored regions will be explored as part of the subsequent model evaluation.

Based on this experimental design, observed short-duration rainfall rate data \(i_{D}\) for multiple durations D are used as the response variable in the MCQRNN model and spatial variables available over the domain—including at the ungauged location—are used as covariates in the regression equations. In this study, five covariates (\(\#I=5\)), including latitude (lat), longitude (lon), elevation (elev), and climatological total winter (DJF) and summer precipitation (JJA) (Fig. 5) (McKenney et al. 2011), are used alongside the two (\(\#M=2\)) monotone covariates [\(\tau \) and \(-\log (D)\)]. As an abbreviated example, stacked data matrices for a single site (\(s_{1}\)), two quantiles (\(\tau _{1}\)and \(\tau _{2}\)), and two durations (\(D_{1}\) and \(D_{2}\)), for N years of short-duration rainfall observations would take the form:

$$\begin{aligned} {\mathbf{y}}_{s_{1}}^{(S)}&=\left[ \begin{array}{c} i_{D_{1}}(1)\\ \vdots \\ i_{D_{1}}(N)\\ i_{D_{2}}(1)\\ \vdots \\ i_{D_{2}}(N)\\ i_{D_{1}}(1)\\ \vdots \\ i_{D_{1}}(N)\\ i_{D_{2}}(1)\\ \vdots \\ i_{D_{2}}(N) \end{array}\right] , \\ {\mathbf{X}}_{s_{1}}^{(S)}&=\left[ \begin{array}{ccccccc} \tau _{1} &{} -\hbox {log}(D_{1}) &{} \hbox {lat}(s_{1}) &{} \hbox {lon}(s_{1}) &{} \hbox {elev}(s_{1}) &{} \hbox {DJF}(s_{1}) &{} \hbox {JJA}(s_{1}) \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} -\hbox {log}(D_{1}) &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} -\hbox {log}(D_{2}) &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \tau _{1} &{} -\hbox {log}(D_{2}) &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \tau _{2} &{} -\hbox {log}(D_{1}) &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} -\hbox {log}(D_{1}) &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} -\hbox {log}(D_{2}) &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots &{} \vdots \\ \tau _{2} &{} -\hbox {log}(D_{2}) &{} \hbox {lat}(s_{1}) &{} \hbox {lon}(s_{1}) &{} \hbox {elev}(s_{1}) &{} \hbox {DJF}(s_{1}) &{} \hbox {JJA}(s_{1}) \end{array}\right] . \end{aligned}$$
(20)

For a given site of interest, the full stacked training dataset is expanded to include data from the 80 nearest gauged sites, 6 values of \(\tau \) (\(0.5,\,0.8,\,0.9,\,0.96,\,0.98,\,0.99\)), and 9 durations (5-, 10-, 15-, 30- and 60-min, 2-, 6-, 12- and 24-h).

4.2 Cross-validation results

Regional MCQRNN and QRNN models for IDF curves are evaluated via leave-one-out cross-validation. Each of the 565 observing sites is treated, in turn, as being “ungauged”, i.e., data from nearest 80 sites to each left-out site are used to fit the models, model predictions are made at the left-out site, and model performance statistics are calculated based on the left-out data. Following Ouali and Cannon (2018), 54 separate QRNN models are fit for each site, one for each combination of the 9 durations (\(D = 5\)-min to 24-h) and 6 \(\tau \)-quantiles (\(\tau = 0.5\)–0.99) reported in ECCC IDF curves. Each MCQRNN model combines data for all 9 values of D and fits non-crossing quantile curves for the 6 \(\tau \)-quantiles simultaneously.

Non-negativity constraints are imposed in both QRNN and MCQRNN models by setting g to the smooth ramp function (Cannon 2011). Monotonicity constraints—increasing with \(\tau \) and decreasing with D—are imposed in the MCQRNN model by adopting the MMLP architecture with additional monotone covariates [\(\tau \) and \(-\log (D)\)]. The optimum level of complexity for each kind of model is selected based on values of QAIC, here based on the composite QR error function (e.g., Xu et al. 2017), averaged over all sites, from candidates with \(J=1,\,2,\ldots ,\,5\) (Koenker and Schorfheide 1994; Doksum and Koo 2000; Xu et al. 2017). The number of hidden nodes J is fixed to the same value for all sites in the study domain. QAIC is minimized for QRNN models with \(J=1\) and MCQRNN models with \(J=3\).

Cross-validation results comparing the MCQRNN \((J=3)\) and QRNN \((J=1)\) models are reported in terms of relative differences in leave-one-out estimates of the quantile regression error function

$$\begin{aligned} \hbox {RD}_{\tau }=100\,\left( \frac{E_{{\tau }}^{(\mathrm{MCQRNN})}-E_{{\tau }}^{(\mathrm{QRNN})}}{E_{{\tau }}^{(\mathrm{QRNN})}}\right) \end{aligned}$$
(21)

summed over all stations for each return period and duration. Values are shown in Table 1a. Because the underlying model architecture is, aside from different values of J and inclusion of monotonicity constraints, fundamentally the same for the QRNN and MCQRNN models, it is not surprising that the two perform similarly well. MCQRNN and QRNN errors fall within 5% of one another for nearly all combinations of return period and duration, although MCQRNN tends to perform slightly better for short durations (\(D=5\)-min to 2-h) and QRNN for longer durations (\(D = 6\)–24-h). Poorer performance of the MCQRNN model in these cases is partly attributable to the smaller rainfall intensities that are associated with long duration storms being weighted less in the CQR cost function (Eq. 14) than the larger intensities that accompany short duration storms. This can be remedied by setting \(\omega _{\tau (s)}\propto \log (D)\) in Eq. 14. Results for the MCQRNN model with weighting are shown in Table 1b. Weighting improves performance for longer durations, while having minimal impact on shorter durations. Further results will be reported for the weighted MCQRNN model.

Table 1 Summary of cross-validated relative differences \(\hbox {RD}_{\tau }\) (%) in quantile regression error stratified by duration D, for all stations, for MCQRNN models (a) without weighting and (b) with weighting proportional to \(\log (D)\)

Despite the similar levels of quantile error, the additional MCQRNN monotonicity constraints on \(\tau \) and D leads to IDF curves that are guaranteed to increase as occurrence frequency and storm duration decrease, properties that need not be present for QRNN predictions. This is evident for Victoria Intl A (Fig. 7), where quantile crossing and non-monotone increasing behaviour with decreasing storm duration is noted for the 100-year QRNN model predictions (cf. Fig. 6).

Fig. 7
figure 7

Leave-one-out predictions of IDF curves for 2-, 5-, 10-, 25-, 50- and 100-year (in rainbow colours from bottom to top) return period intensities for Victoria Intl A (station 1018621) from a QRNN models and b MCQRNN model (cf. Fig. 6). Points (black square) show observed annual maximum rainfall rate data for 5-, 10-, 15-, 30- and 60-min, 2-, 6-, 12- and 24-h durations

Each of the QRNN (\(J=1\)) models for the 54 combinations of \(\tau \) and D contain \(J\,(\#I+1)+J+1=1\,(5+1)+1+1=8\) parameters or 432 parameters in total. Because it borrows strength over \(\tau \) and D (\(\#M=2\)), the MCQRNN (\(J=3\)) model requires just \(J\,(\#I+\#M+1)+J+1=3\,(5+2+1)+3+1=28\) shared parameters for the same task. Given that the two models show similar levels of performance, parameters in the separate QRNN equations must be largely redundant. If model complexity is increased, for example to \(J=5\), the total number of estimated parameters is 1944 for QRNN (36 for each combination of \(\tau \) and D) versus 46 for MCQRNN. By way of comparison, the at-site (rather than ungauged) ECCC IDF curves require estimation of 30 parameters (18 Gumbel distribution and 12 interpolation equation parameters).

Do the non-crossing/monotonicity constraints and ability to borrow strength provide a guard against overfitting if MCQRNN model complexity is misspecified? Fig. 8 shows relative differences \(\hbox {RD}_{\tau }\) in cross-validated quantile regression error for MCQRNN and QRNN models with \(J=1,2,\ldots ,5\); in both cases, the optimal QRNN \((J=1)\) model serves as the reference. Consistent with results from QAIC model selection, cross-validated QRNN errors increase when \(J>1\). When using more than the recommended number of hidden nodes, the QRNN performs poorly, especially for long return period estimates. However, for MCQRNN, in the absence of underfitting (i.e., \(J=1\)), there is little penalty for specifying an overly complex model. Performance of the optimal MCQRNN \((J=3)\) model recommended by QAIC model selection is nearly identical to that of the misspecified \(J=5\) model. The non-crossing constraint provides strong regularization and resistance to overfitting.

Fig. 8
figure 8

Cross-validated relative differences \(\hbox {RD}_{\tau }\) (%) in quantile regression error between MCQRNN and QRNN IDF curve predictions for \(J=1,2,\ldots ,5\) using QRNN \((J=1)\) as the reference model. Results are shown for 2-, 5-, 10-, 25-, 50- and 100-year return periods

Results reported so far have compared leave-one-out cross-validation performance of the MCQRNN and QRNN models. This does not provide any indication of how well the ungauged predictions compare with those estimated by the at-site ECCC IDF curve procedure, i.e., by fitting the Gumbel distribution and log linear interpolating equations to observed annual maxima at each station. Following Ouali and Cannon (2018), the ability of the MCQRNN to replicate the at-site ECCC IDF curves is measured by the quantile regression error ratio

$$\begin{aligned} \hbox {R}_{\tau }=\frac{E_{{\tau }}^{\prime (\mathrm{ECCC})}}{E_{{\tau }}^{(\mathrm{MCQRNN})}} \end{aligned}$$
(22)

where \(E_{{\tau }}^{\prime (\mathrm{ECCC})}\) is the in-sample, at-site quantile regression error of the ECCC IDF curve interpolating equations. A value of 1 means that ungauged MCQRNN predictions reach the same level of error as the at-site ECCC IDF curves. Note: even though the ECCC IDF curves are calculated from observations at each station, it is possible for \(R_{\tau }\) to exceed 1 as the annual maximum rainfall data may deviate from the assumed Gumbel distribution and log linear form of the interpolating equations. Results are summarized in Table 2. Values of \(R_{\tau }\) greater than 0.9—based on the 10% relative error threshold recommended by Mishra et al. (2012) for acceptable model simulations of urban rainfall extremes—are found for 41 of the 54 combinations of of D and \(\tau \), including all return periods from 2 to 10-year. More broadly, values exceed 0.7 for all combinations of D and \(\tau \).

Table 2 Summary of quantile regression error ratio \(\hbox {R}_{\tau }\) stratified by duration D between at-site ECCC IDF curves and ungauged MCQRNN predictions for all stations

As shown in Fig. 5, stations are not evenly distributed across Canada; northern latitudes, in particular, are very sparsely gauged. Does MCQRNN performance depend on station density? Values of \(R_{\tau }\), stratified by the median distance of each ungauged station to its 80 neighbours, are shown in Fig. 9. As expected, errors are nearly equivalent (\(R_{\tau }>0.975\)) to the at-site estimates in areas of high station density (median distances \(<~100\) km). Modest performance declines are noted (\(R_{\tau }>0.875\)) with increasing median distance up to 500 km, beyond which performance degrades more substantially, especially for the longest return periods (\(R_{\tau =0.99}<0.8\)). The viability of ungauged estimation should be evaluated carefully in areas of low station density.

Fig. 9
figure 9

Mean quantile regression error ratio \(R_{\tau }\) between at-site ECCC IDF curves and leave-one-out cross-validated MCQRNN predictions; values of \(R_{\tau }\) are stratified according to the median distance between the left-out station and its 80 neighbouring stations. Each of the 10 distance groupings contains an approximately equal numbers of stations (56 or 57)

5 Conclusion

This study introduces a novel form of quantile regression that can be used to simultaneously estimate multiple non-crossing, nonlinear quantile regression functions. MCQRNN is the first neural network-based quantile regression model that guarantees non-crossing of regression quantiles. The model architecture, which is based on the standard MLP neural network, also allows optional monotonicity, positivity/non-negativity, and generalized additive model constraints to be imposed in a straightforward manner. As an extension, a simple way to control the strength of non-additive relationships is also provided. The Huber function approximation to the QR error function means that standard least-squares regression and non-crossing expectile regression functions can be estimated using the same model architecture.

Given its close relationship to composite QR models, MCQRNN is first evaluated using the Monte Carlo simulation experiments adopted by Xu et al. (2017) to demonstrate the CQRNN model. In comparison to MLP, QRNN, and CQRNN models, MCQRNN is more robust than the benchmark models, especially for non-normal error distributions. Next, the MCQRNN model is evaluated on real-world climate data by estimating rainfall IDF curves in Canada. Cross-validation results suggest that the MCQRNN effectively borrows strength across different storm durations and return periods, which results in a model that is robust against overfitting. In comparison to standard QRNN, the ability of the MCQRNN model to incorporate monotonicity constraints—rainfall intensity should increase monotonically as the occurrence frequency and storm duration decrease—leads to more realistic estimates of extreme rainfall at ungauged sites. While promising, use of the MCQRNN for IDF curve estimation is presented here as a proof of concept. Other avenues of research include a more principled consideration of regionalization (Ouarda et al. 2001), other covariates (Madsen et al. 2017), and comparison against a wider range of nonlinear methods (Ouali et al. 2017). The MCQRNN model architecture is extremely flexible and many of its features are also not explored in this study. For example, the use of different weights for each \(\tau \) in the composite QR error function (Jiang et al. 2012; Sun et al. 2013), multiple hidden layers, and the ability to estimate non-crossing, nonlinear expectile regression functions (Jiang et al. 2017) are left for future research.

Finally, code implementing the MCQRNN model is freely available from the Comprehensive R Archive Network as part of the qrnn package.