An Extreme Value Bayesian Lasso for the Conditional Left and Right Tails

We introduce a novel regression model for the conditional left and right tail of a possibly heavy-tailed response. The proposed model can be used to learn the effect of covariates on an extreme value setting via a Lasso-type specification based on a Lagrangian restriction. Our model can be used to track if some covariates are significant for the lower values, but not for the (right) tail---and vice-versa; in addition to this, the proposed model bypasses the need for conditional threshold selection in an extreme value theory framework. We assess the finite-sample performance of the proposed methods through a simulation study that reveals that our method recovers the true conditional distribution over a variety of simulation scenarios, along with being accurate on variable selection. Rainfall data are used to showcase how the proposed method can learn to distinguish between key drivers of moderate rainfall, against those of extreme rainfall.

The main contribution of this paper rests on a Bayesian regression model for the conditional lower values (i.e., left tail) and conditional (right) tail of a possibly heavy-tailed response. Our model pioneers the development of regression methods in an extreme value framework that allow for some covariates to be significant for the lower values but not for the tail (and viceversa). Some further comments on the proposed model are in order. First, the proposed model bypasses the need for (conditional) threshold selection; such selection is particularly challenging in a regression framework as it entails selecting a function of the covariate (say, u x ) rather than a single scalar. Second, our method models both the conditional lower values and the conditional tail, offering a full portrait of the conditional distribution, while still being able to extrapolate beyond observed data into the conditional tail. Finally, our method is directly tailored for variable selection in an extreme value framework; in particular, it can be used to track what covariates are significant for the lower values and tails.
In an extreme value framework, interest focuses on modelling the most extreme observations, disregarding the central part of the distribution. Usually, efforts center on modelling extreme values using the generalized extreme value distribution or the generalized Pareto distribution (Embrechts et al., 1997;Coles, 2001;Beirlant et al., 2004). Many observations are disregarded using the latter approaches, and the choice of the block size or threshold are far from straightforward. Moreover, in many situations of applied interest, it would be desirable to model both the lower values of the data along with the extreme values in a regression framework. Our model builds over Papastathopoulos and Tawn (2013) and Naveau et al. (2016) who proposed an extended generalized Pareto distribution (EGPD) to jointly model low, moderate and extreme observations-without the need of threshold selection; other interesting options for modeling both the bulk and the tail of a distribution, include extreme value mixture models (e.g., Frigessi et al., 2002;Behrens et al., 2004;Carreau and Bengio, 2009;Cabras and Castellanos, 2011;MacDonald et al., 2011;do Nascimento et al., 2012) as well as composition-based approaches (Stein, 2020(Stein, , 2021).
The proposed model can be regarded as a Bayesian Lasso-type model for the lower values and tail of a possibly heavy-tailed response supported on the positive real line. The Bayesian Lasso was introduced by Park and Casella (2008) as a Bayesian version of Tibshirani's Lasso (Tibshirani, 1996). Roughly speaking, the frequentist Lasso is a regularization method, that shrinks some regression coefficients, and sets others to zero; as such it is naturally tailored for variable selection. The frequentist Lasso solves a least squares type of optimization problem, where an L 1 -penalization is incorporated via a constraint. The starting point for the Bayesian Lasso is the fact that the posterior mode generated from a Laplace prior coincides with the solution to the Lasso least squares optimization problem with L 1 -penalization (Reich and Ghosh, 2019, Section 4.2.3). It should be noted however that the Bayesian Lasso does not set coefficients to zero.
Shrinkage in a Bayesian context has been traditionally achieved via priors with peaks at zero such as the Laplace. Other examples of continuous shrinkage priors include the Horseshoe (Carvalho et al., 2010), Dirichlet-Laplace (Bhattacharya et al., 2015), and R2D2 (Zhang et al., 2020). There are also discrete shrinkage priors (George and McCulloch, 1993) which are arguably more interpretable. Despite all these considerable advances in Bayesian variable selection, the Bayesian Lasso is simply one way to go, one that is associated L 1 -penalization, but other priors could be used to meet a similar target, leading to other geometries for the corresponding penalization. For a recent review of Bayesian regularization methods, see Polson and Sokolov (2019).
As we clarify below (Section 2.2), our model has also links with quantile regression. Indeed, by modeling both the conditional lower values and the conditional tail, our model bridges quantile regression (Koenker and Bassett, 1978) with extremal quantile regression (Chernozhukov, 2005). Finally, as a byproduct, this paper contributes to the literature on Bayesian inference for Pareto distributions (Arnold and Press, 1989;de Zea Bermudez and Turkman, 2003;Castellanos and Cabras, 2007;de Zea Bermudez and Kotz, 2010;Villa, 2017).
The rest of this paper unfolds as follows. In Section 2 we introduce the proposed methods.
We assess the performance of the proposed methods on simulated examples in Section 3, and report the main findings of our numerical studies. A data illustration is included in Section 4.

EXTREME VALUE BAYESIAN LASSO
To streamline the presentation, the proposed methods are introduced in a step-by-step fashion, with the most flexible version of our model being introduced in Section 2.4.

THE EXTENDED GENERALIZED PARETO FAMILY
Our starting point for modeling is the so-called extended generalized Pareto distribution (EGPD), as proposed in Papastathopoulos and Tawn (2013) and Naveau et al. (2016); the EGPD is a distribution over the positive real line, whose cumulative distribution function is F (y) = G{H(y)}, where G : [0, 1] → [0, 1] is a carrier function, which tilts the generalized Pareto distribution (GPD) function Here, σ > 0 is a scale parameter, ξ ∈ R is the shape parameter; the case ξ = 0 should be understood by taking the limit ξ → 0. The shape parameter ξ is termed extreme value index and it controls the rate of decay of the tail. Following Naveau et al. (2016), we assume that the carrier function G obeys the following conditions: and thus it can be understood as a tail-equivalence condition (Embrechts et al., 1997, Section 3.3), where y * = inf{y : F (y) < 1} is the so-called right endpoint, here assumed to be positive. Since tail-equivalence implies that both F and H are in the same domain of attraction (Resnick, 1971), it follows from Assumption A that ξ can be literally interpreted as the extreme value index of F (y) = G{H(y)}. We note further that Assumption C implies that small values follow a Weibull type GPD, and that the role and need of Assumption B is in fact questionable (Tencaliec et al., 2019); actually, the latter paper makes no reference to it.
For parsimony reasons, we focus on modeling G using a parametric class, so that G(v) ≡ G κ (v), with κ ∈ R q . The canonical example of a parametric carrier is G κ (v) = v κ , with κ > 0 controlling the shape of the lower tail, with a larger value of κ leading to less mass close to zero; we refer to the EGPD distribution with the latter carrier as the canonical EGPD.

TAIL OF A POSSIBLY HEAVY-TAILED RESPONSE
The first version of our model specifies the conditional distribution function where x = (x 1 , . . . , x p ) ∈ R p is a vector of covariates and κ(x) is a vector-valued function with components given by inverse link functions and G κ(x) ∈ G , for every x ∈ X ⊆ R p ; here and below, β j = (β 1,j , . . . , β p,j ) ∈ R p , for j = 1, . . . , q. For reasons that will become evident later, we do not allow for now (σ, ξ) to depend on the covariates x, but we will extend the specification in (1) (1) is obtained by specifying Example 2 (Beta carrier, exponential link). Another variant of (1) is obtained by specifying where is the distribution function of a Beta distribution with parameters (1/κ(x), 2).
A consequence of (1) is that (5) warrants some remarks on links with quantile regression. The first version of our model in (1) can be regarded as a model that bridges quantile regression (Koenker and Bassett, 1978) with extremal quantile regression (Chernozhukov, 2005), in the sense that it offers a way to model both moderate and high quantiles. Quantile regression (Koenker, 2005) allows for each τ th conditional quantile to have its own slope β τ , according to the following linear specification In the same way that high empirical quantiles fail to extrapolate into the tail of a distribution, the standard version of quantile regression in (6) is unable to extrapolate into the tail of the conditional response.
We describe next Bayesian Lasso modeling and tackle inference for the first version of our model. For parsimony, below we will focus on the version of the model that sets q = 1 in (2), so We underscore that what is studied in the next section applies with some minor modifications to the full model presented above; the focus on q = 1 (to which Examples 1 2 apply) in the next section eases however notation, and it obviates the need for discussing again the identifiability issues that we have alluded to already in Example 3.

REGULARIZATION AND BAYESIAN INFERENCE
be a random sample from F (x, y). We propose a Bayesian Lasso-type specification for the model in (1) so as to regularize the log likelihood. Specifically, let G κ(x) ∈ G , with κ(x) being a function of x, so that the likelihood becomes is the density of the generalized Pareto distribution, g κ(x) = dG κ(x) /dy, and (a) + = max(0, a) is the positive part function. In a Bayesian context, the constrained optimization problem underlying the Lasso can be equivalently rewritten as the posterior mode of regression parameters, provided that one assumes that those are (a priori) independent with identical Laplace priors (i.e., double exponential), that is, π(β) ∝ q j=1 e −λ/2 |β j | ; see Tibshirani (1996) When the goal is to set a prior on the space of heavy-tailed distributions, then it may be sensible to set π ξ = Gamma(a ξ , b ξ ), whereas π ξ = N(µ ξ , σ ξ ) may be sensible if all domains attraction are equally likely a priori. Since the posterior has no closed-form expression, we resort to Markov Chain Monte Carlo (MCMC) methods to sample from the posterior.
A shortcoming of the first version of the model is that it only allows for the effect of covariates on the lower values, but not on the tail; this is a consequence of the fact that only the part of the model that drives the lower values, κ(x), is indexed by a covariate. This issue is addressed in the next section.

EXTENSIONS FOR COVARIATE-ADJUSTED TAIL
In practice some covariates can be significant for the lower values but not for the tail-or the other way around. Thus, we extend the specification from Sections 2.2-2.3 by also allowing parameters underlying the tail to depend on covariates. Specifically, we consider the following specification: where H(y | x) is a reparametrized conditional generalized Pareto distribution, with parameters Here, κ(x) is a function as in (7) whereas ν(x) = φ(x T α) and ξ(x) = µ(x T γ), with φ and µ being inverse-link functions. The canonical embodiment of the full version of our model is assuming ξ(x) > 0. The schematic representation below summarizes our model: Extreme Value Bayesian Lasso for the Conditional Lower Values and Tail   (2nd version) 1. Likelihood The specification in (8) warrants some remarks: 1. The need for a reparametrization: While it would seem natural to simply index (σ, ξ) with a covariate and to proceed as in Sections 2.2-2.3, similarly to Chavez-Demoulin and Davison (2005) and Cabras and Castellanos (2011), who work in a GPD setting we found that approach to be computationally unstable; in particular, in our case the latter parametrization leads to poor mixing and to small effective sample sizes.
3. Fisher information: Parameter orthogonality (Young and Smith, 2005, Section 9.2) requires the computation of the Fisher information matrix. The canonical EGPD is a particular case of the so-called generalized Feller-Pareto distribution (Kleiber and Kotz, 2003), with (a, b, p, q, r) = (1, σ/ξ, κ, 1, 1/ξ), as where B(p, q) = 1 0 t p−1 (1 − t) q−1 dt, the entries of the Fisher information matrix follow from Mahmoud and Abd El-Ghafour (2015, Section 3); yet the matrix is rather intricate and thus we decided to look for a more sensible way to proceed than attempting to achieve parameter orthogonality.

SIMULATION SCENARIOS AND ONE-SHOT EXPERIMENTS
In this section, we describe the simulation scenarios and present one-shot experiments so as to illustrate the proposed method; a Monte Carlo study is presented in Section 3.2. For now, we focus on describing the setting over which we simulate the data and on discussing the numerical experiments. Code for implementing our methods is available from the Supplementary Material; we used jags (Plummer, 2019) as we have more experience with it, but stan (Gelman et al., 2015) would look like another natural option that could potentially yield faster mixing and more effective sample sizes at little extra programming cost. Throughout we use uninformative Gamma priors, Gamma(0.1, 0.1), for the hyperparameters λ k , λ ν , and λ ξ .
Data are simulated according to (8), with power carrier G κ(x) (v) = v κ(x) and with link functions κ x = exp(x T β), ν x = exp(x T α), ξ x = exp(x T γ). We consider four scenarios, described below, based on the following vectors for the regression coefficients: • Scenario 1-Light effects for lower values, light effects for tail: β = a; α = b; γ = c.
• Scenario 2-Light effects for lower values, large effects for tail: β = a; α = 2b; γ = 2c For each of the scenarios above, we simulated n = 500 observations, {(x i , y i )} n i=1 , from a conditional EGPD with a canonical carrier function; this is about the same number of observations available in the data application of Section 4. The covariates are simulated from independent standard uniform distributions. Figure 1 shows cross sections of the true and conditional densities for Scenarios 1-4 estimated using our methods.
As can be seen from these figures, the method recovers satisfactorily well the true conditional density-especially keeping in mind the sample size.

MONTE CARLO SIMULATION STUDY
To assess the finite-sample performance of the proposed methods, we now present the results of a Monte Carlo simulation study, where we repeat 250 times the one-shot experiments from Section 3.1. We consider the different sample sizes of n = 100, n = 250, and n = 500. We now move to the conditional density. To compare the fitted conditional density against the true as the sample size increases, we resort to the mean integrated squared error (MISE): where the expectation is taken so to summarize the average behavior of the randomness on the double integral that stems from the posterior mean estimate f (Wand and Jones, 1995, Section 2.3). Figure 4 shows a boxplot of the MISE of each run of the simulation experiment for Scenario 1. As can be seen from Figure 4, MISE tends to decrease as the sample size increases, thus indicating that a better performance of the proposed methods is to be expected on larger samples; the same holds for the remainder scenarios as can be seen from the boxplots of MISE available in the Supplementary Material. To examine the quality of each fit corresponding to a simulated dataset, {(x i , y i )} n i=1 , we use randomized quantile residuals (Dunn and Smyth, 1996) adapted to our model, that is, , where Φ −1 is the quantile function of the standard Normal distribution and F is the posterior mean estimate of the conditional distribution function. Figure 6 depicts the corresponding posterior mean QQplot of randomized quantile residuals against the theoretical standard normal quantiles, and it evidences an acceptably good fit of the model for Scenarios 1-4. Finally, to supplement the analysis, we also report in the Supplementary Material numerical experiments that tend to support the claim that the performance of the proposed methods is relatively satisfactory at separating covariate effects for the lower values and tail (Section 2.1 of Supplementary Material), and that this also holds in a misspecified setting. Despite the overall good performance, in all fairness not everything is perfect; specifically, our supporting numerical experiments reveal that: i) the correspondence between "non-constant" effects on lower quantiles and coefficients of κ(x) is weaker for Scenario 3; ii) bias appears on the lower conditional quantile function estimates, under misspecification.  To analyze such rainfall events in Funchal, Madeira, we have gathered data from the National Oceanic and Atmospheric Administration (www.noaa.gov). Specifically, we have gathered total monthly precipitation (.01 inches), as well as the following potential drivers for extreme rainfall:  1979, 1993, 2001(Baioni et al., 2011. After eliminating the dry events (i.e., zero precipitation) and the missing precipitation data (two observations), we are left with a total of 532 observations.
The potential drivers for extreme rainfall above have been widely examined in the climatological literature, mainly on large landmasses. In particular, it has been suggested that in North America ENSO, PDO, and NAO play a key role governing the occurrence of extreme rainfall events (Kenyon and Hegerl, 2010;Zhang et al., 2010;Whan and Zwiers, 2017); yet for the UK, while NAO is believed to impact the occurrence of extreme rainfall events, no influence of PDO nor AMO has been detected (Brown, 2018  ii) whether such drivers are also relevant for moderate rainfall events.

TRACKING DRIVERS OF MODERATE AND EXTREME RAIN-FALL
One of the goals of the analysis is to use the lenses of our model so to learn what are the drivers of moderate and extreme rainfall in Funchal. To conduct such inquiry, we use the full model from Section 2.4 (see Eq. (9)), with power carrier G κ(x) (v) = v κ(x) and with link functions Covariates have been standardized and we used a flat Normal prior, N(0, 100 2 ), for the intercepts, and uninformative Gamma priors, Gamma(0.1, 0.1), for the hyperparameters λ k , λ ν , λ ξ ; here an identity link was used for ξ(x) as fitting a GPD to the response via likelihood methods suggested no compelling evidence in favor of an heavy-tailed response. After a burn-in period of 10 000 iterations, we collected a total of 20 000 posterior samples. The results of MCMC convergence diagnostics were satisfactory; in particular, the effective sample sizes were acceptably high, ranging from about 1000 to 5 000, and all values of Geweke's diagnostic were satisfactory, ranging from about −3.2 to 3.0. Figure 5 depicts the credible intervals for each regression coefficient. To assess the fit of the proposed model we resort once more to randomized quantile residuals (Dunn and Smyth, 1996). Figure 6 evidences an acceptably good fit of the model; while the pointwise bands in Figure 6 are narrow, they result from acceptably high effective sample sizes as mentioned above. Zwiers (2017) for North America. Yet, similarly to Brown (2018), we find no relevant impact of PDO, given the rest, on extreme rainfall spells. Interestingly, of all these potencial drivers for extreme rainfall, only ENSO seems to be statistically associated with moderate rainfall; see Figure 5 (left). Also, NP is the most relevant driver of moderate rainfall, and yet the analysis suggests it plays no role on extreme rainfall. Additionally, Figure 5 (left) suggests that an increase in NP leads to a heavier left tail (i.e., dryer months), whereas an increase in ENSO leads to less mass close to zero (i.e., rainy months). Finally, to supplement the analysis, we also present conditional quantiles in the Supplementary Material so to directly assess how rainfall itself is impacted by all these potential drivers.

CLOSING REMARKS
In this paper we have introduced a Lasso-type model for the lower values and tail of a possibly heavy-tailed response. The proposed model: i) bypasses the need for threshold selection (u x ); ii) models both the conditional lower values and conditional tails; iii) it is naturally tailored for variable selection. In addition, contrarily to GPD-based approach of Davison and Smith (1990) the proposed model does not suffer from lack of threshold stability, as indeed our model does not depend on a threshold; the fact that the Davison-Smith approach does not retain the threshold stability property is known since (Eastoe and Tawn, 2009, Section 2.2), and it implies that selection of covariates in the parameters of the GPD is not invariant to threshold choice. Yet, other approaches that circumvent the threshold stability issue are available beyond our model, including an inhomogeneous point process formulation for extremes (Coles, 2001, Chapter 7).
Interestingly, the proposed model can be regarded as a Lasso-type model for quantile regression that is tailored for both the lower values and for extrapolating into the tails. As a byproduct, our paper has links with the Bayesian literature on Bayesian distributional regression (Umlauf and Kneib, 2018), and with the recent paper of Groll et al. (2019).
Despite the considerations above on threshold selection there is no 'free lunch'; indeed, here the parameters of the model and the choice of the carrier dictate the behavior of the quantiles in the whole range of the distribution, while the impact of some parameters is smaller in the lower or upper tail. Yet, competing EGPD models that stem from different choices for the carrier function can be formally compared and ranked via standard model selection criteria, such as Log Pseudo Marginal Likelihood (e.g. Geisser and Eddy, 1979;Gelfand and Dey, 1994).
Some final comments on future research are in order. A version of our model that includes additionally a regression model for point masses at zero would be natural for a variety of contexts, such as for modeling long periods without rain or droughts. Semicontinuous responses that consist of a mixture of zeros and a continuous positive distribution are indeed common in practice (e.g., Olsen and Schafer, 2001). Another natural avenue for future research would endow the model with further flexibility by resorting to a generalized additive model, where the smooth function of each covariate is modeled using B-spline basis functions. The latter extension would however require a group lasso (Yuan and Lin, 2006), shrinking groups of regression coefficients (per smooth function) towards zero. While we focus here on modeling positive random variables, another interesting extension of the proposed model would consider the left endpoint itself as a parameter rather than fixing it at zero. Finally, from an inference perspective it would also seem natural defining a semiparametric Bayesian version of our model that would set a prior directly on the covariate-specific carrier function G x rather than specifying a power carrier function in advance, and to model the conditional density as F (y | x) = G x {H(y | x)}. This approach would however require setting a prior on the space G of all carrier functions so as to ensure that G x obeys Assumptions A-C, for all x. While priors on spaces of functions are an active field of research (Ghosal and Van der Vaart, 2015), definition of priors over G is to our knowledge an open problem; a natural line of attack to construct such a prior would be via random Bernstein polynomials (Petrone and Wasserman, 2002), with weight constraints derived from Lemma 1 of Tencaliec et al. (2019); indeed, by modeling the carrier function with a Bernstein polynomial basis a higher level of flexibility could be achieved in comparison to that offered by Examples 1-3.