Truncated pair-wise likelihood for the Brown-Resnick process with applications to maximum temperature data

Max-stable processes are a natural extension of multivariate extreme value theory important for modeling the spatial dependence of environmental extremes. Inference for max-stable processes observed at several spatial locations is challenging due to the intractability of the full likelihood function. Composite likelihood methods avoid these difficulties by combining a number of low-dimensional likelihood objects, typically defined on pairs or triplets of spatial locations. This work develops a new truncation procedure based on ℓ1-penalty to reduce the complexity and computational cost associated with the composite likelihood function. The new method is shown to achieve a favorable trade-off between computational burden and statistical efficiency by dropping a number of noisy sub-likelihoods. The properties of the new method are illustrated through numerical simulations and an application to real extreme temperature data.


Introduction
Weather and climate extremes are well-known for their environmental, social and economic impact, with heat waves, droughts, floods, and hurricanes being common examples. The widespread use of geo-referenced data together with the need to monitor extreme events have motivated a growing interest in statistical methods for spatial extremes. On the other hand, the availability of accurate inference methods able to estimate accurately the severity of heat extremes is important to understand, prepare for and adapt to future environment changes. This work is motivated by the analysis of extreme temperature data recorded in the state of Victoria, Australia, by Bureau of Meteorology (BoM) (http://www.bom.gov.au/climate/data), the national meteorological service of Australia monitoring local climate, including extreme weather events.
An important class of statistical models for spatial extremes are the so-called max-stable processes, which provide a theoretically-justified description of extreme events measured at several spatial locations. Smith (1990) proposed an easy-tointerpret max-stable model based on storm profiles. Despite its widespread use, the Smith model is often criticized for its lack of realism due to excessive smoothness. A more useful max-stable process is the Brown-Resnick process, a generalization of the Smith model able to describe a wider range of extreme dependence regimes (Brown and Resnick 1977;Kabluchko et al. 2009). Reviews of max-stable models and inference are given by Davison et al. (2012) and Davison and Huser (2015).
Inference for max-stable models is generally difficult due to the computational intractability of the full likelihood function. These challenges have motivated the use of composite likelihood (CL) methods, which avoid dealing with intractable full likelihoods by taking a linear combination of low-dimensional likelihood score functions (Lindsay 1988;Varin et al. 2011). Various composite likelihood designs have been studied for max-stable models. Davison et al. (2012) and Huser and Davison (2013) consider pair-wise likelihood estimation based on marginal likelihoods defined on pairs of sites. In the context of the Smith model, Genton et al. (2011) show that estimation based on triple-wise likelihoods (i.e. combining sublikelihoods defined on three sites) is more efficient compared to pair-wise likelihood. For the more realistic Brown-Resnick model, however, Huser and Davison (2013) show that the efficiency gains from using triple-wise likelihood are modest.
The choice of the linear coefficients combining the partial log-likelihood score objects has important repercussions on both efficiency and computation for the final estimator. Cox and Reid (2004) discuss the substantial loss of efficiency for pair-wise likelihood estimators when a large number of correlated scores is included. In the context of max-stable processes, various works aimed at improving efficiency and computing based on the idea that sub-likelihoods defined on nearby locations are generally more informative about dependence parameters than those for distant locations. Sang and Genton (2014) consider a weighting strategy for sub-likelihoods based on tapering to exclude distant pairs or triples and improve statistical efficiency. Their method improves efficiency compared to uniform weights, but tuning of the tapering function is computationally intensive. Castruccio et al. (2016) consider combining a number of sub-likelihoods by taking more than three locations at the time, and show the benefits from likelihood truncation obtained by retaining partial likelihood pairs for nearby locations. In a different direction, other studies have focused on direct approximation of the full likelihood. Huser et al. (2019) consider full-likelihood based inference through a stochastic Expectation-Maximisation algorithm. Thibaud et al. (2016) considers Bayesian approach where the full-likelihood is constructed by considering a partition of the data based on occurrence times of maxima within blocks. Although the current full likelihood approaches do not directly require the computation of the full likelhood as a sum over all partitions, their application is still hindered by issues related to computational efficiency when the number of measuring sites is large. On the other hand, composite likelihood methods offer considerable computational advantages compared to full likelihood approaches although they may lack of statistical efficiency when too many correlated sub-likelihood objects are considered.
The main contribution of this work is the application of the general composite likelihood truncation methodology of Huang and Ferrari (2017) in the context of max-stable models and pair-wise likelihood for the analysis of extreme temperature data. The new method, referred to as truncated pair-wise likelihood (TPL) hereafter. In the proposed TPL estimation, a data-driven combination of pair-wise log-likelihood objects is obtained by optimizing statistical efficiency, subject to a 1 -penalty discouraging the inclusion of too many terms in the final estimating equations. Whilst the basic method of Huang and Ferrari (2017) had a single linear coefficient for each sub-likelihood object, here we extend that approach by allowing parameter-specific coefficients within each pair-wise likelihood score. This generalization is shown to improve stability of the truncated estimating equations and the statistical accuracy of the final estimator. The proposed 1 -penalty enables us to retain only informative sub-likelihood objects corresponding to nearby pairs. This reduces the final computational cost and yields estimators with considerable efficiency compared to pair-wise likelihood estimator with equal coefficients commonly adopted in the spatial extremes literature.
The rest of the paper is organized as follows. In Section 2, we review maxstable processes and the Brown-Resnick model. In Section 3, we describe the main methodology for likelihood truncation and parameter estimation within the pair-wise likelihood estimation framework. In Section 4, we carry out Monte Carlo simulations to illustrate the properties of the method and compare it with other pair-wise likelihood strategies in terms of computational burden and statistical efficiency. In Section 5, we apply the method to analyze extreme temperature data recorded in the state of Victoria, Australia. In Section 6, we conclude and give final remarks.

The Brown-Resnick process
Following Huser and Davison (2013), the Brown-Resnick process (Brown and Resnick 1977;Kabluchko et al. 2009) can be defined as the stationary maxstable process with spectral representation given by Z(x) = sup i∈N W i (x)/T i , x ∈ X ⊆ R 2 , where 0 < T 1 < T 2 < . . . are points of a Poisson process on R + , W 1 (x), W 2 (x), . . . are independent replicates of the random process W (x) = exp{ε(x) − γ (x)}, x ∈ X , ε(x) represents a Gaussian process with stationary increments such that ε(0) = 0 almost surely and γ (h) is the semi-variogram of The process Z(x) may be interpreted as the maximum of random storms W i (x) of size 1/T i .
Let s be the total number of locations being considered. The s-dimensional distribution function for the process {Z(x), x ∈ X } measured at the set of locations S ∈ X can be written as is the so-called exponent measure function. Different max-stable models are obtained by specifying the exponent measure V (·) through the choice of semi-variogram γ (·). For example, the Brown-Resnick model can be specified by the parametric variogram model with γ (h; θ) = ( h /ρ) α and θ = (α, ρ) , where ρ > 0 and 0 < α ≤ 2 are the range and the smoothness parameters, respectively. When α = 2 the Brown-Resnick process has maximum smoothness with semi-variogram γ (h) = h h for some covariance matrix . In this case, the Brown-Resnick process is equal to the Smith process (Kabluchko et al. 2009;Padoan et al. 2010). Figure 1 shows semi-variograms for different parameter values (top row) with realizations of the corresponding Brown-Resnick process at single site (bottom row). The variogram increases as ρ decreases, whilst its shape can be convex (α > 1), linear (α = 1), or concave (α < 1).  Brown-Resnick process Z(x) for different specifications of smoothness (α) and range (ρ) parameters. Solid, dashed and dotted curves in each plot correspond to range parameter ρ = 0.5, 1.0 and 1.5, respectively. Bottom row: each plot shows individual realizations of the Brown-Resnick processes at a single site with parameters corresponding to the variograms in the top row

Marginal pair-wise density functions
Let z = (z 1 , . . . , z s ) be a sample with z j = Z(x j ) denoting the realization at site j and S = {x 1 , . . . , x s } ∈ X . The density of {Z(x 1 ), . . . , Z(x s )} can be written as where P s denotes the set of all possible partitions of the set {x 1 , . . . , x s }, ξ = (ξ 1 , . . . , ξ l ), |ξ | = l is the size of the partition ξ , and where d |ξ j | /dz ξ j denotes the mixed partial derivatives with respect to the ξ j element of z (Padoan et al. 2010). Since the the cardinality of P s increases quickly with the number of sites s, the density and the full likelihood functions are unavailable for arbitrary number of sites s due to the storage and computation of an exponentially increasing number of derivatives.
Although the full density cannot be computed unless s is trivially small, lowdimensional densities are readily available. The bivariate exponent measure for the Brown-Resnick process where a(θ) = 2γ (x j − x k ; θ) and Φ(·) is the standard normal distribution function. Let m = s(s − 1)/2 be the total number of pairs (z j , z k ), 1 ≤ j < k ≤ s, obtained from elements of z. Let r = 1, . . . , m be the subscript corresponding to a site pair {(j, k) : 1 ≤ j < k ≤ s}. The bivariate density function f r (z j , z k ; θ) is obtained by direct differentiation as where V r = V (z j , z k ; θ) andV j ,V k ,V jk are the corresponding partial derivatives dV r /dz j , dV r /dz k , d 2 V r /(dz j dz k ) given in the Appendix.
For the case of two sites (s = 2), the bivariate extremal coefficient is   Cooley et al. (2006).

Truncated pair-wise likelihood by 1 -penalization
In this section we describe our likelihood truncation approach and related pair-wise inference. For concreteness, we focus on pair-wise inference and the Brown-Resnick model with variogram γ (h) = ( h /ρ) α . In principle, the proposed apprach may be applied also in the context of composite likelihood designs besides pair-wise likelihood (e.g. triple-wise likelihood) and other max-stable models.
For inference, we consider weighted pair-wise likelihood estimators (PLEs),θ w , found by solving the estimating equations where W is the 2m × 2 matrix with w α = (w α,1 , . . . , w α,m ) and w ρ = (w ρ,1 , . . . , w ρ,m ) being the vectors containing specific coefficients for the score components, 0 is a m × 1 vector of zeros, and u (i) (θ, w) is the ith realization of u(θ, w) defined as u(θ, w) = W u(θ ). Here w = (w α , w ρ ) is the 2m × 1 vector containing all the coefficients, which we refer to as composition rule in the rest of the paper. A popular choice for w in applications is the vector with uniform elements w α = w ρ = (1, . . . , 1) , corresponding to the uniform pair-wise likelihood estimator (UPLE).
The PLE is a popular estimator for max-stable models due to its flexibility and well-known asymptotic properties. Particularly, √ n(θ w − θ) converges to a bivariate normal distribution with zero mean vector and asymptotic covariance matrix is the so-called Godambe information matrix, and H w (θ ) = E[∂u(θ, w)/∂θ ] and J w (θ ) = Var[u(θ, w)] are the 2 × 2 sensitivity and variability matrices. Although the PLE is consistent, its variance can be much larger than that of the maximum likelihood estimator depending on the choice of the composition rule w. If the composition rule w has all nonzero elements, the matrices H w and J w involve O(s 2 ) and O(s 4 ) terms, respectively. Thus, when the number of sites s is moderate or large, the presence of many correlated pair-wise scores can inflate J w and the implied asymptotic variance G w (θ ) −1 . From a computational viewpoint, another drawback is that finding the standard errors of the PLE is computationally expensive for large s due to the presence of many terms in J w . In the following section, we describe a likelihood truncation methodology able to reduce the computational burden while avoiding issues related to variance inflation.

Truncation by 1 -norm penalization
To increase statistical performance of pair-wise likelihood estimation for max-stable models while reducing the computing costs, we adopt a new truncation strategy of the estimating equations. The resulting composition rule contains a number of zero elements, which implies simplified pair-wise likelihood equations with less terms. We propose find such a composition rule by minimizing the distance between the unknown full likelihood score, subject to an 1 -norm penalty representing the likelihood complexity. The procedure may be regarded as to maximize the statistical accuracy for a certain level of afforded computing. Specifically, we aim to solve the PL estimating equations 0 = n i=1 u (i) (θ, w) in (7) with respect to θ with w = w(θ) found by minimizing with respect to w the ideal criterion where λ = (λ α , λ ρ ) is a 2×1 vector of tuning parameters with non-negative elements and u ML l (θ ) = ∂ log f (z 1 , . . . , z s ; θ)/∂l, l ∈ {α, ρ} denotes the elements of the unknown maximum likelihood score function. It is worth to keep in mind that the minimizer w(θ) also depends on the tuning value λ.
The term E u ML l (θ ) − w l u l (θ ) 2 in (10) represents the distance between the pairwise score and the maximum likelihood score. Thus, the particular case when λ l = 0 gives fixed-sample optimality (O F -optimality) defined as the projection of the the ML score onto the linear space spanned by the partial scores (Heyde 2008). Without additional constraints, however, we have no way to reduce the likelihood complexity since all the pair-wise score terms are in principle included in the final estimating equation. On the other hand, for sufficiently large λ l > 0, the penalty m r=1 λ l |w l,r |, l ∈ {α, ρ} implies truncated estimating equations by avoiding the inclusion of noisy terms in the pairwise likelihood score u (θ, w). This is analogous to 1 -penalized least-squares approaches for regression (e.g. see Efron et al. (2004)). However, while in regression the penalty involves directly regression coefficients, our penalty does not involve the statistical parameter θ but only the composition rule w.
Due to the geometry of the 1 -norm penalty, the composition rule w(θ) minimizing (10) contains an increasing number of zero elements as λ l grows. Therefore, such a penalty is effectively a constraint on the computing cost (or, equivalently, on the likelihood complexity). This means that the truncated solution w(θ) can be interpreted as one that maximizes statistical efficiency for a given level of computing. Alternatively, it may be interpreted as one maximizing computational efficiency for a given level of efficiency.
Direct minimization of Q λ (θ, w) is not useful in practice due to the presence of the intractable likelihood score u ML l and expectations in (10). To eliminate the explicit dependence on the ML score, note that the expectation in (10) can be written as where c is a term not depending on w l . Dependence on the ML score is avoided by replacing the term E u ML by the diagonal vector of the score covariance matrix. To see this, note that each partial score u l,r (θ ) defines an unbiased estimating equation, i.e. satisfying Eu l,r (θ ) = 0. This implies the important relationship where the first equality in (12) is obtained by differentiating Eu l,r (θ ) = 0 under the integral, whilst the second equality is the Bartlett's identity. Unbiasedness implies the important relationship The last expression in (13) can be written diag{E[u l (θ )u l (θ ) ]} with diag(A) denoting the vector collecting the diagonal elements of the square matrix A. Finally, replacing the expression of the covariance matrix E u l (θ )u l (θ ) by its For a given θ, we minimizeQ λ (θ, w) to obtain the empirical composition ruleŵ(θ). Further insight on the solution from the above minimization program may be helpful. The truncated composition rule solving the empirical objective (14) contains elements that are exactly zero when the corresponding sub-likelihood scores are weakly correlated to the maximum likelihood score. To see this, letŵ =ŵ(θ) be the minimizer of (14) with θ fixed and equal to the the true parameter value for simplicity. Then the truncated composition ruleŵ = (ŵ α ,ŵ ρ ) minimizing the empirical objective (14) has the form where A ⊆ {1, . . . , m} is the index set of selected scores such that . . , m}/A, and function sign(w) denotes the vector sign function with rth element taking values −1, 0 and 1 if w r < 0, w r = 0 and w r > 0, respectively. The details of the derivation of the solution (3.10) to the optimization problem (3.9) are found in Theorem 3.2 of Huang and Ferrari (2017). Here l,j is the residual difference between the rth score component and the composite likelihood score, andŜ l,A is the the covariance sub-matrix for the selected scores. One can show that the empirical average in the left hand side of (16) approximates Cov(u l,r , u ML l − jŵ l,j u l,j ), i.e. the covariance between the score for the rth pair and the residual difference between maximum likelihood and pairwise likelihood scores. This means that our truncation approach retains only pairwise score terms u l,r , able to explain the gap between the full likelihood score and the pair-wise score, while dropping the remaining scores.
One should note that our methodology relies on certain desirable asymptotic properties including unbiasedness of the truncated composite likelihood estimating equations. These are guaranteed only under certain regularity conditions and, unfortunately, are not straightforward to verify for the Brown-Resick model. For unbiasedness of the selected equations with fixed weights, one important condition is differentiability in quadratic mean of each pairwise log-likelihood function. Following Corollary 4.6 in Dombry et al. (2016), the Brown-Resnick model on a pair of sites automatically satisfies their conditions A1-A3. This also implies that the overall pairwise log-likelihood with fixed coefficients is differentiable in quadratic mean. One complication is that, differently from the usual composite likelihood setting (e.g. see Padoan et al. (2010)), in our method the weights for pairwise likelihoods depend on the parameter θ, but in practice such weights are estimated from the data by pluggingin a root-n consistent estimator. As a consequence, additional regularity conditions concerning convergence of such weights in probability are needed. Using arguments analogous to Huang and Ferrari (2017), one main condition is that the matrix of pairwise scores is dominated by an integrable function not depending on the parameters. Finally, following Dombry et al. (2016), identifiability for pair-wise estimation holds if euclidean distances for any three sites are not equal.

Implementation and computational aspects
The analysis in Huang and Ferrari (2017) show thatQ λ (θ, w) is a consistent estimate of the population criterion Q λ (θ, w) (up to some irrelevant additive term not depending on w) as long as θ is in a root-n neighborhood of the true parameter value. Thus, we start by taking a computationally cheap and consistent preliminary estimate and then use the truncation method described in Section 3.2 to improve upon such initial estimate. In practice, our truncation procedure is applied through the following steps: Step 0) Initialization: Find a root-n consistent estimateθ. This can be achieved by solving the estimating equation (7) with w k ∼ Bernoulli(π ), 1 ≤ k ≤ 2m, where π is a desired fraction of initial nonzero coefficients.
Step 1) Truncation: Compute the truncated composition ruleŵ given in (15), by minimizing the empirical criterionQ λ (θ, w). For sufficiently large λ l , this step will result in a likelihood function with a number of terms set exactly equal to zero.
The criterionQ λ (θ, w) in Step 1 is a quadratic function of w, with a 1 constraint term. To solve the minimization problem in Step 1, we implement a step-up algorithm which essentially coincides with the least angle regression (LARS) algorithm of Efron et al. (2004). LARS starts with a large initial value of λ l (l ∈ {α, ρ}) which yields an initial solution ofŵ l with all elements equal to zero. Then in each subsequent step, the algorithm includes exactly one score component at the time, say u l,r (θ ), in the current composite score u(θ, w), by decreasing λ l in such a way that the correspondent coefficient inŵ l becomes different from zero. The included score components u l,r have their covariance with residuals 1/n n i=1 u (i) l,r (θ)[u (i) l,r (θ) −ŵ l u (i) l (θ )] higher than those not included as discussed in (16). In the last step, the algorithm yields m coefficientsŵ

Standard errors
For a given composition rule w, the matrices H w (θ ) and J w (θ ) forming the Godambe inforamtion matrix given in (9) are estimated by their empirical counterpartŝ whereŜ(θ) = n −1 n i=1 u (i) (θ ) u (i) (θ ) is the empirical score covariance matrix, and D(θ) is a 2m × 2 matrix with the first m rows and last m rows stacks of elements (Ŝ(θ) j,j ,Ŝ(θ) j,j +m ) and (Ŝ(θ) j +m,j ,Ŝ(θ) j +m,j +m ), j = 1, . . . , m, respectively. A plug-in estimate var(θ λ ) of the variance of the final estimatorθ λ is found by replacinĝ θ λ and its composition ruleŵ in (17) to obtain: Estimating the asymptotic variance of composite likelihood estimators is notoriously difficult. When the composition rules w contains all non-zero coefficients, J w (θ ) may involve a very large number of noisy score covariance terms. When the number of sites s (and the corresponding number of sublikelihoods 2m) is moderate or large, this increases the computational cost and implies inaccurate estimates of PLE's variance. The proposed plug-in estimate (18), on the other hand, represents a computationally efficient and stable alternative. For an appropriate choice of λ = (λ α , λ ρ ) , the truncated composition ruleŵ does not include elements corresponding to the noisiest pairwise scores. As a results, the plug-in variance estimator var(θ λ ) is expected to be more accurate and compationally stable compared to the variance estimator that uses all nonzero elements in w.

Selection of λ
Letk l , l ∈ {α, ρ}, be the number of non-zero elements in the selected composition ruleŵ l found by minimizing the empirical objective (14). Recall that for the LARStype algorithm described in Section 3.3 selecting the number of non-zero components ink l is equivalent to setting corresponding tuning constant λ l . We choosek l such that at least a fraction of the total information available on parameter l is reached.
LetŜ (t) l be the t × t empirical covariance between sub-scores for parameter l after t steps of the LARS algorithm (i.e. after including t terms in the pair-wise likelihood equation), and (t) l be the smallest eigenvalue ofŜ l ) as the reduction on variability (information gain) on l in step t. The Min-Max Theorem of linear algebra implies that including the remaining non-selected sub-likelihood components will increase the information on l by at most 1 + (m − t) (t) l /tr(Ŝ (t) l ). We propose to findk l using the empirical rulê The proportion of information obtained up to step t has to be greater than φ l (t). In practice, we choose values τ close to 1. Particularly, the value τ = 0.9 is found to select a number of pair-wise likelihood components that balance well computing and statistical efficiency in most of our numerical examples.
The advantage of our application of the LARS algorithm is that it does not require re-estimation of θ and of the Godambe information for each value of λ. As a consequence estimates of the asymptotic variance are not necessarily computed for each λ. On the other hand, the pair-wise scores are only estimated once at the beginning of the algorithm and can be used to guide selection of λ as described in the above criterion.
While in principle one may select λ by optimizing a criterion based on the estimated Godambe information, this would require additional computations. Namely, at each step of the algorithm w is updated entirely, meaning that re-estimation of θ and re-computation of the matrices in the Godambe information would be also necessary for each value of λ. While this is feasible in small problems, it might be challenging for certain data sets containing a large number of observations.

Missing data
In our numerical applications there are no missing data. In practice, however, often not all sites have data for all years. Some insight on how to proceed in such a setting may be helpful. Suppose that at time i, we have only k sites. Without loss of generality, let Z (i) obs = (Z (i) 1 , . . . , Z (i) k ), k < s, be the vector observed data at time i, where s is the total number of available sites. The missing data are denoted by

. . . , T (i) s ) be a random vector with binary entries indicating the missing data ( T (i)
j = 0 if the observation at time i and location j is missing and T (i) j = 1 otherwise). Assume that T (i) is an independent draw from the distribution depending on an unknown parameter ζ . Here θ denotes the max-stable parameter of interest ( θ = (α, ρ) for the Brown-Resnik model).
The type of treatment for missing data depends on the specific model for the missing data mechanism. For simplicity, here we limit our discussion to the the case of missing completely at random (MCAR) data. The observed data-likelihood function for rth pair {(j, k), 1 ≤ j < k ≤ s} evaluated at the ith observation can be written as where g(·; ζ ) is the bivariate pmf of (T j , T k ), and where f r (·; θ) is the bivariate max-stable model defined in (5). Note that when observation is missing in either site j or k (i.e., t k ), f obs (·; θ) = 0 is actually independent of the parameter θ. This is because marginalization of bivariate max-stable model leads to unit Frechèt univariate distributions. This means that the truncated pair-wise likelihood estimator (TPLE) in Section 3.1 can be computed as usual, but the pair-wise likelihood scores terms in the estimatin equation will be u (i)

Monte Carlo simulations
We simulate from the Brown-Resnick model described in Section 2 for various settings of the parameter θ = (ρ, α) using the R package SpatialExtremes (Ribatet 2015). We implement the two-step approach described in Section 3.3 to find the truncated PLE (TPLE)θ λ . The preliminary estimateθ is found by setting π = 0.3. We investigate the statistical efficiency and computational cost of TPLE. For comparison, we consider the PLE with uniform coefficients w unif1 = (1, . . . , 1) (UPLE) due to its widespread use, and the PLE with coefficients w unif2 set to 1 if the corresponding pairs of locations have distance less than one third of the radius of study region, or 0 otherwise (UPLE 2 ). We also consider the random PLE (RPLE) with coefficients w rand containing 0.3 × 2m elements equal 1 at random positions, where m is the total number of pair-wise likelihoods.
The performance of our method is measured by Monte Carlo estimates of the relative mean squared error ofθ λ = (ρ λ ,α λ ) and required CPU time compared the other composition rules. Particularly, we estimate the relative mean squared errors RE (1) = RE(w unif1 ), RE (2) = RE(w rand ) RE (3) = RE(w unif2 ), and the relative computing whereθ(w) is the pairwise likelihood estimator obtained using the composition rule w.
Simulation 1 In our first simulation, we illustrate the impact of the tuning constants (λ α , λ ρ ) -or, equivalently, the number of selected pair-wise likelihood terms -on statistical accuracy and computational efficiency. Figure 3 (top row) shows the number of pairs of sites selected, i.e. the numbers of nonzero elements in the estimated coefficientsŵ α = (ŵ α,1 , . . . ,ŵ α,m ) andŵ ρ = (ŵ ρ,1 , . . . ,ŵ ρ,m ) against the criterion φ l (t), l ∈ {α, ρ} defined in (3.5). Recall that φ l (t) represents a lower bound on the explained variability in the selected pair-wise scores after t terms are included in the pairwise likelihood equations. The curves are obtained from a single simulation at 30 randomly selected locations on [0, 100] 2 . Figure 3 (bottom rows) shows Monte Carlo estimates of the relative efficiency of the TPLE compared to the UPLE, separately for parameters α and ρ against φ α (t) and φ ρ (t) (RE (1) α and RE (1) ρ , respectively). Estimates are based on 1000 Monte Carlo samples of size 50 from a Brown-Resnick process at 30 randomly selected locations on [0, 100] 2 , which are not varied throughout simulations. Remarkably, selecting just 20 to 30 pair-wise score terms (i.e. 5 to 7% of the entire set of feasible terms), already gives dramatic improvements in terms of relative efficiency compared to UPLE.  (14), against the lower bounds on scores variability, φ ρ (t) and φ α (t), after including t terms as defined (19). The top part of each plot shows the number of pair-wise terms included. Plots are obtained from a single realization of the Brownick-Resnick process with (ρ, α) = (2.8, 1.5) at 30 random sites on [0, 100] 2 . Bottom rows: Monte Carlo estimates of relative efficiencies RE The computational complexity (Fig. 3, top rows) increases when the number of pair-wise scores with coefficients different from zero increases (equivalently, when λ l decreases). Thus, the computing cost is maximum when λ α = λ ρ = 0, since all the pair-wise scores are included. The relative error (Fig. 3, bottom rows) follows a U-shape behavior which is explained as follows. The optimal theoretical weights are given by λ α = λ ρ = 0, corresponding to the optimal estimating equations described in Heyde (2008). However, such optimal weights are not achievable in practice due to the substantial correlation between pair-wise scores and the presence of estimation error. This means that for λ l close to zero the estimated composite likelihood coefficientsŵ becomes increasingly unstable yielding parameter estimatesθ(ŵ) with large variance. Specifically, the optimal weights depend on the inverse of the estimated pair-wise score covariance matrix, which is nearly singular in presence of pronounced spatial correlation. This behavior is exacerbated when the number of sites increases. On the other hand, by including too few pair-wise scores in the likelihood equation (i.e. setting too large λ α , λ ρ ), some important information on the parameter may be missed thus resulting in poor accuracy of the parameter estimator.
Simulation 2 In our second Monte Carlo experiment, we carry out a systematic assessment of the performance of the TPLE compared to UPLE and RPLE. For the TPLE, we consider various choices for the minimum proportion of explained score variability (τ = 0.9, 0.95 and 0.99). Tables 1 and 2 show results based on 1000 Monte Carlo samples of size 50 from a Brown-Resnick process with different smoothness and range parameters observed, respectively, at 20 and 30 random locations on [0, 100] 2 , which are not varied throughout simulations. We report Monte Carlo estimates for the following quantities: mean number of pair-wise score terms included (#Terms), E(α λ ) and E(ρ λ ), sd(α λ ) and sd(ρ λ ), relative mean squared error and relative computing cost of the TPLE compared to UPLE and RPLE. Whilst the TPLE generally outperforms the UPLE in terms relative efficiency, it also performs comparably to the random PLE in terms of computational cost. Both the accuracy and computing efficiency of TPLE become more pronounced as the number of sites increases. Finally, note that when α and ρ decrease, the TPLE tends to perform similarly to the UPLE in terms of efficiency. This is not surprising since in this situation observations between sites become increasingly independent and all sub-likelihoods contain roughly the same information on the parameters.

Simulation 3
In our third Monte Carlo experiment, we examine the estimator of the extremal coefficient, a useful quantity in spatial analysis of extremes. The accuracy of our method is compared with UPLE and RPLE. We also assess the accuracy of the estimated extremal coefficientsη 2 (h), obtained by plugging-in parameter estimateŝ θ in the formula η 2 (h; θ) given in Section 2.3. Figure 4 (top row) shows the fitted extremal coefficients curvesη 2 (h) based on the estimated and the true parameters. Figure 4 (bottom row) shows the corresponding mean square errors of the estimates obtained by plugging-in TPL, UPL and RPL estimates. The lighter circles in the plots correspond to empirical estimates of the pairwise coefficients. Results are based on 1000 Monte Carlo samples of size 50, generated from 20 and 30 randomly selected sites on [0, 100] 2 with true parameters (α, ρ) = (1.5, 28). Whilst all the estimators tend to underestimate the extremal coefficient for relatively large h , our truncation approach clearly outperforms the other two methods.

Analysis of Victoria extreme temperature data
In this section, we apply the new estimation method to maximum temperature data recorded in the state of Victoria, Australia. Daily temperature maxima from 1971 to 2017 are provided by the national Australian meteorological service, the Bureau of Meteorology (data are available at http://www.bom.gov.au/climate/data). The final dataset contains the highest annual temperature recordings measured at 26 stations over 47 years from 1971 to 2017. The distances between these stations range between 13 and 1100 kilometers. The locations for the measuring stations are shown in Figure  6 (left). Sites colored in blue (red) correspond to average maximum temperatures below (above) the average maximum temperature across all sites.
The main objective of our analysis is to estimate the correlation structure pertaining extreme temperatures. As a pre-processing step, we transform the data at each α ), relative computing cost of TPLE compared to UPLE (RC (1) ), RPLE (RC (2) ) and UPLE 2 (RC (3) ).
α ), relative computing cost of TPLE compared to UPLE (RC (1) ), RPLE (RC (2) ) and UPLE 2 (RC (3) ).  location into a unit Fréchet distribution with marginal parameters obtained by fitting Generalized Extreme Value models at each location. Extreme dependence parameters under the Brown-Resnick model are obtained using truncated, random and uniform PLEs. Standard deviations and covariances of the pairwise likelihood estimators are calculated by the sandwich approximation of the inverse Godambe information matrix described in Section 3.4. Figure 5 (top) depicts the entire trajectory for range and smoothness parameters fitted using the TPLE for increasing explained score variability φ ρ (t) and φ α (t), respectively, along with 95% confidence bands. For comparison, the horizontal dotdashed line represent the UPLE estimate. Figure 5 (bottom) gives the number of nonzero elements for the truncated composition rulesŵ ρ andŵ α with number of selected pair-wise score terms reported on the top axises. Note that just by including a small fraction pair-wise likelihoods, the TPLE is very close to the UPLE involving all 325 pair-wise likelihood terms. However, we find that the TPLE has much smaller standard errors compared to the UPLE. For example, when τ = 0.95, the 95% In Figure 6 (left), edges joining site locations represent the selected pair-wise scores by our truncation method when τ = 0.9. Note that the selected scores represent a small fraction of the available 325 pair-wise terms, and generally corresponding to pairs of sites close to each other. This is not surprising, since pairs of neighboring sites are expected provide more information on extreme correlation compared to those far apart from each other. Figure 6 (right) shows extremal coefficient estimates obtained by plugging-in the truncated, uniform and random pair-wise likelihood estimators of the smoothness and range parameters. Whilst fo TPL and UPL estimates are generally very close, the TPL extremal coefficient estimate becomes smaller than the UPL estimate as the distance h increases. Finally, in Fig. 7 we show fitted extremal dependence coefficients and realizations of temperature maxima in degrees Celsius simulated from fitted max-stable models by TPLE (with τ = 0.9) and UPLE on the map of Victoria (bottom row). Dashed lines connecting the sites represent pair-wise likelihood scores selected by our truncation methodology described in Section 3.2. Right: Light gray points represent empirical extremal coefficient estimates against distances between corresponding locations. The smooth curves represent fitted extremal coefficients obtained by plugging-in the truncated (solid), uniform (dashed) and random (dot-dashed) pair-wise likelihood estimators of the smoothness and range parameters (α and ρ) for the Brown-Resnick model Fig. 7 Maps of the state of Victoria, Australia, with fitted extremal dependence coefficients (top row) and temperature maxima in degrees Celsius simulated from fitted max-stable models (bottom row) using the proposed truncated pair-wise likelihood (TPLE, left column) and the classic uniform pair-wise likelihood (UPLE, right column) estimators

Conclusion and final remarks
Building on the general methodology introduced in Huang and Ferrari (2017), we have developed and applied a new truncation strategy for pair-wise likelihood estimation in the Brown-Resnick model, a popular dependence model for spatial extremes. Our method represents a statistically efficient and computationally parsimonious approach for estimating parameters of spatial max-stable models for extreme values. The pair-wise likelihoods constructed by our new method is obtained by minimizing an estimate of the 2 -distance between the composite likelihood score and full likelihood score subject to a 1 -norm penalty representing the computational cost (or composite likelihood complexity). When the number of pair-wise likelihood terms considered for estimation is relatively large compared to sample size, traditional CL estimators with uniform weights may be inaccurate due to potentially many large correlations between the sub-likelihood scores (Cox and Reid 2004;Lindsay et al. 2011). This issue is crucial for pairwise likelihood estimation in the context of spatial max-stable models (Sang and Genton 2014). The proposed 1 -penalization strategy carries out principled selection of pair-wise likelihood terms. Particularly, only the pair-wise scores with the largest correlation with the maximum likelihood score are kept in the estimating equations (see Section 3.2). This mechanism is shown to improve statistical efficiency whilst reducing the computational burden associated with extensive combinations of equally weighted CL objects. These features of our truncation method make it particularly effective for analyzing large datasets with measurements taken at many sites, where the number of noisy or redundant likelihood terms often increases very quickly with the number of sites. Huang and Ferrari (2017) show that 1 -truncation yields CL estimators with the asympotic variance approaching that of optimal CL estimators (with no penalty). The results in this paper supports their theoretical results and suggest that CL truncation by 1 -penalization is a valid option when dealing with complex likelihood estimation problems.
Various research directions based on modification the current approach may be pursued in the future. One research direction concerns the choice of the penalty function. Inspired by the literature in variable selection in regression (see, e.g., ?Efron04,Fan10 ()), we achieve sparsity of estimating equations via the 1penalty described in (10). Note, however, that in our context the penalty involves the composition rule w but does not the model parameter θ , which is low-dimensional and is treated as fixed. This means that, differently from penalized regression, our estimating equations (and resulting parameter estimates) remain approximately unbiased even when λ is not zero. In the future, exploring other suitable sparsitypromoting penalties strategies may be valuable to deal with cases where both the CL complexity and size of the parameter space are large. Another research direction concerns the choice of the CL design. In this paper we have focused on pair-wise likelihood inference, but our approach may be used for higher-order CL designs with sub-likelihood constructed on site pairs, triplets, quadruples, etc, similarly to Castruccio et al. (2016). Finally, since settings beyond the Brown-Resnick model were not directly pursued in the present paper, numerical studies and applications of the TCLE in the context of other max-stable processes would be also valuable.