Large Deviations of Bivariate Gaussian Extrema

We establish sharp tail asymptotics for component-wise extreme values of bivariate Gaussian random vectors with arbitrary correlation between the components. We consider two scaling regimes for the tail event in which we demonstrate the existence of a restricted large deviations principle, and identify the unique rate function associated with these asymptotics. Our results identify when the maxima of both coordinates are typically attained by two different vs. the same index, and how this depends on the correlation between the coordinates of the bivariate Gaussian random vectors. Our results complement a growing body of work on the extremes of Gaussian processes. The results are also relevant for steady-state performance and simulation analysis of networks of infinite server queues.


Introduction
Motivated by applications to the analysis of queueing networks, we study the large deviations of extreme values of multivariate Gaussians. We focus on the bivariate case for simplicity, but our analysis will carry over to the more general case with some effort. Let {X 1 , X 2 , . . .} be an ensemble of independent and identically distributed (i.i.d.) bi-variate Gaussians with covariance matrix Σ, and letX n ∶= (max 1≤i≤n X (1) i , max 1≤i≤n X (2) i ) be the component-wise maximum, or extreme value, random vector. For simplicity we assume that E[X 1 ] = 0. In the context of a queueing networkX n is an approximation to the maximum congestion experienced over a typical interval in a network of infinite server queues, for instance. We characterize the likelihood of the tail event {X n > a n u} where u ∈ (0, ∞) as the typical interval n tends to infinity, under the assumption that a n → ∞ as n → ∞. We consider two cases.
The different cases in (2) originate due to the different scenarios in which the bivariate distribution can attain its maximum. In all the cases where a term +1 is present, the maximum is attained by one index of X i which simultaneously attains the maximum of both coordinates. In all the 1 2 REMCO VAN DER HOFSTAD AND HARSHA HONNAPPA cases where a term +2 is present, the maximum is attained by two different indices of X i , one which attains the maximum of the first coordinate, and one which attains the maximum of the second coordinate. The latter case has most distinct possibilities ("larger entropy"), while the first may have a larger probability for appropriate correlation coefficients ρ. The optimal strategy is characterized by the 'largest probability wins' principle.
Case 2: Larger scales. On a much larger scale, where a n ≫ √ log n, we establish two main results. First, we prove a leading order asymptote for the extreme value that aligns with the result in case 1. Precisely, in Theorem 3 we prove the RLDP In Theorem 4, on the other hand, we establish a sharp asymptote for the likelihood and show that there exists a continuous function I∶ R d → R and constants K, b and c such that lim n→∞ a b n n c e a 2 n I(u) P(X n > a n u) = K.
The proofs of these theorems uses the inclusion-exclusion principle to bound the likelihood from above and below.
Related Literature. Multivariate Gaussians emerge as stationary limits of networks of G G ∞ infinite server queues when the arrival rate is high (i.e., in heavy-traffic in the sense of [9]); see [1,22] as well. This is straightforward to observe in the case of a single M G ∞ station where the number in system in steady-state is Poisson distributed. When the arrival rate is high the steady-state distribution is well approximated by a Gaussian.
Next, there is an explicit connection with extreme value theory (EVT). The logarithmic asymptotics established here complement the uniform convergence results for EVT; see [21,Chapter 4] and [3, Section I, Chapter C]. There are also clear connections with recent work on extremes of multidimensional Gaussian processes in [4,5,6,17,15] and other related work, where logarithmic asymptotics are derived for the "at least one in the set" extremum (not the component-wise extremum considered here) for Gaussian processes. We note, in particular, [17] where logarithmic asymptotics are derived for the "at least one in set" extremum of a sequence of (non-i.i.d.) generally distributed random vectors. The authors present a general theory closely aligned with the RLDP for univariate random variables introduced in [8], whereby the Gärtner-Ellis condition need not be satisfied. Of course, our results are more restrictive in the sense that we only study i.i.d. Gaussian random vectors, but we also consider large-scale asymptotics that are not under consideration there.
Our results are also closely related to the important series of papers by Hashorva and Hüsler [11,12,13] generalizing the classic Mills ratio Gaussian tail bound [20]. We observe that the quadratic program logarithmic asymptote derived in Lemma 1 is also implied by the tail bound dervied in [11,13]. In [13], the authors derive exact asymptotics for integrals of Gaussian random vectors, and in particular focus on the "at least one in the set" extremum for half-space extreme value sets. Our proof does not rely on the bound in [11,13].
It would be interesting to strengthen Theorem 2 to sharp asymptotics as is performed for large scales in Theorem 4. This is hard, since various error terms that can easily be dealt with in the proof of Theorem 4 as they are much smaller than the leading order, will only become marginally smaller. In would also be interesting to extend our analysis to other multivariate random vectors with non-trivial dependence.
Notations and Setting. All vector relations should be understood component-wise. Thus, x > y implies that x (j) > y (j) for every component j. Following [17] we define a restricted large deviation principle (RLDP) as follows: for any q ∈ R d a sequence of multivariate R d -valued random variables where v n , a n → ∞ as n → ∞. This asymptotic is not a full-fledged large deviations principle (LDP) since it does not provide any insight into what happens for negative q, i.e., it only deals with attaining large positive values. Furthermore, as noted in [17] and [8], if W n satisfies a LDP with continuous rate function, then it automatically satisfies the RLDP. On the other hand, if the rate function is discontinuous, then it might not satisfy the RLDP.

Right Scale Asymptote
We start by analyzing extreme events for bivariate Gaussian random variables: Lemma 1 (Extreme events for single normal random variables). Let {a n } n≥1 be any unbounded increasing sequence in n ∈ N. Then lim n→∞ 1 a 2 n log P (X 1 > a n ε) = − ess inf Proof. By definition, and with C an explicit constant, 1 a 2 n log P (X 1 > a n ε) = 1 where the second equality follows by a substitution of variables. Laplace's principle [?, Chapter 4] implies the claim.
Next, we consider the asymptotics of the logarithmic likelihood of the event {∃i ≤ n∶ X i > a n u}. Note that this is not the component-wise maximum.
Proof. Observe that where b n ∶= P({X 1 > a n u} c ).
From (11), it follows that log P(∃i ≤ n∶ X i > a n u) ≤ log P(X 1 > a n u) + log n, (12) using the fact that b n < 1 for all finite n. Lemma 1 implies that Next, for the lower bound, we work with the term log ∑ n−1 i=0 b i n to obtain a finer analysis. In particular, suppose we demonstrate that, since b n ∈ [0, 1], log(nb n−1 n ) ≥ log n + o(log n) as n → ∞; then, it follows that Consequently, Lemma 1, combined with this result, implies that thereby completing the proof of the proposition.
Next, the Gaussian upper tail bound implies that for large n, thereby completing the proof.
Lemma 2 (Analysis of variational problem). By a straightforward calculation, otherwise.
Proof. Fix ρ ∈ (0, 1] and without loss of generality assume that σ (1) = σ (2) = 1. We can divide the positive quadrant into three regions as shown in Figure 1, where ρ = 0.5. Suppose that u is such that u (2) ≤ ρu (1) (see the red region in Figure 1), then where the final inequality holds for any x > u. It follows that 1 2 (2) . Finally, in the region where neither of these conditions holds (the blank region in Figure 1), it is straightforward to verify the Karush-Kuhn-Tucker (KKT) conditions for u and, since Σ −1 is positive definite, u is the unique optimizer.
As a consequence, we obtain the main result of this section: Theorem 2 (Extreme value asymptotics for bi-variate Gaussians). Under the conditions of Proposition 1, where otherwise.
Further, with (I ⋆ , J ⋆ ) the indices that maximizeX n (i.e.,X n = (X (1) It is an interesting problem to extend (24) to the case where but this seems quite difficult, as it requires a finer analysis beyond the principle of the largest term [7, Lemma 1.2.15].
Proof. Note that P X n > a n u = P(∃i ≤ n∶ X i > a n u) (26) + P(∃i ≠ j ≤ n∶ X (1) i > a n u (1) , X (2) j > a n u (1) , ∃ i ≤ n∶ X i > a n u). Depending on u, the first or the second term will be dominant. Ignoring the event that ∃ i ≤ n∶ X i > a n u (which leads to an upper bound, but this event can also be incorporated in more detail), we have (using that X (1) i , X (2) j are Gaussian) P(∃i ≠ j ≤ n∶ X (1) i > a n u (1) , X (2) j > a n u (1) ) ≈ n(n − 1)P(X (1) i > a n u (1) )P(X (2) j > a n u (2) ) (27) Taking log's and dividing by log n = a 2 n gives lim n→∞ 1 a 2 n log P(∃i ≠ j ≤ n∶ X (1) i > a n u (1) , X (2) j > a n u (1) Then, by the principle of the largest term [7, Lemma 1.2.15], it follows that lim n→∞ 1 a 2 n log P(X n > a n u) = max lim n→∞ 1 a 2 n log P (∃i ≤ n∶ X i > a n u) , lim n→∞ 1 a 2 n log P ∃i ≠ j ≤ n∶ X (1) i > a n u (1) , X (2) j > a n u (1) Further, lim n→∞ 1 a 2 n log P I ⋆ ≠ J ⋆ X n > a n u = lim n→∞ 1 a 2 n log P I ⋆ ≠ J ⋆ ,X n > a n u − log P X n > a n u (30) ≤ J 2 (u σ) − J(u σ) < 0, when J 2 (u σ) < J(u σ), showing that P I ⋆ ≠ J ⋆ X n > a n u = o(1) when J 2 (u σ) < J(u σ). Similarly, lim n→∞ 1 a 2 n log P I ⋆ = J ⋆ X n > a n u = lim n→∞ 1 a 2 n log P I ⋆ = J ⋆ ,X n > a n u − log P X n > a n u (31) when J 1 (u σ) < J(u σ), showing that P I ⋆ = J ⋆ X n > a n u = o(1) when J 1 (u σ) < J(u σ). This proves (24) subject to (23).
We are left to prove the claim in (23), for which we consider several cases: Case (2): u (2) σ (2) ≤ ρu (1) σ (1) . The proof follows case (1) and is omitted. (2) . Lemma 2 implies that (29) is simply Which of the two is the maximizer depends sensitively on the relation between ρ and u.
From this formula, we see the importance of symmetry, as A (i,j) = A (j,i) for the symmetric case where u (1) = u (2) , but not when this is not the case. In the symmetric case where u (1) = u (2) , we note that A (i,j) = A (j,i) , so we may write, instead, (42) P(Z n > a n u) = P ⋃ (i,j)∶i≤j {Z (1) i > a n u (1) , Z (2) j > a n u (2) } .
Using inclusion-exclusion. We use inclusion-exclusion to obtain that (43) while for the symmetric case, we sum over ordered pairs (i, j) with i ≤ j instead.
Below, we analyse each of these terms. We separate between the case where (i) the indices are different; (ii) they are equal but the probability simplifies; (iii) they are different and we need to perform the integral over the joint density using the Laplace method. We start with the asymmetric case where u (1) ≠ u (2) , remarking on the extension to the symmetric case at the end of the proof. Without loss of generality, we may assume that u (1) > u (2) .
Sum of probabilities: unequal indices. First consider the case where i ≠ j. Then, since (Z (1) i , Z (2) j ) are i.i.d. standard normal random variables, (45) (i,j)∶i≠j P A (i,j) = n(n − 1) 1 − Φ(a n u (1) ) 1 − Φ(a n u (2) ) , where Φ(x) = P(Z ≤ x) is the error-function or the distribution function of a standard normal. By the asymptotics, for x large, we thus obtain that Sum of probabilities: simple cases of equal indices. We next proceed with the case where i = j, for which we get (1) , Z (2) > a n u (2) .
Sum of probabilities: Lapace integral for equal indices. When the above simple cases do not apply, we write P Z (1) > a n u (1) , Z (2) > a n u (2) explicitly as a two-dimensional integral as P Z (1) > a n u (1) , Z (2) > a n u (2) = 1 2π We rescale the integrands by a n to obtain P Z (1) > a n u (1) , Z (2) > a n u (2) = a 2 This is a classic example of a Laplace integral. Thus, the integral is dominated by the minimum of (x 2 1 − 2ρx 1 x 2 + x 2 2 ) 2(1 − ρ 2 ) over all (x 1 , x 2 ) for which x 1 ≥ u (1) , x 2 ≥ u (2) . Since (x 2 1 − 2ρx 1 x 2 + 10 REMCO VAN DER HOFSTAD AND HARSHA HONNAPPA x 2 2 ) 2(1 − ρ 2 ) is convex, this minimum is attained at one of the boundaries. Since ρu (1) < u (2) , this minimum is attained at x 1 = u (1) , x 2 = u (2) (see also the analysis in Lemma 1). Thus, Therefore, we obtain that Since ρu (1) < u (2) , we have that the quadratic function inside the exponential is minimized for (2) . Shifting both integrands by u (1) and u (2) respectively, leads to Now rescaling both integrands by a −2 n leads to 2πa 2 n exp a 2 n (u (1) ) 2 − 2ρu (1) u (2) + (u (2) ) 2 2(1 − ρ 2 ) P Z (1) > a n u (1) , Z (2) > a n u (2) Again we see the significance of the assumption that ρu (1) < u (2) , which implies that both linear terms have a negative coefficient, and thus the exponential functions are integrable. As a result, the first exponential in the integral only leads to an error term, so that Combining (51)-(53) with (60) yields the asymptotics of the sum of probabilities. Note that the final outcome yields (35) in the asymmetric case, so what is left is to show that the error term e n (u) is of smaller order.
The symmetric case: sum of probabilities. We now look at the sum of probabilities for the symmetric case, and analyse P(A (i,j) ) there. The analysis for the case where i ≠ j is identical to the one above, except for the fact that the prefactor (due to the number of pairs (i, j)) is changed from n(n−1) to n(n−1) 2. The contribution for the case where i = j is also the same as above. In fact, it is easy to see that for u (1) = u (2) = u, we have I((u, u)) = u 2 when ρ ≤ 0, while I((u, u)) = u 2 (1 + ρ) for ρ > 0, since precisely when ρ > 0.
The error term e n (u): asymmetric case. In dealing with error terms, we will make essential use of the fact that a n ≫ √ log n. This condition implies that if a certain event A satisfies P(A) ≤ e −a 2 n J for some J > I(u), then P(A) will constitute an error term in evaluating P(Z n > a n u), irrespective of the precise powers of a n and n. Recall (44). We investigate the different ways that (i, j) ≠ (k, l) can occur, depending on the cardinality of {i, j, k, l} which ranges from 2 to 4. Below, we assume throughout the analysis that the indices i, j, k, l used are distinct.
Case (4): (i, j), (k, l): By independence, this probability equals P(A (i,j) )P(A (k,l) ), the rate at speed a 2 n being at least 2I(u), which is again strictly larger than I(u). Together, these cases show that e n (u) is of smaller order than the sum of probabilities in the asymmetric case.
The error term e n (u): symmetric case. This analysis is similar the the asymmmetric case, except that some cases do not arise. We again go through the distinct possibilities, writing u = u (1) = u (2) : Case (2): (i, i), (i, j) or (i, i), (j, i): By independence, this probability equals P Z (1) > a n u, Z (2) > a n u P(Z > a n u). Obviously, the rate at speed a 2 n is strictly larger than I((u, u)). Case (3a): (i, i), (j, k): By independence, this probability equals P(A (i,i) )P(A (j,k) ), the rate at speed a 2 n again being strictly larger than I((u, u)). Case (3b): (i, j), (j, k): By independence, this probability equals P(A (j,j) )P(Z > a n u)P(Z > a n u), the rate at speed a 2 n again being strictly larger than I((u, u)). Case (4): (i, j), (k, l): By independence, this probability equals P(A (i,j) )P(A (k,l) ), the rate at speed a 2 n being at least 2I((u, u)), which is again strictly larger than I((u, u)).
Together, these cases show that e n (u) is also of smaller order than the sum of probabilities in the symmetric case.