Convergence of U-Processes in H\"older Spaces with Application to Robust Detection of a Changed Segment

To detect a changed segment (so called epidemic changes) in a time series, variants of the CUSUM statistic are frequently used. However, they are sensitive to outliers in the data and do not perform well for heavy tailed data, especially when short segments get a high weight in the test statistic. We will present a robust test statistic for epidemic changes based on the Wilcoxon statistic. To study their asymptotic behavior, we prove functional limit theorems for U-processes in H\"older spaces. We also study the finite sample behavior via simulations and apply the statistic to a real data example.


Introduction
In change point detection, the hypothesis is typically stationarity, but there are different types of alternatives, like the at most one change point or multiple change points. In this article, we are interested in testing stationarity with respect to the so called epidemic change or changed segment alternative: We have a random sample X 1 , X 2 , . . . , X n (with values in a sample space (S, S) and distributions P X 1 , P X 2 , . . . , P Xn ) and we wish to test the null hypothesis H 0 : P X 1 = P X 2 = · · · = P Xn , versus the alternative H 1 : there is a segment I * := {k * + 1, . . . , m * } ⊂ I n := {1, 2, . . . , n} such that for i ∈ I * , and P = Q.
Under H 1 the sample (X i , i ∈ I * ) constitutes a changed segment starting at k * and having the length * = m * − k * and Q is then the corresponding distribution in the changed segment. This type of alternative is of special relevance in epidemiology and has first been studied by Levin and Kline [16] in the case of a change in mean. Their test statistic is a generalization of the CUSUM (cumulated sum) statistic. Simultaneously, epidemic-type models were introduced by Commenges, Seal and Pinatel [3] in connection with experimental neurophysiology.
where I(k, m) = {k + 1, . . . , m} and I c (k, m) = I n \ I(k, m). As neither the beginning k * nor the end m * of changed segment is known, the statistics T n := max 0≤k<m≤n 1 ρ n (m − k) ∆ n (k, m) may be used to test the presence of a changed segment in the sample (X i ), where ρ n (m − k) is a factor smoothing over the influence of either too short or too large data windows. In this paper we consider a class of U -statistic type measures of heterogeneity ∆ n (k, m) defined via a measurable function h : S × S → R by ∆ n (k, m) = ∆ h,n (k, m) := i∈I(k,m) j∈In\I (k,m) h(X i , X j ), and the corresponding test statistics where 0 ≤ γ < 1/2 and ρ γ (t) = [t(1 − t)] γ , 0 < t < 1.
Although other weighting functions are possible our choice is limited by application of a functional central limit theorem in Hölder spaces.
Recall the kernel h is symmetric if h(x, y) = h(y, x) and antisymmetric if h(x, y) = −h(y, x) for all x, y ∈ S. Any non symmetric kernel h can be antisymmetrized by considering h(x, y) = h(x, y) − h(y, x), x, y ∈ S.
Let's note that the kernel h is antisymmetric if and only if E[h(X, Y )] = 0 for any independent random variables with the same distribution such that the expectation exists. The if part follows by Fubini and antisymmetry. To see the only if part, first consider the one point distribution X = x and Y = x almost surely to conclude that h(x, x) = 0 for all x. Next, consider the two point distribution P (X = x) = P (X = y) = 1/2 and conclude that 0 = E[h(X, Y )] = (h(x, x) + h(y, y) + h(x, y) + h(y, x))/4 and thus h(x, y) = −h(y, x). So a U -statistic with antisymmetric kernel has expectation 0 if the observations are independent and identically distributed and are good candidates for change point tests. We only consider antisymmetric kernels in this paper.
In the case of a real valued sample, examples of antisymmetric kernels include the CUSUM kernel h C (x, y) = x − y or the Wilcoxon kernel h W (x, y) = 1 {x<y} − 1 {y<x} . The kernel h W leads to Wilcoxon type statistics T n (γ, h W ) := max 0≤k<m≤n 1 ρ γ ((m − k)/n) i∈I(k,m) j∈In\I(k,m) whereas with the kernel h C we get CUSUM type statistics where X n := n −1 n i=1 X i . As more general classes of kernels and corresponding statistics we can consider the CUSUM test of transformed data (h(x, y) := ψ(x) − ψ(y)) or a test based on two-sample M-estimators (h(x, y) = ψ(x − y) for some monotone function, see Dehling et al. [8]).
Based on invariance principles in Hölder spaces discussed in the next section, we derive the limit distribution of test statistics T n (γ, h). Theorems 1 and Theorem 2 provide examples of our results. Let W = (W (t), t ≥ 0) be a standard Wiener process and B = (B(t), 0 ≤ t ≤ 1) be a corresponding Brownian bridge. Define for 0 ≤ γ < 1/2, Theorem 1. If (X i ) i∈N are independent and identically distributed random elements and h is an antisymmetric kernel with E[|h(X 1 , X 2 )| p ] < ∞ for some p > 2, then for any γ < (p − 2)/2p, we have where the variance parameter σ h is defined by Note that in practice, the random variables X i might not have high moments, but if we use a bounded kernel like h W , we know that the condition of the theorem holds for any p ∈ (0, ∞), so we have the convergence for any γ < 1/2. Also, in practical applications, the variance parameter has to be estimated. This can be done bŷ withĥ 1 (x) = n −1 n j=1 h(x, X i ). For the case of a dependent sample, we consider absolutely regular sequences of random elements (also called β-mixing). Recall that the coefficients of absolute regularity (β m ) m∈N are defined by where F b a := σ(X a , X a+1 , . . . , X b ) is the σ-field generated by X a , X a+1 , . . . , X b .
Theorem 2. Let (X i ) i∈N be a stationary, absolutely regular sequence and h be an antisymmetric kernel, and assume that the following conditions are satisfied: Then for any 0 ≤ γ < 1/2 − 1/r, we have where the long run variance parameter σ ∞ is given by For a bounded kernel h the conditions (ii) on decay of the coefficients of absolute regularity reduces to (ii') k max{k, k r/2−1 }β k < ∞ for some r > 2. Following Vogel and Wendler [26], σ 2 ∞ can be estimated using a kernel variance estimator. For this, define autocovariance estimatorsρ(k) bŷ Then, for some Lipschitz continuous function K with K(0) = 1 and finite integral, we setσ where b n is a bandwidth such that b n → ∞ and b n / √ n → 0 as n → ∞. With the help of the limit distribution and the variance estimators, we obtain critical values for our test statistic. Simulated quantiles for the limit distribution can be found in Section 8.
To discuss the behavior of the test statistics T n (γ, h) under the alternative we assume that for each n ≥ 1 we have two probability measures P n and Q n on (S, S) and a random sample (X ni ) 1≤i≤n such that for k * n , * n ∈ {1, . . . , n}, for i ∈ I n \ I * . Set h(x, y)Q n (dx)P n (dy), ν n = S S (h(x, y) − δ n ) 2 Q n (dx)P n (dy).
Theorem 3. Let 0 ≤ γ < 1. Assume that for all n ∈ N, the random variables X n1 , . . . , X nn are independent and let h be an antisymmetric kernel. If For dependent random variables, we get a similar theorem: Assume that for all n ∈ N, the random variables X n1 , . . . , X nn are absolutely regular with mixing coefficients (β k ) k∈N not depending on n, such that ∞ k=1 k q/2 β 1/2−1/q k < ∞ for some q > 2. Let h be an antisymmetric kernel, such that there exist C r < ∞ such that E[|h(X in , X jn )| q ] ≤ C q for all n ∈ N, i, j ≤ n. Furthermore, let 0 ≤ γ < 1 and assume that lim Then (4) holds.
This implies that a test based on statistic T n (γ, h) is consistent. More on consistency see Section 7. The proofs of Theorems 1 and 2 are given in Section 6.

Simulation results
We compare the CUSUM type and the Wilcoxon type test statistic in a Monte Carlo simulation study. The model is an autoregressive process (Y n ) n∈N of order 1 with Y i = aY i−1 + i , where ( i ) i∈N are either normal distributed, exponential distributed or t 5 distributed. We assume that the first L observations are shifted, so that we observe Under independence, the distribution of the change-point statistics does not dependent on the beginning of the changed segment, only on the length. In Table 1, we show some simulation results comparing the power for a changed segment in the beginning of the data and in the middle for a dependent sequence (autoregressive parameter a = 0.5). The rejection frequencies do not differ much, so we restrict further simulations to segments of the form I = {1, . . . , L}. Table 1: Empirical rejection frequency under alternative for an AR(1)-process of length N = 480 with AR-parameter 0.5 and t 5 distributed-innovations, changed segment from 1 to 160 or from 161 to 320, change height δ n = 0.58, level α = 5%. In Figure 1, the results for n = 240 independent observations (a = 0) are shown. In this case, we use the known variance of our observations and do not estimate the variance. The relative rejection frequency of 3,000 simulation runs under the alternative is plotted against the relative rejection frequency under the hypothesis for theoretical significance levels of 1%, 2.5%, 5% and 10%. As expected, the CUSUM test has a better performance than the Wilcoxon test for normal distributed data. For the exponential and the t 5 distribution, the Wilcoxon type test has higher power. For the long changed segment (L = 80), the weighted tests with γ = 0.1 outperform the tests with γ = 0.3. For the short changed segment (L = 30), the Wilcoxon type test has more power with weight γ = 0.3. The same holds for the CUSUM type test under normality. For the other two distributions however, the empirical size is also higher for γ = 0.3 so that the size corrected power is not improved.
In Figure 2, we show the results for n = 480 dependent observations (AR(1) with a = 0.5). In this case, we estimated the long run variance with a kernel estimator, using the quartic spectral kernel and the fixed bandwidth b = 4. Both tests become too liberal now with typical rejection rates of 13% to 15% for a theoretical level of 10%. For the long changed segment (L = 160) it is better to use the weight γ = 0.1, for the short segment (L = 60) the weight γ = 0.3. Under normality, the CUSUM type test has a better performance, though the difference in power is not very large. For the other two distributions, the Wilcoxon type test has a better power.
In practice, the strength of dependence is usually not known beforehand, so it would make sense to use a data-adaptive bandwidth for the variance estimation. However, the bias of the variance estimator under the alternative might get worse for data-adaptive bandwidths, and this might lead to a nonmonotonic power of change-point tests, see e.g. Vogelsang [27] or Shao and Zhang [23]. For this reason, we propose to estimate the variance the following way: Split the data set into five shorter parts of equal length and use a variance estimator with data-adaptive bandwidth separately for each of the parts. Then take the median of the five estimators for standardizing the test statistic. The beginning and the end of the changed segment will only affect at most two of the parts, so we have at least three estimates not affected. In the simulations in Figure 3, we study again an AR(1)process and use the standard setting of the R function lvar for the data-adaptive choice of the bandwidths in the five parts. With this method, we do not observe a loss of power compared to the fixed bandwidth. Under the hypothesis, the all tests become strongly oversized. The Wilcoxon type test statistic clearly outperforms the CUSUM type statistic for nonnormal in innovations.
Another problem in many practical applications is the unknown length of the changed segment, so that it is difficult to choose the value γ ∈ [0, 1/2) to achieve the optimal power. If there is no a-priori knowledge of the typical length of an epidemic change, it would also be possible to use the maximum of (suitable standardized) test statistics for different values of γ. Another straightforward application of Theorem 15 leads to the asymptotic distribution of this combined test statistic and critical values could be obtained via simulations, but this goes beyond the scope of this paper.

Data example
We investigate the frequency of search for the term 'Harry Potter' from January 2004 until February 2019 obtained from Google trends. The time series is plotted in Figure 4. We apply the CUSUM type and the Wilcoxon type change-point test with weight parameters γ ∈ {0, 0.1, ..., 0.4}. The lag one autocovariance is estimated as 0.457, so that we have to allow for dependence in our testing procedure. We estimate the long run variance with a kernel estimator, using the quartic spectral kernel and the fixed bandwidth b = 4.
The CUSUM type test does not reject the hypothesis of stationarity for a significance level of 5%, regardless of the choice of γ. In contrast, the Wilcoxon type test detects a changed segment for any γ ∈ {0, 0. By visual inspection of the time series, we come to the conclusion that the estimated changed segment for values γ ≥ 0.1 fits the data better, because this segment coincides with a period with only low frequencies of search. Furthermore, the spikes of this time series can be explained by the release of movies, and the estimated changed segment is between the release of the last harry potter movie in July 2011 and the release of 'Fantastic Beasts and Where to Find Them' in November 2016.

Double partial sum process
Throughout this section we assume that the sequence (X i ) is stationary and P X := P X i is the distribution of each X i . Consider for a kernel h : S × S → R the double partial sums h(X i , X j ), 1 ≤ k < n and the corresponding polygonal line process U h,n = (U h,n (t), t ∈ [0, 1]) defined by
In order to make use of results for partial sum processes, we decompose the U -statistics into a linear part and a so-called degenerate part. Hoeffding's decomposition of the kernel h reads h(x, y)P X (dy), and g(x, y) = h(x, y) − h 1 (x) + h 1 (y), x, y ∈ S, and leads to the splitting where is the polygonal line process defined by partial sums of random variables (h 1 (X i )). Decomposition (7) reduces (h, γ)-FCLT to Hölderian invariance principle for random variables (h 1 (X i )) via the following lemma.

Lemma 6. If there exists a constant C > 0 such that for any integers
for any 0 ≤ γ < 1/2.

Remark 7.
For an antisymmetric kernel h the condition (8) follows from the following one: there exists a constant C > 0 such that for any 0 ≤ m 1 < n 1 ≤ m 2 < n 2 , Indeed, by antisymmetry so that (9) yields Before we proceed with the proof of Lemma 6 we need some preparation. Let D j be the set of dyadic numbers of level j in [0, 1], that is D 0 := {0, 1} and for j ≥ 1, D j := (2l − 1)2 −j ; 1 ≤ l ≤ 2 j−1 . For r ∈ D j set r − := r − 2 −j , r + := r + 2 −j , j ≥ 0. For f : [0, 1] → R and r ∈ D j define The following sequential norm on H o γ [0, 1] defined by is equivalent to the norm ||f || γ , see [2]: there is a positive constant c γ such that In what follows, we denote by log the logarithm with basis 2 (log 2 = 1).
Lemma 8. For any 0 ≤ γ ≤ 1 there is a constant c γ > 0 such that, if V n is a polygonal line function with vertexes (0, 0), (k/n, V n (k/n)), k = 1, . . . , n, then Proof. First we remark that for any j ≥ 1, As r + and r − belong to D j , this gives, and it follows by (10), If s and t > s belong to the same interval, say, [(k − 1)/n, k/n], then, observing that the slope of V n in this interval is precisely n[V n (k/n) − V n ((k − 1)/n)], we have If s ∈ [(k − 1)/n, k/n), t ∈ [(j − 1)/n, j/n) and j > k + 1, then We apply these three configurations to s = r and t = r + 2 −j . If j ≥ log n then only the first two configurations are possible and we deduce max j≥log n If j < log n then we apply the third configuration to obtain max j<log n To complete the proof just observe that nr + 2 −j = nr + 1 if j = log n and so ∆ n ≤ max j≤log n 2 γj max r∈D j |V n ( nr + n2 −j /n) − V n ( nr /n)|.
Proof of Lemma 6. By Lemma 8 we have with some constant C > 0, Condition (8) gives This yields, taking into account that nr + n2 −j − nr ≤ n2 −j for r ∈ D j , This completes the proof due to the restriction 0 ≤ γ < 1/2.
The next lemma gives a general conditions for the tightness of the sequence (n −1/2 W h 1 ,n ) in Hölder spaces.
Lemma 9. Assume that the sequence (X i ) i∈N is a stationary and for a q > 2, there is a constant c q > 0 such that for any Then for any By Lemma 8, where I n (a) = P max 0≤j≤log n 2 βj max with some constant c β > 0. Since nr + n2 −j − nr ≤ n2 −j we have by condition (11), with some constant c > 0. Since q/2 − qβ − 1 > 0, we obtain I n (a) ≤ ca −q and complete the proof of (12) and that of the lemma.
Summing up we have the following functional limit theorem for the process U h,n .
Theorem 10. Assume that the sequence (X i ) is stationary sequence of S-valued random elements. Let h be an antisymmetric kernel end E|h(X 1 , X 2 )| p < ∞ for some p > 2. If (i) there is a constant C > 0 such that for any 0 ≤ m 1 < n 1 ≤ m 2 < n 2 the inequality (9) is satisfied; (ii) for some 2 < q ≤ p the inequality (11) is satisfied; (iii) there is a Gaussian process U h such that

iid sample
In this subsection we establish the (h, γ) − F CLT for independent identically distributed sequences (X i ) i∈N .
Theorem 11. Assume that (X i ) are independent and identically distributed random elements in S and the measurable function h : Particularly, if the kernel h is antisymmetric and bounded, then (X i ) satisfies (h, γ)-FCLT for any 0 ≤ γ < 1/2.

Condition (ii) is obtained via Rosenthal's inequality. Since the moment assumption gives
As the convergence n −1/2 W h 1 ,n fdd −−−→ n→∞ σ h 1 W is well known, the proof is completed.

Mixing sample
In this subsection we establish the (h, γ) − F CLT for β-mixing sequences (X i ) i∈N . For A ⊂ Z we will denote by P A the joint distribution of {X i , i ∈ A}. We write P X for the distribution of X i . We need some auxiliary lemmas: Lemma 12. Let i 1 < i 2 < · · · < i k be arbitrary integers. Let f : S k → R be a measurable function such that for any j, for some δ > 0. Then Proof. The proof goes along the lines of the proof of Lemma 1 in Ken-ichi Yoshihara [30].
Lemma 13. Assume that for a δ > 0 there is a constant M such that Then for any 0 ≤ m 1 < n 1 ≤ m 2 < n 2 , Proof. We have First consider the case where i 1 < i 2 and j 1 < j 2 . If j 2 − j 1 > i 2 − i 1 then by Lemma 12 we have Note that for any y ∈ S, Treating the other cases in the same way, we deduce that for any If k := min{|i 2 − i 1 |, |j 2 − j 1 |} = |i 2 − i 1 |, then there are less than n 1 − m 1 choices for i 1 , at most 2 choices for i 2 , as i 2 ∈ {i 1 − k, i 1 + k}. Furthermore, there are less than n 2 − m 2 choices for j 1 , and, because |j 2 − j 1 | ≤ k, at most 2k + 1 choices for j 2 . In the case k := min{|i 2 − i 1 |, |j 2 − j 1 |} = |j 2 − j 1 |, we can use a similar reasoning. In total, there are less than 12(n 1 − m 1 )(n 2 − m 2 )k ways to chose the indices for given k. We arrive at then there is a constant c r,δ > 0 such that for any 0 ≤ k < m ≤ n, Proof. This lemma is proved in Yokoyama [31] for real valued strongly mixing random variables. We need to note that if (X i ) is β-mixing then (h 1 (X i )) is β-mixing as well for any measurable h 1 : S → R. Being such this sequence is also strongly mixing.
Theorem 15. Assume that (X i ) is a strictly stationary β-mixing sequence of random elements in S and the measurable function h : S × S → R is antisymmetric. If E|h(X 1 , X 2 )| q < ∞ and for some q > 2 and 2 < r < q, then (X i ) satisfies (h, γ) − F CLT for any 0 ≤ γ < 1/2 − 1/r with the limit process U h = σ ∞ B, where B = (B(t), t ∈ [0, 1]) is a standard Brownian bridge and Particularly, if the kernel h is antisymmetric and bounded then condition (13) becomes k k r/2−1 β k < ∞, and in this case (X i ) satisfies (h, γ)-FCLT for any 0 ≤ γ < 1/2 − 1/r. Proof. We need to check conditions (i)-(iii) of Theorem 10. First we check (i) using Lemma 13 with δ = (q−2)/2. Condition (ii) follows imediately from Lemma 14. Finally, convergence of finite dimensional distributions can be deduced from invariance principles for α-mixing sequences proved by a number of authors (see, e.g., [14] and references therein).

Asymptotic distribution under null
In the following, we show how the asymptotic behaviour of the statistic T n (γ, h) follows from the functional limit results for U -processes: Theorem 16. Let 0 ≤ γ < 1/2 and let the kernel h : S × S → R be antisymmetric. Assume that (X i ) is a stationary sequence and satisfies (h, γ)-FCLT with the limit process U h . Then Proof. Set for f ∈ H o γ [0, 1], and 0 ≤ s < t ≤ 1, Since U h (0) = U h (1) we see that F (U h ) = T γ . We have due to anti-symmetry of h, for any 0 ≤ k < m ≤ n,
To this aim we apply the following simple lemma (the proof is given in [21], see. Lemma 13 therein).
Lemma 17. Let (η n ) n≥1 be a tight sequence of random elements in the separable Banach space B and g n , g be continuous functionals B → R. Assume that g n converges pointwise to g on B and that (g n ) n≥1 is equicontinuous. Then g n (η n ) = g(η n ) + o P (1).
We check the continuity of the function F first. We have if t − s ≤ 1/2, ρ γ (t − s) ≥ 2 −γ (t − s) γ and this yields Hence, F (f ) ≤ 6||f || γ and this yields the continuity since the inequality |F (f ) − F (g)| ≤ F (f − g) can be easily checked. Similarly we have |F n (f ) − F n (g)| ≤ F n (f − g) ≤ 32 γ ||f − g|| γ , therefore the sequence ( The pointwise convergence of (F n ) being now established, and observing that by the (γ, h)-FCLT, the sequence n −3/2 U n is tight, Lemma 17 gives (14). Since F is continuous, continuous mapping theorem together with (h, γ)-FCLT yield By (14) we get This completes the proof.
Combination of this general result with Theorem 11 and Theorem 15 gives the proofs of Theorem 1 and Theorem 2 respectively.

Behavior under the alternative
To discuss the behaviour of the test statistics T n (γ, h) under the alternative we assume that for each n ≥ 1 we have two probability measures P n and Q n on (S, S) and a random sample (X ni ) 1≤i≤n such that for k * n , m * n ∈ {1, . . . , n}, P X ni = Q n , for i ∈ I * := {k * n + 1, . . . , m * n } P n , for i ∈ I n \ I * .
We will write k = k n , m = m n and = m − k for short. Set h(x, y)Q n (dx)P n (dy).
Note that δ n measures in a sense the difference between the probability distributions P n and Q n . If P n = Q n , then δ n = 0. If h(x, y) = h c (x, y) then δ n = sP n (dx) − xQ n (dx). If h = h W then δ n = P n (x)Q n (dx) − Q n (x)P n (dx). The general consistency result is in the following elementary lemma.
and √ n δ n * Proof of Theorem 3. Set for i ∈ I * , j ∈ I n \ I * , Noting that EZ ij = 0 and EZ 2 ij = ν n for any i ∈ I * , j ∈ I n \ I * , we obtain This yields (15) by (3) and completes the proof.
Proof of Theorem 4. We will use a Hoeffding decomposition adjusted to the changing distribution. To this aim we define Next we show that the following estimates hold with an absolute constant C > 0: g n (X i,n , X j,n ) + n i=k * + * +1 k * + * j=k * +1 g n (X i,n , X j,n ) 2   ≤ C * (n − * ). (19) These estimates yield with an absolute constant C > 0 and (15) follows by (5). Hence, it remains to prove (17)- (19).

Critical Values
Below in Table 2, we give the upper quantiles of limit distribution of the one-sided and two-sided test statistics, that is where B is a standard Brownian bridge. The distribution was evaluated on a grid of size 10,000 and we run a Monte-Carlo-simulation with 30,000 runs.