On variance estimation under shifts in the mean

In many situations, it is crucial to estimate the variance properly.
Ordinary variance estimators perform poorly in the presence of shifts in the mean. We investigate an approach based on non-overlapping blocks, which yields good results in change-point scenarios.
We show the strong consistency and the asymptotic normality of such blocks-estimators of the variance under independence. Weak consistency is shown for short-range dependent strictly stationary data. We provide recommendations on the appropriate choice of the block size and compare this blocks-approach with difference-based estimators. If level shifts occur frequently and are rather large, the best results can be obtained by adaptive trimming of the blocks.


Introduction
We consider a sequence of random variables Y 1 , … , Y N generated by the model Most of the time we assume that X 1 , … , X N are i.i.d. random variables with E X t = and Var X t = 2 , but this will be relaxed occasionally to allow for a short-range dependent strictly stationary sequence. The observed data y 1 , … , y N are affected by an unknown number K of level shifts of possibly different heights h 1 , … , h K at different time points t 1 , … , t K . Our goal is the estimation of the variance 2 . Without loss of generality, we will set = 0 in the following.
(1) Y t = X t + K ∑ k=1 h k I t≥t k .
In Sect. 2 we analyse estimators of 2 from the sequence of observations (Y t ) t≥1 by combining estimates obtained from splitting the data into several blocks. Without the need of explicit distributional assumptions the mean of the blockwise estimates turns out to be consistent if the size and the number of blocks increases, and the number of jumps increases slower than the number of blocks. If many jumps in the mean are expected to occur, an adaptively trimmed mean of the blockwise estimates can be used, see Sect. 3. In Sect. 4 a simulation study is conducted to assess the performance of the proposed approaches. In Sect. 5 the estimation procedures are applied to real data sets, while Sect. 6 summarizes the results of this paper.

Estimation of the variance by averaging
When dealing with independent identically distributed data the sample variance is the common choice for estimation of 2 . However, if we are aware of a possible presence of level shifts at unknown locations, it is reasonable to divide the sample Y 1 , … , Y N into m non-overlapping blocks of size n = ⌊N∕m⌋ and to calculate the average of the m sample variances derived from the different blocks. A similar approach has been used in Dai et al. (2015) in the context of repeated measurements data and in Rooch et al. (2019) for estimation of the Hurst parameter.
The blocks-estimator ̂ 2 Mean of the variance investigated here is defined as where S 2 j = 1 n−1 ∑ n t=1 (Y j,t − Y j ) 2 , Y j = 1 n ∑ n t=1 Y j,t and Y j,1 , … , Y j,n are the observations in the jth block. We are interested in finding the block size n which yields a low mean squared error (MSE) under certain assumptions.
In what follows, we will concentrate on the situation where all jump heights are positive. This is a worse scenario than having both, positive and negative jumps, since the data are more spread in the former case resulting in a larger positive bias of most scale estimators.

Asymptotic properties
We will use some algebraic rules for derivation of the expectation and the variance of quadratic forms in order to calculate the MSE of ̂ 2 Mean , see Seber and Lee (2012). Let B be the number of blocks with jumps in the mean and K ≥ B the total number of jumps. The expected value and the variance of ̂ 2 Mean are given as follows: (2) where 4 = E X 4 1 , A = n − 1 n 1 n 1 T n , n is the unit matrix, 1 n = (1, … , 1) T and j contains the expected values of the random variables in the perturbed block j = 1, … , B , i.e., j = ( j,1 , … , j,n ) T = (E(Y j,1 ), … , E(Y j,n )) T . The term T j A j ∕(n − 1) is the empirical variance of the expected values E(Y j,1 ), … , E(Y j,n ) in block j. In a jumpfree block, we have T j A j = 0 , since all expected values and therefore the elements of j are equal.
The blocks-estimator (2) estimates the variance consistently if the number of blocks grows sufficiently fast as is shown in Theorem 1. (1) be segregated into m blocks of size n, where t 1 , … , t K are the time points of the jumps of size h 1 , … , h K , respectively. Let B out of m blocks be contaminated by K 1 , … ,K B jumps, respectively, with and m → ∞ , whereas the block size n can be fixed or increasing as N → ∞ . Then ̂ 2 Mean = 1 m ∑ m j=1 S 2 j → 2 almost surely.
Proof Without loss of generality assume that the first B out of m blocks are contaminated by K 1 , … ,K B jumps, respectively. Let the term S 2 j,0 denote the empirical variance of the uncontaminated data in block j, while S 2 j,h is the empirical variance when K j level shifts are present. Moreover, Y j,1 , … , Y j,n are the observations in the jth block, j,t = E Y j,t and j = 1 n ∑ n t=1 E � Y j,t � . Then we have For the second term in the last Eq.
(3), we have almost surely The term 1 (4) is a random variable with finite moments if n and B are fixed. This random variable converges to the term E In the case of B → ∞ and n → ∞ this term converges to E |X 1 | almost surely due to Theorem 2 of Hu et al. (1989) and the condition E(|X 1 | 4 ) < ∞ , since S 2 j − E S 2 j are uniformly bounded with P(|S 2 j − E S 2 j | > t) → 0 ∀t due to Chebyshev's inequality and Var(S 2 j ) → 0 . Moreover, we used the fact that B � � � The following is valid for the third term in (3): The first term of the last equation in (3) converges almost surely to 2 due to the results on triangular arrays in Theorem 2 of Hu et al. (1989), assuming that the condition E(|X 1 | 4 ) < ∞ holds, since S 2 j − E S 2 j are uniformly bounded with P(|S 2 j − E S 2 j | > t) → 0 ∀t due to Chebyshev's inequality and Var(S 2 j ) → 0 . Application of Slutsky's Theorem proves the result. ◻

Remark 2
1. If the jump heights are bounded by a constant h ≥ h k , k = 1, … , K, the strongest restriction arises if all heights equal this upper bound resulting in the constraint

By the Central Limit Theorem, the estimator ̂ 2
Mean is asymptotically normal if no level shifts are present and the block size n is fixed. Its asymptotic efficiency relative to the ordinary sample variance is in case of i.i.d data with finite fourth moments, where (see Angelova (2012) for the variance Var(S 2 1 ) of the sample variance). E.g., under normality the efficiency of the blocks estimater with fixed n is (n − 1)n −1 < 1. 3. The asymptotic efficiency of the blocks-estimator relative to the sample variance is 1 if n → ∞.
The next Theorem shows that ̂ 2 Mean is asymptotically not only normal but even fully efficient in case of a growing block size.
. Convergence of the term (6) in mean implies convergence in probability to zero. Application of the Central Limit Theorem to the remaining terms of (5) yields the desired result. ◻

Remark 4
In the proof of Theorem 3, we have assumed that m = o(n) , i.e., the block size grows faster than the number of blocks. This condition can be dropped using the Lyapunov condition under the assumption of finite eighth moments, as will be shown in the following. We set where i denotes the ith row of the triangular array. The Lyapunov condition [see corollary 1.9.3 in Serfling (1980)] is the following: With = 2 and existing moments 1 , … , 8 of X 1 we get where g is a function of the existing moments 1 , … , 8 with g( 1 , … , 8 ) = O(1) . See Angelova (2012) for the fourth central moment of the sample variance. Therefore, the condition m = o(n) can be dropped.

Choice of the block size
When choosing blocks of length n = 2 , the estimator ̂ 2 Mean results in a differencebased estimator which considers ⌊N∕2⌋ consecutive non-overlapping differences: Difference-based estimators have been considered in many papers, see Von Neumann et al. (1941), Rice (1984), Gasser et al. (1986), Hall et al. (1990), Dette et al. (1998), Munk et al. (2005, Tong et al. (2013), among many others. Dai and Tong (2014) discussed estimation approaches based on differences in nonparametric regression context, Wang et al. (2017) considered an estimation technique which involves differences of second order, while Tecuapetla-Gómez and Munk (2017) proposed a difference-based estimator for m-dependent data. An ordinary differencebased estimator of first order, which considers all N − 1 consecutive differences, is [see, e.g., Von Neumann et al. (1941) T . Theorems 1.5 and 1.6 in Seber and Lee (2012) are used to calculate the expectation and the variance of the estimator ̂ 2 Diff : where = E(Y) , i = E X i 1 and a is a vector of the diagonal elements of A. When no changes in the mean are present both, the difference-based and the averaging estimators, are unbiased. For the variance of the estimators we have For example, for N = 100 and a block size n = 10, we get Var ̂ 2 Mean = 0.0222 , while Var ̂ 2 Diff = 0.0302 . Therefore, when no changes in the mean are present the block-estimator can have smaller variance than the difference-based estimator. In Sect. 4 we compare the performance of the proposed estimation procedures with that of the difference-based estimator (7) in different change-point scenarios.
In the following, we investigate the proper choice of the block size for the estimator ̂ 2 Mean . All calculations in this paper have been performed with the statistical software +R+, version 3.5.2, R Core Team (2018).
For known jump positions the MSE of the blocks-variance estimator ̂ 2 Mean can be determined analytically. The position of the jump is relevant for the performance of this approach. Therefore, it is reasonable to consider different positions of the K jumps to get an overall assessment of the performance of the blocks-estimator. For every K ∈ {1, 3, 5} , we generate K jumps of equal heights h = ⋅ , with ∈ {0, 0.1, 0.2, … , 4.9, 5} , at positions sampled randomly from a uniform distribution on the values max n (N − ⌊N∕n⌋n) + 1, … , N − max n (N − ⌊N∕n⌋n) without replacement, and calculate the MSE for every reasonable block size n ∈ {2, 3, 4, … , ⌊N∕2⌋} . This is repeated 1000 times, leading to 1000 MSE values for every h and n based on different jump positions. The average of these MSE values is taken for each h and n. Data are generated from the standard normal or the t 5 -distribution. Panel (a) of Fig. 1 shows the block size n opt which yields the least theoretical MSE value of the estimator ̂ 2 Mean depending on the jump height h = ⋅ with K ∈ {1, 3, 5} jumps for N = 1000 observations and normal distribution. We observe that n opt decreases for ̂ 2 Mean as the jump height grows. Blocks of size 2 (resulting in a non-overlapping difference-based estimator) are preferred when h ≈ 4 and K = 5 , while larger blocks lead to better results in case of smaller or less jumps. Panel (b) of Fig. 1 depicts the MSE of ̂ 2 Mean for the respective MSE-optimal block size n opt . In all three scenarios the MSE increases if the jump height gets larger. Different values for the optimal block size n opt are obtained in different scenarios, i.e., for different K and h = ⋅ . As the true number and height of the jumps in the mean are usually not known in practice, we wish to choose a block size which yields good results in many scenarios. We do not consider very high jumps any further, since they can be detected easily and are thus not very interesting. The square root of the sample size N has proven to be a good choice for the block size in many applications, see e.g. Rooch et al. (2019). Moreover, in the upper panel of Fig. 1, we observe that smaller block sizes n are preferred when the number of change-points is high. If the estimation of the variance is in the focus of the application, we suggest to choose the block size depending on K: which gets smaller if the number of jumps increases, resulting in many jump-free and only few contaminated blocks. Otherwise, if testing is of interest in view of Theorem 3, we suggest a block size which grows slightly faster than √ N , e.g.

Remark 5
For large N we get m = N∕n = √ N(K + 1) . In this case the number of jumps needs to satisfy K = o m 1∕3 = o N 1∕6 K 1∕3 , i.e., K = o N 1∕4 can be tolerated, see Remark 2. An even larger rate of shifts can be tolerated by choosing m = N∕c for some constant c, i.e., a fixed block length n. However, this reduces the efficiency of the resulting estimator in the presence of a small rate of shifts, see Fig. 1. For example, for the standard normal distribution, the asymptotic relative efficiency of the blocks-estimator ̂ 2 Mean is (n − 1)n −1 < 1 and that of the differencebased estimator [as defined in (7)] is 2/3 [see e.g. Seber and Lee (2012)].
Panel (c) of Fig. 1 shows the MSE of the estimator ̂ 2 Mean with the block size n chosen according to (8). For K ∈ {3, 5} there is only a moderate loss of performance when choosing n according to (8) instead of n opt which depends on the number of jumps K and the height h = ⋅ . In case of K = 1 change in the mean the performance of the blocks-estimator worsens slightly. Table 1 shows the average MSE of the ordinary sample variance for normally distributed data and different values of K and h, with N = 1000 . Again, 1000 simulation runs are performed where jumps are added to the generated data at randomly chosen positions. We observe that the MSE becomes very large when the number and height of the level shifts increases. Obviously, the blocks-estimator ̂ 2 Mean performs much better than the sample variance.
The results for data from the t 5 -distribution are similar to those obtained for the normal distribution, see Fig. 7 in "Appendix." Again, the blocks-estimator with the block size (8) [Panel (c)] performs well and does not lose too much performance compared to Panel (b) of Fig. 7, where the optimal block size is considered. Similar results are obtained for N = 2500 , see Figs. 9 and 10 in "Appendix".
As the number of change-points K is not known exactly in real applications, there are several possibilities to set K in formula (8): � with a large guess on the value of K.
2. Pre-estimate K with an appropriate procedure. 3. Use prior knowledge about plausible values of K.
We will discuss the first two approaches in the following two subsections.

Using a large guess on the number of jumps
If many change-points are present, a small block size should be chosen, while larger blocks are preferred in the case of only a few level shifts. If a practitioner does not have knowledge about the number of jumps in the mean we recommend choosing a rather high value K in the formula (8), which results in small blocks. Doing so we are usually on the safe side, since choosing too few blocks can result in a very high MSE, while the performance of the estimator does not worsen so much when choosing many blocks.
As an example we generate 1000 time series of length N = 1000 with K = 3 jumps at random positions. Figure 2 shows the MSE of ̂ 2 Mean depending on the jump height h = , ∈ {0, 0.1, … , 5}, when choosing n = √ N K+1 with values K ∈ {0, 1, … , 6} . We observe that choosing a too small number of blocks (i.e., a too large block size) results in large MSE values if the jumps are rather high. On the other hand, the results do not worsen as much when choosing unnecessarily many and thus small blocks. Figure 8 in "Appendix" shows similar results for K = 5 jumps.
Choosing n = 2 results in a non-overlapping difference-based estimator which performs also well but loses efficiency compared to the block-estimator with growing block size, which can be fully efficient, see Remarks 2 and 5.
We will not consider the block-estimator with the block size n depending on the choice of some large value of K instead of the true one in the following investigation, since it is a subjective choice of a practitioner.

Pre-estimation of the number of jumps
We will investigate the MOSUM procedure for the detection of multiple changepoints proposed by Eichinger and Kirch (2018) to pre-estimate the number of change-points K. According to the simulations in the aforementioned paper, this procedure yields very good results in comparison with many other procedures for change-point estimation. We will describe the procedure briefly in the following. At time point t a statistic T t,N is calculated as In what follows, we will set the bandwidth parameter to G = is a critical value depending on the bandwidth parameter G and a sequence N → 0 and ̂ 2 t,N is a local estimator of the variance at location t. In a window of length 2G, the location t is treated as a possible change-point location. A mean correction is performed before and after the time point t for computing ̂ 2 t,N . The corresponding estimated change-point locations t 1 , … ,tK are More information on the procedure can be found in Eichinger and Kirch (2018).
We will use the MOSUM procedure to estimate the number of change-points K in the formula (8) for the block size n. The corresponding blocks-estimator is denoted as ̂ 2 mosum Mean . In Sect. 4 we will see that the performance of the two blocks-estimators ̂ 2 Mean [as defined in (2)] and ̂ 2 mosum Mean is similar in many cases. Moreover, we introduce another estimation procedure which is fully based on the MOSUM method, for comparison. We divide the data into K + 1 blocks at the estimated locations t 1 , … ,tK of the level shifts. In every block j = 1, … ,K + 1, the of those values can be computed to estimate the variance, i.e., where n j is the size of block j = 1, … ,K.

Extension to short-range dependent data
Since many real datasets exhibit autocorrelation, we investigate the averaging estimation procedure under dependence. We consider a strictly stationary linear process with mean zero and finite variance and Theorem 6 Consider a strictly stationary linear process (X t ) t≥1 with an absolutely Proof Without loss of generality assume that the first B ≤ K out of m blocks are perturbed by K jumps in the mean. Furthermore, let = E(X j,t ) ∀j, t , j,t = E Y j,t and Let the term S 2 j,0 denote the empirical variance of the uncontaminated data X j,1 , … , X j,n in the block j = 1, … , m , while S 2 j,h is the empirical variance in the perturbed block j = 1, … , B. Then For the first term A 1,N of (10), we have: The first term of (11) converges to zero, since the term 1 For the second term of (11) we have due to absolute summability of the autocovariance function , which implies convergence in probability. Therefore, the second term of (11) converges to zero. Altogether, the sum (11) converges to zero.
The second and the third terms A 2,N and A 3,N of (10) converge to zero due to the following considerations: due to absolute summability of the autocovariance function and the condition . Therefore, the term A 2,N converges to zero in rth mean with r = 2 , which implies convergence in probability. The argument in the

Remark 7
If the jump heights are bounded by a constant h ≥ h k , k = 1, … , K, the strongest restriction arises if all heights equal this upper bound resulting in the constraint

Trimmed estimator of the variance
So far we have considered cases where the number of changes in the mean K is rather small with respect to the number of the blocks m and thus to the number of observations N. However, there might be situations in which level shifts occur frequently. Asymptotically, the blocks-estimator ̂ 2 Mean [see (2)] is still a good choice for the estimation of the variance as long as the number of level shifts grows slowly, see e.g. Theorem 1. However, if many jumps are present in a finite sample the blocks-estimator is no longer a good choice and will become strongly biased. We propose an asymmetric trimmed mean of the blockwise estimates instead of their ordinary average, i.e., large estimates are removed and the average value of the remaining ones is calculated. We do not consider a symmetric trimmed mean, since the sample variance is positively biased in the presence of level shifts, so that estimates from blocks containing a level shift are expected to show up as upper outliers. Moreover, we suggest using rather many small blocks to account for potentially many level shifts. In the next Sects. 3.1 and 3.2, the choice of the trimming fraction is discussed.

Estimation with a fixed trimming fraction
The trimmed blocks-estimator is given as are the ordered blockwise estimates, m is the number of blocks and C N,Tr, is a sample and distribution dependent correction factor to ensure unbiasedness in the absence of level shifts. In practice this constant can be simulated under the assumption of observing data from a known location-scale family. For example, for standard normal distribution, = 0.2 , N = 1000 and n = 20 ( m = 50 ) we generate 1000 samples of length N = 1000 and calculate the average of the uncorrected trimmed variance estimates. The reciprocal of this average value yields C 1000,Tr,0.2 = 1.198.
As an example, we generate 1000 time series of length N ∈ {1000, 2500} from normal and t 5 -distribution. We add K = N ⋅ p jumps to the generated data at randomly chosen positions, as was done in Sect. 2.2, with p ∈ {0, 2∕1000, 4∕1000, 6∕1000, 10∕1000} and height h ∈ {0, 2, 3, 5} . We choose n = 20 to ensure that the number of jump-contaminated blocks is sufficiently smaller than the total number of blocks. Table 2 shows the simulated MSE of the trimmed estimator (12) for ∈ {0.1, 0.3, 0.5} . Clearly, the performance of the trimmed estimator depends on the number of jumps in the mean and the trimming parameter . Larger values of are required when dealing with many jumps but lead to an increased MSE if there are only a few jumps. Therefore, it is reasonable to choose adaptively, as will be described in the next Sect. 3.2.

Adaptive choice of the trimming fraction
Instead of using a fixed trimming fraction, we can choose adaptively, yielding the adaptive trimmed estimator ̂ 2 Tr,ad with  where S 2 (1) ≤ ⋯ ≤ S 2 (m) are the ordered blockwise sample variances and adapt is the adaptively chosen percentage of the blocks-estimates which will be removed.

Estimation under normality
We use the approach for outlier detection discussed in Davies and Gather (1993) to determine adapt , assuming that the underlying distribution is normal. In this case, the distribution of the sample variance is well known, i.e., in block j we have that Since the true variance 2 is not known, we propose to replace 2 by an appropriate initial estimate, such as the median of the blocks-estimates, i.e., where C N,Med is a finite sample correction factor. Subsequently, we remove those values which exceed q 2 n−1 ,1− m , the (1 − m )-quantile of the 2 n−1 -distribution, with We will refer to the adaptively trimmed estimator based on the approach of Davies and Gather (1993) under normality as ̂ 2 norm Tr,ad . The choice of m results in a probability of 1 − that no observation (block in our Furthermore, we expect that roughly m ⋅ m blocks are trimmed on average in the absence of level shifts. The following simulations suggest that the adaptive trimming fraction is slightly larger than m which can be explained by the fact that we need to use an estimate such as ̂ 2 Med instead of the unknown 2 . We generated 10,000 sequences of observations of size N ∈ {1000, 2500} for ∈ {0.05, 0.1} . Table 3 shows the average number of trimmed blocks in the absence of level shifts. m = 1 − (1 − ) 1∕m and ∈ (0, 1).
We suggest choosing a small block size, e.g. n = 20 , to cope with a possibly large number of change-points. In this way, it is ensured that the number of uncontaminated blocks is much larger than the number of perturbed blocks. Moreover, we choose = 0.05.
The correction factor in (13) needs to be simulated taking into account that the percentage of the omitted block-estimates is no longer fixed. Therefore, for given N and we generate 1000 sequences of observations. In each simulation run we calculate the block-estimates S 2 j , j = 1, … , m, and the initial estimate of the variance ̂ 2 Med . Subsequently, we remove the values T j = (n − 1)S 2 j ∕̂ 2 Med which exceed the quantile q 2 n−1 ,1− m . Then the average value of the remaining block-estimates is computed. The procedure yields 1000 estimates. The correction factor is the reciprocal of the average of these values. For N = 1000 and = 0.05 the simulated correction factor is C 0.05 1000,Tr,ad = 1.0020 , while for N = 2500 we have C 0.05 2500,Tr,ad = 1.0009 , so both are nearly 1 and could be neglected with little loss.

Estimation under unknown distribution
When no distributional assumptions are made and the block size n is large, one can use that √ n  Table 4 shows the average number of trimmed blocks in the absence of level shifts for normally distributed data, analogously to Table 3. We observe that the average number of trimmed block-estimates is much larger than the values for the trimming procedure which is based on the normality assumption.

Remark 8
We do not use distribution dependent correction factors for the estimators ̂ 4,Med and ̂ 2 Med , since the underlying distribution is not known. The asymptotic distribution of the blockwise empirical variances is normal and therefore, for large block sizes, the distribution is expected to be roughly symmetric. A correction factor can then be neglected with little loss, since the sample median of symmetrically distributed random variables estimates their expected value.

Choice of the trimming fraction based on the MOSUM procedure
In Sect. 2.4 we have used the MOSUM procedure proposed by Eichinger and Kirch (2018) to estimate the unknown number of jumps K. The corresponding estimate K can also be used to determine the trimming fraction of the trimmed estimator in (12), i.e., we can set =K∕m . We will denote the corresponding estimator as ̂ 2 mos Tr,ad . We compare the average number of trimmed blocks for the three estimators ̂ 2 mos Tr,ad , ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad in a simulation study. We generate 1000 sequences of observations of length N = 1000 for different values of K and h. To every sequence we add K jumps of height h at randomly chosen positions, as was done in Sect. 2.2. Table 5 shows the average number of trimmed blocks for data sampled from the standard normal distribution and the corresponding MSE in different scenarios. Mean is similar to that of ̂ 2 mos Tr,ad in that case. As opposed to this, the three trimmed estimators trim similarly many blocks on average if the jumps are large, with ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad leading to smaller MSE values than ̂ 2 mos Tr,ad . This can be explained by the fact that the former methods base the decision whether to trim a block or not directly on the criterion whether the empirical variance in a block is unusual or not. The averaging estimator ̂ 2 Mean is outperformed by the three trimmed estimators when dealing with large jumps.
For the heavy-tailed t 5 -distribution, the results are found in Table 6. The MOSUMbased procedure yields slightly better results for large jump heights, i.e., h > 2 , since the adaptively trimmed estimators overestimate the number of contaminated blocks in many cases. For small jumps, the adaptively trimmed procedures yield smaller MSE values. The three trimmed estimators perform better than ̂ 2 Mean in every case. The results for the data generated by the AR(1) model with = 0.5 are displayed in Table 7. The MOSUM-based procedure trims considerably more blocks than the adaptively trimmed estimators ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad yielding higher MSE values. Moreover, the averaging estimator ̂ 2 Mean is outperformed by the adaptively trimmed estimators ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad in every case. We conclude that the estimators ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad perform better when dealing with large jumps and normally distributed data. For small jumps less contaminated blocks are trimmed by the adaptively trimmed estimators resulting in a higher MSE values in many scenarios. In this case the MOSUM-based estimator ̂ 2 mos Tr,ad yields the best results which are similar to those of ̂ 2 Mean . For heavy-tailed data, the MOSUM-based procedure is preferable under large jumps but is inferior to the other Tr,ad perform best. Therefore, the choice of the most appropriate estimator depends on the knowledge about the underlying distribution and the height of the jumps.
In the simulation study in Sect. 4, we concentrate on the adaptively trimmed estimators ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad since they yield very good results in many scenarios, and on the averaging estimator ̂ 2 Mean since it performs well when dealing with a few small jumps and independent data.

Simulations
In this section we compare the estimators ̂ 2 Mean , ̂ 2 Diff , ̂ 2 Tr,0.5 with the block size n = 20 , ̂ 2 norm Tr,ad with the block size n = 20 and = 0.05 , ̂ 2 other Tr,ad with the block size n = 40 and = 0.05 , ̂ 2 mosum Mean and ̂ 2 mosum W in different scenarios. We generate 1000 sequences of observations of length N ∈ {200, 1000, 2500} from the standard normal and the t 5 distribution. We add K jumps of heights h ∈ {0, 2, 3, 5, 8} to the data at randomly chosen positions as was done in Sect. 2.2. K is chosen dependent on the number of observations, i.e., K = p ⋅ N with p ∈ {0, 2∕1000, 4∕1000, 6∕1000, 10∕1000}.    Tables 3 and 4. Therefore, we also expect that more blocks are trimmed away by ̂ 2 other Tr,ad if level shifts are present, reducing the risk of including perturbed blocks in the trimmed estimator ̂ 2 other Tr,ad . The trimmed estimator ̂ 2 Tr,0.5 with a fixed trimming fraction also yields good results. However, this estimation procedure requires the knowledge of the underlying distribution to compute the finite sample correction factor, see Sect. 3.1. The difference-based estimator ̂ 2 Diff performs well as long as the jumps are moderately high. Table 9 shows the results for the t 5 distribution. The estimation procedures ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad yield the best results in this scenario. In Table 10 the simulated MSE is presented when the data is generated from the autoregressive (AR) model with = 0.5 , i.e., the data is positively correlated. The performance of the difference-based estimator worsens considerably then. This is due to the fact that this estimation procedure makes explicit use of the assumption of uncorrelatedness. While ̂ 2 Diff underestimates the true variance drastically (resulting in a high MSE value) when no changes in the mean are present, the performance seems to improve slightly when dealing with many high jumps. This can be explained by the fact that the positive bias, which arises from the jumps, compensates for the negative bias which arises from the (incorrect) assumption of uncorrelatedness. The blocks-estimator ̂ 2 Mean exhibits a similar behaviour, since the block size is small when the number of jumps is high, while correlated data require large block sizes to ensure satisfying results. For dependent data the best results are obtained when using the adaptively trimmed estimators. Since the variance 2 is underestimated when the data are dependent, the values T j in (15) and (17) get larger resulting in a higher trimming parameter adapt . Therefore, more blocks are trimmed away ensuring that the perturbed ones are not involved in the calculation of the overall estimate.   Figure 3 shows the simulated MSE dependent on the AR-parameter ∈ {0.1, … , 0.8} in four different scenarios: no jumps, few small jumps, many small jumps, and many high jumps. We observe that the performance of all estima- Mean . The trimmed estimation procedures do not suffer much from the increasing strength of correlation in our example. The averaging approach ̂ 2 Mean yields good results if the dependence is rather weak and if only few small jumps are present. The performance of the weighted average ̂ 2 mosum W worsens drastically when many high jumps are present. During this estimation procedure, the data are segregated into few blocks at estimated change-point locations. The approach is highly biased in the case of many and high level shifts if the number of change-points is underestimated. This is sometimes the case if two level shifts are not sufficiently far away from each other, see Eichinger and Kirch (2018).
Based on the simulation results we recommend using the averaging estimation procedure ̂ 2 Mean if the data are not expected to be correlated or heavy-tailed and the jump heights are rather small. Otherwise, if either no information on the distribution, the dependence structure of the data and the jump heights is given or the jumps are expected to be rather large, the adaptively trimmed procedure ̂ 2 other Tr,ad can be recommended.

Application
In this section, we apply the blocks-approach to two datasets in order to estimate the variance.

Nile river flow data
The first dataset contains the widely discussed Nile river flow records in Aswan from 1871 to 1984, see e.g. Hassan (1981), Hipel and McLeod (1994), Syvitski and Saito (2007), among many others. We consider the N = 114 annual maxima of the average monthly discharge in m 3 ∕s , since these values are often assumed where X t originates from the AR(1)-process with parameter ∈ {0.1, 0.2, … , 0.8} to be independent in hydrology. The maxima are determined from January to December. The flooding season is from July to September, see Hassan (1981). Figure 4 shows the annual maxima of the average monthly discharge of the Nile river for the years 1871-1984.
The construction of the two Aswan dams in 1902 and from 1960 to 1969 obviously caused changes in the river flow, see Hassan (1981) and Hipel and McLeod (1994). We used Levene's test [see Section 12.4.2 in Fox (2015)] to check the three segments of the data (divided by the years 1902 and 1960) for equality of variances. The null hypothesis of equal variances was not rejected with a p value of p = 0.40.
A Q-Q plot of the data indicates that the deviation from a normal distribution is not large, see Fig. 11 in "Appendix". With = 0.05 and n = 10 ( m = 11 blocks) no blocks are trimmed away during the trimming procedures ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad . The ordinary sample variance of the entire data yields the value 3,243,866, see Table 11. For the blocks-estimator of the variance from (2) we choose the block size according to (8) with K = 2 getting n = ⌊ √ 114∕3⌋ = 3 . All blockwise estimators examined in this paper yield much smaller variance estimates for the whole observation period, ranging from 2,075,819 to 2,684,368.

PAMONO data
In the second example, we use data obtained from the PAMONO (Plasmon Assisted Microscopy of Nano-Size Objects) biosensor, see Siedhoff et al. (2014). This technique is used for detection of small particles, e.g. viruses, in a sample fluid. For more details, see Siedhoff et al. (2011). PAMONO data sets are sequences of grayscale images. A particle adhesion causes a sustained local intensity change. This results in an obvious level shift in the time series of grayscale values for each corresponding pixel coordinate. To the best of our knowledge, a change of the variance after a jump in the mean is not expected to occur. A Q-Q plot of the data indicates that the assumption of a normal distribution is reasonable, see Fig. 12 in "Appendix".
In Panel ( Since changes in the mean are not expected there, we use these data to get some insight into the typical value range of the variance. The sample variance of the contaminated data (upper panel) is 1.59 × 10 −4 which is not within the typical range of values, since it exceeds the upper whisker of the boxplot. The other estimation procedures discussed in this paper yield values within the interval [1.1 × 10 −4 , 1.2 × 10 −4 ] which are well within the interquartile range. We conclude that these approaches yield reasonable estimates for these data.

PAMONO data with trend
Again, we consider a PAMONO dataset, see Sect. 5.2. Panel (a) of Fig. 6 shows a time series corresponding to a pixel, which seems to exhibit a virus adhesion as well as a linear trend. Panel (b) of Fig. 6 shows the differenced data, i.e., Y t − Y t−1 , t = 2, … , 388.
The differences of first order appear to be independent and scattered around a fixed mean. Few large differences can be observed which presumably originate from the jumps in the mean at the corresponding time points. The existence of the trend could be explained by the fact that the surface, on which the fluid for virus adhesion is placed, was heated up over time. N = 388 observations are available. We apply the estimation procedures ̂ 2 Mean [using K ∈ {1, … , 5} in the formula (8) values for the variance, which range from ̂ 2 Mean = 0.95 × 10 −6 (with K = 5 ) to ̂ 2 mosum W = 1.66 × 10 −6 . The empirical variance of the observations has the value 26.48 × 10 −6 , which is much larger than the other estimates. According to our experience the PAMONO data can be assumed to be uncorrelated after differencing. The sample variance of differenced data is 1.93 × 10 −6 , which is an estimator of 2 2 , yielding the value 0.97 × 10 −6 as an estimate for 2 , which is near the estimated value of ̂ 2 Mean . We conclude that the proposed procedures yield reasonable results even in this situation, where a linear trend is present.

Conclusion
In the presence of level shifts, ordinary variance estimators like the empirical variance perform poorly. In this paper, we considered several estimation procedures in order to account for possible changes in the mean.
Estimation of 2 based on pairwise differences is popular in nonparametric regression and works well in the presence of level shifts and an unknown error distribution if the data are independent and the fraction of shifts is asymptotically negligible. However, we have identified scenarios where estimation based on longer blocks is to be preferred.
If only a few small level shifts are expected in a long sequence of observations our recommendation is to use the mean of the blocks-variances ̂ 2 Mean . This estimation procedure does not require knowledge of the underlying distribution, performs well in the aforementioned situation and is asymptotically even as efficient as the ordinary sample variance if there are no level shifts.
If many or large level shifts are expected to occur we recommend using the adaptive trimmed estimators ̂ 2 norm Tr,ad and ̂ 2 other Tr,ad . These procedures are constructed for independent data and use either the exact 2 -distribution or the asymptotic normal distribution of the blockwise estimates, where the second and the fourth moments need to be estimated. We have found these trimming approaches to work reasonably well even under moderate autocorrelations, although many blocks are trimmed away then, presumably due to the underestimation of the unknown variance in the formula (17). Therefore, when no changes in the mean are present the trimmed estimators suffer efficiency loss. On the other hand, we expect that many perturbed blocks are trimmed away in the presence of level shifts reducing the bias of the estimator. The trimming approach could be extended to dependent data in future work.
In many applications we rather wish to estimate the standard deviation , e.g. for standardization. If only few jumps of moderate heights are expected to occur, either the average value of the blockwise standard deviations or the square root of the blocks-variance estimator ̂ 2 Mean can be used. Otherwise, the square root of the trimmed estimator ̂ 2 other Tr,ad can be recommended. For a large sample size N the finite sample correction factors can be neglected with little loss, see "Appendix".
An interesting extension will be to consider situations where not only the level but also the variability of the data can change. Suitable approaches for such scenarios might be constructed by combining the ideas discussed here with those presented by Wornowizki et al. (2017), where tests for changes in variability have been investigated using blockwise approaches, assuming a constant mean. This will be an issue for future work.

Estimation by the blockwise average
We will consider the two blocks-estimators where C N,1 and C N,2 are sample dependent correction factors to ensure unbiasedness when no changes in the mean are present. The block size n can be chosen according to the rule (8). If the number of change-points is not known in practice, it can be estimated as is done in Sect. 2.4.
For normally distributed data, the correction factors C N,1 and C N,2 can be determined analytically. To derive the correction factor C N,1 for the estimator ̂ corr Mean,1 , we will first consider the exact distribution of the empirical variance when dealing with jumps in the mean in order to derive the distribution of ̂ Mean,1 in (18). Given independent identically normally distributed data X j,1 , … , X j,n it is well known that (n−1)S 2 j 2 ∼ 2 n−1 in a jth jump-free block with n observations. The situation is different in the presence of jumps. Without loss of generality the following lemma is expressed in terms of the first block consisting of the observation times t = 1, … , n and containing K 1 ≤ K jumps.

Lemma 9
Assume that X 1 , … , X n ∼ N(0, 2 ) and Y t = X t + ∑K 1 k=1 h k I t≥t k for t = 1, … , n . Then we have for S 2 1 = 1 Proof Y 1 = X 1 + 1 and Y t − Y 1 = X t − X 1 − 1 + 1,t , t = 1, … , n, are independent, since X 1 and X t − X 1 are independent and the remaining terms are deterministic constants. Hence, S 2 1 and Y 1 are independent. Furthermore, , since Y t ∀t are independent and n 2 X 2 1 ∼ 2 1 . The moment-generating function at z ∈ ℝ of both sides and the independence of S 2 1 and Y 1 yield: In the following, we assume that B ≤ K blocks are contaminated by K 1 , … ,K B jumps, respectively, with ∑ B k=1K k = K . Without loss of generality assume that the jumps are contained in the first B blocks, while the last m − B > 0 blocks do not contain any jumps. The square root of a 2 n−1, j -distributed random variable (n − 1)S 2 j ∕ 2 is -distributed with n − 1 degrees of freedom and non-centrality parameter √ j , see e.g. Miller (1964). We hence have √ n − 1S j ∕ ∼ n−1, √ j , j = 1, … , m , where j = 0 for the last blocks j = B + 1, … , m , i.e., √ n − 1S j ∕ ∼ n−1 . The expected value of S j is given as where F 1,1 (a, b, z) represents the generalized hypergeometric function, see Olver et al. (2010) for more details. When no changes in the mean are present we have that j = 0 ∀ j and therefore F 1,1 (− 0.5, 0.5(n − 1), − 0.5 j ) = 1 . The exact finite sample correction factor is given as which is the reciprocal of the term C n, j in (20) when no level shifts are present, since F 1,1 (− 0.5, 0.5(n − 1), − 0.5 j ) = 1 , j = 1, … , m , in this case.
For the second estimator (19), we have the following statements on its expectation and a suitable finite sample correction factor: We used the fact that ̂ Mean,2 follows a scaled u,v distribution with u = m(n − 1) degrees of freedom and the non-centrality parameter v = � ∑ B j=1 j , since n−1 2 ∑ m j=B+1 S 2 j ∼ 2 (m−B)(n−1) and n−1 . Using this information we can determine the expectation of the estimator straightforwardly, see Miller (1964) and Olver et al. (2010). The correction factor is the reciprocal of D n, 1 ,…, B , where we have Γ(0.5m(n − 1)) Γ(0.5(m(n − 1) + 1)) .
where j,t and j are defined in the proof of Theorem 1.
This can be shown with Lemma 1.4A in Serfling (1980), since ̂ Mean,1 and ̂ Mean,2 are consistent estimators and their variances tend to zero which implies convergence of the means and thus the above statement. Therefore, for large N and n we can neglect the correction factors and use the estimators ̂ Mean,1 and ̂ Mean,2 instead of ̂ corr Mean,1 and ̂ corr Mean,1 with block sizes n → ∞.

Trimmed estimation
When dealing with a large number of level shifts, as is discussed in Sect. 3, the square root of the variance estimator ̂ 2 Tr,ad from (13) can be used to estimate the standard deviation . For large N and n, a correction factor to ensure unbiasedness when no changes in the mean are present can be neglected. Table 12 shows the simulated finite sample correction factors for normally and t 5 -distributed data as well as for the stationary AR(1)-process with normal errors and parameter ∈ {0.3, 0.6} . We observe that the correction factors are nearly one except for strongly correlated data, i.e., AR-process with parameter = 0.6. Figs. 7,8,9,10,11 and 12. C N,1 → 1 and C N,2 → 1 as N → ∞, Table 12 Simulated finite sample correction factors for the adaptively trimmed estimation procedures for normally and t 5 -distributed data as well as for the stationary AR(1)-process with normal errors and parameter ∈ {0.3, 0.6} , denoted by AR(0.3) and AR(0.6) N(0, 1) t 5 AR (