1 Introduction

We consider a sequence of random variables \(Y_1,\ldots ,Y_N\) generated by the model

$$\begin{aligned} Y_t = X_t + \sum _{k = 1}^{K} h_k I_{t \ge t_k}. \end{aligned}$$
(1)

Most of the time we assume that \(X_1,\ldots ,X_N\) are i.i.d. random variables with \(E\left( X_t\right) =\mu \) and \({\hbox {Var}}\left( X_t\right) = \sigma ^2\), but this will be relaxed occasionally to allow for a short-range dependent strictly stationary sequence. The observed data \(y_1,\ldots ,y_N\) are affected by an unknown number K of level shifts of possibly different heights \(h_1,\ldots ,h_K\) at different time points \(t_1,\ldots ,t_K\). Our goal is the estimation of the variance \(\sigma ^2\). Without loss of generality, we will set \(\mu = 0\) in the following.

In Sect. 2 we analyse estimators of \(\sigma ^2\) from the sequence of observations \((Y_t)_{t\ge 1}\) by combining estimates obtained from splitting the data into several blocks. Without the need of explicit distributional assumptions the mean of the blockwise estimates turns out to be consistent if the size and the number of blocks increases, and the number of jumps increases slower than the number of blocks. If many jumps in the mean are expected to occur, an adaptively trimmed mean of the blockwise estimates can be used, see Sect. 3. In Sect. 4 a simulation study is conducted to assess the performance of the proposed approaches. In Sect. 5 the estimation procedures are applied to real data sets, while Sect. 6 summarizes the results of this paper.

2 Estimation of the variance by averaging

When dealing with independent identically distributed data the sample variance is the common choice for estimation of \(\sigma ^2\). However, if we are aware of a possible presence of level shifts at unknown locations, it is reasonable to divide the sample \(Y_1,\ldots ,Y_N\) into m non-overlapping blocks of size \(n = \lfloor N / m\rfloor \) and to calculate the average of the m sample variances derived from the different blocks. A similar approach has been used in Dai et al. (2015) in the context of repeated measurements data and in Rooch et al. (2019) for estimation of the Hurst parameter.

The blocks-estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) of the variance investigated here is defined as

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{Mean}} = \frac{1}{m}\sum _{j = 1}^{m} S ^2_j, \end{aligned}$$
(2)

where \(S_j^2 = \frac{1}{n-1}\sum _{t = 1}^{n} (Y_{j, t} - \overline{Y}_j)^2\)\(\overline{Y}_j = \frac{1}{n}\sum _{t = 1}^{n} Y_{j,t}\) and \(Y_{j, 1},\ldots ,Y_{j, n}\) are the observations in the jth block. We are interested in finding the block size n which yields a low mean squared error (MSE) under certain assumptions.

In what follows, we will concentrate on the situation where all jump heights are positive. This is a worse scenario than having both, positive and negative jumps, since the data are more spread in the former case resulting in a larger positive bias of most scale estimators.

2.1 Asymptotic properties

We will use some algebraic rules for derivation of the expectation and the variance of quadratic forms in order to calculate the MSE of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\), see Seber and Lee (2012). Let B be the number of blocks with jumps in the mean and \(K\ge B\) the total number of jumps. The expected value and the variance of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) are given as follows:

$$\begin{aligned} E\left( \widehat{\sigma ^2}_{\mathrm{Mean}}\right)&= \sigma ^2 + \frac{1}{m(n-1)}\sum _{j = 1}^B \mu _j^{\mathrm{T}} A \mu _j ,\\ {\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Mean}}\right)&= \frac{1}{m}\left( \frac{\nu _4}{n} - \frac{\sigma ^4(n-3)}{n(n-1)}\right) + \frac{4\sigma ^2}{m^2(n-1)^2} \sum _{j = 1}^{B} \mu _j^{\mathrm{T}} A \mu _j, \end{aligned}$$

where \(\nu _4 = E\left( X_1^4\right) ,\, A = {\mathbb {I}}_n - \frac{1}{n}1_n1_n^{\mathrm{T}}\), \({\mathbb {I}}_n\) is the unit matrix, \(1_n = (1,\ldots ,1)^{\mathrm{T}}\) and \(\mu _j\) contains the expected values of the random variables in the perturbed block \(j=1,\ldots ,B\), i.e., \(\mu _j = (\mu _{j, 1}, \ldots , \mu _{j, n})^{\mathrm{T}} = (E(Y_{j, 1}), \ldots , E(Y_{j, n}))^{\mathrm{T}}\). The term \(\mu _j^{\mathrm{T}} A \mu _j/(n-1)\) is the empirical variance of the expected values \(E(Y_{j, 1}), \ldots , E(Y_{j, n})\) in block j. In a jump-free block, we have \(\mu _j^{\mathrm{T}} A \mu _j = 0\), since all expected values and therefore the elements of \(\mu _j\) are equal.

The blocks-estimator (2) estimates the variance consistently if the number of blocks grows sufficiently fast as is shown in Theorem 1.

Theorem 1

Let \(Y_{1},\ldots ,Y_{N}\)with \(Y_t = X_t + \sum _{k = 1}^{K} h_k I_{t \ge t_k}\)from Model (1) be segregated into m blocks of size n, where \(t_1,\ldots ,t_K\)are the time points of the jumps of size \(h_1,\ldots ,h_K\), respectively. Let B out of m blocks be contaminated by \(\widetilde{K}_1,\ldots ,\widetilde{K}_B\)jumps, respectively, with \(\sum _{j = 1}^B \widetilde{K}_j = K = K(N)\). Moreover, let \(E(|X_1|^4)<\infty \), \( K \left( \sum _{k = 1}^K h_k \right) ^2 = o\left( m \right) \)and \(m\rightarrow \infty \), whereas the block size n can be fixed or increasing as \(N\rightarrow \infty \). Then \(\widehat{\sigma ^2}_{\mathrm{Mean}} = \frac{1}{m}\sum _{j = 1}^{m} S_j^2 \rightarrow \sigma ^2\)almost surely.

Proof

Without loss of generality assume that the first B out of m blocks are contaminated by \(\widetilde{K}_1,\ldots ,\widetilde{K}_B\) jumps, respectively. Let the term \(S^2_{j, 0}\) denote the empirical variance of the uncontaminated data in block j, while \(S^2_{j, h}\) is the empirical variance when \(\widetilde{K}_j\) level shifts are present. Moreover, \(Y_{j, 1},\ldots ,Y_{j, n}\) are the observations in the jth block, \(\mu _{j,t} = E\left( Y_{j, t}\right) \) and \(\overline{\mu }_{j} = \frac{1}{n} \sum _{t = 1}^n E\left( Y_{j, t}\right) \). Then we have

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{Mean}} & = \frac{1}{m}\sum _{j = 1}^{m} S^2_j = \frac{1}{m}\sum _{j = B + 1}^{m} S^2_{j, 0} + \frac{1}{m}\sum _{j = 1}^{B} S^2_{j, h}\nonumber \\ & = \frac{1}{m}\sum _{j = B + 1}^{m} S^2_{j, 0} + \frac{1}{m}\sum _{j = 1}^{B} \frac{1}{n-1}\sum _{t = 1}^{n} \left( X_{j, t} + \mu _{j,t} - \overline{X}_j - \overline{\mu }_{j} \right) ^2 \nonumber \\ & = \frac{1}{m}\sum _{j = 1}^{m} S^2_{j, 0} + \frac{1}{m}\sum _{j = 1}^{B}\frac{2}{n - 1} \sum _{t = 1}^n (X_{j, t} - \overline{X}_j)(\mu _{j,t} - \overline{\mu }_{j}) \nonumber \\&\quad+\, \frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \mu _{j,t} - \overline{\mu }_{j} \right) ^2. \end{aligned}$$
(3)

For the second term in the last Eq. (3), we have almost surely

$$\begin{aligned}&\left| \frac{2}{m(n-1)}\sum _{j = 1}^{B} \sum _{t = 1}^n (X_{j, t} - \overline{X}_j)(\mu _{j,t} - \overline{\mu }_{j})\right| \nonumber \\&\quad \le \frac{2}{m(n-1)}\sum _{j = 1}^{B} \sum _{t = 1}^n \left| (X_{j, t} - \overline{X}_j)\right| \left| (\mu _{j,t} - \overline{\mu }_{j})\right| \nonumber \\&\quad \le \frac{2}{m(n-1)}\sum _{j = 1}^{B} \sum _{t = 1}^n \left| (X_{j, t} - \overline{X}_j)\right| \left| \sum _{k = 1}^K h_k \right| \nonumber \\&\quad = B \left| \sum _{k = 1}^K h_k \right| \frac{2}{m} \frac{n}{n-1} \frac{1}{B} \sum _{j = 1}^{B} \frac{1}{n} \sum _{t = 1}^n \left| (X_{j, t} - \overline{X}_j)\right| \longrightarrow 0. \end{aligned}$$
(4)

The term \(\frac{1}{B} \sum _{j = 1}^{B} \frac{1}{n} \sum _{t = 1}^n \left| (X_{j, t} - \overline{X}_j)\right| \) in (4) is a random variable with finite moments if n and B are fixed. This random variable converges to the term \(E\left( \frac{1}{n}\sum _{t = 1}^n \left| (X_{j, t} - \overline{X}_j)\right| \right) \) almost surely if \(B\rightarrow \infty \). In the case of \(B\rightarrow \infty \) and \(n\rightarrow \infty \) this term converges to \( E\left( |X_1|\right) \) almost surely due to Theorem 2 of Hu et al. (1989) and the condition \(E(|X_1|^4)<\infty \), since \(S_j^2 - E\left( S_j^2\right) \) are uniformly bounded with \(P(|S_j^2 - E\left( S_j^2\right) |>t)\rightarrow 0\,\forall t\) due to Chebyshev’s inequality and \({\hbox {Var}}(S_j^2) \rightarrow 0\). Moreover, we used the fact that \(B \left| \sum _{k = 1}^K h_k \right| \le K \left| \sum _{k = 1}^K h_k \right| = o(m)\).

The following is valid for the third term in (3):

$$\begin{aligned} \frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \mu _{j,t} - \overline{\mu }_{j} \right) ^2&\le \frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \sum _{k = 1}^K h_k \right) ^2 \\&= \frac{B}{m}\frac{n}{n - 1} \left( \sum _{k = 1}^K h_k \right) ^2 \longrightarrow 0. \end{aligned}$$

The first term of the last equation in (3) converges almost surely to \(\sigma ^2\) due to the results on triangular arrays in Theorem 2 of Hu et al. (1989), assuming that the condition \(E(|X_1|^4)<\infty \) holds, since \(S_j^2 - E\left( S_j^2\right) \) are uniformly bounded with \(P(|S_j^2 - E\left( S_j^2\right) |>t)\rightarrow 0\,\forall t\) due to Chebyshev’s inequality and \({\hbox {Var}}(S_j^2) \rightarrow 0\). Application of Slutsky’s Theorem proves the result. \(\square \)

Remark 2

  1. 1.

    If the jump heights are bounded by a constant \(h\ge h_k,\, k = 1,\ldots ,K,\) the strongest restriction arises if all heights equal this upper bound resulting in the constraint \( K \left( \sum _{k = 1}^K h_k \right) ^2 = K^3h^2 = o\left( m \right) \). Consistency is thus guaranteed if the number of blocks grows faster than \(K^3\).

  2. 2.

    By the Central Limit Theorem, the estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) is asymptotically normal if no level shifts are present and the block size n is fixed. Its asymptotic efficiency relative to the ordinary sample variance is

    $$\begin{aligned} \frac{{\hbox {Var}}(S^2)}{{\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Mean}}\right) } = \frac{\frac{\nu _4}{N} - \frac{\sigma ^4(N-3)}{N(N-1)} }{\frac{1}{m}\left( \frac{\nu _4}{n} - \frac{\sigma ^4(n-3)}{n(n-1)}\right) } = \frac{ \nu _4 - \frac{\sigma ^4(N-3)}{ (N-1)} }{ \nu _4 - \frac{\sigma ^4(n-3)}{ (n-1)} } \overset{N \rightarrow \infty }{\longrightarrow } \frac{ \nu _4 - \sigma ^4 }{ \nu _4 - \frac{\sigma ^4(n-3)}{ (n-1)} } \end{aligned}$$

    in case of i.i.d data with finite fourth moments, where

    $$\begin{aligned} {\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Mean}}\right) = {\hbox {Var}}\left( \frac{1}{m} \sum _{j = 1}^m S_j^2\right) = \frac{1}{m} {\hbox {Var}}\left( S_1^2\right) = \frac{1}{m}\left( \frac{\nu _4}{n} - \frac{\sigma ^4(n-3)}{n(n-1)}\right) \end{aligned}$$

    (see Angelova (2012) for the variance \({\hbox {Var}}(S_1^2)\) of the sample variance). E.g., under normality the efficiency of the blocks estimater with fixed n is \( (n-1)n^{-1}<1.\)

  3. 3.

    The asymptotic efficiency of the blocks-estimator relative to the sample variance is 1 if \(n\rightarrow \infty \).

The next Theorem shows that \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) is asymptotically not only normal but even fully efficient in case of a growing block size.

Theorem 3

Assume that the i.i.d. random variables \(Y_1,\ldots ,Y_N\)are segregated into m blocks of size n, with \(m,\, n\rightarrow \infty \)such that \(m = o\left( n\right) , n = o\left( N\right) \). Moreover, assume that \(\nu _4 = E\left( X_1^4\right) <\infty \). Then we have

$$\begin{aligned} \sqrt{N} \left( \widehat{\sigma ^2}_{\mathrm{Mean}} - \sigma ^2\right) \overset{d}{\longrightarrow } {N}\left( 0,\nu _4 - \sigma ^4\right) . \end{aligned}$$

Proof

Rewriting the estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) we get

$$\begin{aligned} \frac{\widehat{\sigma ^2}_{\mathrm{Mean}} - \sigma ^2}{\sqrt{{\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Mean}}\right) }}&= \frac{\frac{1}{m(n-1)}\sum \nolimits _{j = 1}^{m}\sum \nolimits _{t = 1}^{n}\left( X_{j, t} - \overline{X}_j\right) ^2 - \sigma ^2}{\sqrt{\frac{1}{m} \left( \frac{\nu _4}{n} - \frac{\sigma ^4(n-3)}{n(n-1)}\right) }} \nonumber \\&= \sqrt{N} \frac{\frac{1}{m(n-1)}\sum \nolimits _{j = 1}^{m}\sum \nolimits _{t = 1}^{n} X_{j, t}^2 - \frac{n}{m(n-1)}\sum \nolimits _{j = 1}^{m}\overline{X}_j^2 - \sigma ^2}{\sqrt{ \nu _4 - \frac{\sigma ^4(n-3)}{n-1}}}. \end{aligned}$$
(5)

For the second term of the numerator in (5), we have that

$$\begin{aligned} E\left( \left| \sqrt{N} \frac{n}{m(n-1)}\sum _{j = 1}^{m}\overline{X}_j^2\right| \right)&= \frac{\sqrt{N}}{m} \frac{n}{n-1} \sum _{j = 1}^{m} E\left( \overline{X}_j^2\right) = \sqrt{N} \frac{n}{n-1}\frac{\sigma ^2}{n}\nonumber \\&= \sqrt{mn} \frac{n}{n-1}\frac{\sigma ^2}{n} = \sqrt{\frac{m}{n}}\frac{n}{n-1}\sigma ^2\rightarrow 0, \end{aligned}$$
(6)

since \(m = o(n)\). Convergence of the term (6) in mean implies convergence in probability to zero. Application of the Central Limit Theorem to the remaining terms of (5) yields the desired result. \(\square \)

Remark 4

In the proof of Theorem 3, we have assumed that \(m=o(n)\), i.e., the block size grows faster than the number of blocks. This condition can be dropped using the Lyapunov condition under the assumption of finite eighth moments, as will be shown in the following. We set

$$\begin{aligned} T_{i, j} = \frac{S_{i, j}^2 - \sigma ^2}{\sqrt{\frac{m_i}{n_i}\left( \nu _4 - \frac{\sigma ^4(n_i - 3)}{n_i - 1} \right) }}, \end{aligned}$$

with \(E(T_{i,j}) = 0\) and \(\sum _{j = 1}^{m_i} E(T_{i,j}^2) = 1\,\, \forall i,\) where i denotes the ith row of the triangular array. The Lyapunov condition [see corollary 1.9.3 in Serfling (1980)] is the following:

$$\begin{aligned} \exists \delta >0: \lim _{i\rightarrow \infty } \sum _{j = 1}^{m_i} E\left( |T_{i, j}|^{2 + \delta } \right) = 0 \end{aligned}$$

With \(\delta = 2\) and existing moments \(\nu _1,\ldots ,\nu _8\) of \(X_1\) we get

$$\begin{aligned} \sum _{j = 1}^{m_i}E\left( |T_{i, j}|^{4} \right)&= \sum _{j = 1}^{m_i}\frac{1}{\frac{m_i^2}{n_i^2}\left( \nu _4 - \frac{\sigma ^4(n_i - 3)}{n_i - 1} \right) ^2}\cdot E\left( \left( S_{i,j}^2 - \sigma ^2 \right) ^4\right) \\&= \frac{m_i}{\frac{m_i^2}{n_i^2}\left( \nu _4 - \frac{\sigma ^4(n_i - 3)}{n_i - 1} \right) ^2}\cdot O\left( \frac{1}{n_i^2}\right) \cdot g(\nu _1,\ldots ,\nu _8) \\&= o\left( \frac{1}{m_i}\right) \rightarrow 0, \end{aligned}$$

where g is a function of the existing moments \(\nu _1,\ldots ,\nu _8\) with \(g(\nu _1,\ldots ,\nu _8) = O(1)\). See Angelova (2012) for the fourth central moment of the sample variance. Therefore, the condition \(m = o(n)\) can be dropped.

2.2 Choice of the block size

When choosing blocks of length \(n = 2\), the estimator \(\widehat{\sigma }^2_{\mathrm{Mean}}\) results in a difference-based estimator which considers \(\lfloor N/2\rfloor \) consecutive non-overlapping differences:

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{Mean}, n=2} = \frac{1}{2\lfloor N/2\rfloor }\sum _{j = 1}^{\lfloor N/2\rfloor } \left( Y_{2j} - Y_{2j-1}\right) ^2. \end{aligned}$$

Difference-based estimators have been considered in many papers, see Von Neumann et al. (1941), Rice (1984), Gasser et al. (1986), Hall et al. (1990), Dette et al. (1998), Munk et al. (2005), Tong et al. (2013), among many others. Dai and Tong (2014) discussed estimation approaches based on differences in nonparametric regression context, Wang et al. (2017) considered an estimation technique which involves differences of second order, while Tecuapetla-Gómez and Munk (2017) proposed a difference-based estimator for m-dependent data. An ordinary difference-based estimator of first order, which considers all \(N-1\) consecutive differences, is [see, e.g., Von Neumann et al. (1941)]:

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{Diff}} = \frac{1}{2(N-1)}\sum _{j = 1}^{N-1} \left( Y_{j + 1} - Y_{j}\right) ^2 = \frac{1}{2(N-1)} Y^{\mathrm{T}} A Y, \end{aligned}$$
(7)

where \(A = \tilde{A}^{\mathrm{T}} \tilde{A} \) and \(\tilde{A}\) such that \(\tilde{A} Y = \left( Y_2 - Y_1,\ldots ,Y_N - Y_{N-1} \right) ^{\mathrm{T}} .\) Theorems 1.5 and 1.6 in Seber and Lee (2012) are used to calculate the expectation and the variance of the estimator \(\widehat{\sigma ^2}_{\mathrm{Diff}}\):

$$\begin{aligned} E\left( \widehat{\sigma ^2}_{\mathrm{Diff}}\right)&= \sigma ^2 + \frac{1}{2(N-1)} \mu ^{\mathrm{T}} A \mu ,\\ {\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Diff}}\right)&= \frac{1}{4(N-1)^2}\left( \nu _4 (4 N - 6) + 2\sigma ^4 + 4\sigma ^2 \mu ^{\mathrm{T}} A^2 \mu + 4\nu _3 \mu ^{\mathrm{T}} A a\right) , \\ \end{aligned}$$

where \(\mu = E(Y) \), \(\nu _i = E\left( X_1^i\right) \) and a is a vector of the diagonal elements of A.

When no changes in the mean are present both, the difference-based and the averaging estimators, are unbiased. For the variance of the estimators we have

$$\begin{aligned} {\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Diff}}\right)&= \nu _4\frac{4N - 6}{4(N-1)^2} + \frac{\sigma ^4}{2(N-1)^2}, \\ {\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Mean}}\right)&= \frac{\nu _4}{N} - \frac{\sigma ^4(n-3)}{N(n-1)}. \end{aligned}$$

For example, for \(N = 100\) and a block size \(n = 10,\) we get \({\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Mean}}\right) = 0.0222\), while \({\hbox {Var}}\left( \widehat{\sigma ^2}_{\mathrm{Diff}}\right) = 0.0302\). Therefore, when no changes in the mean are present the block-estimator can have smaller variance than the difference-based estimator. In Sect. 4 we compare the performance of the proposed estimation procedures with that of the difference-based estimator (7) in different change-point scenarios.

In the following, we investigate the proper choice of the block size for the estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}} \). All calculations in this paper have been performed with the statistical software +R+, version 3.5.2, R Core Team (2018).

For known jump positions the MSE of the blocks-variance estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}} \) can be determined analytically. The position of the jump is relevant for the performance of this approach. Therefore, it is reasonable to consider different positions of the K jumps to get an overall assessment of the performance of the blocks-estimator. For every \(K\in \{1, 3, 5\}\), we generate K jumps of equal heights \(h = \delta \cdot \sigma \), with \(\delta \in \{0, 0.1,0.2,\ldots ,4.9,5\}\), at positions sampled randomly from a uniform distribution on the values \(\max _{n}\left( N - \lfloor N / n\rfloor n \right) + 1,\ldots ,N - \max _{n}\left( N - \lfloor N / n\rfloor n \right) \) without replacement, and calculate the MSE for every reasonable block size \(n\in \{2,3,4,\ldots ,\lfloor N/2\rfloor \}\). This is repeated 1000 times, leading to 1000 MSE values for every h and n based on different jump positions. The average of these MSE values is taken for each h and n. Data are generated from the standard normal or the \(t_5\)-distribution.

Panel (a) of Fig. 1 shows the block size \(n_{\mathrm{opt}}\) which yields the least theoretical MSE value of the estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) depending on the jump height \(h = \delta \cdot \sigma \) with \(K \in \{1, 3, 5\}\) jumps for \(N = 1000\) observations and normal distribution. We observe that \(n_{\mathrm{opt}}\) decreases for \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) as the jump height grows. Blocks of size 2 (resulting in a non-overlapping difference-based estimator) are preferred when \(h \approx 4\sigma \) and \(K = 5\), while larger blocks lead to better results in case of smaller or less jumps.

Fig. 1
figure 1

a MSE-optimal block length \(n_{\mathrm{opt}}\) of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\), b MSE regarding \(n_{\mathrm{opt}}\) of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) and c MSE of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) when choosing \( n = \frac{\sqrt{N}}{K + 1}\) for \(K = 1\) (solid line), \(K = 3\) (dashed line) and \(K=5\) (dotted line) with \(N=1000\), \(Y_t = X_t + \sum _{k = 1}^{K} h I_{t \ge t_k}\), where \(X_t \sim {N}(0, 1)\) and \(h = \delta \cdot \sigma ,\, \delta \in \{0, 0.1, \ldots , 5\}\).

Panel (b) of Fig. 1 depicts the MSE of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) for the respective MSE-optimal block size \(n_{\mathrm{opt}}\). In all three scenarios the MSE increases if the jump height gets larger. Different values for the optimal block size \(n_{\mathrm{opt}}\) are obtained in different scenarios, i.e., for different K and \(h = \delta \cdot \sigma \). As the true number and height of the jumps in the mean are usually not known in practice, we wish to choose a block size which yields good results in many scenarios. We do not consider very high jumps any further, since they can be detected easily and are thus not very interesting. The square root of the sample size N has proven to be a good choice for the block size in many applications, see e.g. Rooch et al. (2019). Moreover, in the upper panel of Fig. 1, we observe that smaller block sizes n are preferred when the number of change-points is high. If the estimation of the variance is in the focus of the application, we suggest to choose the block size depending on K:

$$\begin{aligned} n = \max \left\{ \left\lfloor \frac{\sqrt{N}}{K + 1}\right\rfloor , 2 \right\} , \end{aligned}$$
(8)

which gets smaller if the number of jumps increases, resulting in many jump-free and only few contaminated blocks. Otherwise, if testing is of interest in view of Theorem 3, we suggest a block size which grows slightly faster than \(\sqrt{N}\), e.g. \(n = \max \left\{ \left\lfloor \frac{N^{6 / 10}}{K + 1}\right\rfloor , 2 \right\} \), which yields similar results as (8).

Remark 5

For large N we get \(m = N / n = \sqrt{N}(K+1)\). In this case the number of jumps needs to satisfy \(K = o\left( m^{1/3} \right) = o\left( N^{1/6} K^{1/3} \right) \), i.e., \(K = o\left( N^{1/4} \right) \) can be tolerated, see Remark 2. An even larger rate of shifts can be tolerated by choosing \(m = N / c\) for some constant c, i.e., a fixed block length n. However, this reduces the efficiency of the resulting estimator in the presence of a small rate of shifts, see Fig. 1. For example, for the standard normal distribution, the asymptotic relative efficiency of the blocks-estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) is \((n - 1)n^{-1} < 1\) and that of the difference-based estimator [as defined in (7)] is 2/3 [see e.g. Seber and Lee (2012)].

Panel (c) of Fig. 1 shows the MSE of the estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) with the block size n chosen according to (8). For \(K \in \{3, 5\}\) there is only a moderate loss of performance when choosing n according to (8) instead of \(n_{\mathrm{opt}}\) which depends on the number of jumps K and the height \(h = \delta \cdot \sigma \). In case of \(K = 1\) change in the mean the performance of the blocks-estimator worsens slightly.

Table 1 shows the average MSE of the ordinary sample variance for normally distributed data and different values of K and h, with \(N = 1000\). Again, 1000 simulation runs are performed where jumps are added to the generated data at randomly chosen positions. We observe that the MSE becomes very large when the number and height of the level shifts increases. Obviously, the blocks-estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) performs much better than the sample variance.

Table 1 MSE of the sample variance for normally distributed data and different K and h, \(N = 1000\)

The results for data from the \(t_5\)-distribution are similar to those obtained for the normal distribution, see Fig. 7 in “Appendix.” Again, the blocks-estimator with the block size (8) [Panel (c)] performs well and does not lose too much performance compared to Panel (b) of Fig. 7, where the optimal block size is considered. Similar results are obtained for \(N = 2500\), see Figs. 9 and 10 in “Appendix”.

As the number of change-points K is not known exactly in real applications, there are several possibilities to set K in formula (8):

  1. 1.

    Use \(n = \max \left\{ \left\lfloor \frac{\sqrt{N}}{K + 1} \right\rfloor , 2 \right\} \) with a large guess on the value of K.

  2. 2.

    Pre-estimate K with an appropriate procedure.

  3. 3.

    Use prior knowledge about plausible values of K.

We will discuss the first two approaches in the following two subsections.

2.3 Using a large guess on the number of jumps

If many change-points are present, a small block size should be chosen, while larger blocks are preferred in the case of only a few level shifts. If a practitioner does not have knowledge about the number of jumps in the mean we recommend choosing a rather high value K in the formula (8), which results in small blocks. Doing so we are usually on the safe side, since choosing too few blocks can result in a very high MSE, while the performance of the estimator does not worsen so much when choosing many blocks.

As an example we generate 1000 time series of length \(N = 1000\) with \(K = 3\) jumps at random positions. Figure 2 shows the MSE of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) depending on the jump height \(h = \delta \sigma ,\, \delta \in \{0, 0.1,\ldots ,5\},\) when choosing \( n = \frac{\sqrt{N}}{K + 1}\) with values \(K\in \{0, 1, \ldots , 6\}\). We observe that choosing a too small number of blocks (i.e., a too large block size) results in large MSE values if the jumps are rather high. On the other hand, the results do not worsen as much when choosing unnecessarily many and thus small blocks. Figure 8 in “Appendix” shows similar results for \(K = 5\) jumps.

Choosing \(n=2\) results in a non-overlapping difference-based estimator which performs also well but loses efficiency compared to the block-estimator with growing block size, which can be fully efficient, see Remarks 2 and 5.

We will not consider the block-estimator with the block size n depending on the choice of some large value of K instead of the true one in the following investigation, since it is a subjective choice of a practitioner.

Fig. 2
figure 2

MSE of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) when choosing \( n = \frac{\sqrt{N}}{K + 1}\) for true \(K = 3\) (—), \(K = 0\) ( ) \(K = 1\) ( ) \(K = 2\) ( ) \(K = 4\) ( ) \(K = 5\) ( ) and \(K = 6\) ( ) with \(N=1000\) and \(h = \delta \cdot \sigma ,\, \delta \in \{0, 0.1, \ldots , 5\}\), \(Y_t = X_t + \sum _{k = 1}^{3} h I_{t \ge t_k}\), where a \(X_t \sim {N}(0, 1)\) and b \(X_t \sim t_3\)

2.4 Pre-estimation of the number of jumps

We will investigate the MOSUM procedure for the detection of multiple change-points proposed by Eichinger and Kirch (2018) to pre-estimate the number of change-points K. According to the simulations in the aforementioned paper, this procedure yields very good results in comparison with many other procedures for change-point estimation. We will describe the procedure briefly in the following.

At time point t a statistic \(T_{t, N}\) is calculated as

$$\begin{aligned} T_{t, N}(G) = \frac{1}{\sqrt{2G}} \Bigg ( \sum _{i = t + 1}^{t+ G} Y_i - \sum _{t-G+1}^t Y_i \Bigg ), \end{aligned}$$

where \(G = G(N)\) is a bandwidth parameter and \(G\le t \le N - G\). In what follows, we will set the bandwidth parameter to \(G = \sqrt{N}\). The estimated number of change-points \(\widehat{K}\) is the number of pairs \((v_i, w_i)\) which fulfil

$$\begin{aligned}&\frac{|T_{t, N}(G)|}{\widehat{\tau }_{t, N}} \ge D_N(G,\delta _N)\quad {\text { for }} \quad t = v_i,\ldots , w_i\\&\frac{|T_{t, N}(G)|}{\widehat{\tau }_{t, N}}< D_N(G,\delta _N)\quad {\text { for }} \quad t = v_i-1, w_i+1\\&w_i - v_i \ge \eta G \quad {\text { with}} 0<\eta <1/2 {\text {arbitrary but fixed,}} \end{aligned}$$

where \(D_N(G,\delta _N)\) is a critical value depending on the bandwidth parameter G and a sequence \(\delta _N\rightarrow 0\) and \(\widehat{\tau ^2}_{t, N}\) is a local estimator of the variance at location t. In a window of length 2G, the location t is treated as a possible change-point location. A mean correction is performed before and after the time point t for computing \(\widehat{\tau ^2}_{t, N}\).

The corresponding estimated change-point locations \(\widehat{t}_1, \ldots , \widehat{t}_{\widehat{K}}\) are

$$\begin{aligned} \widehat{t}_i = {\text {arg}}\max _{v_i\le t\le w_i} \frac{|T_{t, N}(G)|}{\widehat{\tau }_{t, N}}. \end{aligned}$$

More information on the procedure can be found in Eichinger and Kirch (2018).

We will use the MOSUM procedure to estimate the number of change-points K in the formula (8) for the block size n. The corresponding blocks-estimator is denoted as \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\). In Sect. 4 we will see that the performance of the two blocks-estimators \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) [as defined in (2)] and \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) is similar in many cases.

Moreover, we introduce another estimation procedure which is fully based on the MOSUM method, for comparison. We divide the data into \(\widehat{K} + 1\) blocks at the estimated locations \(\widehat{t}_1, \ldots , \widehat{t}_{\widehat{K}}\) of the level shifts. In every block \(j = 1,\ldots ,\widehat{K} + 1,\) the empirical variance \(S_j^2\) is calculated. A weighted average \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) of those values can be computed to estimate the variance, i.e.,

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}} = \sum _{j = 1}^{\widehat{K}} \frac{n_j}{N}S_j^2, \end{aligned}$$
(9)

where \(n_j\) is the size of block \(j = 1,\ldots ,\widehat{K}.\)

2.5 Extension to short-range dependent data

Since many real datasets exhibit autocorrelation, we investigate the averaging estimation procedure under dependence. We consider a strictly stationary linear process \((X_t)_{t\ge 1}\) with \(X_t = \sum _{i = 0}^\infty \psi _i \zeta _{t - i}\), where \(\zeta _{t}\) are i.i.d. with mean zero and finite variance and \((\psi _i)_{i\ge 0}\) is a sequence of constants with \( \sum _{i = 0}^\infty \psi _i<\infty \). The autocovariance function is defined as \(\gamma (\delta ) = E((X_t - E(X_t))(X_{t+\delta } - E(X_{t+\delta }))) ,\, \delta \in {\mathbb {N}}\), and is absolutely summable in this case, i.e., \(\sum _{\delta = 0}^{\infty } |\gamma (\delta )| < \infty \).

Theorem 6

Consider a strictly stationary linear process \((X_t)_{t\ge 1}\)with an absolutely summable autocovariance function and \(Y_t = X_t + \sum _{k = 1}^{K} h_k I_{t \ge t_k}\), i.e., data with K level shifts. Moreover, let \(K\left( \sum _{k = 1}^{K} h_k\right) ^2 = o(m\log (N)^{-1})\). Then \(\widehat{\sigma ^2}_{\mathrm{Mean}} = \frac{1}{m}\sum _{j = 1}^{m} S_j^2 \rightarrow \sigma ^2\)in probability.

Proof

Without loss of generality assume that the first \(B\le K\) out of m blocks are perturbed by K jumps in the mean. Furthermore, let \(\mu = E(X_{j,t})\, \forall j, t\), \(\mu _{j,t} = E\left( Y_{j, t}\right) \) and \(\overline{\mu }_{j} = \frac{1}{n} \sum _{t = 1}^n E\left( Y_{j, t}\right) \). Let the term \(S^2_{j, 0}\) denote the empirical variance of the uncontaminated data \(X_{j, 1},\ldots ,X_{j,n}\) in the block \(j = 1,\ldots ,m\), while \(S^2_{j, h}\) is the empirical variance in the perturbed block \(j = 1,\ldots ,B\).

Then

$$\begin{aligned}&P\left( \left| \widehat{\sigma ^2}_{\mathrm{Mean}} - \sigma ^2 \right|>\epsilon \right) = P\left( \left| \frac{1}{m}\sum _{j = B+1}^{m} S ^2_{j,0} + \frac{1}{m}\sum _{j = 1}^{B} S ^2_{j,h} - \sigma ^2 \right|>\epsilon \right) \nonumber \\&= P\left( \left| \frac{1}{m}\sum _{j = B+1}^{m} S ^2_{j,0} + \frac{1}{m}\sum _{j = 1}^{B} S ^2_{j,0} \right. \right. \nonumber \\&\quad + \,\frac{1}{m}\sum _{j = 1}^{B}\frac{2}{n - 1} \sum _{t = 1}^n (X_{j, t} - \overline{X}_j)(\mu _{j,t} - \overline{\mu }_{j}) \nonumber \\&\quad \left. \left. + \,\frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \mu _{j,t} - \overline{\mu }_{j} \right) ^2 - \sigma ^2 \right|>\epsilon \right) \nonumber \\&= P\left( \left| \frac{1}{m}\sum _{j = 1}^{m } S ^2_{j,0} - \sigma ^2 + \frac{1}{m}\sum _{j = 1}^{B}\frac{2}{n - 1} \sum _{t = 1}^n X_{j, t} (\mu _{j,t} - \overline{\mu }_{j}) \right. \right. \nonumber \\&\quad \left. \left. + \,\frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \mu _{j,t} - \overline{\mu }_{j} \right) ^2 \right|>\epsilon \right) \nonumber \\&\le P\left( \left| \frac{1}{m}\sum _{j = 1}^{m } S ^2_{j,0} - \sigma ^2\right|> \frac{\epsilon }{3} \right) \nonumber \\&\quad + \,P\left( \left| \frac{1}{m}\sum _{j = 1}^{B}\frac{2}{n - 1} \sum _{t = 1}^n X_{j, t} (\mu _{j,t} - \overline{\mu }_{j})\right|> \frac{\epsilon }{3} \right) \nonumber \\&\quad + \,P\left( \frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \mu _{j,t} - \overline{\mu }_{j} \right) ^2 > \frac{\epsilon }{3} \right) \nonumber \\&=: A_{1,N} +A_{2,N} +A_{3,N} \end{aligned}$$
(10)

For the first term \(A_{1,N}\) of (10), we have:

$$\begin{aligned} A_{1,N}&= P\left( \left| \frac{1}{m}\sum _{j = 1}^{m } S ^2_{j,0} - \sigma ^2\right|> \frac{\epsilon }{3} \right) \nonumber \\&= P\left( \left| \frac{1}{m}\sum _{j = 1}^{m } \frac{1}{n-1}\sum _{t = 1}^n (X_{j, t} - \overline{X}_j)^2 - \sigma ^2\right|> \frac{\epsilon }{3} \right) \nonumber \\&= P\left( \left| \frac{1}{m}\sum _{j = 1}^{m } \frac{1}{n-1}\sum _{t = 1}^n (X_{j, t} - \mu )^2 + \frac{1}{m}\sum _{j = 1}^{m } \frac{n}{n-1} (\mu - \overline{X}_j)^2 \right. \right. \nonumber \\&\quad \left. \left. +\, \frac{2}{m(n-1)} \sum _{j = 1}^{m } \sum _{t = 1}^n (X_{j,t} - \mu )(\mu - \overline{X}_j) - \sigma ^2\right|> \frac{\epsilon }{3} \right) \nonumber \\&\le P\left( \left| \frac{N}{m(n-1)} \frac{1}{N}\sum _{j = 1}^{m } \sum _{t = 1}^n (X_{j, t} - \mu )^2 - \sigma ^2\right|> \frac{\epsilon }{6} \right) \nonumber \\&\quad + \,P\left( \left| -\frac{ n}{m (n-1) } \sum _{j=1}^m(\overline{X}_j - \mu )^2 \right| > \frac{\epsilon }{6}\right) \end{aligned}$$
(11)

The first term of (11) converges to zero, since the term \(\frac{1}{N}\sum _{j = 1}^{m } \sum _{t = 1}^n (X_{j, t} - \mu )^2\) is a consistent estimator of \(\sigma ^2\) and \(\frac{N}{m(n-1)} = \frac{N}{N - m}\rightarrow 1\). For the second term of (11) we have

$$\begin{aligned}&E\left( \left| -\frac{ n}{m(n-1)} \sum _{j=1}^m(\overline{X}_j - \mu )^2 \right| \right) = \frac{n}{m(n-1)} \sum _{j=1}^mE\left( (\overline{X}_j - \mu )^2\right) \\&\quad = \frac{n}{m(n-1) } \sum _{j=1}^m {\hbox {Var}}(\overline{X}_j) \le \frac{n}{(n-1) } \frac{\gamma (0) + 2\sum _{i = 1}^{\infty } \gamma (i)}{n} \\&\quad \longrightarrow 0, \end{aligned}$$

due to absolute summability of the autocovariance function \(\gamma \), which implies convergence in probability. Therefore, the second term of (11) converges to zero. Altogether, the sum (11) converges to zero.

The second and the third terms \(A_{2,N}\) and \(A_{3,N}\) of (10) converge to zero due to the following considerations:

$$\begin{aligned}&E\left( \frac{1}{m}\sum _{j = 1}^{B}\frac{2}{n - 1} \sum _{t = 1}^n X_{j, t} (\mu _{j,t} - \overline{\mu }_{j}) \right) \\&\quad = \frac{1}{m}\sum _{j = 1}^{B}\frac{2}{n - 1} \sum _{t = 1}^n E(X_{j, t}) (\mu _{j,t} - \overline{\mu }_{j})\\&\quad = \frac{1}{m}\sum _{j = 1}^{B}\frac{2\mu }{n - 1} \sum _{t = 1}^n (\mu _{j,t} - \overline{\mu }_{j}) = 0\\&\quad {\hbox {Var}}\left( \frac{1}{m}\sum _{j = 1}^{B}\frac{2}{n - 1} \sum _{t = 1}^n X_{j, t} (\mu _{j,t} - \overline{\mu }_{j}) \right) \\&\quad = \frac{4}{m^2(n-1)^2}{\hbox {Var}}\left( \sum _{j = 1}^{B} \sum _{t = 1}^n X_{j, t} (\mu _{j,t} - \overline{\mu }_{j}) \right) \\&\quad = \frac{4}{m^2(n-1)^2} Cov\left( \sum _{j = 1}^{B} \sum _{t = 1}^{n}X_{j, t} (\mu _{j,t} - \overline{\mu }_{j}), \sum _{l = 1}^{B} \sum _{s = 1}^{n} X_{l,s} (\mu _{l,s} - \overline{\mu }_{l})\right) \\&\quad \le \frac{4}{m^2(n-1)^2}\sum _{j = 1}^{B} \sum _{t = 1}^{n}\sum _{l = 1}^{B} \sum _{s = 1}^{n} |\mu _{j,t} - \overline{\mu }_{j}| |\mu _{l,s} - \overline{\mu }_{l}| |Cov(X_{j, t}, X_{l, s})| \\&\quad \le \frac{4}{m^2(n-1)^2} \left( \sum _{k = 1}^K h_k\right) ^2 \sum _{j = 1}^{m} \sum _{t = 1}^{n}\sum _{l = 1}^{m} \sum _{s = 1}^{n} |Cov(X_{j, t}, X_{l, s})|\\&\quad = \frac{4}{m^2(n-1)^2} \left( \sum _{k = 1}^K h_k\right) ^2 \left( NVar(X_{1,1}) + 2\sum _{u = 1}^{N - 1}|\gamma (u)|(N - u)\right) \\&\quad \le \frac{4N}{m^2(n-1)^2} \left( \sum _{k = 1}^K h_k\right) ^2 \left( {\hbox {Var}}(X_{1,1}) + 2 \sum _{u = 1}^{\infty }|\gamma (u)| \right) \\&\quad \longrightarrow 0, \end{aligned}$$

due to absolute summability of the autocovariance function \(\gamma \) and the condition \(K\left( \sum _{k = 1}^{K} h_k\right) ^2 = o(m\log (N)^{-1})\). Therefore, the term \(A_{2, N}\) converges to zero in rth mean with \(r = 2\), which implies convergence in probability. The argument in the probability \(A_{3,N}\) is deterministic. Set \(\epsilon _N = \log (N)^{-1}.\) For large N we have

$$\begin{aligned} A_{3,N}&\le P\left( \frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \mu _{j,t} - \overline{\mu }_{j} \right) ^2> \frac{\epsilon _N}{3} \right) \\&= P\left( \log (N) \frac{1}{m}\sum _{j = 1}^{B}\frac{1}{n-1} \sum _{t = 1}^n \left( \mu _{j,t} - \overline{\mu }_{j} \right) ^2> \frac{1}{3} \right) \\&\le P\Bigg (\log (N) \frac{B}{m}\frac{n}{n-1} \left( \sum _{k = 1}^{K} h_k\right) ^2> \frac{1}{3} \Bigg ) \\&\le P\Bigg (\log (N) \frac{K}{m}\frac{n}{n-1} \left( \sum _{k = 1}^{K} h_k\right) ^2> \frac{1}{3} \Bigg ) = 0, \end{aligned}$$

since \(K\left( \sum _{k = 1}^{K} h_k\right) ^2 = o(m\log (N)^{-1})\) holds. Altogether we get the result \(\widehat{\sigma ^2}_{\mathrm{Mean}} \overset{p}{ \rightarrow } \sigma ^2\). \(\square \)

Remark 7

If the jump heights are bounded by a constant \(h\ge h_k,\, k = 1,\ldots ,K,\) the strongest restriction arises if all heights equal this upper bound resulting in the constraint \( K^3h^2 = o\left( m \log (N)^{-1} \right) \).

3 Trimmed estimator of the variance

So far we have considered cases where the number of changes in the mean K is rather small with respect to the number of the blocks m and thus to the number of observations N. However, there might be situations in which level shifts occur frequently. Asymptotically, the blocks-estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) [see (2)] is still a good choice for the estimation of the variance as long as the number of level shifts grows slowly, see e.g. Theorem 1. However, if many jumps are present in a finite sample the blocks-estimator is no longer a good choice and will become strongly biased. We propose an asymmetric trimmed mean of the blockwise estimates instead of their ordinary average, i.e., large estimates are removed and the average value of the remaining ones is calculated. We do not consider a symmetric trimmed mean, since the sample variance is positively biased in the presence of level shifts, so that estimates from blocks containing a level shift are expected to show up as upper outliers. Moreover, we suggest using rather many small blocks to account for potentially many level shifts. In the next Sects. 3.1 and 3.2, the choice of the trimming fraction is discussed.

3.1 Estimation with a fixed trimming fraction

The trimmed blocks-estimator is given as

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{Tr}, \alpha } = C_{N, {\mathrm{Tr}}, \alpha } \frac{1}{m-\lfloor \alpha m\rfloor }\sum _{j = 1}^{m-\lfloor \alpha m\rfloor } S^2_{\left( j\right) }, \end{aligned}$$
(12)

where \(S^2_{\left( 1\right) }\le \cdots \le S^2_{\left( m\right) }\) are the ordered blockwise estimates, m is the number of blocks and \(C_{N, {\mathrm{Tr}},\alpha }\) is a sample and distribution dependent correction factor to ensure unbiasedness in the absence of level shifts. In practice this constant can be simulated under the assumption of observing data from a known location-scale family. For example, for standard normal distribution, \(\alpha = 0.2\), \(N = 1000\) and \(n = 20\) (\(m = 50\)) we generate 1000 samples of length \(N = 1000\) and calculate the average of the uncorrected trimmed variance estimates. The reciprocal of this average value yields \(C_{1000, {\mathrm{Tr}}, 0.2} = 1.198 \).

As an example, we generate 1000 time series of length \(N \in \{ 1000,2500\}\) from normal and \(t_5\)-distribution. We add \(K = N\cdot p \) jumps to the generated data at randomly chosen positions, as was done in Sect. 2.2, with \(p\in \{0, 2/1000, 4/1000, 6/1000, 10/1000\}\) and height \(h\in \{0, 2, 3, 5\}\). We choose \(n = 20\) to ensure that the number of jump-contaminated blocks is sufficiently smaller than the total number of blocks.

Table 2 shows the simulated MSE of the trimmed estimator (12) for \(\alpha \in \{0.1, 0.3,0.5\}\). Clearly, the performance of the trimmed estimator depends on the number of jumps in the mean and the trimming parameter \(\alpha \). Larger values of \(\alpha \) are required when dealing with many jumps but lead to an increased MSE if there are only a few jumps. Therefore, it is reasonable to choose \(\alpha \) adaptively, as will be described in the next Sect. 3.2.

Table 2 Simulated MSE\(\cdot 10^2\) of \(\widehat{\sigma ^2}_{\mathrm{Tr}, \alpha }\) for normally and \(t_5\)-distributed data with \(N=1000\) and different \(\alpha \), h and K

3.2 Adaptive choice of the trimming fraction

Instead of using a fixed trimming fraction, we can choose \(\alpha \) adaptively, yielding the adaptive trimmed estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}\) with

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{Tr,ad}} = C_{N, {\mathrm{Tr,ad}}} \frac{1}{m-\lfloor \alpha _{\mathrm{adapt}} m\rfloor }\sum _{j = 1}^{m-\lfloor \alpha _{\mathrm{adapt}} m\rfloor } S^2_{\left( j\right) } , \end{aligned}$$
(13)

where \(S^2_{\left( 1\right) }\le \cdots \le S^2_{\left( m\right) }\) are the ordered blockwise sample variances and \(\alpha _{\mathrm{adapt}}\) is the adaptively chosen percentage of the blocks-estimates which will be removed.

3.2.1 Estimation under normality

We use the approach for outlier detection discussed in Davies and Gather (1993) to determine \(\alpha _{\mathrm{adapt}}\), assuming that the underlying distribution is normal. In this case, the distribution of the sample variance is well known, i.e., in block j we have that

$$\begin{aligned} T_j := \frac{(n-1)S^2_j }{ \sigma ^2 }\sim \chi ^2_{n-1}. \end{aligned}$$

Since the true variance \(\sigma ^2\) is not known, we propose to replace \(\sigma ^2\) by an appropriate initial estimate, such as the median of the blocks-estimates, i.e.,

$$\begin{aligned} \widehat{\sigma ^2}_{\mathrm{Med}} = C_{N,{\mathrm{Med}}}\cdot {\text {med}}\{S_1^2,\ldots ,S_m^2\}, \end{aligned}$$
(14)

where \(C_{N,{\mathrm{Med}}}\) is a finite sample correction factor. Subsequently, we remove those values

$$\begin{aligned} \widehat{T}_j = \frac{(n-1)S^2_j }{ \widehat{\sigma ^2}_{\mathrm{Med}}}, \end{aligned}$$
(15)

which exceed \(q_{\chi ^2_{n-1}, 1 - \beta _m}\), the \((1 - \beta _m)\)-quantile of the \(\chi ^2_{n-1}\)-distribution, with

$$\begin{aligned} \beta _m = 1 - (1 -\beta )^{1 / m} \quad {\text {and}} \quad \beta \in (0,1) . \end{aligned}$$
(16)

We will refer to the adaptively trimmed estimator based on the approach of Davies and Gather (1993) under normality as \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\).

The choice of \(\beta _m\) results in a probability of \(1-\beta \) that no observation (block in our case) is trimmed if \(T_1,\ldots ,T_m\) are i.i.d. \(\chi ^2_{n-1}\)-distributed, i.e.,

$$\begin{aligned}P(T_1 \le q_{\chi ^2_{n-1}, 1 - \beta _m},\ldots ,T_m \le q_{\chi ^2_{n-1}, 1 - \beta _m}) &= \left( P(T_1 \le q_{\chi ^2_{n-1}, 1 - \beta _m})\right) ^m \\& = \left( 1 - \beta _m\right) ^m = \left( (1 -\beta )^{1 / m}\right) ^m = 1 -\beta . \end{aligned}$$

Furthermore, we expect that roughly \(m\cdot \beta _m\) blocks are trimmed on average in the absence of level shifts. The following simulations suggest that the adaptive trimming fraction is slightly larger than \(\beta _m\) which can be explained by the fact that we need to use an estimate such as \(\widehat{\sigma ^2}_{\mathrm{Med}} \) instead of the unknown \(\sigma ^2\). We generated 10,000 sequences of observations of size \(N \in \{1000, 2500\}\) for \(\beta \in \{0.05,\, 0.1\}\). Table 3 shows the average number of trimmed blocks in the absence of level shifts.

Table 3 Average number of trimmed blocks in the absence of level shifts for normally distributed data and different N and \(\beta \) for the estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\)

We suggest choosing a small block size, e.g. \(n = 20\), to cope with a possibly large number of change-points. In this way, it is ensured that the number of uncontaminated blocks is much larger than the number of perturbed blocks. Moreover, we choose \(\beta = 0.05 \).

The correction factor in (13) needs to be simulated taking into account that the percentage of the omitted block-estimates is no longer fixed. Therefore, for given N and \(\beta \) we generate 1000 sequences of observations. In each simulation run we calculate the block-estimates \(S_j^2,\, j=1,\ldots ,m,\) and the initial estimate of the variance \(\widehat{\sigma ^2}_{\mathrm{Med}} \). Subsequently, we remove the values \(\widehat{T}_j = (n-1)S^2_j / \widehat{\sigma ^2}_{\mathrm{Med}} \) which exceed the quantile \(q_{\chi ^2_{n-1}, 1 - \beta _{m}}\). Then the average value of the remaining block-estimates is computed. The procedure yields 1000 estimates. The correction factor is the reciprocal of the average of these values. For \(N = 1000\) and \(\beta = 0.05\) the simulated correction factor is \(C_{1000, {\mathrm{Tr,ad}}}^{0.05} = 1.0020\), while for \(N=2500\) we have \(C_{2500, {\mathrm{Tr,ad}}}^{0.05} = 1.0009 \), so both are nearly 1 and could be neglected with little loss.

3.2.2 Estimation under unknown distribution

When no distributional assumptions are made and the block size n is large, one can use that \(\sqrt{n}\left( S^2_j - \sigma ^2\right) / \sqrt{\nu _4-\sigma ^4}\) is approximately standard normal if the fourth moment exists.

The fourth central moment \(\nu _4 = E\left( \left( X_1 - E(X_1)\right) ^4\right) \) needs to be estimated properly in the presence of level shifts. We can estimate it in blocks and then compute the median of the blocks-estimates \(\widehat{\mu }_{4,{\mathrm{Med}}}\), as was done in (14). Then, values

$$\begin{aligned} \widehat{T}_j = \sqrt{n}\left( S^2_j - \widehat{\sigma ^2}_{\mathrm{Med}}\right) / \sqrt{ \widehat{\mu }_{4,{\mathrm{Med}}}-\widehat{\sigma ^2}_{\mathrm{Med}}^2}, \end{aligned}$$
(17)

which exceed the \((1 - \beta _m)\)-quantile of the standard normal distribution are removed. The corresponding adaptively trimmed estimator is denoted as \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\), where, again, \(\beta = 0.05 \) [see (16)] will be used in what follows.

Table 4 Average number of trimmed blocks in the absence of level shifts for normally distributed data and different values of N and \(\beta \) for the estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\)

Table 4 shows the average number of trimmed blocks in the absence of level shifts for normally distributed data, analogously to Table 3. We observe that the average number of trimmed block-estimates is much larger than the values for the trimming procedure which is based on the normality assumption.

Remark 8

We do not use distribution dependent correction factors for the estimators \(\widehat{\mu }_{4,{\mathrm{Med}}}\) and \(\widehat{\sigma ^2}_{\mathrm{Med}}\), since the underlying distribution is not known. The asymptotic distribution of the blockwise empirical variances is normal and therefore, for large block sizes, the distribution is expected to be roughly symmetric. A correction factor can then be neglected with little loss, since the sample median of symmetrically distributed random variables estimates their expected value.

3.3 Choice of the trimming fraction based on the MOSUM procedure

In Sect. 2.4 we have used the MOSUM procedure proposed by Eichinger and Kirch (2018) to estimate the unknown number of jumps K. The corresponding estimate \(\widehat{K}\) can also be used to determine the trimming fraction \(\alpha \) of the trimmed estimator in (12), i.e., we can set \(\alpha = \widehat{K} / m\). We will denote the corresponding estimator as \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{mos}}\). We compare the average number of trimmed blocks for the three estimators \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{mos}}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) in a simulation study. We generate 1000 sequences of observations of length \(N = 1000\) for different values of K and h. To every sequence we add K jumps of height h at randomly chosen positions, as was done in Sect. 2.2. Table 5 shows the average number of trimmed blocks for data sampled from the standard normal distribution and the corresponding MSE in different scenarios. Moreover, the MSE of the averaging estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) is displayed in the table for comparison.

We observe that \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{mos}}\) trims more blocks on average and yields a smaller MSE than \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) if the jumps are small. The MSE of the averaging estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) is similar to that of \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{mos}}\) in that case. As opposed to this, the three trimmed estimators trim similarly many blocks on average if the jumps are large, with \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) leading to smaller MSE values than \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{mos}}\). This can be explained by the fact that the former methods base the decision whether to trim a block or not directly on the criterion whether the empirical variance in a block is unusual or not. The averaging estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) is outperformed by the three trimmed estimators when dealing with large jumps.

For the heavy-tailed \(t_5\)-distribution, the results are found in Table 6. The MOSUM-based procedure yields slightly better results for large jump heights, i.e., \(h>2\), since the adaptively trimmed estimators overestimate the number of contaminated blocks in many cases. For small jumps, the adaptively trimmed procedures yield smaller MSE values. The three trimmed estimators perform better than \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) in every case.

The results for the data generated by the AR(1) model with \(\phi = 0.5\) are displayed in Table 7. The MOSUM-based procedure trims considerably more blocks than the adaptively trimmed estimators \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) yielding higher MSE values. Moreover, the averaging estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) is outperformed by the adaptively trimmed estimators \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) in every case.

We conclude that the estimators \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) perform better when dealing with large jumps and normally distributed data. For small jumps less contaminated blocks are trimmed by the adaptively trimmed estimators resulting in a higher MSE values in many scenarios. In this case the MOSUM-based estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{mos}}\) yields the best results which are similar to those of \(\widehat{\sigma ^2}_{\mathrm{Mean}}\). For heavy-tailed data, the MOSUM-based procedure is preferable under large jumps but is inferior to the other trimmed estimates if the jumps are small. When dealing with positively correlated processes the adaptively trimmed estimators \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) perform best. Therefore, the choice of the most appropriate estimator depends on the knowledge about the underlying distribution and the height of the jumps.

Table 5 Average number of trimmed blocks (\(\overline{\widehat{K}}\)) and resulting MSE for normally distributed data, \(N = 1000\) and different values of K and h
Table 6 Average number of trimmed blocks (\(\overline{\widehat{K}}\)) and resulting MSE for \(t_5\)-distributed data, \(N = 1000\) and different values of K and h
Table 7 Average number of trimmed blocks (\(\overline{\widehat{K}}\)) and resulting MSE for data generated from the AR(1) model with \(\phi = 0.5\) and \(N=1000\) for different values of K and h

In the simulation study in Sect. 4, we concentrate on the adaptively trimmed estimators \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) since they yield very good results in many scenarios, and on the averaging estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) since it performs well when dealing with a few small jumps and independent data.

4 Simulations

In this section we compare the estimators \(\widehat{\sigma ^2}_{\mathrm{Mean} }\), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\) with the block size \(n= 20\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) with the block size \(n= 20\) and \(\beta = 0.05\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) with the block size \(n= 40\) and \(\beta = 0.05\), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) in different scenarios. We generate 1000 sequences of observations of length \(N\in \{200,1000,2500\}\) from the standard normal and the \(t_5\) distribution. We add K jumps of heights \(h \in \{0, 2, 3, 5, 8\}\) to the data at randomly chosen positions as was done in Sect. 2.2. K is chosen dependent on the number of observations, i.e., \(K = p\cdot N\) with \(p\in \{0, 2/1000, 4/1000, 6/1000, 10/1000\}\).

Table 8 shows the simulated MSE for the normal distribution. The estimators \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) and \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) yield similar results. We conclude that the estimation of the number of jumps K [required in the rule (8)] does not have a large effect on the estimator. The estimators \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) and \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) yield, the best results if the jump heights are not very large, i.e., \(h\le 2\). However, the MSE of taking the ordinary average is much larger than that of the other estimators if the jump heights are large. Large jumps result in large blockwise estimates, which have a strong impact on the ordinary average.

The trimmed estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) yields the best results among all methods considered here for normally distributed data if the jumps are rather high. When many small level shifts are present \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) outperforms \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\), although the latter makes use of the exact normality assumption. The estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) tends to remove more block-estimates than \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) in the absence of level shifts, see Tables 3 and 4. Therefore, we also expect that more blocks are trimmed away by \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) if level shifts are present, reducing the risk of including perturbed blocks in the trimmed estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\).

The trimmed estimator \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\) with a fixed trimming fraction also yields good results. However, this estimation procedure requires the knowledge of the underlying distribution to compute the finite sample correction factor, see Sect. 3.1. The difference-based estimator \(\widehat{\sigma ^2}_{\mathrm{Diff} }\) performs well as long as the jumps are moderately high.

Table 9 shows the results for the \(t_5\) distribution. The estimation procedures \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) yield the best results in this scenario.

In Table 10 the simulated MSE is presented when the data is generated from the autoregressive (AR) model with \(\phi = 0.5\), i.e., the data is positively correlated. The performance of the difference-based estimator worsens considerably then. This is due to the fact that this estimation procedure makes explicit use of the assumption of uncorrelatedness. While \(\widehat{\sigma ^2}_{\mathrm{Diff} }\) underestimates the true variance drastically (resulting in a high MSE value) when no changes in the mean are present, the performance seems to improve slightly when dealing with many high jumps. This can be explained by the fact that the positive bias, which arises from the jumps, compensates for the negative bias which arises from the (incorrect) assumption of uncorrelatedness. The blocks-estimator \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) exhibits a similar behaviour, since the block size is small when the number of jumps is high, while correlated data require large block sizes to ensure satisfying results. For dependent data the best results are obtained when using the adaptively trimmed estimators. Since the variance \(\sigma ^2\) is underestimated when the data are dependent, the values \(\widehat{T}_j\) in (15) and (17) get larger resulting in a higher trimming parameter \(\alpha _{\mathrm{adapt}}\). Therefore, more blocks are trimmed away ensuring that the perturbed ones are not involved in the calculation of the overall estimate.

Figure 3 shows the simulated MSE dependent on the AR-parameter \(\phi \in \{0.1,\ldots ,0.8\}\) in four different scenarios: no jumps, few small jumps, many small jumps, and many high jumps. We observe that the performance of all estimators \(\widehat{\sigma ^2}_{\mathrm{Mean} }\), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) worsens when the strength of the correlation (expressed by the parameter \(\phi \)) grows. When no changes in the mean are present [see Panel (a)] the estimators underestimate the variance more as \(\phi \) gets larger. On the other hand, large level shifts result in a positive bias of the estimators. Therefore, when many high jumps are present [Panel (d)] the results are better than in the case of many small jumps [Panel (c)]. This is more obvious for the estimators \(\widehat{\sigma ^2}_{\mathrm{Mean} }\), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\) and \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\). The trimmed estimation procedures do not suffer much from the increasing strength of correlation in our example. The averaging approach \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) yields good results if the dependence is rather weak and if only few small jumps are present. The performance of the weighted average \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) worsens drastically when many high jumps are present. During this estimation procedure, the data are segregated into few blocks at estimated change-point locations. The approach is highly biased in the case of many and high level shifts if the number of change-points is underestimated. This is sometimes the case if two level shifts are not sufficiently far away from each other, see Eichinger and Kirch (2018).

Based on the simulation results we recommend using the averaging estimation procedure \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) if the data are not expected to be correlated or heavy-tailed and the jump heights are rather small. Otherwise, if either no information on the distribution, the dependence structure of the data and the jump heights is given or the jumps are expected to be rather large, the adaptively trimmed procedure \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) can be recommended.

Table 8 Simulated MSE\(\cdot 10^2\) of \(\widehat{\sigma ^2}_{\mathrm{Mean} }\), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}} \), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) for normally distributed data and different sample sizes N, jump heights \(h\cdot \sigma \) and number of jumps \(K = p\cdot N\) with \(p\in \{0, 2/1000, 4/1000, 6/1000, 10/1000\}\)
Table 9 Simulated MSE\(\cdot 10^2\) of \(\widehat{\sigma ^2}_{\mathrm{Mean} }\), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}} \), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) for \(t_5\)-distributed data and different sample sizes N, jump heights \(h\cdot \sigma \) and number of jumps \(K = p\cdot N\) with \(p\in \{0, 2/1000, 4/1000, 6/1000, 10/1000\}\)
Table 10 Simulated MSE\(\cdot 10^2\) of \(\widehat{\sigma ^2}_{\mathrm{Mean} }\), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}} \), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) for data generated from the AR(1) model with \(\phi = 0.5\) and different sample sizes N, jump heights \(h\cdot \sigma \) and number of jumps \(K = p\cdot N\) with \(p\in \{0, 2/1000, 4/1000, 6/1000, 10/1000\}\)
Fig. 3
figure 3

Simulated MSE of \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) (—), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\) (- - -), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) (\({\cdot \cdot \cdot }\)) , \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\) (\({{-} \cdot {-}}\)), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\) (– – – ), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) with bandwidth \(G = \sqrt{N}\) ( ) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) with bandwidth \(G = \sqrt{N}\) ( ) for a \(K =0, h = 0\), b \(K = 2, h = 2 \), c \(K = 10, h = 2 \) and d \(K = 10, h = 8 \) with \(N=1000\), \(Y_t = X_t + \sum _{k = 1}^{K} h I_{t \ge t_k}\), where \(X_t \) originates from the AR(1)-process with parameter \(\phi \in \{0.1,0.2,\ldots ,0.8\}\)

5 Application

In this section, we apply the blocks-approach to two datasets in order to estimate the variance.

5.1 Nile river flow data

The first dataset contains the widely discussed Nile river flow records in Aswan from 1871 to 1984, see e.g. Hassan (1981), Hipel and McLeod (1994), Syvitski and Saito (2007), among many others. We consider the \(N=114\) annual maxima of the average monthly discharge in \({\text {m}}^3 / {\text {s}}\), since these values are often assumed to be independent in hydrology. The maxima are determined from January to December. The flooding season is from July to September, see Hassan (1981). Figure 4 shows the annual maxima of the average monthly discharge of the Nile river for the years 1871–1984.

Fig. 4
figure 4

Maximal monthly discharge of the Nile river at Aswan in the period 1871–1984

The construction of the two Aswan dams in 1902 and from 1960 to 1969 obviously caused changes in the river flow, see Hassan (1981) and Hipel and McLeod (1994). We used Levene’s test [see Section 12.4.2 in Fox (2015)] to check the three segments of the data (divided by the years 1902 and 1960) for equality of variances. The null hypothesis of equal variances was not rejected with a p value of \(p = 0.40\).

A Q–Q plot of the data indicates that the deviation from a normal distribution is not large, see Fig. 11 in “Appendix”. With \(\beta = 0.05\) and \(n=10\) (\(m = 11\) blocks) no blocks are trimmed away during the trimming procedures \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\). The ordinary sample variance of the entire data yields the value 3,243,866, see Table 11. For the blocks-estimator of the variance from (2) we choose the block size according to (8) with \(K = 2\) getting \(n = \lfloor \sqrt{114} / 3 \rfloor = 3\). All blockwise estimators examined in this paper yield much smaller variance estimates for the whole observation period, ranging from 2,075,819 to 2,684,368.

We conclude that the procedures \(\widehat{\sigma ^2}_{\mathrm{Mean} }\), \(\widehat{\sigma ^2}_{\mathrm{Diff} }\), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) perform better than the ordinary sample variance, since the estimated values on the whole dataset are similar to those for the period 1903–1960 in between the changes.

Table 11 Rounded estimates (\(\cdot 10^{-3}\)) of the variance for the annual maxima of the average monthly discharge of the Nile river in Aswan

5.2 PAMONO data

In the second example, we use data obtained from the PAMONO (Plasmon Assisted Microscopy of Nano-Size Objects) biosensor, see Siedhoff et al. (2014). This technique is used for detection of small particles, e.g. viruses, in a sample fluid. For more details, see Siedhoff et al. (2011). PAMONO data sets are sequences of grayscale images. A particle adhesion causes a sustained local intensity change. This results in an obvious level shift in the time series of grayscale values for each corresponding pixel coordinate. To the best of our knowledge, a change of the variance after a jump in the mean is not expected to occur. A Q–Q plot of the data indicates that the assumption of a normal distribution is reasonable, see Fig. 12 in “Appendix”.

In Panel (a) of Fig. 5 we see a time series corresponding to one pixel which exhibits a virus adhesion, therefore revealing several level shifts in the mean of the time series. \(N = 1000\) observations are available. Panel (b) of Fig. 5 shows a boxplot of 101,070 values of the ordinary sample variance for time series which correspond to pixels without virus adhesion.

Fig. 5
figure 5

a Intensity over time for one pixel and b a boxplot of variances for the virus-free pixels together with the ordinary sample variance of the above data ( ) and values of \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) and \(\widehat{\sigma ^2}_{\mathrm{Diff} }\) (—), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\) (- - -), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\) (\({\cdot \cdot \cdot \, \cdot }\)), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}} \) (- \({\cdot }\) -), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) (– – –) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) (– - –)

Since changes in the mean are not expected there, we use these data to get some insight into the typical value range of the variance. The sample variance of the contaminated data (upper panel) is \(1.59\times 10^{-4}\) which is not within the typical range of values, since it exceeds the upper whisker of the boxplot. The other estimation procedures discussed in this paper yield values within the interval \([1.1 \times 10^{-4},\, 1.2\times 10^{-4}]\) which are well within the interquartile range. We conclude that these approaches yield reasonable estimates for these data.

5.3 PAMONO data with trend

Again, we consider a PAMONO dataset, see Sect. 5.2. Panel (a) of Fig. 6 shows a time series corresponding to a pixel, which seems to exhibit a virus adhesion as well as a linear trend. Panel (b) of Fig. 6 shows the differenced data, i.e., \(Y_t - Y_{t-1},\, t = 2,\ldots ,388\).

Fig. 6
figure 6

a Intensity over time for one pixel and b corresponding differenced series

The differences of first order appear to be independent and scattered around a fixed mean. Few large differences can be observed which presumably originate from the jumps in the mean at the corresponding time points. The existence of the trend could be explained by the fact that the surface, on which the fluid for virus adhesion is placed, was heated up over time. \(N = 388\) observations are available. We apply the estimation procedures \(\widehat{\sigma ^2}_{\mathrm{Mean} }\) [using \(K \in \{1,\ldots , 5\}\) in the formula (8)], \(\widehat{\sigma ^2}_{\mathrm{Diff} }\), \(\widehat{\sigma ^2}_{\mathrm{Tr}, 0.5}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}}\), \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}}\), \(\widehat{\sigma ^2}_{\mathrm{Mean} }^{\mathrm{mosum}}\) and \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}}\) to the data and get estimated values for the variance, which range from \(\widehat{\sigma ^2}_{\mathrm{Mean} } = 0.95 \times 10^{-6}\) (with \(K = 5\)) to \(\widehat{\sigma ^2}_{\mathrm{W} }^{\mathrm{mosum}} = 1.66 \times 10^{-6}\). The empirical variance of the observations has the value \(26.48 \times 10^{-6}\), which is much larger than the other estimates. According to our experience the PAMONO data can be assumed to be uncorrelated after differencing. The sample variance of differenced data is \(1.93 \times 10^{-6}\), which is an estimator of \(2\sigma ^2\), yielding the value \(0.97 \times 10^{-6}\) as an estimate for \(\sigma ^2\), which is near the estimated value of \(\widehat{\sigma ^2}_{\mathrm{Mean} }\).

We conclude that the proposed procedures yield reasonable results even in this situation, where a linear trend is present.

6 Conclusion

In the presence of level shifts, ordinary variance estimators like the empirical variance perform poorly. In this paper, we considered several estimation procedures in order to account for possible changes in the mean.

Estimation of \(\sigma ^2\) based on pairwise differences is popular in nonparametric regression and works well in the presence of level shifts and an unknown error distribution if the data are independent and the fraction of shifts is asymptotically negligible. However, we have identified scenarios where estimation based on longer blocks is to be preferred.

If only a few small level shifts are expected in a long sequence of observations our recommendation is to use the mean of the blocks-variances \(\widehat{\sigma ^2}_{\mathrm{Mean}}\). This estimation procedure does not require knowledge of the underlying distribution, performs well in the aforementioned situation and is asymptotically even as efficient as the ordinary sample variance if there are no level shifts.

If many or large level shifts are expected to occur we recommend using the adaptive trimmed estimators \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{norm}} \) and \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}} \). These procedures are constructed for independent data and use either the exact \(\chi ^2\)-distribution or the asymptotic normal distribution of the blockwise estimates, where the second and the fourth moments need to be estimated. We have found these trimming approaches to work reasonably well even under moderate autocorrelations, although many blocks are trimmed away then, presumably due to the underestimation of the unknown variance in the formula (17). Therefore, when no changes in the mean are present the trimmed estimators suffer efficiency loss. On the other hand, we expect that many perturbed blocks are trimmed away in the presence of level shifts reducing the bias of the estimator. The trimming approach could be extended to dependent data in future work.

In many applications we rather wish to estimate the standard deviation \(\sigma \), e.g. for standardization. If only few jumps of moderate heights are expected to occur, either the average value of the blockwise standard deviations or the square root of the blocks-variance estimator \(\widehat{\sigma ^2}_{\mathrm{Mean}}\) can be used. Otherwise, the square root of the trimmed estimator \(\widehat{\sigma ^2}_{\mathrm{Tr,ad}}^{\mathrm{other}} \) can be recommended. For a large sample size N the finite sample correction factors can be neglected with little loss, see “Appendix”.

An interesting extension will be to consider situations where not only the level but also the variability of the data can change. Suitable approaches for such scenarios might be constructed by combining the ideas discussed here with those presented by Wornowizki et al. (2017), where tests for changes in variability have been investigated using blockwise approaches, assuming a constant mean. This will be an issue for future work.