1 Motivation

The importance of plotting the data as a first step of a statistical analysis is stressed in numerous textbooks (e.g., Ruppert and Matteson 2015; Brockwell and Davis 2016). For instance, Brockwell and Davis (2016, p. 12) recommend time series plots to check whether there are ‘any apparent sharp changes in behavior’. If such a change is present in the data, yet is ignored in the subsequent analysis, the conclusions drawn from the data may be invalid (see, e.g., Baltagi et al. 2013; Demetrescu and Hanck 2013; Harvey et al. 2013; Xu 2015). For instance, Xu (2015) shows that if a structural break in the error variances is ignored, standard tests for the the constancy of regression coefficients suffer from size distortions—even asymptotically. To avoid such misleading test results, applying a formal structural break test is typically recommended as a pre-step to the actual statistical analysis of the data.

However, one problem with the recommendation to look for ‘any apparent sharp changes in behavior’ is that the decision to apply the structural break test has been informed by the data. Hence, any change point test, if it is to be valid, needs to hold size conditional on having looked at the data. Of course, such tests are exclusively constructed to hold size unconditionally and, hence, suffer from size distortions if applied otherwise. The first main aim of this paper is to quantify these size distortions for structural break tests that are applied conditional on large deviations being observed in the data. We show that these size distortions can become so large that a true null is rejected with certainty.

This changes when one moves from a structural break context, where all data are available in advance, to a monitoring context, where—after having observed some training data—the data become available ‘as you go’ for sequential tests of parameter stability. Of course, ex ante it may be unclear precisely which parameters to monitor for constancy. One possibility may be to monitor a parameter, whose estimates have fluctuated somewhat in the training data. Then, similarly as above, the monitoring procedure needs to hold size conditionally on large fluctuations in the training data. The second main aim of this paper is to show that, unlike for one-shot structural break tests, this is indeed the case.

We mention that the issue of investigating the data to ‘decide’ which hypotheses to test is an old one. Selvin and Stuart (1966) call this ‘hunting’, because the investigator hunts for hypotheses to be tested based on the data. They illustrate the practice with Pearson’s \(\chi ^2\) goodness-of-fit (GoF) test. In applications, a possible distribution for the data is usually not chosen ex ante, but ex post by eye-balling a histogram. This practice is to some extent unavoidable, as in our structural break example. Selvin and Stuart (1966) conclude with regard to hunting that ‘the only criticism to be made is of the delusion that one has to pay no price for the sport.’ Interestingly, while in general the bias incurred by hunting—as in the GoF example—is hard to quantify, it is possible for monitoring procedures and (up to a single unknown parameter) also for structural break tests.

Even outside the statistical literature, testing hypothesis on the same data that inspired it is known to be harmful. In the social sciences, Kerr (1998) calls this HARKing (Hypothesizing After the Results are Known) and defines it as presenting a post hoc hypothesis (i.e., one informed by the collected data) as an a priori hypothesis. Among others, Simmons et al. (2011) and Gelman and Loken (2014) point out that data analyses in the social sciences are often driven by the observed data, and show that this contributes—among other factors—to the prevalence of false negatives in published work (which is a finding that has led to the so-called replication crisis). This is as in our one-shot structural break setting, where the data-inspired hypothesis (‘there is a structural break in the data’) is more likely to be accepted even when wrong, producing a false negative.

The remainder of the paper proceeds as follows. Section 2 states and discusses the main theoretical results of this paper. The proofs of these results are deferred to the Appendix. An empirical application to Bitcoin returns in Sect. 3 demonstrates that if the decision to apply a structural break test is made conditional on the data, different results may be obtained when this fact is taken into account. The final Sect. 4 concludes.

2 Main results

2.1 Structural break tests

Let \(X_1,\ldots ,X_n\) denote the (possibly multivariate) observations to be tested for a structural break in some scalar parameter \(\gamma _i\) of their respective distribution function \(F_i\) (\(i=1,\ldots ,n\)). For instance, \(\gamma _i\) may be the mean or the variance of a component series, or it may denote the correlation or tail dependence coefficient of two components of the series. Structural break tests for these parameters are well-established (see, e.g., Inclan and Tiao 1994; Vogelsang 1998; Wied et al. 2012; Hoga 2018). Interest in change point tests focuses on the null hypothesis

$$\begin{aligned} {\mathcal {H}}_0^{S}\ :\ \gamma =\gamma _1=\cdots =\gamma _n, \end{aligned}$$

i.e., the constancy of the parameter \(\gamma \) over time.

Let \({\widehat{\gamma }}(0,t)\) denote an estimate of \(\gamma \) based on the subsample \(X_1,\ldots ,X_{\lfloor nt\rfloor }\). Here, \(\lfloor \cdot \rfloor \) rounds down to the nearest integer. Further, let D[0, 1] denote the space of real-valued functions on [0, 1] that possess left-hand limits and are right-continuous (Davidson 1994). We make the following high-level assumption under \({\mathcal {H}}_0^{S}\):

Assumption 1

It holds that, as \(n\rightarrow \infty \),

$$\begin{aligned} t\sqrt{k_n}\left[ {\widehat{\gamma }}(0,t)-\gamma \right] \overset{d}{\longrightarrow }\sigma W(t)\qquad in D[0,1], \end{aligned}$$

where \(\sigma >0\), \(k_n\rightarrow \infty \), and \(\{W(t)\}_{t\in [0,1]}\) denotes a standard Brownian motion.

Such a functional central limit theorem has been shown to hold for many parameters. For the leading case \(k_n=n\), it holds (e.g.) for the mean (Davidson 1994), the variance (Wied et al. 2012), correlation (Wied et al. 2012) and Kendall’s tau (Dehling et al. 2017). For extreme value quantities, whose estimators typically depend only on a vanishing fraction of the sample, a scaling different from \(\sqrt{n}\) is required in Assumption 1. For instance, Assumption 1 holds for the tail index (Hoga 2017a), an extreme quantile estimator (Hoga 2017b) and the tail dependence coefficient (Hoga 2018) for some \(k_n=o(n)\). For feasible break testing, the nuisance parameter \(\sigma \) in Assumption 1 needs to be estimated consistently. To that end, we impose the following assumption under \({\mathcal {H}}_0^{S}\):

Assumption 2

There exists an estimator \({\widehat{\sigma }}\) satisfying \({\widehat{\sigma }}\overset{p}{\longrightarrow }\sigma \), as \(n\rightarrow \infty \).

The usual recursive test statistic for testing \({\mathcal {H}}_0^{S}\) is

$$\begin{aligned} T_n=\frac{1}{{\widehat{\sigma }}}\sup _{t\in [0,1]}\left| t\sqrt{k_n}\left[ {\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)\right] \right| \overset{d}{\longrightarrow }\sup _{t\in [0,1]}\left| W(t)-tW(1)\right| ,\qquad n\rightarrow \infty , \end{aligned}$$
(1)

where the convergence follows from Assumptions 1 and 2 and the continuous mapping theorem together with Slutsky’s theorem (Davidson 1994). Here, fluctuations in the recursive parameter estimates \({\widehat{\gamma }}(0,t)\) that deviate ‘too much’ from the full-sample estimate \({\widehat{\gamma }}(0,1)\) are taken as evidence against the null. Based on the \((1-\alpha )\)-quantile \(c_{\alpha }^{S}\) of the limiting distribution in (1), we reject \({\mathcal {H}}_0^{S}\) at significance level \(\alpha \in (0,1)\) if \(T_n>c_{\alpha }^{S}\), since from (1)

$$\begin{aligned} \Pr \left\{ T_n>c_{\alpha }^{S}\right\} \longrightarrow \alpha ,\qquad n\rightarrow \infty . \end{aligned}$$
(2)

However, often the decision to apply a structural break test (usually to validate the intended subsequent statistical analysis) is not made before the data have been collected. Rather, as pointed out in the Motivation, it is made afterwards based on having observed some large deviations in the series. This eye-balling may be formalized as testing if and only if \(t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|/{\widehat{\sigma }}>\delta /\sqrt{k_n}\) for some \(t\in [0,1]\), where the ‘if and only if’-part of course constitutes a crude approximation. Here, the pre-factor t discounts a large deviation that is based on very few data points for small t; the inclusion of \(\sqrt{k_n}\) reflects smaller expected fluctuations in larger samples; \({\widehat{\sigma }}^2\) estimates the asymptotic variance of \({\widehat{\gamma }}(0,1)\) and, hence, reflects estimation uncertainty; finally, \(\delta >0\) determines the (unknown) sensitivity of the visual inspection. In other words, \(\delta \) is the parameter for which eye-balling for changes can be best approximated by the conditioning event \(\{\sup _{t\in [0,1]}{t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|}/{\widehat{\sigma }}>\delta /\sqrt{k_n}\}\).

Of course, this approximation of the eye-balling heuristic may not be perfect, but we argue that it is close enough to be interesting. For instance, plots of \(t\mapsto {\widehat{\gamma }}(0,t)\) are often used as a diagnostic tool indicating structural change; see, e.g., Quintos et al. (2001, Fig. 3) or Wied et al. (2012, Fig. 1). For obvious reasons, we call the above procedure ‘formalized’ eye-balling.

Now, carrying out the structural break test if only if \(t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|/{\widehat{\sigma }}>\delta /\sqrt{k_n}\) for some \(t\in [0,1]\), the probability that should be controlled is

$$\begin{aligned} \Pr \left\{ T_n>c_{\alpha }^{S}\ \Bigg \vert \ \frac{1}{{\widehat{\sigma }}}\sup _{t\in [0,1]}\left\{ t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\right\} >\delta /\sqrt{k_n}\right\} , \end{aligned}$$
(3)

and not \(\Pr \left\{ T_n>c_{\alpha }^{S}\right\} \) as in (2). Of course, the conditioning event in the above probability may also be written as \(\{T_n>\delta \}\). We prefer to write it as in (3) to emphasize the ‘formalized’ eye-balling rationale of the conditioning.

Remark 1

To appreciate the difference between the conditional and unconditional approach, consider trajectories \(X_{1}^{(b)},\ldots ,X_{n}^{(b)}\) (\(b=1,\ldots ,B\)) for which (2) holds under \({\mathcal {H}}_0^{S}\). Then, for sufficiently large n, one expects to reject \({\mathcal {H}}_{0}^{S}\) in the unconditional procedure for \(B\alpha \) of these trajectories. In contrast, the conditional approach considers only those, say k, trajectories \(\{X_{1}^{(b_i)},\ldots ,X_{n}^{(b_i)}\}_{i=1,\ldots ,k}\) with visually apparent indications of change, i.e., those trajectories satisfying \(\sup _{t\in [0,1]}{\{t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\}}/{\widehat{\sigma }}>\delta /\sqrt{k_n}\). If one rejects \({\mathcal {H}}_{0}^{S}\) for each of these k trajectories where the test statistic exceeds \(c_{\alpha }^{S}\), one cannot expect a rejection in (the desired number of) \(k\alpha \) cases, but instead the number of rejections is much higher, since only trajectories with large fluctuations are considered in the first place.

Formally, the asymptotic rejection probability is given by the following

Theorem 1

Suppose Assumptions 1 and 2 hold. Then, under \({\mathcal {H}}_0^{S}\), as \(n\rightarrow \infty \),

$$\begin{aligned} \Pr \left\{ T_n>c_{\alpha }^{S}\ \Bigg \vert \ \frac{1}{{\widehat{\sigma }}}\sup _{t\in [0,1]}\left\{ t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\right\}>\delta /\sqrt{k_n}\right\} \longrightarrow {\left\{ \begin{array}{ll}1,&{} \delta >c_{\alpha }^{S},\\ \frac{\alpha }{f(\delta )},&{} \delta \le c_{\alpha }^{S}, \end{array}\right. }\nonumber \\ \end{aligned}$$
(4)

where \(f(x)=2\sum _{n=1}^{\infty }(-1)^{n-1}\exp \{-2 n^2 x^2\}=\Pr \left\{ \sup _{t\in [0,1]}|W(t)-tW(1)|\ge x\right\} \in [0,1]\).

As a simple application of the elementary conditional probability formula, the proof of Theorem 1 is almost trivial. The main insight afforded by Theorem 1 is that the textbook recommendation to look for ‘any apparent changes in behavior’ (Brockwell and Davis 2016, p. 12) should come with a warning that, if the formal structural break test is applied as usual (i.e., with the same critical value \(c_{\alpha }^{S}\) suggested by the unconditional test), it rejects the null more often than it should. Moreover, the influence of the conditioning can be quantified up to the parameter \(\delta \). This is in contrast to the example of GoF tests mentioned in the Motivation. There, the act of looking at a histogram and spotting a resemblance with some parametric distribution is much harder to formalize and, hence, the impact on a subsequent GoF test much harder to quantify.

Specifically, Theorem 1 formalizes the intuition that if there is preliminary (‘formalized’ eye-balling) evidence in the data that a null hypothesis is false, then that null is more likely to be rejected, because the other cases—where there is no preliminary evidence—are not considered. Since the conditioning event in (4) may be equivalently written as \(\{T_n>\delta \}\), it is clear that, in case \(\delta >c_{\alpha }^{S}\), the true null is rejected not only asymptotically with probability one, but even almost surely in finite samples. Even for a less stringent testing condition with \(\delta \le c_{\alpha }^{S}\) the conditional (asymptotic) rejection probability under the null, \(\alpha /f(\delta )\), is larger than \(\alpha \). Only when \(\delta =0\), i.e., when there is no conditioning, is the desired type I error rate of \(\alpha \) attained.

Nonetheless, if \(\delta \) is known (which is typically not the case), there is a way to keep a desired confidence level \(\alpha \) even in the conditional test. For a fixed \(\delta >0\), one simply chooses \(\alpha ^{*}=\alpha f(\delta )\in (0,\alpha )\). Then, by (4),

$$\begin{aligned} \Pr \left\{ T_n>c_{\alpha ^{*}}^{S}\ \Bigg \vert \ \frac{1}{{\widehat{\sigma }}}\sup _{t\in [0,1]}\left\{ t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\right\} >\delta /\sqrt{k_n}\right\} \longrightarrow \alpha ^{*}/f(\delta )=\alpha ,\qquad n\rightarrow \infty . \end{aligned}$$

This means that when the test is only applied if the data provide preliminary evidence for a structural break, the resulting bias can be corrected by suitably lowering the confidence level of the critical value. In spirit, this is similar to the usual pre-testing strategy of (suitably) lowering the significance level that avoids inflating type I error (Giles and Giles 1993).

Remark 2

Theorem 1 only investigates the behavior of the test under the null and shows it to be oversized. In line with the well-known tradeoff between type I- and type II-error, this suggests that the test has higher power under the alternative. We refrain from formally investigating the (local) power of the test, as it is does not hold size and, hence, is not a valid statistical test.

2.2 Monitoring parameter changes

Let \(X_1,\ldots ,X_n\) denote the variables in the training period that are available prior to monitoring. The goal in monitoring is to detect breaks in some parameter as new observations \(X_{n+1},X_{n+2},\ldots \) become available, and to do so as quickly as possible. Formally, interest in monitoring centers on sequentially testing

$$\begin{aligned} {\mathcal {H}}_0^{M}\ :\ \gamma =\gamma _{n+1}=\gamma _{n+2}=\cdots \end{aligned}$$

given the non-contamination assumption \(\gamma =\gamma _1=\cdots =\gamma _n\), which imposes structural stability in the training period. Of course, non-contamination can be tested using the methods of Sect. 2.1. We consider so-called closed-end procedures, where monitoring stops after \(X_{n+1},\ldots ,X_{\lfloor nT\rfloor }\) (\(1<T<\infty \)) have been observed.

Let \({\widehat{\gamma }}(a,b)\) denote a \(\gamma \)-estimate based on \(X_{\lfloor na\rfloor +1},\ldots ,X_{\lfloor nb\rfloor }\) (\(0\le a<b\le T\)). Define \(D^2[0,T]\) as the space of all \({\mathbb {R}}^2\)-valued functions on [0, T] that are right-continuous with left-hand limits in each component (Davidson 1994). The following condition is the analogue of Assumption 1 and has likewise been shown to hold for various estimators (see, e.g., Hoga and Wied (2017)).

Assumption 3

It holds that, as \(n\rightarrow \infty \),

$$\begin{aligned} \sqrt{k_n}\begin{pmatrix} t[{\widehat{\gamma }}(0,t)-\gamma ]\\ t_0[{\widehat{\gamma }}(t,t+t_0)-\gamma ]\\ \end{pmatrix} \overset{d}{\longrightarrow } \sigma \begin{pmatrix} W(t)\\ W(t+t_0)-W(t) \end{pmatrix}\qquad \text {in }D^2[0,T-t_0], \end{aligned}$$

where \(\sigma >0\), \(t_0>0\), \(T>\max \{t_0,1\}\), and \(\{W(t)\}_{t\in [0,T]}\) denotes a standard Brownian motion.

We base monitoring on the moving-sum detector

$$\begin{aligned} M_n(t)=\frac{1}{{\widehat{\sigma }}}\left| t_0\sqrt{k_n}[{\widehat{\gamma }}(t,t+t_0)-{\widehat{\gamma }}(0,1)]\right| , \qquad t\in [1,T-t_0], \end{aligned}$$

where the estimator \({\widehat{\sigma }}\) from Assumption 2 is typically calculated from the non-contaminated training data. The idea behind \(M_n(t)\) is that large deviations between \({\widehat{\gamma }}(t,t+t_0)\) and the non-contaminated estimate \({\widehat{\gamma }}(0,1)\) indicate a structural change in the monitoring period. Again by the continuous mapping theorem and Slutzky’s lemma, it follows under Assumptions 2 and 3 that

$$\begin{aligned} \sup _{t\in [1,T-t_0]}M_n(t)\overset{d}{\longrightarrow }\sup _{t\in [1,T-t_0]}\left| W(t+t_0)-W(t)-t_0W(1)\right| . \end{aligned}$$

Thus, we reject \({\mathcal {H}}_0^{M}\) at significance level \(\alpha \in (0,1)\) as soon as \(M_n(t)>c_{\alpha }^{M}\) for some \(t>1\), where \(c_{\alpha }^{M}\) is implicitly defined by

$$\begin{aligned} \Pr \left\{ \sup _{t\in [1,T-t_0]}\left| W(t+t_0)-W(t)-t_0W(1)\right| >c_{\alpha }^{M}\right\} =\alpha . \end{aligned}$$

Hence, one controls the asymptotic probability

$$\begin{aligned} \Pr \left\{ \sup _{t\in [1,T-t_0]}M_n(t)>c_{\alpha }^{M}\right\} \longrightarrow \alpha ,\qquad n\rightarrow \infty . \end{aligned}$$
(5)

When deciding which parameters to monitor, one may choose those whose subsample estimates have exhibited some variation in the training period. This again leads us to consider monitoring for change conditional on \(\sup _{t\in [0,1]}{\{t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\}}/{\widehat{\sigma }}>\delta /\sqrt{k_n}\) for some \(\delta >0\). As before, the parameter \(\delta \) is unknown, depending on the sensitivity of the visual inspection by the practitioner. When monitoring conditionally on having observed some noticeable variation in the training period, one should control

$$\begin{aligned} \Pr \left\{ \sup _{t\in [1,T-t_0]}M_n(t)>c_{\alpha }^{M}\ \Bigg \vert \ \frac{1}{{\widehat{\sigma }}}\sup _{t\in [0,1]}\left\{ t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\right\} >\delta /\sqrt{k_n}\right\} . \end{aligned}$$
(6)

If \(\delta \) could actively be chosen in practice, it should not be chosen too large (larger than some critical value \(c_{\alpha }^{S}\)), because this would already be evidence against \(\gamma _1=\cdots =\gamma _n\) (cf. (2)), violating the non-contamination assumption. However, the conditioning in (6) does not matter asymptotically for monitoring, as shown next.

Theorem 2

Suppose Assumptions 2 and 3 hold. Then, under \({\mathcal {H}}_0^{M}\), as \(n\rightarrow \infty \),

$$\begin{aligned} \Pr \left\{ \sup _{t\in [1,T-t_0]}M_n(t)>c_{\alpha }^{M}\ \Bigg \vert \ \frac{1}{{\widehat{\sigma }}}\sup _{t\in [0,1]}\left\{ t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\right\} >\delta /\sqrt{k_n}\right\} \longrightarrow \alpha . \end{aligned}$$
(7)

Theorem 2 shows that when the training data inspire a test of \({\mathcal {H}}_{0}^{M}\), the monitoring procedure can be applied as usual, i.e., with the same boundary \(c_{\alpha }^{M}\) as in (5). This result is reminiscent of the intuition that, while one cannot (without modification at least) use the same data that inspired a hypothesis to test it, one can simply wait for fresh (out-of-sample) data to verify it. The limit in (7) would trivially obtain if the two events in the probability were independent. Yet, this is not the case, as \({\widehat{\gamma }}(0,1)\) is common to both events, and the underlying \(X_t\)’s may be serially dependent across the training and monitoring period.

Remark 3

The conclusion of Theorem 2 does not depend on the type of detector used. For instance, using the expanding sum detector \(E_n(t)=\frac{1}{{\widehat{\sigma }}}|(t-1)\sqrt{k_n}[{\widehat{\gamma }}(1,t)-{\widehat{\gamma }}(0,1)]|\) would—under a suitable analogue of Assumption 3—lead to the same result. We omit details for brevity.

Remark 4

Reconsider the set-up of Remark 1. Suppose Assumptions 2 and 3 hold for the (continued) simulated trajectories \(X_{1}^{(b)},\ldots ,X_{n}^{(b)},X_{n+1}^{(b)},\ldots \) (\(b=1,\ldots ,B\)). Then, in sufficiently large samples, one expects to reject \({\mathcal {H}}_0^{M}\) based on unconditional testing for roughly \(\alpha \%\) of all trajectories; see (5). (This is as in the structural break setting of Sect. 2.1, where one also rejects \({\mathcal {H}}_0^{S}\) for \(\alpha \%\) of all trajectories.) If monitoring is done conditionally, then one expects to reject \({\mathcal {H}}_0^{M}\) for \(\alpha \%\) of the k trajectories satisfying the conditioning event. (This contrasts with the structural break tests, where—in the conditional setting—\({\mathcal {H}}_{0}^{S}\) is rejected for \(\alpha /f(\min \{\delta ,\alpha \})\%\) of said k trajectories.) Hence, while (5) and (7) suggest an equal number of type I errors asymptotically, the trajectories tested are different—in unconditional monitoring all trajectories are considered, whereas conditional monitoring only considers the fraction, where the condition is met.

Remark 5

The result illustrated in Remark 4 for the different conditional rejection probabilities in structural break testing and monitoring has some analogy with the following example. Suppose a coin comes up heads 9 out of 10 times. This then raises some suspicion of the fairness of the coin. The data-inspired hypothesis that the coin is unfair, is then more likely to be accepted as true when tested on the same observations. (This result is analogous to Theorem 1.) However, tossing the same coin again 10 times, the ‘unfair coin’ hypothesis—suggested by the first 10 throws—can be tested on fresh data as if no conditioning took place. (This is analogous to Theorem 2, where out-of-sample data become available for monitoring.) Of course, while fresh data is also used in testing \({\mathcal {H}}_0^{M}\), the monitoring situation is more complex than the coin flip example. First, data may be serially dependent in Theorem 2. Second, \({\widehat{\gamma }}(0,1)\) appears in both the detector \(M_n(t)\) and the conditioning event in (6).

3 Empirical application

Here, we illustrate the practical implications of Theorems 1 and 2 . We do so using Bitcoin log-returns \(X_1,\ldots ,X_n\) from 01/01/2016 to 31/12/2019, giving \(n=1,461\) observations.Footnote 1 Böhme et al. (2015) provide a comprehensive review of the crypto-currency. Due to the rising popularity of crypto-currencies, much research effort has been devoted to studying Bitcoin, which represents the largest market share among all crypto-currencies. For instance, Urquhart (2016) investigates the efficiency of the Bitcoin market using, among others, a classical Ljung–Box test. Most tests (including Ljung–Box tests) rely on the absence of structural breaks for their validity. However, as Bitcoins represent a new asset class, ex ante knowledge of the parameters that may change (e.g., mean, variance, higher order moments, tail index, autocorrelations, etc.) is hard to justify. Testing for change in all conceivable parameters invariably inflates type I errors, due to multiple testing issues. So it seems natural to only test for breaks in parameters that have fluctuated noticeably in the data. Hence, in the following, we test for breaks in parameters, where ‘formalized’ eye-balling—as described in Sect. 2.1—indicates a possible change.

We exemplarily consider the first two moments as the most important parameters determining location and scale. On stretches of stationarity in the data, Assumptions 13 are then likely to be satisfied by the Bitcoin log-returns. This is because GARCH-type volatility models have been successfully used in modeling Bitcoin returns (Cheah and Fry 2015). Carrasco and Chen (2002) show that several GARCH models are mixing with rates implying suitable functional central limit theorems in Assumptions 1 and 3 to hold (Herrndorf 1985). Likewise the mixing conditions are sufficient for consistent estimation of the long-run variance \(\sigma ^2\) in Assumption 2 (De Jong and Davidson 2000).

Fix the significance level at \(\alpha =0.05\). In the ‘standard’ eye-balling method, one would plot the series of log-returns, and look for preliminary evidence of parameter change. If such evidence is found, the series would then be subjected to a formal level-\(\alpha \) test, without reflecting in the test that the data themselves suggested the hypothesis. Alternatively, we use the ‘formalized’ eye-balling approach and assume that visually inspecting the time series for abrupt changes inspires a subsequent test if and only if \(\frac{\sqrt{n}}{{\widehat{\sigma }}}\sup _{t\in [0,1]}\left\{ t|{\widehat{\gamma }}(0,t)-{\widehat{\gamma }}(0,1)|\right\} \) exceeds \(\delta \). Of course, \(\delta \) is unknown in practice, but for illustrative purposes we assume here that it equals the 20%-critical value, i.e., \(\delta =c_{\alpha =0.2}^{S}=1.073\). Thus, we apply a test to one of the two parameters—first and second moments—if the corresponding value of \(T_n\) is larger than \(\delta \). Without any modification of the significance level \(\alpha \), this would yield a test of level \(\alpha /f(\delta )=\alpha /0.2=25\%\) by Theorem 1. By the same theorem, a (conditional) level-\(\alpha \) test is however obtained, if we reject when \(T_n>c_{\alpha ^{*}}^{S}=1.628\) for \(\alpha ^{*}=\alpha f(\delta )=0.2\alpha =0.01\).

Figure 1 displays the price process and the log-returns of Bitcoin. The prices seem to indicate returns with positive mean until the peak on December 16, 2017, and a negative mean in the year thereafter. This strongly suggests the need for a more formal test of a constant mean. Define \(S_n^{M}(t)=\frac{\sqrt{n}}{{\widehat{\sigma }}^{M}}t|{\widehat{\gamma }}^{M}(0,t)-{\widehat{\gamma }}^{M}(0,1)|\), where \({\widehat{\gamma }}^{M}(0,t)\) denotes the sample mean of \(X_{1},\ldots ,X_{\lfloor nt\rfloor }\), and \({\widehat{\sigma }}^{M}\) is a HAC estimator with Bartlett kernel and bandwidth \(\lfloor \log n\rfloor \). As the test statistic \(T_n^{M}=\sup _{t\in [0,1]}S_n^{M}(t)=1.578\) is larger than the critical value \(c_{\alpha =0.05}^{S}=1.358\), the null of a constant mean is rejected. However, this test incurs a not-accounted-for bias, because the mean break hypothesis was suggested by the same data that were used for testing.

To account for this bias, we apply the ‘formalized’ eye-balling technique next. The test statistic for a mean change is larger than \(\delta \) (\(T_n^{M}=1.578>1.073=\delta \)). Conditional on this result, a level-\(\alpha \) test rejects if \(T_n^{M}>1.628\). Hence, we do not reject the constant mean hypothesis at a \(5\%\)-significance level using the conditional test.

The plot of \(t\mapsto S_n^{M}(t)\) in Fig. 1 graphically illustrates the conflicting results. The red dotted lines indicate the ‘incorrect’ critical value \(c_{\alpha =0.05}^{S}=1.358\) and the ‘correct’ \(c_{\alpha =0.01}^{S}=1.628\). While the ‘incorrect’ value is exceeded by \(S_n^{M}(t)\), the ‘correct’ value is not. The additional evidence required to exceed the ‘correct’ critical value can be seen as a compensation for the fact that the data themselves indicated the mean break hypothesis.

Fig. 1
figure 1

From top to bottom: Bitcoin prices in US $, log-returns, plots of \(t\mapsto S_n^{M}(t)\) and \(t\mapsto S_n^{V}(t)\). The dashed red lines in the bottom two panels indicate the \(1\%\)- and \(5\%\)-critical values \(c_{\alpha =0.01}^{S}=1.628\) and \(c_{\alpha =0.05}^{S}=1.358\)

The valid conditional test did not reject the constant mean hypothesis. Nonetheless, the persistence in the price process observed in Fig. 1 indicates the possibility of mean changes. This suggests that it may be useful to monitor for breaks in the mean starting in 2020. For instance, in a risk management context, it is particularly important to detect downward breaks in the mean to avoid losses. The implication of Theorem 2 is that such a monitoring procedure can be applied as if it had not been Fig. 1 that suggested the hypothesis.

As the conditional test did not reject the null of a constant mean, we may validly apply a test for a change in the variance by testing for the constancy of second moments. Define \(S_n^{V}(t)=\frac{\sqrt{n}}{{\widehat{\sigma }}^{V}}t|{\widehat{\gamma }}^{V}(0,t)-{\widehat{\gamma }}^{M}(0,1)|\), where \({\widehat{\gamma }}^{V}(0,t)\) denotes the sample mean of \(X_{1}^2,\ldots ,X_{\lfloor nt\rfloor }^2\), and \({\widehat{\sigma }}^{V}\) again denotes a HAC estimator (with Bartlett kernel and bandwidth \(\lfloor \log n\rfloor \)) for the asymptotic variance of \({\widehat{\gamma }}^{V}(0,1)\). The test statistic for a variance change is once more larger than \(\delta \) (\(T_n^{V}=\sup _{t\in [0,1]}S_n^{V}(t)=1.692>1.073=\delta \)). Conditional on this result, a level-\(\alpha \) test rejects the null of a constant variance, as \(T_n^{V}=1.692>1.628\). Of course, in this case the ‘naive’ test also would have led to a rejection (because \(T_n^{V}=1.692>1.358=c_{\alpha =0.05}^{S}\)). The plot of \(t\mapsto S_n^V(t)\) in Fig. 1 illustrates the result. It also allows to date the (most prominent) break somewhere around the first half of 2017, where the largest values of \(S_n^{V}(t)\) are attained. Of course, as the variance is not even constant in the training period, monitoring (using Theorem 2) should not be carried out, due to structural change contaminating the training period.

The evidence for a break in the variance of Bitcoin returns suggests that either statistical analyses (such as Ljung–Box tests used to assess market efficiency) need to be robust to unconditional heteroscedasticity or the analyses have to be restricted to break-free subsamples.

4 Conclusion

More often than not, hypotheses are generated by data. While, in general, fresh data is desirable to validly verify a hypothesis, in some applications hypotheses need to be tested on the same data that generated it (e.g., the structural break hypothesis for the Bitcoin returns from 2016 to 2019). In situations like these, the bias of having looked at the data before hypotheses are formulated can frequently not be corrected, e.g., in goodness-of-fit testing. We show in this note that a correction is theoretically possible for structural break tests, if the critical value is suitably increased—with the increase depending on a single unknown constant. This provides one further reason for the use of large critical values or, equivalently, small significance levels, that has also been advocated elsewhere (Benjamin 2018). Furthermore, this shows that the textbook recommendation to visually inspect the data for breaks should carry a warning that subsequent formal structural break tests need to take into account that the break hypothesis was suggested by the data. By contrast, when monitoring parameters, ‘hypotheses’ generated from the training data can be tested without any correction.