Quantifying the data-dredging bias in structural break tests

Structural break tests are often applied as a pre-step to ensure the validity of subsequent statistical analyses. Without any a priori knowledge of the type of breaks to expect, eye-balling the data can indicate changes in some parameter, e.g., the mean. This, however, can distort the result of a structural break test for that parameter, because the data themselves suggested the hypothesis. In this paper, we formalize the eye-balling procedure and theoretically derive the implied size distortion of the structural break test. We also show that eye-balling a stretch of historical data for possible changes in a parameter does not invalidate the subsequent procedure that monitors for structural change in new incoming observations. An empirical application to Bitcoin returns shows that taking into account the data-dredging bias, which is incurred by looking at the data, can lead to different test decisions.


Motivation
The importance of plotting the data as a first step of a statistical analysis is stressed in numerous textbooks (e.g., Ruppert and Matteson 2015;Brockwell and Davis 2016). For instance, Brockwell and Davis (2016, p. 12) recommend time series plots to check whether there are 'any apparent sharp changes in behavior'. If such a change is present in the data, yet is ignored in the subsequent analysis, the conclusions drawn from the data may be invalid (see, e.g., Baltagi et al. 2013;Demetrescu and Hanck 2013;Har-The author is grateful to two referees and Christoph Hanck for their valuable comments and suggestions. The remainder of the paper proceeds as follows. Section 2 states and discusses the main theoretical results of this paper. The proofs of these results are deferred to the Appendix. An empirical application to Bitcoin returns in Sect. 3 demonstrates that if the decision to apply a structural break test is made conditional on the data, different results may be obtained when this fact is taken into account. The final Sect. 4 concludes.

Structural break tests
Let X 1 , . . . , X n denote the (possibly multivariate) observations to be tested for a structural break in some scalar parameter γ i of their respective distribution function F i (i = 1, . . . , n). For instance, γ i may be the mean or the variance of a component series, or it may denote the correlation or tail dependence coefficient of two components of the series. Structural break tests for these parameters are well-established (see, e.g., Inclan and Tiao 1994;Vogelsang 1998;Hoga 2018). Interest in change point tests focuses on the null hypothesis H S 0 : γ = γ 1 = · · · = γ n , i.e., the constancy of the parameter γ over time.
Let γ (0, t) denote an estimate of γ based on the subsample X 1 , . . . , X nt . Here, · rounds down to the nearest integer. Further, let D[0, 1] denote the space of real-valued functions on [0, 1] that possess left-hand limits and are right-continuous (Davidson 1994). We make the following high-level assumption under H S 0 : Assumption 1 It holds that, as n → ∞, where σ > 0, k n → ∞, and {W (t)} t∈[0,1] denotes a standard Brownian motion.
Such a functional central limit theorem has been shown to hold for many parameters. For the leading case k n = n, it holds (e.g.) for the mean (Davidson 1994), the variance , correlation ) and Kendall's tau (Dehling et al. 2017). For extreme value quantities, whose estimators typically depend only on a vanishing fraction of the sample, a scaling different from √ n is required in Assumption 1. For instance, Assumption 1 holds for the tail index (Hoga 2017a), an extreme quantile estimator (Hoga 2017b) and the tail dependence coefficient (Hoga 2018) for some k n = o(n). For feasible break testing, the nuisance parameter σ in Assumption 1 needs to be estimated consistently. To that end, we impose the following assumption under H S 0 : Assumption 2 There exists an estimator σ satisfying σ p −→ σ , as n → ∞.

Y. Hoga
The usual recursive test statistic for testing H S 0 is where the convergence follows from Assumptions 1 and 2 and the continuous mapping theorem together with Slutsky's theorem (Davidson 1994). Here, fluctuations in the recursive parameter estimates γ (0, t) that deviate 'too much' from the full-sample estimate γ (0, 1) are taken as evidence against the null. Based on the (1 − α)-quantile c S α of the limiting distribution in (1), we reject However, often the decision to apply a structural break test (usually to validate the intended subsequent statistical analysis) is not made before the data have been collected. Rather, as pointed out in the Motivation, it is made afterwards based on having observed some large deviations in the series. This eye-balling may be formalized as testing if and only if t| γ (0, t) − γ (0, 1)|/ σ > δ/ √ k n for some t ∈ [0, 1], where the 'if and only if'-part of course constitutes a crude approximation. Here, the prefactor t discounts a large deviation that is based on very few data points for small t; the inclusion of √ k n reflects smaller expected fluctuations in larger samples; σ 2 estimates the asymptotic variance of γ (0, 1) and, hence, reflects estimation uncertainty; finally, δ > 0 determines the (unknown) sensitivity of the visual inspection. In other words, δ is the parameter for which eye-balling for changes can be best approximated by the conditioning event {sup t∈[0,1] t| γ (0, t) − γ (0, 1)|/ σ > δ/ √ k n }. Of course, this approximation of the eye-balling heuristic may not be perfect, but we argue that it is close enough to be interesting. For instance, plots of t → γ (0, t) are often used as a diagnostic tool indicating structural change; see, e.g., Quintos et al. (2001, Fig. 3) or Wied et al. (2012, Fig. 1). For obvious reasons, we call the above procedure 'formalized' eye-balling. Now, carrying out the structural break test if only if t| γ (0, t)− γ (0, 1)|/ σ > δ/ √ k n for some t ∈ [0, 1], the probability that should be controlled is and not Pr T n > c S α as in (2). Of course, the conditioning event in the above probability may also be written as {T n > δ}. We prefer to write it as in (3) to emphasize the 'formalized' eye-balling rationale of the conditioning.
Remark 1 To appreciate the difference between the conditional and unconditional approach, consider trajectories X Then, for sufficiently large n, one expects to reject H S 0 in the unconditional procedure for Bα of these trajectories. In contrast, the conditional approach considers only those, say k, trajectories {X ..,k with visually apparent indications of change, i.e., those trajectories satisfying sup t∈ [0,1] {t| γ (0, t) − γ (0, 1)|}/ σ > δ/ √ k n . If one rejects H S 0 for each of these k trajectories where the test statistic exceeds c S α , one cannot expect a rejection in (the desired number of) kα cases, but instead the number of rejections is much higher, since only trajectories with large fluctuations are considered in the first place.
Formally, the asymptotic rejection probability is given by the following Theorem 1 Suppose Assumptions 1 and 2 hold. Then, under H S 0 , as n → ∞, As a simple application of the elementary conditional probability formula, the proof of Theorem 1 is almost trivial. The main insight afforded by Theorem 1 is that the textbook recommendation to look for 'any apparent changes in behavior' (Brockwell and Davis 2016, p. 12) should come with a warning that, if the formal structural break test is applied as usual (i.e., with the same critical value c S α suggested by the unconditional test), it rejects the null more often than it should. Moreover, the influence of the conditioning can be quantified up to the parameter δ. This is in contrast to the example of GoF tests mentioned in the Motivation. There, the act of looking at a histogram and spotting a resemblance with some parametric distribution is much harder to formalize and, hence, the impact on a subsequent GoF test much harder to quantify.
Specifically, Theorem 1 formalizes the intuition that if there is preliminary ('formalized' eye-balling) evidence in the data that a null hypothesis is false, then that null is more likely to be rejected, because the other cases-where there is no preliminary evidence-are not considered. Since the conditioning event in (4) may be equivalently written as {T n > δ}, it is clear that, in case δ > c S α , the true null is rejected not only asymptotically with probability one, but even almost surely in finite samples. Even for a less stringent testing condition with δ ≤ c S α the conditional (asymptotic) rejection probability under the null, α/ f (δ), is larger than α. Only when δ = 0, i.e., when there is no conditioning, is the desired type I error rate of α attained.
Nonetheless, if δ is known (which is typically not the case), there is a way to keep a desired confidence level α even in the conditional test. For a fixed δ > 0, one simply chooses α * = α f (δ) ∈ (0, α). Then, by (4), This means that when the test is only applied if the data provide preliminary evidence for a structural break, the resulting bias can be corrected by suitably lowering the confidence level of the critical value. In spirit, this is similar to the usual pre-testing strategy of (suitably) lowering the significance level that avoids inflating type I error (Giles and Giles 1993).
Remark 2 Theorem 1 only investigates the behavior of the test under the null and shows it to be oversized. In line with the well-known tradeoff between type I-and type II-error, this suggests that the test has higher power under the alternative. We refrain from formally investigating the (local) power of the test, as it is does not hold size and, hence, is not a valid statistical test.

Monitoring parameter changes
Let X 1 , . . . , X n denote the variables in the training period that are available prior to monitoring. The goal in monitoring is to detect breaks in some parameter as new observations X n+1 , X n+2 , . . . become available, and to do so as quickly as possible.
Formally, interest in monitoring centers on sequentially testing given the non-contamination assumption γ = γ 1 = · · · = γ n , which imposes structural stability in the training period. Of course, non-contamination can be tested using the methods of Sect. 2.1. We consider so-called closed-end procedures, where monitoring stops after X n+1 , . . . , as the space of all R 2 -valued functions on [0, T ] that are rightcontinuous with left-hand limits in each component (Davidson 1994). The following condition is the analogue of Assumption 1 and has likewise been shown to hold for various estimators (see, e.g., Hoga and Wied (2017)).
Assumption 3 It holds that, as n → ∞, We base monitoring on the moving-sum detector where the estimator σ from Assumption 2 is typically calculated from the noncontaminated training data. The idea behind M n (t) is that large deviations between γ (t, t + t 0 ) and the non-contaminated estimate γ (0, 1) indicate a structural change in the monitoring period. Again by the continuous mapping theorem and Slutzky's lemma, it follows under Assumptions 2 and 3 that Hence, one controls the asymptotic probability When deciding which parameters to monitor, one may choose those whose subsample estimates have exhibited some variation in the training period. This again leads us to consider monitoring for change conditional on sup t∈ [0,1] {t| γ (0, t) − γ (0, 1)|}/ σ > δ/ √ k n for some δ > 0. As before, the parameter δ is unknown, depending on the sensitivity of the visual inspection by the practitioner. When monitoring conditionally on having observed some noticeable variation in the training period, one should control If δ could actively be chosen in practice, it should not be chosen too large (larger than some critical value c S α ), because this would already be evidence against γ 1 = · · · = γ n (cf. (2)), violating the non-contamination assumption. However, the conditioning in (6) does not matter asymptotically for monitoring, as shown next.
Theorem 2 Suppose Assumptions 2 and 3 hold. Then, under H M 0 , as n → ∞, Theorem 2 shows that when the training data inspire a test of H M 0 , the monitoring procedure can be applied as usual, i.e., with the same boundary c M α as in (5). This result is reminiscent of the intuition that, while one cannot (without modification at least) use the same data that inspired a hypothesis to test it, one can simply wait for fresh (out-of-sample) data to verify it. The limit in (7) would trivially obtain if the two events in the probability were independent. Yet, this is not the case, as γ (0, 1) is common to both events, and the underlying X t 's may be serially dependent across the training and monitoring period.

Remark 3
The conclusion of Theorem 2 does not depend on the type of detector used. For instance, using the expanding sum detector E n (t) = 1 σ |(t − 1) √ k n [ γ (1, t) − γ (0, 1)]| would-under a suitable analogue of Assumption 3-lead to the same result. We omit details for brevity.

Remark 4
Reconsider the set-up of Remark 1. Suppose Assumptions 2 and 3 hold for the (continued) simulated trajectories = 1, . . . , B). Then, in sufficiently large samples, one expects to reject H M 0 based on unconditional testing for roughly α% of all trajectories; see (5). (This is as in the structural break setting of Sect. 2.1, where one also rejects H S 0 for α% of all trajectories.) If monitoring is done conditionally, then one expects to reject H M 0 for α% of the k trajectories satisfying the conditioning event. (This contrasts with the structural break tests, where-in the conditional setting-H S 0 is rejected for α/ f (min{δ, α})% of said k trajectories.) Hence, while (5) and (7) suggest an equal number of type I errors asymptotically, the trajectories tested are different-in unconditional monitoring all trajectories are considered, whereas conditional monitoring only considers the fraction, where the condition is met.

Remark 5
The result illustrated in Remark 4 for the different conditional rejection probabilities in structural break testing and monitoring has some analogy with the following example. Suppose a coin comes up heads 9 out of 10 times. This then raises some suspicion of the fairness of the coin. The data-inspired hypothesis that the coin is unfair, is then more likely to be accepted as true when tested on the same observations. (This result is analogous to Theorem 1.) However, tossing the same coin again 10 times, the 'unfair coin' hypothesis-suggested by the first 10 throws-can be tested on fresh data as if no conditioning took place. (This is analogous to Theorem 2, where out-of-sample data become available for monitoring.) Of course, while fresh data is also used in testing H M 0 , the monitoring situation is more complex than the coin flip example. First, data may be serially dependent in Theorem 2. Second, γ (0, 1) appears in both the detector M n (t) and the conditioning event in (6).

Empirical application
Here, we illustrate the practical implications of Theorems 1 and 2 . We do so using Bitcoin log-returns X 1 , . . . , X n from 01/01/2016 to 31/12/2019, giving n = 1, 461 observations. 1 Böhme et al. (2015) provide a comprehensive review of the cryptocurrency. Due to the rising popularity of crypto-currencies, much research effort has been devoted to studying Bitcoin, which represents the largest market share among all crypto-currencies. For instance, Urquhart (2016) investigates the efficiency of the Bitcoin market using, among others, a classical Ljung-Box test. Most tests (including Ljung-Box tests) rely on the absence of structural breaks for their validity. However, as Bitcoins represent a new asset class, ex ante knowledge of the parameters that may change (e.g., mean, variance, higher order moments, tail index, autocorrelations, etc.) is hard to justify. Testing for change in all conceivable parameters invariably inflates type I errors, due to multiple testing issues. So it seems natural to only test for breaks in parameters that have fluctuated noticeably in the data. Hence, in the following, we test for breaks in parameters, where 'formalized' eye-balling-as described in Sect. 2.1-indicates a possible change.
We exemplarily consider the first two moments as the most important parameters determining location and scale. On stretches of stationarity in the data, Assumptions 1-3 are then likely to be satisfied by the Bitcoin log-returns. This is because GARCH-type volatility models have been successfully used in modeling Bitcoin returns (Cheah and Fry 2015). Carrasco and Chen (2002) show that several GARCH models are mixing with rates implying suitable functional central limit theorems in Assumptions 1 and 3 to hold (Herrndorf 1985). Likewise the mixing conditions are sufficient for consistent estimation of the long-run variance σ 2 in Assumption 2 (De Jong and Davidson 2000).
Fix the significance level at α = 0.05. In the 'standard' eye-balling method, one would plot the series of log-returns, and look for preliminary evidence of parameter change. If such evidence is found, the series would then be subjected to a formal levelα test, without reflecting in the test that the data themselves suggested the hypothesis. Alternatively, we use the 'formalized' eye-balling approach and assume that visually inspecting the time series for abrupt changes inspires a subsequent test if and only if √ n σ sup t∈ [0,1] {t| γ (0, t) − γ (0, 1)|} exceeds δ. Of course, δ is unknown in practice, but for illustrative purposes we assume here that it equals the 20%-critical value, i.e., δ = c S α=0.2 = 1.073. Thus, we apply a test to one of the two parameters-first and second moments-if the corresponding value of T n is larger than δ. Without any modification of the significance level α, this would yield a test of level α/ f (δ) = α/0.2 = 25% by Theorem 1. By the same theorem, a (conditional) level-α test is however obtained, if we reject when T n > c S α * = 1.628 for α * = α f (δ) = 0.2α = 0.01. Figure 1 displays the price process and the log-returns of Bitcoin. The prices seem to indicate returns with positive mean until the peak on December 16, 2017, and a negative mean in the year thereafter. This strongly suggests the need for a more formal test of a constant mean. Define S M n (t) = √ n σ M t| γ M (0, t) − γ M (0, 1)|, where γ M (0, t) denotes the sample mean of X 1 , . . . , X nt , and σ M is a HAC estimator with Bartlett kernel and bandwidth log n . As the test statistic T M n = sup t∈[0,1] S M n (t) = 1.578 is larger than the critical value c S α=0.05 = 1.358, the null of a constant mean is rejected. However, this test incurs a not-accounted-for bias, because the mean break hypothesis was suggested by the same data that were used for testing.
To account for this bias, we apply the 'formalized' eye-balling technique next. The test statistic for a mean change is larger than δ (T M n = 1.578 > 1.073 = δ). Conditional on this result, a level-α test rejects if T M n > 1.628. Hence, we do not reject the constant mean hypothesis at a 5%-significance level using the conditional test.
The plot of t → S M n (t) in Fig. 1 graphically illustrates the conflicting results. The red dotted lines indicate the 'incorrect' critical value c S α=0.05 = 1.358 and the 'correct' c S α=0.01 = 1.628. While the 'incorrect' value is exceeded by S M n (t), the 'correct' value is not. The additional evidence required to exceed the 'correct' critical value can be seen as a compensation for the fact that the data themselves indicated the mean break hypothesis. The valid conditional test did not reject the constant mean hypothesis. Nonetheless, the persistence in the price process observed in Fig. 1 indicates the possibility of mean changes. This suggests that it may be useful to monitor for breaks in the mean starting in 2020. For instance, in a risk management context, it is particularly important to detect downward breaks in the mean to avoid losses. The implication of Theorem 2 is that such a monitoring procedure can be applied as if it had not been Fig. 1 that suggested the hypothesis.
As the conditional test did not reject the null of a constant mean, we may validly apply a test for a change in the variance by testing for the constancy of second moments.
denotes the sample mean of X 2 1 , . . . , X 2 nt , and σ V again denotes a HAC estimator (with Bartlett kernel and bandwidth log n ) for the asymptotic variance of γ V (0, 1). The test statistic for a variance change is once more larger than δ (T V n = sup t∈[0,1] S V n (t) = 1.692 > 1.073 = δ). Conditional on this result, a level-α test rejects the null of a constant variance, as T V n = 1.692 > 1.628. Of course, in this case the 'naive' test also would have led to a rejection (because T V n = 1.692 > 1.358 = c S α=0.05 ). The plot of t → S V n (t) in Fig. 1 illustrates the result. It also allows to date the (most prominent) break somewhere around the first half of 2017, where the largest values of S V n (t) are attained. Of course, as the variance is not even constant in the training period, monitoring (using Theorem 2) should not be carried out, due to structural change contaminating the training period. The evidence for a break in the variance of Bitcoin returns suggests that either statistical analyses (such as Ljung-Box tests used to assess market efficiency) need to be robust to unconditional heteroscedasticity or the analyses have to be restricted to break-free subsamples.

Conclusion
More often than not, hypotheses are generated by data. While, in general, fresh data is desirable to validly verify a hypothesis, in some applications hypotheses need to be tested on the same data that generated it (e.g., the structural break hypothesis for the Bitcoin returns from 2016 to 2019). In situations like these, the bias of having looked at the data before hypotheses are formulated can frequently not be corrected, e.g., in goodness-of-fit testing. We show in this note that a correction is theoretically possible for structural break tests, if the critical value is suitably increased-with the increase depending on a single unknown constant. This provides one further reason for the use of large critical values or, equivalently, small significance levels, that has also been advocated elsewhere (Benjamin 2018). Furthermore, this shows that the textbook recommendation to visually inspect the data for breaks should carry a warning that subsequent formal structural break tests need to take into account that the break hypothesis was suggested by the data. By contrast, when monitoring parameters, 'hypotheses' generated from the training data can be tested without any correction.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. = Pr T n > c S α T n > δ = Pr T n > c S α , T n > δ / Pr {T n > δ} = Pr T n > max{c S α , δ} / Pr {T n > δ} .
The conclusion now follows from (1) and Proposition 12.3.4 of Dudley (2004).
Proof of Theorem 2 From the continuous mapping theorem, Slutzky's lemma and Assumptions 2 and 3, we obtain, as n → ∞,