Epidemic changepoint detection in the presence of nuisance changes

Many time series problems feature epidemic changes—segments where a parameter deviates from a background baseline. Detection of such changepoints can be improved by accounting for the epidemic structure, but this is currently difficult if the background level is unknown. Furthermore, in practical data the background often undergoes nuisance changes, which interfere with standard estimation techniques and appear as false alarms. To solve these issues, we develop a new, efficient approach to simultaneously detect epidemic changes and estimate unknown, but fixed, background level, based on a penalised cost. Using it, we build a two-level detector that models and separates nuisance and signal changes. The analytic and computational properties of the proposed methods are established, including consistency and convergence. We demonstrate via simulations that our two-level detector provides accurate estimation of changepoints under a nuisance process, while other state-of-the-art detectors fail. In real-world genomic and demographic datasets, the proposed method identified and localised target events while separating out seasonal variations and experimental artefacts. Supplementary Information The online version contains supplementary material available at 10.1007/s00362-022-01307-x.


Introduction
The problem of identifying when the probability distribution of a time series changeschangepoint detection -has been studied since the middle of the 20th century.Early developments, such as Shewhart's control charts (Shewhart, 1930) and Page's CUSUM test (Page, 1954), stemmed from operations research.However, as automatic sensing systems and continuous data collection became more common, many new use cases for changepoint detection have arisen, such as seismic events (Li et al., 2016), epidemic outbreaks (Texier et al., 2016), physiological signals (Vaisman et al., 2010), market movements (Bracker and Smith, 1999), and network traffic spikes (Hochenbaum et al., 2017).Stimulated by such 1 arXiv:2008.08240v1[stat.ME] 19 Aug 2020 practical interest, the growth of corresponding statistical theory has been rapid, as reviewed in e.g.Aminikhanghahi and Cook (2016) and Truong et al. (2020).
Different applications pose different statistical challenges.If a single drastic change may be expected, such as when detecting machine failure, the goal is to find a method with minimal time to detection and a controlled false alarm rate (Lau and Tay, 2019).In many other applications the number of changepoints is unknown a priori; the challenge is then to identify the change locations regardless, preferably in a computationally efficient way.Some problems, such as peak detection in sound (Aminikhanghahi and Cook, 2016) or genetic data (Hocking et al., 2017), feature epidemic segments -changepoints followed by a return to the background level -and incorporating this constraint can improve detection or simplify post-processing of ouputs.
However, current detection methods that do incorporate a background level assume it to be stable throughout the dataset (e.g.Fisch et al., 2018;Zhao and Yau, 2019).This is not realistic in several common applications.In genetic analyses such as measurements of protein binding along DNA there may be large regions where the observed background level is shifted due to structural variation in the genome or technical artefacts (Zhang et al., 2008).Similarly, a standard task in sound processing is to detect speech in the presence of dynamic background chatter, so-called babble noise (Pearce et al., 2000).In various datasets from epidemiology or climatology, such as wave height measurements (Killick et al., 2012), seasonal effects are observed as recurring background changes and will interfere with detection of shorter events.Methods that assume a constant background will be inaccurate in these cases, while ignoring the epidemic structure entirely would cost detection power and complicate the interpretation of outputs.
Our goal is to develop a general method for detecting epidemic changepoints in the presence of nuisance changes in the background.Furthermore, we assume that the only available feature for distinguishing the two types of segments is their duration: this would allow analysing the examples above, which share the property that the nuisance process is slower.The most similar research to ours is that of Lau and Tay (2019) for detecting failure of a machine that can be switched on, and thus undergo an irrelevant change in the background level.However, the setting there concerned a single change with specified background and nuisance distributions; in contrast, we are motivated by the case when multiple changes may be present, with only duration distinguishing their types.Detection then requires two novel developments: 1) rapid estimation of local background level, 2) modelling and distinguishing the two types of potentially overlapping segments.
These developments are presented in this paper as follows: after a background section we provide a new algorithm that simultaneously detects epidemic changepoints and estimates the unknown background level (Section 3).The convergence and consistency of this algorithm are proved in the Supplementary Material.While this algorithm is of its own interest, we use it to build a detector that allows local variations in the background, i.e., nuisance changes, in Section 4, again with proven properties.In Section 5 we test various properties of the algorithms with simulations before showing how the proposed nuisance-robust detector can be applied in two problems: detecting histone modifications in human genome, while ignoring structural variations, and detecting the effects of the COVID-19 pandemic in Spanish mortality data, robustly to seasonal effects.Compared to state-of-the-art methods, the proposed detector produced lower false-alarm rates (or more parsimonious models), while retaining accurate detection of true signal peaks.

Background
In the general changepoint detection setup, the data consists of a sequence of observations x 0:n = {x 0 , x 1 , . . ., x n }, split by k changepoints 0 < τ 1 < τ 2 < . . .τ k < n into k + 1 segments.Within each segment the observations are drawn from a distribution specific to that segment; these are often assumed to belong to some fixed family f θ (x), with different values of parameter θ for each segment.The most common and historically earliest example is the change in mean of a Gaussian (see Truong et al. (2020)), i.e., for each t ∈ [τ i , τ i+1 ), x t ∼ N (µ i , σ 2 ), for known, fixed σ 2 .(We assume θ ∈ R 1 to keep notation clearer, but multidimensional problems are also common, such as when both the mean and variance of a Gaussian are subject to change.)While early research focused on optimal hypothesis testing of a single change against no change in a parameter, modern models aim to estimate the position of all changepoints {τ i } in a given time series, and optionally the vector of segment parameters θ.A common approach is to use a penalised cost: choose a segment cost function C(x a:b ; θ), usually based on negative log-likelihood of the observations x a:b given θ, and a penalty p(k) for the number of changepoints k.The full cost of x τ 0 :τ k (where τ 0 = 0 and τ k = n) is then: (1) Changepoint number and positions are estimated by finding F (n) = min F (n; τ , θ, k).Such estimation has been shown to be consistent for a range of different data generation models (Fisch et al., 2018;Zheng et al., 2019).
For a discrete problem such as this, computation of the true minimum is not triviala naïve brute force approach would require O(2 n ) tests.Approaches to reducing this cost fall in two broad classes: 1) simplifying the search algorithm by memoisation and pruning of paths (Jackson et al., 2005;Killick et al., 2012;Rigaill, 2015); 2) using greedy methods to find approximate solutions faster (Fryzlewicz, 2014;Baranowski et al., 2019).In both classes, there are methods that can consistently estimate multiple changepoints in linear time under certain conditions.
The first category is of more relevance here.It is based on the Optimal Partitioning (OP) algorithm (Jackson et al., 2005).Let the data be partitioned into discrete blocks A function V that maps each set of blocks If each segment incurs a fixed penalty β = p(k)/k, then the function F defined in (1) is block-additive over segments, and can be defined recursively as: In OP, this cost is calculated for each s ≤ n, and thus its minimisation requires O(n 2 ) evaluations of C. Furthermore, when the cost function C is such that for all a ≤ b < c: then at each s, it can be quickly determined that some candidate segmentations cannot be "rescued" by further segments, and so they can be pruned from the search space.This approach reduces the complexity to O(n) in the best case.It gave rise to the family of algorithms called PELT, Pruned Exact Linear Time (Killick et al., 2012;Zhao and Yau, 2019).Note that OP and PELT do not rely on any probabilistic assumptions and find the exact same minimum as the global exponential-complexity search.Since the introduction of PELT, a number of alternative pruning schemes have been developed (Maidstone et al., 2016).
A variation on the basic changepoint detection problem is the identification of epidemic changepoints, when a change in regime appears for a finite time and then the process returns to the background level.The concept of a background distribution f B is introduced, and the segments are defined by pairs of changepoints s i , e i , outside of which the data is drawn from the background model.The data model then becomes: (4) An equivalent of the CUSUM test for the epidemic situation was first proposed by Levin and Kline (1985), and a number of methods for multiple changepoint detection in this setting have been proposed (Olshen et al., 2004;Fisch et al., 2018;Zhao and Yau, 2019).These use a cost function that includes a separate term C 0 for the background points: A common choice for the background distribution is some particular "null" case of the segment distribution family, so that f B (x) = f S (x; θ 0 ) and C 0 (•) = C(•; θ 0 ).However, while the value of θ 0 is known in some settings (such as when presented with two copies of DNA), in other cases it may need to be estimated.Since this parameter is shared across all background points, the cost function is no longer block-additive as in (2), and algorithms such as OP and PELT cannot be directly applied.
One solution is to substitute the unknown parameter with some robust estimate of it, based on the unsegmented series x 0:n .The success of the resulting changepoint estimation then relies on this estimate being sufficiently close to the true value, and so the nonbackground data fraction must be small (Fisch et al., 2018).This is unlikely to be true in our motivating case, when nuisance changes in the background level are possible.
Another option is to define which can be minimised using OP or PELT, and then use gradient descent for the outer minimisation over θ 0 (Zhao and Yau, 2019).The main drawback of this approach is an increase in computation time proportional to the number of steps needed for the optimisation.

Detection of changepoints with unknown background parameter
To estimate the background parameter and changepoints efficiently, while allowing a large proportion of non-background data, we introduce Algorithm 1.The algorithm makes a pass over the data, during which epidemic segments are detected in a standard way, i.e., by minimising penalized cost using OP (steps 3-12).The estimate of parameter θ 0 is iteratively updated at each time point.We will show that this process converges almost surely to the true θ 0 for certain models of background and segment levels, or reaches its neighbourhood (θ 0 − ; θ 0 + ) under more general conditions.
Algorithm 1 Detection of changepoints with unknown background parameter Calculate F B = F (t − 1) + C 0 (x t ; θ 0 ), and Assign B(t) = B(t − 1) ∪ x t , and recalculate θ 0 from B 7: Assign B(t) = B(t − k) (here k: argmin F S in step 4) 10: end if 12: end for 13: Repeat steps 3-12, without updating θ 0 14: Output: θ 0 , changepoint positions The algorithm then makes a second pass over the data (step 13), repeating the segmentation using the final estimate of θ 0 .The step is identical to a single application of a known-background changepoint detector, such as the one described in Fisch et al. (2018).
Its purpose is to update the changepoint positions that are close to the start of the data and so had been determined using less precise estimates of θ 0 .This simplifies the theoretical performance analysis, but an attractive option is to use this algorithm in an online manner, without this step.We evaluate the practical consequences of this omission later in Section 5.
By segmenting and estimating the background simultaneously, this algorithm is computationally more efficient than dedicated optimisation over possible θ 0 values as in Zhao and Yau (2019), while allowing a larger fraction of non-background points than using a robust estimator on unsegmented data, as in Fisch et al. (2018).
We now demonstrate that Algorithm 1 will converge almost surely under certain constraints, and is consistent.

Convergence
Theorem 1.Consider the problem of an epidemic change in mean, with data x 0:n generated as in (4).Assume the distribution f B (x) and marginal distribution of segment points f S (x) = f S (x; θ)f Θ (θ)dθ are symmetric and strongly unimodal, with unknown background mean θ 0 , and that data points within each segment are iid.Here, f Θ is the probability density of parameter θ corresponding to each data point.Denote by w t the estimate of θ 0 obtained by analysing x 0:t by Algorithm 1.The sequence {w t } converges: 1. to θ 0 almost surely, if Ef S = θ 0 .
We refer the reader to Appendix A for the proof.It is based on a result by Bottou (1998), who established conditions under which an online minimisation algorithm almost surely converges to the optimum (including stochastic gradient descent as a special case).
We show that these conditions are either satisfied directly by the updating process in our algorithm, or longer update cycles can be defined that will satisfy the conditions, in which case the convergence efficiency is determined by n.

Consistency
The changepoint model can be understood as a function over an interval that is sampled to produce n observed points.As the sampling density increases, more accurate estimation of the number and locations of changepoints is expected; this property is formalised as consistency of the detector.Fisch et al. (2018) showed that detectors based on minimising penalised cost are consistent in the Gaussian data case, and their result can be adapted to prove consistency of Algorithm 1.We will use the strengthened SIC-type penalty α log(n) 1+δ , δ > 0 as presented in Fisch et al. (2018), but a similar result can be obtained with δ = 0.
Additionally, the minimum signal-to-noise ratio at which segments can be detected is formalised as this assumption: where ∆ i is a function of the relative strength of the parameter changes |µ i −µ 0 | and σ k /σ 0 + σ 0 /σ k .
Theorem 2. Let the data x 0:n be generated from an epidemic changepoint model as in (4), with f B and f S Gaussian, and the changing parameter θ is either its mean or variance (assume the other parameter is known).Further, assume (6) holds for k changepoints.
Analyse the data by Algorithm 1 with penalty β = α log(n) 1+δ , α, δ > 0. The estimated number and position of changepoints will be consistent, i.e. ∀ > 0, n > B: for some A, B, C that do not increase with n.
The proof is given in Appendix B. We use the connection between our Algorithm 1 and stochastic gradient descent to establish error bounds on the backgorund parameter estimates.
These bounds then allow us to apply a previous consistency result (Fisch et al., 2018) to our case.

Pruning
Much of the improvement in performance of change estimation algorithms comes from pruning of the search space.Standard pruning (Killick et al., 2012) can be applied to Algorithm 1.It continues to find optimal solutions in this case, as is shown in Proposition 1.
To implement pruning, the search set for F S in step 4 would be changed to: and step 11a would be introduced to update K as: Proposition 1. Assume a cost function C such that (3) applies.If, for some s > t − l: then t − k will not be a segment start point in the optimal solution, and can be excluded from consideration for all subsequent t > t.
Proof.For all t ≤ s + l, the proof applies as for other pruned algorithms (Killick et al., 2012): For t > s + l, segment (s, t ) will exceed the length constraint and thus cannot be part of the final segmentation.
4 Detecting changepoints with a nuisance process

Problem setup
In this section, we consider the changepoint detection problem when there is an interfering nuisance process.We assume that this process, like the signal, consists of segments, and we index the start and end times as s N j , e N j .Data within these segments is generated from a nuisance-only distribution f N , or from some distribution f N S if a signal occurs at the same time.In total, four states are possible (background, nuisance, signal, or nuisance and signal), so the overall distribution of data points is: We add two more conditions to ensure identifiability: 1) the nuisance process evolves more slowly than the signal process, so min(e N j − s N j ) > max(e i − s i ); 2) signal segments are either entirely contained within a nuisance segment, or entirely out of it: We adapt the changepoint detection method as follows.Define , a penalty p (m) = β m for the number of nuisance segments, and cost functions Then the full cost of a model with k true segments and m nuisance segments is: Note that this cost can also be expressed, using k j as the number of signal segments that overlap a nuisance segment j and k 0 = k − k j , as: Condition (9) ensures that C and C S are independent (no points or θ i are shared between them), so F is block-additive over (signal or nuisance) segments and can be minimised by dynamic programming.

Proposed method
To minimise this cost, we propose the method outlined in Algorithm 2. Its outer loop proceeds over the data to identify segments by the usual OP approach.However, to evaluate the cost C (x a:b ), which allows segment and nuisance overlaps, a changepoint detection problem with unknown background level must be solved over x a:b .This is achieved by an inner loop using Algorithm 1.If C = C S , this method would reduce to a standard detector of epidemic changepoints, with the exception that segments are separated into two types depending on their length; this is very similar to the collective and point anomaly (CAPA) detector in Fisch et al. (2018).
The cost C is minimised using Algorithm 1, so by Theorem 2, the number of positions of true segments will be estimated consistently given accurate assignment of the m nuisance positions s N j , e N j .However, the latter event is subject to a complex set of assumptions on relative signal strength, position, and duration of the segments.Therefore, we do not attempt to describe these in full here, but instead investigate the performance of the method by extensive simulations in Section 5.1.3.
Algorithm 2 Adaptive segmentation by changepoint detection 3: If θ 0 not known: estimate using an appropriate robust estimator 4: for t ∈ 1, . . ., n do 5: Apply Algorithm 1 to data x t +1:t , store returned cost as C and changepoints as chp N S (t ) end for 9:

Pruning
In the proposed method, the estimation of parameter µ j (the mean of segment j) is sensitive to the segment length, therefore the cost C does not necessarily satisfy the additivity condition (3), and so it cannot be guaranteed that PELT-like pruning will be exact.However, we can establish a local pruning scheme that retains the exact optimum with probability → 1 as n → ∞.
Proposition 2. Assume data x 0:n is generated from a Gaussian epidemic changepoint model, and that the distance between changepoints is bounded by some function A(n): At time t, the solution space is pruned by removing: Here Then ∀ > 0, there exist constants B, n 0 , such that when n > n 0 , the true nuisance segment positions are retained with high probability: The proof is given in Appendix C. The assumed distance bound serves to simplify the detection problem: within each window (t − A(n), t], at most 1 true changepoint may be present, and the initial part of Algorithm 2 is identical to a standard epidemic changepoint detector.It can be shown that other candidate segmentations in the pruning window are unlikely to have significantly lower cost than the one associated with s N j , e N j , and therefore s N j , e N j are likely to be retained in pruning.As this scheme only prunes within windows of size A(n), and setting large A(n) may violate the true data generation model, it is less efficient than PELT-type pruning.However, assuming that the overall estimation of nuisance and signal changepoints is consistent, Proposition 2 extends to standard pruning over the full dataset.We show that this holds empirically in Section 5.1.3.

Experiments
In this section, we present the results of simulations used to evaluate the performance of the methods, and demonstrations on real-world datasets.
The methods proposed in this paper are implemented in R 3.5.3,and available on GitHub at https://github.com/jjuod/changepoint-detection.The repository also includes the code for recreating the simulations and data analyses presented here.

Algorithm 1 estimates the background parameter consistently
In the first experiment we tested the performance of the algorithm with a fixed background parameter.Three different sets of data were generated and used: Scenario 1. Gaussian data with one signal segment: n data points were drawn as x t ∼ N (θ t , 1), with a segment of θ t = 3 at times t ∈ (0.3n; 0.5n], and 0 otherwise (background level).
To evaluate the background level (θ 0 ) estimation by Algorithm 1, we generated time series for each of the three scenarios with values of n between 30 and 750, with 500 replications for each n.Note that the segment positions remain the same for all sample sizes, so the increase in n could be interpreted as denser sampling of the time series.
Each time series was analysed using Algorithm 1 to estimate θ 0 .The maximum segment length was set to l = 0.5n and the penalty to β = 3 log(n) 1.1 .The cost was computed using the Gaussian log-likelihood function with known variance.This means that the cost function was mis-specified in Scenario 3, and so provides a test of the robustness of the algorithm (although the variance was set to 3, as expected for T (3)).
For comparison, in each replication we also retrieved the median of the entire time series x 1:n , which is used as the estimate of θ 0 by Fisch et al. (2018).For scenarios (1) and (2), we also computed the quantiles of N (0, σ 2 / √ n B ) as the oracle efficiency limit based on the CLT, with n B the total number of background points: n B = 0.8n for (1) and 0.7n for (2).
Figure 1 shows the results for all three scenarios.It can be seen that Algorithm 1 produced consistent estimates given sufficiently long time series.When the signal-to-noise ratio was large (scenario 1), most of the estimated background values were accurate even at small n, and at larger n our estimator reached efficiency and accuracy similar to the oracle estimator.Predictably, the median of the non-segmented data produced biased estimates, which nonetheless may be preferrable in some cases, as our estimate showed large variability at low n in scenarios 2 and 3.At n > 400, our estimator provides the lowest total error in all tested scenarios, even with the mis-specified cost function in scenario 3.

Segment positions estimated by Algorithm 1 are accurate and consistent
The same simulation setup was used to evaluate the consistency of estimated segment number and positions.Data were generated and analysed with Algorithm 1 as in Section 5.1.1,but we also retrieved the changepoint positions that were estimated in step 12.This corresponds to the online usage of the algorithm, in which segmentation close to the start of the data is based on early θ 0 estimates and may therefore itself be less precise.We extracted the mean number of segments reported by the algorithm and also calculated the true positive rate (TPR) as the fraction of simulations in which the algorithm reported at least 1 changepoint within 0.05n points of each true changepoint.
The results show that the TPR approaches 1 for all three scenarios (Table 1).As expected, detecting a strong segment was easier than multiple weak ones: segmentation in scenario 1 was accurate even at n = 30, while n > 500 was needed to reach ≥ 90% TPR in scenario 2. In scenario 3, the algorithm correctly detected changes at the true segment start and end (TPR ≈ 100%), but the algorithm tended to fit the segment as multiple ones, likely due to the heavy tails of the t distribution.Notably, skipping step 13 had very little impact on the performance of the algorithm, suggesting that the faster online version can be safely used.

Algorithm 2 recovers true signal segments under interference
To evaluate Algorithm 2, we generated time series with both signal and nuisance segments under two different scenarios: Scenario 1. Gaussian data with a signal segment overlapping a nuisance segment: n data points were drawn as x t ∼ N (θ S t +θ N t , 1), with θ S t = 2 when t ∈ (0.3n; 0.5n], 0 otherwise, Table 1: Consistency of the number and position of estimated changepoints.Time series simulated in three different scenarios were analysed using Algorithm 1.For each n, 500 replications were performed.
The mean number of reported segments and the TPR (fraction of iterations when a changepoint was detected within 0.05n of each true changepoint) are shown.The number of true segments was 1, 3, and 1 for scenarios 1, 2, 3 respectively.The same time series were also analysed in an online manner, i.e., without Step 13 of the algorithm; the results are shown on the right of the θ N t = 1 when t ∈ (0.2n; 0.4n], 0 otherwise 3 when t ∈ (0.5n; 0.6n] −3 when t ∈ (0.7n; 0.8n] 0 otherwise Time series were generated for n between 30 and 240, in 500 replications at each n.Each series was analysed by three methods.Algorithm 2 (proposed) was run with penalties β = β = 3 log(n) 1.1 as before.For classical detection of epidemic changes in mean, we used the R package anomaly (Fisch et al., 2018); the implementation allows separating segments of length 1, but we treated them as standard segments and set all penalties to 3 log(n) 1.1 .In both methods, background parameters were set to µ 0 = 0, σ 0 = 1, and maximum segment length was l = 0.33n in scenario 1 and l = 0.15n in scenario 2. As an example of a different approach, we included the narrowest-over-threshold detector implemented in R package not, with default parameters.This is a non-epidemic changepoint detector that was shown to outperform most comparable methods (Baranowski et al., 2019).Since it does not include the background-signal distinction, we define signal segments as regions between two successive changepoints where the mean exceeds µ 0 ± σ 0 .
As before, evaluation metrics were the mean number of segments and the TPR.For the proposed method, we required accurate segment type detection as well, i.e., a true positive is counted when the detector reported a nuisance start/end within 0.05n of each nuisance changepoint and a signal start/end within 0.05n of each signal changepoint.In scenario 1, we also extracted the estimate of θ corresponding to the detected signal segment (or segment closest to (0.3n; 0.5n] if multiple were detected).The average of these over the 500 replications is reported as θS .
To evaluate the effects of pruning, Algorithm 2 was applied without pruning, or with global pruning as in ( 11), with m ∈ (0; t−l) at each t.Only 5 out of 5000 runs (500 iterations × 10 settings) showed any differences between the globally-pruned and non-pruned methods (Supplementary Table S1 in Appendix D), so we only present results obtained with pruning from here on.
We observed that the proposed method (with pruning) successfully detected true signal segments in both scenarios (Figure 2 and Table 2).The number of nuisance detections was accurate in scenario 1, and slightly underestimated in favour of more signal segments in scenario 2, most likely because the simulated nuisance length was close to the cutoff of 0.15n.As expected, when nuisance segments are not included in the model (anomaly and not methods), they were identified as multiple changepoints; as a result, the number of segments was over-estimated up to 3-fold.These models are also unable to capture the signal-specific change in mean θ S : anomaly estimated θS = 3.99, and not estimated θS = 3.95 in scenario 1 at n = 240.These values correspond to the sum of the signal and nuisance effects.While the estimation is accurate and could be used to recover the value of interest by post-hoc analysis, our proposed method estimated the signal-specific change directly, as θS = 2.00 in scenario 1 at n = 240.

ChIP-seq
As an example application of the algorithms proposed in this paper, we demonstrate peak detection in chromatin immunoprecipitation sequencing (ChIP-seq) data.The goal of ChIPseq is to identify DNA locations where a particular protein of interest binds, by precipitating and sequencing bound DNA.This produces a density of binding events along the genome that then needs to be processed to identify discrete peaks.Typically, some knowledge of the expected peak length is available to the researcher, although with considerable uncertainty (Rashid et al., 2011).Furthermore, the background level may contain local shifts of various sizes, caused by sequencing bias or true structural variation in the genome (Zhang et al., 2008).The method proposed in this paper is designed for such cases, and can potentially provide more accurate and more robust detection.
We used datasets from two ChIP-seq experiments, investigating histone modifications in human immune cells.Broad Institute H3K27ac data was obtained from UCSC Genome Browser, GEO accession GSM733771, as mean read coverage in non-overlapping windows of 25 bp.While ground truth is not available for this experiment, we also retrieved an input control track for the same cell line from UCSC (GEO accession GSM733742).We analysed a window near the centromere of chromosome 1, between 120,100,000 to 120,700,000 bp.This window contains nuisance variation in the background level, as seen in the input control (Figure 3, top).To improve runtime, data was downsampled to approximately 1000 points, each corresponding to mean coverage in a window of 500 bp.All read coordinates in this paper correspond to hg19 genome build.
The second dataset was the UCI/McGill dataset, obtained from https://archive.ics.
uci.edu/ml/datasets/chipseq.This dataset was previously used for evaluating peak detection algorithms (Hocking et al., 2017(Hocking et al., , 2018)).Mean read coverage at 1 bp resolution is provided, as well as peak annotations based on visual inspection.The annotations are weak labels in the sense that they indicate peak presence or absence in a region, not their exact positions, to acknowledge the uncertainty when labelling visually.From this data, we used the H3K36me3 modification in monocyte sample ID McGill0104, AM annotation.Around the labels L = {(s i , e i )} in each chromosome, we extracted read coverage for the window between s − (e − s) and e + (e − s) bp, s = min s i , e = max e i , and downsampled to about 1000 points as before.Based on visual inspection of the coverage, we chose a group of labels in chromosome 12, which provides a variety of annotations and a structure that appears to contain both nuisance shifts and signal peaks.
The ChIP-seq datasets were analysed by the method proposed here (Algorithm 2), an epidemic detector from package anomaly, and the non-epidemic detector not.The length of the signal segments was limited to 50 kbp in anomaly and the proposed method.As an estimate of global mean µ 0 , the median of the unsegmented data was used, and σ 0 estimated by the standard deviation of non-segmented data.As before, penalties were set to 3 log(n) 1.1 .
Only segments with estimated θ > µ 0 are shown, as we are a priori interested only in regions of increased binding.
We also used GFPOP, implemented in R package PeakSegDisk (Hocking et al., 2018): this detector has been developed specifically for ChIP-seq data processing, and models a change in Poisson rate parameter.This method does not include a single background level, but enforces alternating up-down constraints.It is intended as a supervised method, with the penalty value λ chosen based on the best segmentation provided by the training data.Therefore, for the Broad Institute dataset, we repeated the segmentation with λ ∈ {10 1 , 10 2 , 10 3 , 10 4 , 10 5 , 10 6 }, and show the λ that produces between 2 and 10 segments.
To evaluate the results quantitatively, we calculated the SIC based on each method's segmentation.We used Gaussian likelihood with parameters matching the detection status (i.e., estimated mean for the points in each segment μ0 for points outside segments, and σ0 as the standard deviation for all points).The number of parameters was set to 3k + 2: a mean and two endpoints for each of the k segments reported, and two for the background parameters.
In the H3K27ac data, all methods detected the three most prominent peaks, but produced different results for smaller peaks and more diffuse change areas (Figure 3, bottom).Both PeakSegDisk and anomaly marked a broad segment in the area around 120,600,000 bp.
Based on comparison with the control data, this change is spurious, and it exceeds the 50 kbp bound for target segments.While this bound was provided to the anomaly detector, it does not include an alternative way to model these changes, and therefore still reports one or more shorter segments.In contrast, our method accurately modelled the area as a nuisance segment with two overlapping sharp peaks.
Using not, the data was partitioned into 10 segments.By defining segments with low mean (θ < μ0 + σ0 ) as background, we could reduce this to 4 signal segments; while this removed the spurious background change, it also discarded the shorter change around 120,200,000 bp, which fits the definition of a signal peak (< 50 kbp) and was retained by the proposed method.This data illustrates that choosing the post-processing required for most approaches is not trivial, and can have a large impact on the results.In contrast, the parameters required for our method have a natural interpretation and may be known a priori or easily estimated, and the outputs are provided in a directly useful form.This data illustrates that choosing the post-processing options is not trivial, and can have a large impact on the results.In contrast, the parameters required for our method have a natural interpretation and may be known a priori or easily estimated, and the outputs are provided in a directly useful form.
Figure 3: ChIP-seq read counts and analysis results.Counts provided as mean coverage in 500 bp windows for a non-specific control sample (top) and H3K27ac histone modification (bottom), chromosome 1.Segments detected in the H3K27ac data by the method proposed here (Algorithm 2) and three other detectors are shown under the counts.Note that the proposed method can also produce longer nuisance changes (yellow) overlapped by signal segments (blue).not does not specifically identify background segments; we show the ones with relatively low mean in light blue.
In the UCI data, segment detections also generally matched the visually determined labels.However, our method produced the most parsimonious models to explain the changes.
Figure 4 shows such an example from chromosome 12, where our method reported two nuisance segments and a single sharp peak around 62,750,000 bp.The nuisance segments correspond to broad regions of mean shift, which were also detected by anomaly and not, but using 6 and 16 segments, respectively.Notably, PeakSeg differed considerably: as this method does not incorporate a single background level, but requires segments to alternate between background and signal, the area around 62,750,000 bp was defined as background, despite having a mean of 4.5 μ0 .In total, 12 segments were reported by this method.
This shows that the ability to separate nuisance and signal segments helps produce more parsimonious models, and in this way minimises the downstream efforts such as experimental replication of the peaks.
The visual annotations provided for this region are shown in the first row in Figure 4.
Note that they do not distinguish between narrow and broad peaks (single annotations in this sample range up to 690 kb in size).Furthermore, comparison with such labels does not account for finer segmentation, coverage in the peak area, or the number of false alarms outside it.For these reasons we are unable to use the labels in a quantitative way.
Quantitative comparison of the segmentations by SIC also favours our proposed method in both datasets.In the Broad dataset, SICs corresponding to segmentations reported by PeakSeg, not and anomaly were 4447.4,4532.2, and 4311.6,respectively, while the segmentation produced by our model had an SIC of 4285.8.The smallest criterion values in UCI data were also produced by our method (4664.5),closely followed by anomaly (4664.8), while not segmentation resulted in an SIC of 4705.7 and PeakSeg 6886.7.This suggests that in addition to the practical benefits of separating unwanted segments, the nuisancesignal structure provides a better fit to these datasets than models that allow only one type of segments.
Figure 4: A window of H3K36me3 ChIP-seq data on chromosome 10.Read coverage in 1100 bp windows (black points), manual annotations of peaks included in the dataset (boxes below), and detection results using Algorithm 2 proposed in this paper, as well as three state-of-the-art methods (lines at the bottom).

European mortality data
The recent pandemic of coronavirus disease COVID-19 prompted a renewed interest in early outbreak detection and quantification.In particular, analysis of mortality data provided an important resource for guiding the public health responses to it (e.g.Baud et al., 2020;Yuan et al., 2020;Dowd et al., 2020).
We analysed Eurostat data of weekly deaths in Spain over a three year period between 2017 and 2020.Data was retrieved from https://ec.europa.euData Explorer.Besides the impact of the pandemic, mortality data contains normal seasonal variations, particularly in older age groups.We use the 60-64 years age group in which these trends are visually clear and thus provide a ground truth.
Data were analysed with the four methods introduced earlier.For the proposed and anomaly methods, we used the median and standard deviation of the first 52 weeks of the dataset as estimates of µ 0 and σ 0 , respectively.Penalties in both were set to 3 log(n) 1.1 , as previously.The maximum length of signal segments was set to 10 weeks, to separate seasonal effects.In addition, not was used with default parameters, defining signal as regions where the mean exceeds µ 0 ± σ 0 , and PeakSegDisk with a basic grid search to select the penalty as before.
The results of the four analyses are shown in Figure 5. Three of the methods, anomaly, PeakSeg, and Algorithm 2, detected a sharp peak around the pandemic period.However, anomaly and PeakSeg also marked one winter period as a signal segment, while ignoring the other two.Four segments were created by not, including a broad peak continuing well past the end of the pandemic spike.In contrast, the proposed method marked the pandemic spike sharply, while also labelling all three winter periods as nuisance segments.The resulting detection using our method is again parsimonious and flexible: if only short peaks are of interest, our method reports those with lower false alarm rate than the other methods, but broader segments are also marked accurately and can be retrieved if relevant.
As in the ChIP-seq data, comparing the results by SIC identifies our method as optimal for this dataset.The values corresponding to PeakSeg, not and anomaly models were 1629.3,1648.2, and 1626.3 respectively, while Algorithm 2 produced a segmentation with SIC 1568.2.Note that the SIC penalizes both signal and nuisance segments, so in this case our model still appears optimal despite having more parameters.

Discussion
In this paper, we have presented a pair of algorithms for improving detection of epidemic changepoints.Similarly to stochastic gradient descent, the iterative updating of the background estimate in Algorithm 1 leads to fast convergence while allowing large fraction of non-background points.This is utilised in Algorithm 2 to analyse nuisance-signal overlaps.
The computational complexity of both algorithms presented here is O(n) in the best case, which is similar to state-of-the-art pruned algorithms (Killick et al., 2012;Hocking et al., 2018).However, note that this is stated in the number of required evaluations of C.
It is usually implicitly assumed that this function can be evaluated and minimised over θ recursively, so that the total number of operations may also be linear.This would not be achievable with methods that estimate the background level strictly offline, such as aPELTprofile (Zhao and Yau, 2019).Therefore, development of Algorithm 1 was essential to create the overlap detector.
One of major practical benefits of the proposed model is the ability to separate non-target segments.We anticipate that this will greatly improve downstream processing, effectively reducing the false alarm rate or the manual load if the detections are reviewed.Despite that, it is difficult to evaluate this benefit at present: while there are recent datasets with annotations specifically for testing changepoint detection (Hocking et al., 2017;van den Burg and Williams, 2020), they are based on labelling all visually apparent changes.In future work, we expect to provide further application-specific comparisons that would measure the impact of separating and neutralising the nuisance process.A Proof of Theorem 1 (Convergence of Algorithm 1) Bottou (1998) analyses the case of an online algorithm iteratively minimising some function f (x, w) (where x represents the complete data and w the parameters).Data points {x t } arrive sequentially, and at each iteration an estimate w t of the location of the minimum w * is obtained using some update function H(x, w) and learning rate γ t as: This updating mechanism gives rise to stochastic gradient descent if EH(x t+1 , w t ) = ∇ w f (x, w), but for the following argument this is not required.
To make the link with Algorithm 1 explicit, the update equation applied by this algorithm can be written as: Then w * = θ 0 (i.e., the background mean that is to be estimated), and we ask whether the sequence of updates converges w t → w * .It was shown by Bottou (1998) that this occurs almost surely if the following three conditions are met: 1. "convexity" -a single optimum w * exists and the expected value of the updates always points towards it: 2. learning rate convergence: 3. bounded variance of the updates: Thus, proof of convergence of our algorithm reduces to showing that these requirements are satisfied.We start with the assumption that the global mean of segment points is also θ 0 , and then relax this requirement.
The following lemma will be needed: Lemma 1.Let f be a unimodal distribution, symmetric around a point µ (so that f (x 1 ) < f (x 2 ) when x 1 < x 2 ≤ µ and f (x 1 ) > f (x 2 ) when µ ≤ x 1 < x 2 ), such as a Gaussian.
Consider a truncated random variable X with pdf: Proof.
EX − m = A.1 When the global mean of segments matches the background mean Consider the case that the background points are independent draws from N (w * , σ 2 ), so that the points within each segment are N (θ i , σ 2 ), with σ 2 known, and θ ∼ N (w * , τ 2 ).Let w t be the value of the background mean estimated by Algorithm 1 after processing t data points.
In this case w t a.s.
Proof.Denote the true class of the next data point x t+1 by δ t+1 (1 for background points, 0 for signal).Algorithm 1 estimates this as: Initially, assume for simplicity that the true maximum segment length is 1 (and so only k = 1 is tested).When δt+1 = 1, the background estimate is updated as: (otherwise w t+1 = w t ).So γ t = 1/( t i=1 δi + 1), and hence the learning rate convergence conditions ( 14) are satisfied.
Substituting in the costs based on a one-dimensional Gaussian pdf φ, δt+1 = 1 if: The distribution of true segment data points f (x t+1 |δ t+1 = 0) is symmetric with mean w * by assumption (as is easily verified for the Gaussian case).The background point distribution is also automatically symmetric.Thus, the overall distribution of the points used to update the w t estimate is a truncation of a symmetric unimodal distribution.In the present case, it is a truncated normal with limits (w t − σ √ 2β, w t + σ √ 2β), based on ( 17); more generally it is a truncated variant of the parent distribution with symmetric limits of the form w t ± a, and parent mean w * (that the acceptance set is an interval follows from the unimodality of f ).
This means that f (x t+1 | δt+1 = 1) satisfies the requirements for Lemma 1 with µ = w * , which implies the "convexity" condition (13): inf Following a similar approach -conditioning on δ t+1 -and using the law of total variance it can be shown that the variance of E(w − x t+1 ) is finite, as required for the condition (15), and so w t a.s.
Remark.So far, we assumed that segment length k = 1.If segments occur and are tested in non-overlapping windows of any fixed size k ≥ 2, the result is similar: ∀j ∈ where xk = t i=t−k+1 x i+1 /k.This can be expressed as truncation limits for accepted xk , analogously to (17): definition of F and additivity of C, we have: where d is some distance function, x−t is the mean of points x t−k+1:t−1 , and A(t) ≥ 0 is a constant depending only on x 1:t .
It is also helpful to note that F (t − 1) ≤ F (t − k) + C 0 (x t−k+1:t−1 ), hence: This corresponds to the case when all x t−k+1:t−1 were identified as background.
Using the Gaussian cost, i.e. d(a, b) = (a − b) 2 /2, and recursive formula for the mean, the acceptance condition for k becomes: By substituting in the value of A(t − 1) from ( 21), we obtain the following lower bound for P (x ∈ S k |x): Thus, we have the following bound for P r at any k: Therefore, as N increases, 1 − P r grows faster than 1 − EX 1 /w t = 1 − wt+p wt−p x 2 f (x)dx.This means that ∃p 0 , and thus ∃n 0 , such that for n > n 0 , and thus p > p 0 , (20) holds.
So far we assumed EX r = µ.Clearly, for larger values of EX r , EX a is even smaller and ( 13) is satisfied.
For the case when EX r < µ, consider the worst case scenario EX r = w t − p (this is the bound to X 1 , and thus to X r , imposed by S 1 ).Similarly to (19), we need: However, based on (23), P r p/w t ≤ wt+p wt−p O(x 2 /p)f (x)dx, so again ∃n > n 0 such that EX a < w t , and condition (13) holds.
And so overall E(x t | δt = 1) satisfies condition (13).Since the distribution of accepted x t still has bounded support imposed by S 1 , condition (15) still holds, and the learning rate condition (14) holds as before, implying convergence.

A.2 When the global mean of segments does not match the background mean
Consider now θ ∼ N (µ, τ 2 ) for some µ = w * , in particular µ > w * , so that the overall mean of segment points is E(θ) > w * .Then if w * < w t < E(θ), any segment points that were misclassified as background will (on average) push the estimates away from the background mean, in violation of the "convexity" condition (13).
We assume that each segment point is followed by no less than n background points.
Proof.Suppose that a misclassification at time T is followed by n correctly classified background points: δ T = 0, δ t = 1 for t ∈ [T + 1; T + n], δt = 1 for t ∈ [T ; T + n].For the points t ∈ [T + 1; T + n], almost sure convergence of w t was established above, i.e. for all > 0, there exists a t 0 such that ∀t : n ≥ t ≥ t 0 , P (|w T +t − w * | < ) = 1.Therefore, given n ≥ t 0 : Indexing the segment-background cycles by i, denote the first estimate of that segment by w i , so the set of these estimates are: The elements of this sequence can be expressed recursively as: with {x i } = {x t : i(n + 1) ≤ t ≤ i(n + 1) + n}.
− − → w * .(When n → ∞, the convergence conditions are satisfied directly without using the sequence {w i }.)

A.3 Martingale Approach
We can also describe the update process over the background points using martingales.The algorithm estimates are random variables w t ; let {W t } be the sequence of σ-algebras such that for each t, w t is measurable with respect to W t .Using Lemma 1, and assuming w * < w t again, within each cycle the estimates comprise a supermartingale E(w t+1 |W t ) < w t over the points T ≤ t < min(T + n, F HT w (w * )), here F HT x (a) = inf{t : x t ≤ a} is the first hitting time of the process realisation {x t } to value a.
Consider again the problematic case when the global mean does not match the background mean and misclassification pushes the estimate away from the background mean, i.e. w * < w T −1 < w T < E(θ).In order for w i to converge, we need the perturbed estimates to return to a value below w T −1 in each cycle.At the extremes, we have: F HT w (w T ) = T starting position F HT w (w T −1 ) < ∞ for sufficiently large n, because w t a.s.
Clearly, F HT w (w T −1 ) ≤ F HT w (w * ).However, the number of background points n required to satisfy F HT w (w T −1 ) < T + n will depend on five factors: the distribution f B and penalty p (since they determine the distribution of update values H), the size of estimated background set at time T (as it determines the relevant γ t ), and w T −1 and w T .
In practice, n is bounded by the available data, so there is a non-zero probability that, over the segment-background cycles indexed by i: Similar reasoning applies when E(θ) < w T < w T −1 < w * .
This lemma, together with results established for classical changepoint detection, can be used to show that the cost of any inconsistent solution will exceed the cost based on true segment positions and parameters (Proposition 8 in Fisch et al. (2018)): Proposition 3 ( (Fisch et al., 2018)).Define {τ } to be any set of segments {(s i , e i )} that does not satisfy the consistency event in (7).Let θ = argmin θ F (n; θ) be the parameters estimated by minimising the for a given segmentation (i.e. the vector of means and/or variances of x s i :e i for each i).Assume E holds.Then there exist constants C 4 and n 2 such that, when n > n 2 : F (n; {τ }, θ, μ, σ) ≥ F (n; {τ }, θ, μ, σ) + C 3 log(n) 1+δ/2 .
See the original publication for a detailed proof of these results.
Thus, when Proposition 3 holds, we have: F (n; {τ }, θ, μ, σ) > F (n; {τ }, θ, μ, σ), and an exact minimisation algorithm will always find a solution in the consistent set.The overall probability of the events required for Proposition 3 is a combination of P (E), established before, and (25), which by Boole's inequality is: for any > 0, n > n 3 and some constants n 3 , C 5 .
Therefore, with the same probability, s j / ∈ s j +A(n)−1 t=s j k pr,t .Also, s j / ∈ n t=s j +A(n) k pr,t because then s j ≤ t − A(n) and is not considered in the pruning scheme, and clearly s j / ∈ s j −1 t=1 k pr,t .The case for e j follows by symmetry, and obviously no true changepoint can be pruned out if s j , e j = ∅, so the overall probability of retaining a true changepoint remains at P ≥ 1 − Bn − .

Figure 1 :
Figure 1: Consistency of the background parameter values estimations.Time series simulated in three different scenarios were analysed by Algorithm 1 (shown in red).Lines are the inter-quartile range (solid) and 2.5-97.5% range (faint) of the background parameter estimates observed in 500 replications at each n.For comparison, we also show the ranges of estimates obtained by the median (blue) and mean of true background points (green; an oracle estimator).

Figure 2 :
Figure 2: Relative bias in the number of changepoints estimated by the proposed Algorithm 2 (pruned), and two alternative detectors: anomaly and not.Data simulated for two scenarios, in 500 replications for each n.For the proposed algorithm, bias is calculated separately in signal and nuisance segments.

Figure 5 :
Figure 5: Weekly deaths in Spain, in the 60-64 years age group, over 2017-2020 (black points).Detection results using the method proposed in this paper and three alternative methods shown as lines below.
(m + y) − f (m − y))dy.(16)When m + a < µ, f is increasing throughout the integration range, and EX − m > 0; the opposite is true for m − a > µ.If m − a < m < µ < m + a, split the integral in (16) as:EX − m = µ−m 0 y(f (m + y) − f (m − y))dy + a µ−m y(f (m + y) − f (m − y))dy.The first integral covers the range where f is increasing, and thus is positive.Since µ−m > 0, |m + y − µ| < |m − y − µ| for y > 0, and f (m + y) > f (m − y) by symmetry of f around µ and monotonicity, so the second interval is positive as well.Similarly, EX − m < 0 for m − a < µ < m < m + a.

F
HT w (w T i −1 ) > T i + n.In that case, define b = min{a : ∀i, F HT w (a) ≤ T i + n}; the final estimate of w i will be bound by [w * , b].As n increases, P (|w * − b| > ) → 0 for any > 0.

table .
Gaussian data with a nuisance segment and two non-overlapping signal segments: n data points were drawn as x t ∼ N (θ S t + θ N t , 1), with:

Table 2 :
The true positive rate of changepoint estimation by the proposed Algorithm 2 (pruned), and two alternative detectors: anomaly and not.Data simulated for two scenarios, in 500 replications for each n.The true positive rate is the fraction of iterations when a changepoint of correct type was detected within 0.05n of each true changepoint (types are signal, S, and nuisance, N).