Detecting multiple generalized change-points by isolating single ones

We introduce a new approach, called Isolate-Detect (ID), for the consistent estimation of the number and location of multiple generalized change-points in noisy data sequences. Examples of signal changes that ID can deal with are changes in the mean of a piecewise-constant signal and changes, continuous or not, in the linear trend. The number of change-points can increase with the sample size. Our method is based on an isolation technique, which prevents the consideration of intervals that contain more than one change-point. This isolation enhances ID’s accuracy as it allows for detection in the presence of frequent changes of possibly small magnitudes. In ID, model selection is carried out via thresholding, or an information criterion, or SDLL, or a hybrid involving the former two. The hybrid model selection leads to a general method with very good practical performance and minimal parameter choice. In the scenarios tested, ID is at least as accurate as the state-of-the-art methods; most of the times it outperforms them. ID is implemented in the R packages IDetect and breakfast, available from CRAN. Supplementary Information The online version supplementary material available at 10.1007/s00184-021-00821-6.


Introduction
Change-point detection is an active area of statistical research that has attracted a lot of interest in recent years.According to the National Research Council (2013), detecting changes in a data sequence in order to extract information about the underlying signal will continue to play an essential role in the development of the mathematical sciences.A non-exhaustive list of application areas includes financial econometrics (Bai and Perron, 2003;Schröder and Fryzlewicz, 2013); credit scoring (Bolton and Hand, 2002;Curry et al., 2007); bioinformatics (Futschik et al., 2014;Muggeo and Adelfio, 2011;Olshen et al., 2004) and cyber security (Siris and Papagalou, 2006;Tartakovsky et al., 2006).Our work's focus is on a posteriori change-point detection, where the aim is to estimate the number and locations of certain changes in the behavior of the data.We work in the model where X t are the observed data and f t is a one-dimensional, deterministic signal with structural changes at certain points.Two examples are: change-points in the level when f t is seen as piecewise-constant, and change-points in the first derivative when f t is piecewiselinear with or without the continuity constraint.We highlight, however, that our methodology and analysis apply to more general scenarios, for instance the detection of knots in a piecewise polynomial signal of order k, where k is not necessarily equal to zero (piecewiseconstant mean) or one (piecewise-linear mean).The number N of change-points as well as their locations r 1 , r 2 , . . ., r N are unknown and our aim is to estimate them.In addition, N can grow with T .The random variables t in (1.1) have mean zero and variance one; further assumptions will be given later.
When f t is assumed to be piecewise-constant, the existing change-point detection techniques are mainly split into two categories based on whether the change-points are detected all at once or one at a time.The former category mainly includes optimization-based methods, in which the estimated signal is chosen based on its least squares or log-likelihood fit to the data, penalized by a complexity rule in order to avoid overfitting.The most common example of a penalty function is the Schwarz Information Criterion (SIC); see Yao (1988) for details.To solve the implied penalization problem, dynamic programming approaches, such as the Segment Neighborhood (SN) and Optimal Partitioning (OP) methods of Auger and Lawrence (1989) and Jackson et al. (2005), have been developed.In an attempt to improve on OP's computational cost, Killick et al. (2012) introduce the PELT method, based on a pruning step applied to OP's dynamic programming approach.A non-parametric adaptation of PELT is given in Haynes et al. (2017).Rigaill (2015) introduces an improvement over classical SN algorithms, through a pruning approach called PDPa, while Maidstone et al. (2017b) give two algorithms by combining ideas from PELT and PDPa.
In the latter category, in which change-points are detected one at a time, a popular method is binary segmentation, which performs an iterative binary splitting of the data on intervals determined by the previously obtained splits.Vostrikova (1981) introduces and proves the validity of binary segmentation in the setting of change-point detection for piecewise-constant signals.Among others, binary segmentation is used for change-point detection in Chen and Gupta (1997), Yang and Swartz (2005), Fryzlewicz and Subba Rao (2014), and Badagián et al. (2015).The main advantages of binary segmentation are its conceptual simplicity and low computational cost.However, at each step of the algorithm, binary segmentation looks for a single change-point, which leads to its suboptimality in terms of accuracy, especially for signals with frequent change-points.Some variants of binary segmentation that work towards solving this issue are the Circular Binary Segmentation (CBS) of Olshen et al. (2004), the Wild Binary Segmentation (WBS) of Fryzlewicz (2014) and the Narrowest-Over-Threshold (NOT) method of Baranowski et al. (2018).
CBS searches for at most two change-points at each step of the segmentation algorithm.
Instead of initially calculating the contrast value for the whole data sequence, WBS and NOT are based on a random draw of subintervals of the domain of the data, on which an appropriate statistic is tested against a threshold.Apart from binary-segmentationrelated approaches, this category also includes methods that control the False Detection Rate (FDR).For instance, in Li et al. (2016), the FDRSeg method is introduced as a combination of FDR-control and global segmentation methods in a multiscale way.The "pseudo-sequential" (PS) procedure of Venkatraman (1992), as well as the CPM method of Ross (2015) are based on a complete embodiment of online detection algorithms to a posteriori situations and work by bounding the Type I error rate of falsely detecting changepoints.Some methods do not fall in either category.For example, the tail-greedy algorithm in Fryzlewicz (2018) achieves a multiscale decomposition of the data using Unbalanced Haar wavelets in an agglomerative way.
Going beyond the detection of change-points in piecewise-constant signals, the literature becomes poorer, which is surprising given the importance of change-point detection in general frameworks.For example, the detection of changes in the first derivative of a continuous piecewise-linear signal is used in tracking the health progress of patients for variables such as the heart rate, electroencephalogram, and electrocardiogram (Aminikhanghahi and Cook, 2017).Other examples come from estimating trends for banks' monetary statistics (Bianchi et al., 1999), climate change (Robbins et al., 2011), and time series of AIDS cases (Stasinopoulos and Rigby, 1992).For change-point detection in a continuous piecewiselinear signal, the principle of minimizing the residual sum of squares taking into account a penalty, has been used in Bai and Perron (1998), in the trend filtering (TF) approach (Kim et al., 2009;Tibshirani, 2014), as well as in the dynamic programming algorithm CPOP (Maidstone et al., 2017a).The NOT approach of Baranowski et al. (2018)  The terms subinterval and interval will be used interchangeably.Although a detailed explanation of our methodology is provided in Section 2.1, the basic idea is that for an observed data sequence of length T and with λ T a suitably chosen positive constant, ID first creates two ordered sets of K = T /λ T right-and left-expanding intervals as follows.The j th right-expanding interval is We collect these intervals in the ordered set The NOT and WBS methods also include localization ideas; however, the nature of localization in ID means that it is of an order of magnitude faster than these two.Furthermore, in NOT and WBS, it is not certain that the intervals drawn will cover the whole data domain without ignoring areas that include change-points.This is an issue of fundamental importance, especially in signals with a large number of change-points, in which NOT and WBS need to increase the number M of intervals drawn.However, doing this also increases the computational cost.In contrast, due to its interval expansion approach, ID will certainly examine all possible change-point locations.No choice of M is required, which leads to better practical performance with more predictable execution times.
The paper is organized as follows.Section 2 gives a formal explanation of the ID methodology along with two different scenarios of use and the associated theory.In Section 3, we first discuss the computational aspects of ID and the choice of parameter values.ID variants, which lead to improved practical performance, are also explained.In Section 4, we provide a thorough simulation study to compare ID with state-of-the-art methods.
Real-life data examples are provided in Section 5.
2 Methodology and Theory

Methodology
The model is given in (1.1) and the unknown number, N , of change-points can possibly grow with T .Let r 0 = 0 and r N +1 = T and let δ T = min j=1,2,...,N +1 |r j − r j−1 |.For clarity of exposition, we start with a simple example before providing a more thorough explanation of how ID works.Figure 2.1 covers a specific case of two change-points, r 1 = 38 and r 2 = 77.
We will be referring to Phases 1 and 2 involving six and four intervals, respectively.These are clearly indicated in the figure and they are only related to this specific example, as for cases with more change-points we would have more such phases.At the beginning, s = 1, e = T = 100, and we take λ T = 10 (how to choose λ T will be described in Section 3.2).
Suppose the threshold ζ T has been chosen well enough (more details in Section 3.2) so that r 2 gets detected in {X s * , X s * +1 , . . ., X e }, where s * = 71.After the detection, e is updated as the start-point of the interval where the detection occurred; therefore, e = 71.In Phase 2 indicated in the figure, ID is applied in [s, e] = [1,71].Intervals 1, 3 and 5 of Phase  We now describe ID more generically.For each change-point, r j , ID works in two stages: Firstly, isolating r j in an interval that hopefully contains no other change-point, and secondly detecting r j through the use of a suitably chosen contrast function, which is denoted by C b s,e (X), for every integer triple (s, e, b), with 1 ≤ s ≤ b < e ≤ T .Heuristically, the value of C b s,e (X) is relatively small if b is not a change-point.For instance, in piecewiseconstant signals, the contrast function reduces to the absolute value of the CUSUM statistic defined in (2.3), while for the case of continuous, piecewise-linear signals, the contrast function is given in Section 2.2.2.
To achieve isolation we employ the idea of interval expansion in the following sense: For λ T < δ T and with K = T /λ T , let c r j = jλ T and c l j = T − jλ T + 1 for j = 1, 2, . . ., K − 1, while c r K = T and c l K = 1.For a generic interval [s, e], let us define the sequences I r s,e = c r k 1 , c r k 1 +1 , . . ., e , I l s,e = c l k 2 , c l k 2 +1 , . . ., s , (2.1) where k 1 := argmin j∈{1,2...,K} {jλ T > s} and k 2 := argmin j∈{1,2...,K} {T − jλ T + 1 < e}.At , the maximum with respect to b of C b s,e (X) will be tested against a threshold ζ T .At some point in this interval expansion process, there will be s * k, e * k , with k ∈ {1, 2, . . ., 2K}, which contains only one change-point; this being either r 1 or r N , depending on whether r 1 is closer to s, or r N is closer to e.The interval s * k, e * k will not contain any other change-points, due to the fact that at each step we expand the intervals by the quantity λ T that is smaller than the minimum distance between two change-points.The practical choice of λ T is described in Section 3.2.W.l.o.g.assume that where  Venkatraman (1992) and Ross (2015), respectively.ID is conceptually and in practice different from these methods in a number of ways related to the threshold choice, the construction of the estimated change-point locations as well as the way PS and CPM restart upon detection.Furthermore, ID's isolation technique does not appear in CPM.A comparison between the performance of ID and that of state-of-the-art methods (including CPM) is given in Section 4.

Theoretical behavior of ID
We work under the assumption (A1) The random sequence { t } t=1,2,...,T is independent and identically distributed (i.i.d.) from the normal distribution with mean zero and variance one.
The assumption of i.i.d.normal random variables in (A1) is made for technical convenience; Section 3.5 shows how to use ID under non-Gaussianity.The assumption that σ = 1 is not restrictive.If σ is unknown, then we need to estimate it.In the cases of piecewise-constant and piecewise-linear signals, σ can be estimated via the Median Absolute Deviation method proposed in Hampel (1974).For simplicity purposes, let σ = 1, and (1.1) becomes (2.2) With r 0 = 0 and r N +1 = T , and for j = 1, 2, . . ., N +1, we examine the theoretical behavior of ID in the following two illustration cases: Piecewise-constant signals: f t = µ j for t = r j−1 + 1, r j−1 + 2, . . ., r j , and f r j = f r j +1 .
Continuous, piecewise-linear signals: The above scenarios are only examples in which the ID methodology can be applied.The isolation aspect of the method allows its application to various different cases, such as the estimation of the number and the position of knots (either continuous or not) in piecewise polynomial functions.The latter problem is considered for example in Tibshirani (2014), where the Trend Filtering (TF) approach of Kim et al. (2009) is used and compared to smoothing and locally adaptive regression splines.

Piecewise-constant mean signals
Under piecewise-constancy, the contrast function used is the absolute value of the CUSUM statistic, the latter being Theorem 2.1.Let {X t } t=1,2,...,T follow model (2.2), with f t being a piecewise-constant signal and assume that (A1) and (A2) hold.Let N and r j , j = 1, 2, . . ., N be the number and locations of the change-points, while N and rj , j = 1, 2, . . ., N are their estimates, with rj sorted in increasing order.In addition, ∆ f j = f r j +1 − f r j , j = 1, 2, . . ., N .Then, there exist positive constants C 1 , C 2 , C 3 , C 4 , which do not depend on T , such that for T and for a sufficiently large T , we obtain (2.4) From (2.4), we notice that in order to be able to match the estimated change-point locations with the true ones, δ T should be at least as large as max j=1,2,...,N |r j − r j |, meaning that δ T must be at least O(log T ).For this order of δ T , Chan and Walther (2013) argue that the smallest possible In our case, assumption (A2) ensures that the √ log T rate is attained, which is optimal up to the double logarithmic term.This provides evidence that ID allows for detection in complex scenarios, such as limited spacings between change-points.The quantity on the right-hand side of (2.4) is O(1 − 1/T ); the same order as in WBS and NOT.However, ID gives a significantly lower constant C 4 for the bound; see the proof in the online supplement for more details.The rate of the lower bound for the threshold is O √ log T and this is what will be used in practice as the default rate: we use and the choice of the constant C will be explained in Section 3. Furthermore, (2.4) indicates that δ T does not affect the rate of convergence of the estimated change-point locations; these only depend on ∆ f j .

Continuous piecewise-linear mean signals
Under Gaussianity and with R b s,e (X) being the generalized log-likelihood ratio for all possible single change-points within [s, e), the idea is to find a contrast function C b s,e (X), which is maximized at the same point as R b s,e (X).The contrast function is constructed by taking inner products of the data with a contrast vector.In the case of continuous piecewise-linear signals, it has been shown in Baranowski et al. (2018) which is linear with a kink at b + 1, we apply the Gram-Schmidt orthogonalization with respect to γ s,e (t) and 1 s,e (t).Normalizing the obtained vector such that . 2 = 1, returns the contrast vector ψ b s,e (t) defined in (2.6).The best approximation, in terms of the Euclidean distance, of X t in [s, e] is a linear combination of γ s,e (t), 1 s,e (t), and ψ s,e (t), which are mutually orthonormal.This orthonormality leads to R b s,e (X) = X, ψ b s,e = C b s,e (X) (Baranowski et al., 2018).For the consistency of ID in continuous piecewise-linear signals, we make the following assumption, (A3) The minimum distance, δ T , between two change-points and the minimum magnitude of jumps, f T = min j=1,2,...,N 2f r j − f r j +1 − f r j −1 , are connected by the requirement Theorem 2.2.Let {X t } t=1,2,...,T follow model (2.2) with f t being a continuous, piecewiselinear signal and assume that (A1) and (A3) hold.We denote by N and r j , j = 1, 2, . . ., N the number and locations of the change-points, while N and rj , j = 1, 2, . . ., N are their estimates, with rj sorted in increasing order.Also, we denote by ∆ f j = 2f r j − f r j +1 − f r j −1 .Then, there exist positive constants C 1 , C 2 , C 3 , C 4 , which do not depend on T , such that for T f T and for sufficiently large T , (2.7) The quantity on the right-hand side of (2.7) is O (1 − 1/T ).In addition, in the common case of f T ∼ T −1 , ID's change-point detection rate is O T 2/3 (log T ) 1/3 , as can be seen from (2.7).This differs from the O T 2/3 rate derived in Raimondo (1998) only by a logarithmic factor.The lower bound of the threshold is O √ log T and therefore, where C is a constant and we will comment on its choice in Section 3.2.
ID exhibits a great degree of flexibility as it does not depend on the structure of the signal; what changes is the choice of an appropriate contrast function.Adapting a similar approach as the one for the case of continuous piecewise-linear signals, one can construct contrast functions for the detection of other types of features.Part 3: We need to ensure that once S contains N estimates, then for j = 1, 2, . . ., N , each rj is within a distance of C * (log T ) α from r j .To achieve this, for the remaining estimated change-points after Part 2, we use triplets (s j , rj , ẽj ), with sj = (r j−1 + rj )/2 + 1 and ẽj = (r j + rj+1 )/2 .For m = argmin j C rj sj ,ẽ j (X), if C rm sm,ẽm (X) ≤ C√ log T , then we remove rm and relabel the remaining estimates in S in increasing order.We repeat this removal procedure until C rm sm,ẽm (X) > C√ log T , which is when we proceed to Part 4.

Information Criterion approach
Part 4: For the estimated change-points that are in S after Part 3 is completed, we use again the triplets (r j−1 , rj , rj+1 ) in order to find m = argmin j {CS(r j )} and then remove rm from S. This estimates removal approach is repeated until S = ∅.below give the consistency results for the piecewise-constant and continuous piecewise-linear models, based on the sSIC approach.The proof of Theorem 2.3 is in the supplementary material and the same approach can be followed to prove Theorem 2.4.

At the end of
Theorem 2.3.Let {X t } t=1,2,...,T follow model (2.2) under piecewise-constancy and let the assumptions of Theorem 2.1 hold.Let N and r j , j = 1, 2, . . ., N be the number and locations of the change-points.Let N ≤ J, where J is a constant that does not depend on T .In addition, let α > 1 be such that (log is satisfied, where δ T and f T are defined in (A2).With {M j } j=0,1,...,J being the set of candidate models obtained by the solution path algorithm, we define N = argmin j=0,1,...,J sSIC(j).Then, there exist positive constants C 1 , C 2 , which do not depend on T , such that for (2.11) Theorem 2.4.Let {X t } t=1,2,...,T follow model (2.2) under continuous piecewise-linearity and let the assumptions of Theorem 2.2 hold.Let N and r j , j = 1, 2, . . ., N be the number and locations of the change-points.Let N ≤ J, where J is a constant that does not depend on T .In addition, let α > 1 be such that (log is satisfied, where δ T and f T are defined in (A3).With {M j } j=0,1,...,J being the set of candidate models obtained by the solution path algorithm, we define N = argmin j=0,1,...,J sSIC(j).Then, there exist positive constants C 1 , C 2 , which do not depend on T , such that for (2.12) The quantities on the right hand sides of (2.11) and (2.12) are O (1 − 1/T ); the same order as those in (2.4) and (2.7).The lowest admissible δ T f 2 T and δ 3 T f 2 T in Theorems 2.3 and 2.4, respectively, are slightly larger than the same quantities in the thresholding approach.This is expected because SIC-based approaches tend to exhibit better practical behavior for signals that have a moderate number of change-points with large spacings between them, see Yao (1988) for more details.A hybrid that combines the advantages of the thresholding and the SIC-based approach is introduced in Section 3.4.

Computational cost
With δ T being the minimum distance between two change-points, and λ T the intervalexpansion parameter, we need λ T < δ T .As K = T /λ T > T /δ T and the total number, M , of distinct intervals required to scan the data is no more than 2K (K intervals created from each expanding direction), in the worst case scenario we have As a comparison, in WBS and NOT, one needs to draw at least M intervals where M ≥ (9T 2 /δ 2 T ) log (T 2 /δ T ).The lower bound for M in WBS and NOT is O (T 2 /δ 2 T ) up to a logarithmic factor, whereas the lower bound in (3.1) is O (T /δ T ).This results in great speed gains of ID over WBS and NOT.In our understanding, the reason behind this significant difference in the computational complexity of the methods is that in WBS and NOT both the start-and end-points of the randomly drawn intervals have to be chosen, whereas in ID, depending on the expanding direction, we keep the start-or end-point fixed.

Parameter choice
Choice of the threshold constant.In order to decide C and C in (2.5) and ( 2 which followed the normal distribution with mean zero and variance σ 2 ∈ {1, 3, 5}.Standard Gaussian noise was then added onto the simulated signal.For each value of N α , σ 2 and T we generated 1000 replicates and estimated the number of change-points using ID with threshold ζ T as in (2.5) and (2.8) for a variety of constant values C and C, respectively.The best behavior occurred when, approximately, C = 1 and C = 1.4,for piecewise-constant and continuous piecewise-linear signals, respectively.These values will be called the default constants.In the SIC-based approach of Section 2.3, we started by detecting change-points using threshold ζT < ζ T .In practical implementations, we take the constants related to ζT , namely C ζT and Cζ T as defined in Section 2.3, to be 0.9 and 1.25, respectively.
Choice of the expansion parameter λ T .Due to the low computational complexity of ID, we take λ = 3, leading to good accuracy even for signals with very frequent changepoints.For an example of execution speeds on a single core of an Intel Xeon 3.60GHz CPU with 16 GB of RAM, see Table 3 Table 3.1:The average computational time of ID based on the threshold stopping rule, for a simulated data sequence of length 7 × 10 j , j = 3, 4, 5, 6, from a signal that fluctuates between 0 and 4 every 7 observations.The standard deviation is σ = 0.5.

Variants
This section describes a number of different ways to further improve ID's practical performance with respect to both accuracy and speed.
Very long signals: If T is large, we split the given data sequence uniformly into smaller parts (windows), to which ID is then applied.In practical implementations, the length of the window is 3000 and we apply this window structure only when T > 12000, because for smaller values of T there were not significant differences in the execution times of ID and its window-based variant.An idea of the computational improvement that this structure offers is explained in Section 1 of the supplement.
Restarting after detection: In practice, instead of starting from the end-point e * (or startpoint s * ) of the right-expanding (or left-expanding) interval where a detection occurred, we could start from the estimated change-point, b.This alternative, labeled ID det , leads to accuracy improvements without affecting the speed of the method.
Faster solution path algorithm: In practice, we use only Part 4 of the solution path algorithm described in Section 2.3 because it is quicker and conceptually simpler; it requires only the choice of α, and tends not to affect the accuracy of ID.

A hybrid between thresholding and SIC stopping rules
For signals with a large number of regularly occurring change-points, the threshold-based ID tends to behave better than the SIC-based procedure.As explained after Theorems 2.3 and 2.4, this is unsurprising because SIC-based approaches typically perform better on signals with a moderate number of change-points separated by large spacings.This difference in ID's behavior between the threshold-and SIC-based versions is what motivates us to introduce a hybrid of these two stopping rules with minimal parameter choice, which works as follows.Firstly, we find the estimated change-points using the threshold approach ID det with λ th T = 3.If the estimated number of change-points is larger than a constant J * , then the result is accepted and we stop.Otherwise, the hybrid method proceeds to detect the change-points using the SIC-based approach with λ T > λ th T , since the already-applied thresholding rule has not suggested a signal with many change-points.In the simulation results later, we use J * = 100 and λ T = 10.

Extension to different noise structures
This section describes how to use ID when the noise is not Gaussian.We pre-process the data in order to obtain a noise structure that is closer to Gaussianity.For a given scale number s and data {X t } t=1,2,...,T , let us denote by Q = T /s and Xq = 1 We apply ID on Xq q=1,2,...,Q , to obtain the estimated change-points, namely r1 , r2 , . . ., r N in increasing order.To estimate the original locations of the change-points we define rk = rk − 1 s + s 2 + 0.5 , k = 1, 2, . . ., N .There is a trade-off in the choice of the scaling parameter s.The larger the value of s, the closer the distribution of the noise to normal, but the more the amount of pre-processing.In simulations presented in Section 4, we use s = 3 for the case of Student-t 5 distributed noise, while if the tails are heavier (Student-t 3 ), we set s = 5.
The hybrid version of ID will be employed on Xq q=1,2,...,Q and to be consistent with the choice of the expansion parameter as defined in Section 3.2, we take λ * T = λ T /s .In practice, for unknown noise, our recommendation is to set s = 5.

Simulations
In this section, we provide a comprehensive simulation study of the performance of ID against the currently best methods in the scenarios of piecewise-constant signals and continuous piecewise-linear signals., where ft is the ordinary least square approximation of f t between two successive change-points.In continuous piecewise-linear signals, ft is the splines fit obtained using the splines package in R. The scaled Hausdorff distance, d H = n −1 s max {max j min k |r j − rk | , max k min j |r j − rk |} , where n s is the length of the largest segment, is also given.The average computational time for all methods, apart from FDR, is also provided.FDR is excluded due to its non-uniform procedure in terms of the execution speed for each signal (if a newly obtained signal has length greater than previously treated signals, then FDR estimates the threshold by 5000 Monte-Carlo simulations, which makes it very slow).and TGUH (dashed) estimates for Tower Hamlets and Newham.

Samsung Stock Prices
In this section we apply ID to the daily closing stock prices of Samsung Electronics Co.
from July 2012 until July 2017.The data are available from https://finance.yahoo.com/quote/005930.KS/history?p=005930.KS and they were last accessed in January 2018.
We look for changes in a continuous piecewise-linear mean signal.Figure 5.4 shows the results for the ID, NOT and CPOP methods, which detect 79, 14, and 135 change-points, respectively.From both the fit and the residuals given in Figure 5.4, it is not easy to say which of the three methods gives the "best" number of change-points.ID can return a range of different fits providing users with the flexibility to choose according to their preference.In Figure 5.5, we use the solution path and we obtain the estimated signal and the raw residuals of ID for N = 135 (the estimated change-point number through CPOP).
has been shown to lead to consistent estimation of change-points in different scenarios, such as piecewise-linear mean signals.Our proposed approach, labeled Isolate-Detect (ID), is a generic technique for generalized change-point detection in various different structures, such as piecewise-constant or piecewise-linear signals with or without the continuity constraint.In the paper we focus on piecewise-constant and continuous piecewise-linear signals.The concept behind ID is simple and is split into two stages; firstly, the attempted isolation of each of the true change-points within subintervals of the domain [1, 2, . . ., T ], and secondly their detection.
For a suitably chosen contrast function, ID identifies the point with the maximum contrast value in R 1 .If its value exceeds a certain threshold, denoted by ζ T , then it is taken as a change-point.If not, then the process tests the next interval in S RL .Upon detection, the algorithm makes a new start from the end-point (or start-point) of the right-(or left-) expanding interval where the detection occurred.Upon correct choice of ζ T , ID ensures that we work on intervals with at most one change-point.

Figure 2 . 1 :
Figure 2.1: An example with two change-points; r 1 = 38 and r 2 = 77.The dashed line is the interval in which the detection took place in each phase.
the beginning, [s, e] = [1, T ] and we have that k 1 = k 2 = 1.ID starts by looking for change-points interchangeably in right-and left-expanding intervals, denoted by [s * j , e * j ].For example, the first four intervals are

then b 1
is assigned as the estimate of r 1 .After that, ID restarts and looks for change-points in [s, e] = [c r k * , T ], where now s * j and e * j are taken from I l c r k * ,T and I r c r k * ,T as in (2.1), respectively.The algorithm will stop when it is applied in an interval [s, e], such that all expanding intervals [s * j , e * j ] ⊆ [s, e] do not include a point b * j with C b * j s * j ,e * j (X) > ζ T .The idea of a-posteriori change-point detection in which change-points are detected sequentially, has appeared previously in the literature; see for instance the PS and CPM methods of s ≤ b < e ≤ T and n = e − s + 1.Under assumption (A1), it can be shown that argmax b Xb s,e = argmax b R b s,e (X), where R b s,e (X) is the generalized log-likelihood ratio statistic for all potential single change-points within [s, e].For i.i.d.Gaussian noise, the CUSUM estimator of the change-point locations is the maximum likelihood estimator.For the main result of Theorem 2.1, we also make the assumption (A2) The minimum distance, δ T , between two change-points and the minimum magnitude of jumps, f T , are connected by √ δ T f T ≥ C √ log T , for a large enough constant C.We explain after Theorem 2.1 that the above √ log T rate for the lower bound of √ δ T f T is optimal up to a double logarithmic term.The number of change-points, N , is assumed to be neither known nor fixed.It can grow with T and the only indirect assumption on N is due to the minimum distance, δ T , between two change-points in the sense that N + 1 ≤ T /δ T .Below, we give the main theoretical result for the consistency of the number and location of the estimated change-points.The proof is in Section 4 of the online supplement.

Part 1 :
Misspecification of the threshold in the ID algorithm can lead to the misestimation of the number of change-points.To solve this, we develop an approach which starts by possibly overestimating the number of change-points and then creates a solution path, with the estimates ordered according to a certain predefined criterion.The best fit is then chosen, based on the optimization of a model selection criterion.The solution path algorithm: The estimated number of change-points depends on ζ T and this allows us to denote N = N (ζ T ).For given data, we employ ID using first ζ T and then ζT , where ζT < ζ T .Let C ζT and Cζ T be the ζT -associated constants in (2.5) and(2.8),respectively.With J ≥ N (ζ T ), we estimate rj , j = 1, 2, . . ., J, which are sorted in increasing order in S = [r 1 , r2 , . . ., rJ ].The algorithm is split into four parts.Although the description of each part is fairly technical, we note that the different parts are very similar and are based on the idea of removing change-points according to their contrast function values as well as their distance to neighboring estimates.In the algorithm we refer to three parameters: C * , C, and α.Although we do not give a recipe for the choice of C * and C, Section 3.4 describes how to circumvent their choice.The default value of α is 1.01.An explanation of this choice is given before Theorem 2.3.We now give the four parts of the solution path algorithm.All events below occur with probability tending to one with T .With C * a positive constant, the aim is to prune the estimates in S, such that, for each true change-point, there are at most four and at least one estimated changepoint within a distance of C * (log T ) α .To achieve this, ∀j ∈ {1, 2, . . ., J}, we collect triplets (r j−1 , rj , rj+1 ) and we calculate CS(r j ) := C rj rj−1 ,r j+1 (X), with C b s,e (X) being the relevant contrast function.For m = argmin j {CS(r j )}, firstly we check whether CS(r m ) ≤ C√ log T , for C > 0; in the proofs of Theorems 2.3 and 2.4, C = 2 √ 2, but smaller values could be sufficient.If CS(r m ) ≤ C√ log T and also rj+1 − rj−1 ≤ 2C * (log T ) α , we remove rm from S, reduce J by 1, relabel the remaining estimates (in increasing order) in S, and repeat this estimate removal process.We proceed to Part 2 when CS(r m ) > C√ log T .Part 2: The aim is to continue the pruning process in a way that at the end of Part 2, there is at least one estimate within a distance of C * (log T ) α from each true change-point, but also there are at most two estimates between any pair of consecutive true change-points.For the relabeled estimates in S after the completion of Part 1, if rj − rj−1 ≤ C * (log T ) α , then we remove rj , relabel the remaining estimates, and keep removing the estimates until there is no pair (r j−1 , rj ), such that rj − rj−1 ≤ C * (log T ) α .We then calculate CS(r j ) as in Part 1 and for m = argmin j {CS(r j )}, if CS(r m ) ≤ C√ log T , then we remove rm and relabel the remaining elements of S. This estimates removal process is repeated and we proceed to Part 3 only when CS(r m ) > C√ log T .
Part 4, we collect the estimates in a vector b = (b 1 , b 2 , . . ., b J ) ,(2.9) where b J is the estimate that was removed first, b J−1 is the one that was removed second, and so on.From now on, b is called the solution path and is used to give a range of different fits.We define the collection {M j } j=0,1,...,J where M 0 = ∅ and M j = {b 1 , b 2 , . . ., b j }.For j = 2, . . ., J, let b1 < . . .< bj be the sorted elements of M j .Among the collection of models {M j } j=0,1,...,J , we propose to select the one that minimizes the strengthened Schwarz Information Criterion(Liu et al. (1997), Fryzlewicz (2014)), defined as sSIC(j) = −2 j+1 k=1 X bk−1 +1 , . . ., X bk ; θk + n j (log T ) α , (2.10) where b0 = 0 and for each collection M j , bj+1 = T and θ1 , θ2 , . . ., θj+1 are the maximum likelihood estimators of the segment parameters for the model (2.2) with change-point locations b 1 , b 2 , . . ., b j .The quantity n j is the total number of estimated parameters related to M j .Taking α = 1 in (2.10) gives the standard SIC penalty, but our theory requires α > 1.In practice we use α = 1.01 in order to stay close to SIC.Theorems 2.3 and 2.4 .8), respectively, we ran a large-scale simulation study involving a wide range of signals.The number of change-points, N , was generated from the Poisson distribution with rate parameter N α ∈ {4, 8, 12}.For T ∈ {100, 200, 500, 1000, 2000, 5000}, we uniformly distributed the change-points in {1, 2, . . ., T }.Then, for piecewise-constant (or continuous piecewiselinear) signals, at each change-point location we introduced a jump (or a slope change)

Figure 5 . 3 :
Figure 5.3: Top row: The original time series and the fitted piecewise-constant mean signal obtained by ID for both Tower Hamlets and Hackney.Bottom row: NOT (solid)

Figure 5 . 4 :Figure 5 . 5 :
Figure 5.4: Top row: From left to right, the fits for ID, NOT and CPOP, respectively.Bottom row: The raw residuals t = Y t − ft for each method.
1 will not be re-examined in Phase 2 and r 1 gets, upon a good choice of ζ T , detected in {X s , X s+1 , . . ., X e * }, where e * = 40.After the detection, s is updated as the end-point of .1.

Table 4 .
Table 4.2 below shows the competitors used.2:The competing methods used in the simulation study.neighborhood of zero, depending on the example) and those within 10% off the highest are given in bold.As a measure of the accuracy of the detected locations, we provide Monte-Carlo estimates of the mean squared error, MSE = T −1 T t=1 E ft − f t Table 4.3: Distribution of N − N over 100 simulated data sequences of the piecewiseconstant signals (M1)-(M4).The average MSE, d H and computational time are also given.Table 4.4: Distribution of N − N over 100 simulated data sequences from the piecewiseconstant signal (M5).The average MSE, d H and computational time are also given.Table 4.5: Distribution of N − N over 100 simulated data sequences from the piecewiseconstant signal (M6).The average MSE, d H and computational time are also given.Table 4.6: Distribution of N − N over 100 simulated data sequences from the piecewiseconstant signal (M7).The average MSE, d H and computational time are also given.Table 4.8: Distribution of N − N over 100 simulated data sequences of the continuous piecewise-linear signal (W3).The average MSE, d H and computational time are also given.Table4.9:Distribution of N − N over 100 simulated data sequences of the continuous piecewise-linear signal (W4).The average MSE, d H and computational time are also given.