Detecting linear trend changes in data sequences

We propose TrendSegment, a methodology for detecting multiple change-points corresponding to linear trend changes in one dimensional data. A core ingredient of TrendSegment is a new Tail-Greedy Unbalanced Wavelet transform: a conditionally orthonormal, bottom-up transformation of the data through an adaptively constructed unbalanced wavelet basis, which results in a sparse representation of the data. Due to its bottom-up nature, this multiscale decomposition focuses on local features in its early stages and on global features next which enables the detection of both long and short linear trend segments at once. To reduce the computational complexity, the proposed method merges multiple regions in a single pass over the data. We show the consistency of the estimated number and locations of change-points. The practicality of our approach is demonstrated through simulations and two real data examples, involving Iceland temperature data and sea ice extent of the Arctic and the Antarctic. Our methodology is implemented in the R package trendsegmentR, available from CRAN.


Introduction
Multiple change-point detection is a problem of importance in many applications; recent examples include automatic detection of change-points in cloud data to maintain the performance and availability of an app or a website (James et al., 2016), climate change detection in tropical cyclone records (Robbins et al., 2011), detecting exoplanets from light curve data (Fisch et al., 2018), detecting changes in the DNA copy number (Olshen et al., 2004;Jeng et al., 2012;Bardwell et al., 2017), estimation of stationary intervals in potentially cointegrated stock prices (Matteson et al., 2013), estimation of change-points in multi-subject fMRI data (Robinson et al., 2010) and detecting changes in vegetation trends (Jamali et al., 2015). This paper considers the change-point model where f t is a deterministic and piecewise-linear signal containing N change-points, i.e. time indices at which the slope and/or the intercept in f t undergoes changes. These changes occur at unknown locations η 1 , η 2 , . . . , η N . In this article, we assume that the ε t 's are iid N(0, σ 2 ) and in the supplementary material, we show how our method can be extended to dependent and/or non-Gaussian noise such as ε t following a stationary Gaussian AR process or t-distribution. The true change-points {η i } N i=1 are such that, f t = θ ℓ,1 + θ ℓ,2 t for t ∈ [η ℓ−1 + 1, η ℓ ], ℓ = 1, . . . , N + 1 where f η ℓ + θ ℓ,2 f η ℓ +1 for ℓ = 1, . . . , N. ( This definition permits both continuous and discontinuous changes in the linear trend.
Our main interest is in the estimation of N and η 1 , η 2 , . . . , η N under some assumptions that quantify the difficulty of detecting each η i ; therefore, our aim is to segment the data into sections of linearity in f t . In detail, a change-point located close to its neighbouring ones can only be detected when it has a large enough size of linear trend change, while a change-point capturing a small size of linear trend change requires a longer distance from its adjacent change-points to be detected. Detecting linear trend changes is an important applied problem in a variety of fields, including climate change, as illustrated in Section 5.
The change-point detection procedure proposed in this paper is referred to as TrendSegment; it is designed to work well in the presence of either long or short spacings between neighbouring change-points, or a mixture of both. The engine underlying TrendSegment is a new Tail-Greedy Unbalanced Wavelet (TGUW) transform: a conditionally orthonormal, bottom-up transformation for univariate data sequences through an adaptively constructed unbalanced wavelet basis, which results in a sparse representation of the data. In this article, we show that TrendSegment offers good performance in estimating the number and locations of change-points across a wide range of signals containing constant and/or linear segments. TrendSegment is also shown to be statistically consistent and computationally efficient.
In earlier related work regarding linear trend changes, Bai and Perron (1998) consider the estimation of linear models with multiple structural changes by least-squares and present Waldtype tests for the null hypothesis of no change. Kim et al. (2009) and Tibshirani et al. (2014) consider 'trend filtering' with the L 1 penalty and Fearnhead et al. (2019) detect changes in the slope with an L 0 regularisation via a dynamic programming algorithm. Spiriti et al. (2013) study two algorithms for optimising the knot locations in least-squares and penalised splines. Baranowski et al. (2019) propose a multiple change-point detection device termed Narrowest-Over-Threshold (NOT), which focuses on the narrowest segment among those whose contrast exceeds a pre-specified threshold. Anastasiou and Fryzlewicz (2022) propose the Isolate-Detect (ID) approach which continuously searches expanding data segments for changes. Yu et al. (2022) propose a two-step algorithm for detecting multiple change-points in piecewise polynomials with general degrees. Keogh et al. (2004) mention that sliding windows, top-down and bottom-up approaches are three principal categories which most time series segmentation algorithms can be grouped into. Keogh et al. (2004) apply those three approaches to the detection of changes in linear trends in 10 different signals and discover that the performance of bottom-up methods is better than that of top-down methods and sliding windows, notably when the underlying signal has jumps, sharp cusps or large fluctuations. Bottom-up procedures have rarely been used in change-point detection. Matteson and James (2014) use an agglomerative algorithm for hierarchical clustering in the context of change-point analysis. Keogh et al. (2004) merge adjacent segments of the data according to a criterion involving the minimum residual sum of squares (RSS) from a linear fit, until the RSS falls under a certain threshold; but the lack of precise recipes for the choice of this threshold parameter causes the performance of this method to be somewhat unstable, as we report in Section 4.
As illustrated later in this paper, our TGUW transform, which underlies TrendSegment, is designed to work well in detecting frequent change-points or abrupt local features in which many existing change-point detection methods for the piecewise-linear model fail. The TGUW transform constructs, in a bottom-up way, an adaptive wavelet basis by consecutively merging neighbouring segments of the data starting from the finest level (throughout the paper, we refer to a wavelet basis as adaptive if it is constructed in a data-driven way). This enables it to identify local features at an early stage, before it proceeds to focus on more global features corresponding to longer data segments. Fryzlewicz (2018) introduces the Tail-Greedy Unbalanced Haar (TGUH) transform, a bottom-up, agglomerative, data-adaptive transformation of univariate sequences that facilitates change-point detection in the piecewise-constant sequence model. The current paper extends this idea to adaptive wavelets other than adaptive Haar, which enables change-point detection in the piecewise-linear model (and, in principle, to higher-order piecewise polynomials, where the details can be found in Section G of the supplementary material). We emphasise that this extension from TGUH to TGUW is both conceptually and technically non-trivial, due to the fact that it is not a priori clear how to construct a suitable wavelet basis in TGUW for wavelets other than adaptive Haar; this is due to the non-uniqueness of the local orthonormal matrix transformation for performing each merge in TGUW, which does not occur in TGUH. We solve this issue by imposing certain guiding principles in the way the merges are performed, which enables detecting not only long trend segments, but also frequent change-points including abrupt local features. The computational cost of TGUW is the same as TGUH. Important properties of the TGUW transform include orthonormality conditional on the merging order, nonlinearity and "tail-greediness", and will be investigated in Section 2. The TGUW transform is the first step of the TrendSegment procedure, which involves four steps.
The remainder of the article is organised as follows. Section 2 gives a full description of the TrendSegment procedure and the relevant theoretical results are presented in Section 3. The supporting simulation studies are described in Section 4 and our methodology is illustrated in Section 5 through climate datasets. The proofs of our main theoretical results are in Appendix 6. The supplementary material includes theoretical results for dependent and/or non-Gaussian noise, extension to piecewise-quadratic signal, details of robust threshold selection and extra simulation and data application results. The TrendSegment procedure is implemented in the R package trendsegmentR, available from CRAN.

Summary of TrendSegment
The TrendSegment procedure for estimating the number and the locations of change-points includes four steps. We give the broad picture first and outline details in later sections.
1. TGUW transformation. Perform the TGUW transform, a bottom-up unbalanced adaptive wavelet transformation of the input data X 1 , . . . , X T , by recursively applying local conditionally orthonormal transformations. This produces a data-adaptive multiscale decomposition of the data with T −2 detail-type coefficients and 2 smooth coefficients. The resulting conditionally orthonormal transform of the data hopes to encode most of the energy of the signal in only a few detail-type coefficients arising at coarse levels (see Figure 1 for an example output). This representation sparsity justifies thresholding in the next step.

2.
Thresholding. Set to zero those detail coefficients whose magnitude is smaller than a prespecified threshold as long as all the non-zero detail coefficients are connected to each other in the tree structure. This step performs "pruning" as a way of deciding the significance of the sparse representation obtained in step 1.

Inverse TGUW transformation.
Obtain an initial estimate of f t by carrying out the inverse TGUW transformation of the thresholded coefficient tree. The resulting estimator is discontinuous at the estimated change-points. It can be shown to be l 2 -consistent, but not yet consistent for N or η 1 , . . . , η N .
4. Post-processing. Post-process the estimate from step 3 by removing some change-points perceived to be spurious, which enables us to achieve estimation consistency for N and η 1 , . . . , η N . Figure 2 illustrates the first three steps of the TrendSegment procedure. We devote the following four sections to describing each step above in order.

TGUW transformation 2.2.1 Key principles of the TGUW transform
In the initial stage, the data are considered smooth coefficients and the TGUW transform iteratively updates the sequence of smooth coefficients by merging the adjacent sections of the data 1 2 3 4 5 Input Input Input Input Input Input Input Input Input X1 = 1 X2 = 1.5 X3 = 2 X4 = 2.5 X5 = 3 X6 = 3.5 X7 = 4 X8 = 4.5 X9 = 5 Output Output Output Output Output Output Output Output Output s1 = 1.63 s2 = 9.66 d3 = 0 Input Input Input Input Input Input Input Input Input Output Output Output Output Output Output Output Output Output s1 = 4.52 s2 = 7  Figure 1: Multiscale decomposition of the data through the TGUW transform when the data has no change-points ((a), (b)) or one change-point ((c), (d)). s 1 and s 2 are the smooth coefficients obtained through the TGUW transform and d k is the detail coefficient obtained in the k th merge.
When the data has no noise ((a), (c)), d k = 0 for all k in (a) while two non-zero coefficients d 6 and d 7 encode the single change in (c).
which are the most likely to belong to the same segment. The merging is done by performing an adaptively constructed orthonormal transformation to the chosen triplet of the smooth coefficients and in doing so, a data-adaptive unbalanced wavelet basis is established. The TGUW transform is completed after T − 2 such orthonormal transformations and each merge is performed under the following principles.
1. In each merge, three adjacent smooth coefficients are selected and the orthonormal transformation converts those three values into one detail and two (updated) smooth coefficients. The size of the detail coefficient gives information about the strength of the local linearity and the two updated smooth coefficients are associated with the estimated parameters (intercept and slope) of the local linear regression performed on the raw observations corresponding to the initially chosen three smooth coefficients.
(c) Inverse TGUW transform Figure 2: Illustration of the first three steps of the TrendSegment procedure with the observed data X t (dots), the true signal f t (grey line) and the tree of mergings; (a) TGUW transform constructs a tree by merging neighbouring segments, (b) In thresholding, surviving coefficients (solid line in the tree) are chosen by a pre-specified threshold, which decides the location of the estimated change point (red), (c) Inverse TGUW transform gives the estimated signal (green) based on the estimated change points obtained in thresholding.
2. "Two together" rule. The two smooth coefficients returned by the orthonormal transformation are paired in the sense that both contain information about one local linear regression fit.
Thus, we require that any such pair of smooth coefficients cannot be separated when choosing triplets in any subsequent merges. We refer to this recipe as the "two together" rule.
3. To decide which triplet of smooth coefficients should be merged next, we compare the corresponding detail coefficients as their magnitude represents the strength of the corresponding local linear trend; the smaller the (absolute) size of the detail, the smaller the local deviation from linearity. Smooth coefficients corresponding to the smallest detail coefficients have priority in merging.
As merging continues under the "two together" rule, all mergings can be classified into one of three forms: • Type 1: merging three initial smooth coefficients, • Type 2: merging one initial and a paired smooth coefficient, • Type 3: merging two sets of (paired) smooth coefficients, p,r smooth coefficients obtained from {X p , . . . , X r }, paired under the "two together" rule. d [1] p,q,r , d [2] p,q,r paired detail coefficients obtained by merging two adjacent subintervals, {X p , . . . , X q } and {X q+1 , . . . , X r }, where r > q + 2 and q > p + 1 (merge of Type 3). s data sequence vector containing the (recursively updated) smooth and detail coefficients from the initial input s 0 .
where Type 3 is composed of two merges of triplets and more details are given in Section 2.2.2.

Example
We now provide a simple example of the TGUW transformation; the accompanying illustration is in Figure 3. The notation for this example and for the general algorithm introduced later is in Table 1. This example shows single merges at each pass through the data when the algorithm runs in a purely greedy way. We will later generalise it to multiple passes through the data, which will speed up computation (this device is referred to as "tail-greediness" as the algorithm merges those triplets corresponding to the lower tail of the distribution of local deviation from linearity in X). We refer to j th pass through the data as scale j. Assume that we have the initial input s 0 = (X 1 , X 2 , . . . , X 8 ), so that the complete TGUW transform consists of 6 merges. We show 6 example merges one by one under the rules introduced in Section 2.2.1. This example demonstrates all three possible types of merges.
Scale j = 2. From now on, the "two together" rule is applied. Ignoring any detail coef- 2,4 , X 5 ), (X 5 , X 6 , X 7 ), (X 6 , X 7 , X 8 ). We note that (s [2] 2,4 , X 5 , X 6 ) cannot be considered as a candidate for next merging under the "two together" rule as this triplet contains only one (not both) of the paired smooth coefficients returned by the previous merging. Assume that (X 5 , X 6 , X 7 ) gives the smallest size of detail coefficient |d 5,6,7 | among the four candidates, then we merge them through the orthogonal transformation formulated in (8)  , and if we were to treat these two triplets separately, we would be violating the "two together" rule. The summary detail coefficient for this pair of triplets is obtained as d 2,4,7 = max(|d [1] 2,4,7 |, |d [2] 2,4,7 |), which is compared with those of the other triplets. Now suppose that (X 1 , s [1] 2,4 , s [2] 2,4 ) has the smallest size of detail; we merge this triplet and update the data sequence into s = (s [1] 1,4 , s  ,4,8 , d 5,6,7 , d 5,7,8 ). The transformation is now completed with the updated data sequence which contains T − 2 = 6 detail and 2 smooth coefficients.

Some important features of TGUW transformation
Before formulating the TGUW transformation in generality, we describe how it achieves sparse representation of the data. Sometimes, we will be referring to a detail coefficient d · p,q,r as d ( j,k) p,q,r or d ( j,k) , where j = 1, . . . , J is the scale of the transform (i.e. the consecutive pass through the data) at which d · p,q,r was computed, k = 1, . . . , K( j) is the location index of d · p,q,r within all scale j coefficients, and d · p,q,r is d [1] p,q,r or d [2] p,q,r or d p,q,r , depending on the type of merge. The TGUW transform eventually converts the input data sequence X of length T into the sequence containing 2 smooth and T − 2 detail coefficients through T − 2 orthonormal transforms as follows, ...,J,k=1,...,K( j) where Ψ is a data-adaptively chosen orthonormal unbalanced wavelet basis for R T . The detail coefficients d ( j,k) can be regarded as scalar products between X and a particular unbalanced wavelet basis ψ ( j,k) , where the formal representation is given as for detail coefficients and s [1] 1,T = X, ψ (0,1) , s [2] 1,T = X, ψ (0,2) for the two smooth coefficients. The TGUW transform is nonlinear, but it is also conditionally linear and orthonormal given the order in which the merges are performed. The orthonormality of the unbalanced wavelet basis, {ψ ( j,k) }, implies Parseval's identity: Furthermore, the filters (ψ (0,1) , ψ (0,2) ) corresponding to the two smooth coefficients s [1] 1,T and s [2] 1,T form an orthonormal basis of the subspace see Section E of the supplementary materials for further details. This implies 1,T ψ (0,2) is the best linear regression fit to X achieved by minimising the sum of squared errors. This, combined with the Parseval's identity above, implies By construction, the detail coefficients |d ( j,k) | obtained in the initial stages of the TGUW transform tend to be small in magnitude. Then the Parseval's identity in (4) implies that a large portion of T t=1 (X t −X t ) 2 is explained by only a few large |d ( j,k) |'s arising in the later stages of the transform; in this sense, the TGUW transform provides sparsity of signal representation.

TGUW transformation: general algorithm
In this section, we formulate in generality the TGUW transformation illustrated in Section 2.2.2 by showing how an adaptive orthonormal unbalanced wavelet basis, Ψ in (3), is constructed. One of the important principles is "tail-greediness" (Fryzlewicz, 2018) which enables us to reduce the computational complexity by performing multiple merges over non-overlapping regions in a single pass over the data. More specifically, it allows us to perform up to max{2, ⌈ρα j ⌉} merges at each scale j, where α j is the number of smooth coefficients in the data sequence s and ρ ∈ (0, 1) (the lower bound of 2 is essential to permit a Type 3 transformation, which consists of two merges).
We now describe the TGUW algorithm.
1. At each scale j, find the set of triplets that are candidates for merging under the "two together" rule and compute the corresponding detail coefficients. Regardless of the type of merge, a detail coefficient d · p,q,r is, in general, obtained as d · p,q,r = as 1 p:r + bs 2 p:r + cs 3 p:r , where p ≤ q < r, s k p:r is the k th smooth coefficient of the subvector s p:r with a length of r − p + 1 and the constants a, b, c are the elements of the detail filter h = (a, b, c) ⊤ . We note that (a, b, c) also depends on (p, q, r), but this is not reflected in the notation, for simplicity.
The detail filter is a weight vector used in computing the weighted sum of a triplet of smooth coefficients which should satisfy the condition that the detail coefficient is zero if and only if the corresponding raw observations over the merged regions have a perfect linear trend.
If (X p , . . . , X r ) are the raw observations associated with the triplet of the smooth coefficients (s 1 p:r , s 2 p:r , s 3 p:r ) under consideration, then the detail filter h is obtained in such a way as to produce zero detail coefficient only when (X p , . . . , X r ) has a perfect linear trend, as the detail coefficient itself represents the extent of non-linearity in the corresponding region of data.
This implies that the smaller the size of the detail coefficient, the closer the alignment of the corresponding data section with linearity.
2. Summarise all d · p,q,r constructed in step 1 to a (equal length or shorter) sequence of d p,q,r by finding a summary detail coefficient d p,q,r = max(|d [1] p,q,r |, |d [2] p,q,r |) for any pair of detail coefficients constructed by Type 3 merges.
3. Sort the size of the summarised detail coefficients |d p,q,r | obtained in step 2 in non-decreasing order.
4. Extract the (non-summarised) detail coefficient(s) |d · p,q,r | corresponding to the smallest (summarised) detail coefficient |d p,q,r | e.g. both |d [1] p,q,r | and |d [2] p,q,r | should be extracted only if d p,q,r = max(|d [1] p,q,r |, |d [2] p,q,r |). Repeat the extraction until max{2, ⌈ρα j ⌉} (or all possible, whichever is the smaller number) detail coefficients have been obtained, as long as the region of the data corresponding to each detail coefficient extracted does not overlap with the regions corresponding to the detail coefficients already drawn.

For each |d ·
p,q,r | extracted in step 4, merge the corresponding smooth coefficients by updating the corresponding triplet in s through the orthonormal transform as follows, The key step is finding the 3 × 3 orthonormal matrix, Λ, which is composed of one detail and two low-pass filter vectors in its rows. Firstly the detail filter h ⊤ is determined to satisfy the condition mentioned in step 1, and then the two low-pass filters (ℓ ⊤ 1 , ℓ ⊤ 2 ) are obtained by satisfying the orthonormality of Λ. There is no uniqueness in the choice of (ℓ ⊤ 1 , ℓ ⊤ 2 ), but this has no effect on the transformation itself. The details of this mechanism can be found in Section E of the supplementary materials.
6. Go to step 1 and repeat at new scale j = j + 1 as long as we have at least three smooth coefficients in the updated data sequence s.
More specifically, when Type 1 merge is performed in step 1 (i.e. s p:r in (7) consists of three initial smoothing coefficients, which implies r = p + 2), the corresponding detail filter h is obtained as a unit normal vector to the plane {(x, y, z)|x − 2y + z = 0}, thus the detail coefficient d presents the projection of three initial smoothing coefficients to the unit normal vector. In the same manner, due to the orthonormality of Λ in (8), the two low-pass filters (ℓ ⊤ 1 , ℓ ⊤ 2 ) form an arbitrary orthonormal basis of the plane {(x, y, z)|x − 2y + z = 0}. In practice, the detail filter h in Step 1 is obtained by updating so-called weight vectors of constancy and linearity in which the initial inputs have a form of (1, 1, . . . , 1) ⊤ and (1, 2, . . . , T ) ⊤ , respectively. The details can be found in Section F of the supplementary materials.
We now comment briefly on the computational complexity of the TGUW transform. Assume that α j smooth coefficients are available in the data sequence s at scale j and we allow the algorithm to merge up to ρα j many triplets (unless their corresponding data regions overlap) where ρ ∈ (0, 1) is a constant. This gives us at most (1 − ρ) j T smooth coefficients remaining in s after j scales. Solving for (1 − ρ) j T ≤ 2 gives the largest number of scales J as log(T )/ log (1 − ρ) −1 + log(2)/ log(1 − ρ) , at which point the TGUW transform terminates with two smooth coefficients remaining. Considering that the most expensive step at each scale is sorting which takes O(T log(T )) operations, the computational complexity of the TGUW transformation is O(T log 2 (T )).

Thresholding
Because at each stage, the TGUW transform constructs the smallest possible detail coefficients, but it is at the same time orthonormal and so preserves the l 2 energy of the input data, the variability (= deviation from linearity) of the signal tends to be mainly encoded in only a few detail coefficients computed at the later stages of the transform. The resulting sparsity of representation of the input data in the domain of TGUW coefficients justifies thresholding as a way of deciding the significance of each detail coefficient (which measures the local deviation from linearity). (a) survived detail coefficients before and after applying the "connected" rule Figure 4: The tree of mergings in the example of Section 2.2.2. Left-hand trees show the examples of tree obtained from initial hard thresholding, the right-hand trees come from processing the respective left-hand ones by applying (a) the "connected" rule and (b) the "two together" rule, respectively, described in Section 2.3. The circled detail coefficients are the surviving ones.
We propose to threshold the TGUW detail coefficients under two important rules, which should simultaneously be satisfied; we refer to these as the "connected" rule and the "two together" rule. The "two together" rule in thresholding is similar to the one in the TGUW transformation except it targets pairs of detail rather than smooth coefficients, and only applies to pairs of detail coefficients arising from Type 3 merges. Figure 4b shows one such pair in the example , and the "two together" rule means that both such detail coefficients should be kept if at least one survives the initial thresholding. This is a natural requirement as a pair of Type 3 detail coefficients effectively corresponds to a single merge of two adjacent regions.
The "connected" rule which prunes the branches of the TGUW detail coefficients if and only if the detail coefficient itself and all of its children coefficients fall below a certain threshold in absolute value. This is illustrated in Figure 4a along with the example of Section 2.2.2; if both d 2,3,4 and (d [1] 1,4,8 , d [2] 1,4,8 ) were to survive the initial thresholding, the "connected" rule would mean we also had to keep d 1,1,4 , which is the child of (d [1] 1,4,8 , d [2] 1,4,8 ) and the parent of d 2,3,4 in the TGUW coefficient tree.
Through the thresholding, we wish to estimate the underlying signal f in (1) by estimat- is an orthonormal unbalanced wavelet basis constructed in the TGUW transform from the data. Throughout the entire thresholding procedure, the "connected" and "two together" rules are applied in this order. We firstly threshold and apply the "connected" rule, which gives usμ ( j,k) 0 , the initial estimator of µ ( j,k) , aŝ where I is an indicator function and Now the "two together" rule is applied to the initial estimatorsμ ( j,k) 0 to obtain the final estimatorsμ ( j,k) . We firstly note that two detail coefficients, d ( j,k) p,q,r and d ( j ′ ,k+1) p ′ ,q ′ ,r ′ are called "paired" when they are formed by Type 3 mergings and when ( j, p, q, r) = ( j ′ , p ′ , q ′ , r ′ ). The "two together" rule is formulated as below, It is important to note that the application of the two rules ensures thatf is a piecewise-linear function composed of best linear fits (in the least-squares sense) for each interval of linearity.
As an aside, we note that the number of survived detail coefficients does not necessarily equal the number of change-points inf as a pair of detail coefficients arising from a Type 3 merge are associated with a single change-point.

Inverse TGUW transformation
The estimatorf of the true signal f in (1) is obtained by inverting (= transposing) the orthonormal transformations in (8) in reverse order to that in which they were originally performed. This inverse TGUW transformation is referred to as TGUW −1 , and thus where denotes vector concatenation.

Post processing for consistency of change-point detection
As will be formalised in Theorem 1 of Section 3, the piecewise-linear estimatorf in (12)  Stage 1. We transform the estimated functionf in (12) with change-points (η 1 ,η 2 , . . . ,ηÑ) into a new estimatorf with corresponding change-points (η 1 ,η 2 , . . . ,ηÑ). Usingf in (12) as an input data sequence s, we perform the TGUW transform as presented in Section 2.2.4, but in a greedy rather than tail-greedy way such that only one detail coefficient d ( j,1) is produced at each scale j, and thus K( j) = 1 for all j. We repeat to produce detail coefficients until the first detail coefficient such that |d ( j,1) | > λ is obtained where λ is the parameter used in the thresholding procedure described in Section 2.3. Once the condition, |d ( j,1) | > λ, is satisfied, we stop merging, relabel the surviving change-points as (η 1 ,η 2 , . . . ,ηÑ) and construct the new estimatorf as whereη 0 = 0,ηÑ +1 = T and (θ i,1 ,θ i,2 ) are the OLS intercept and slope coefficients, respectively, The exception is when the region under consideration only contains a single data point X t 0 , in which case fitting a linear regression is impossible. We then setf t 0 = X t 0 .
Stage 2. From the estimatorf t in Stage 1, we obtain the final estimatorf by pruning the changepoints (η 1 ,η 2 , . . . ,ηÑ) inf t . For each i = 1, . . . ,Ñ, compute the corresponding detail coefficient . Now prune by finding The estimated functionf is obtained by simple linear regression for each region determined by the final change-pointsη 1 , . . . ,ηN as in (13), with the exception for the case of single data point as described in Stage 1 above.
Through these two stages of post processing, the estimation of the number and the locations of change-points become consistent, and further details can be found in Section 3.

Theoretical results
We study the l 2 consistency off andf , and the change-point detection consistency off , where the estimators are defined in Section 2. The l 2 risk of an estimatorf is defined as where f is the underlying signal as in (1). We firstly investigate the l 2 behaviour off . The proofs of Theorems 1-3 can be found in Appendix 6.
Theorem 1 X t follows model (1) with σ = 1 andf is the estimator in (12). If the threshold as T → ∞ and the piecewise-linear estimatorf containsÑ ≤ CN log(T ) change-points where C is a constant.
Thus,f is l 2 consistent under the strong sparsity assumption (i.e. if N is finite) or even under the relaxed condition that N has the order of log T . The crucial mechanism of l 2 consistency is the "tail-greediness" which allows up to K( j) ≥ 1 smooth coefficients to be removed at each scale j.
In other words, consistency is generally unachievable if we proceed in a greedy (as opposed to tail-greedy) way, i.e. if we only merge one triplet at each scale of the TGUW transformation.
We now move onto the estimatorf obtained in the first stage of post-processing.
Theorem 2 X t follows model (1) with σ = 1 andf is the estimator in (13). Let the threshold λ be as in Theorem 1.
with probability approaching 1 as T → ∞ and there exist at most two estimated change-points between each pair of true We see thatf is l 2 consistent, but inconsistent for the number of change-points. Now we investigate the final estimators,f andN.
Theorem 3 X t follows model (1) with σ = 1 and (f ,N) are the estimators obtained in Section 2.5. Let the threshold λ be as in Theorem 1 and suppose that the number of true change-points, Our theory indicates that when min i¯f i T ∼ T −1 , the change-point detection rate of the Trend-Segment procedure is O p (T 2/3 log T ). If the number of true change-points, N, is finite, then the detection accuracy becomes O p (T 2/3 (log T ) 2/3 ). Comparing it with the rate of O p (T 2/3 (log T ) 1/3 ) derived by Baranowski et al. (2019) and Anastasiou and Fryzlewicz (2022) and also with the rate of O p (T 2/3 ) derived by Raimondo (1998), our detection accuracy is different by only a logarithmic factor. In the case in which min i¯f i T is bounded away from zero, the consistent estimation of the number and locations of change-point is achieved by assuming T 1/3 R 1/3 In addition, when there exists a separate data segment containing only one data point, then the two consecutive change-points, η k−1 and In what follows, we disable Stages 1 and 2 of post-processing by default: our empirical experience is that Stage 1 rarely makes a difference in practice but comes with an additional computational cost, and Stage 2 occasionally over-prunes change-point estimates.
4.1.2 Choice of the "tail-greediness" parameter ρ ∈ (0, 1) is a constant which controls the greediness level of the TGUW transformation in the sense that it decides how many merges are performed in a single pass over the data. A large ρ can reduce the computational cost but it makes the procedure less adaptive, whereas a small ρ gives the opposite effect. Based on our empirical experience, the best performance is stably achieved in the range ρ ∈ (0, 0.05] and we use ρ = 0.04 as a default in the simulation study and data analyses.

Choice of the minimum segment length
We can give a condition on the minimum segment length of the estimated signal returned by the TrendSegment algorithm. If it is set to 1, two consecutive data-points can be detected as change-points. As theoretically shown in the supplementary material, the minimum length of the estimated segment should have an order of log(T ) to achieve estimation consistency in the case of dependent and/or non-Gaussian errors. To avoid too short segments, and to cover non iid Gaussian noise, we set the minimum segment length to C log(T ) and use the default C = 0.9 in the remainder of the paper, otherwise we are not able to detect those short segments in (M6).
This constraint can be adjusted by users in the R package trendsegmentR.

Continuity at change-points
As described in Section 2, the TrendSegment algorithm works by detecting change-points first (in thresholding) and then estimating the best linear fit (in the least-squares sense) for each segment (in the inverse TGUW transform). These procedures normally ensure discontinuity at changepoints, however our R package trendsegmentR has an option for ensuring continuous changepoints by approximating f using the linear spline fit with knots at detected change-points.

Choice of threshold λ
Motivated by Theorem 1, we consider the simplest naïve threshold of the form where σ can be estimated in different ways depending on the type of noise. Under iid Gaussian noise, we can estimate σ using the Median Absolute Deviation (MAD) estimator (Hampel, 1974) quantile function of the Gaussian distribution. We found that under iid Gaussian noise C = 1.3 empirically leads to the best performance over a sequence of C, where the details and the relevant results for non-Gaussian and/or dependent errors can be found in Section C of the supplementary material. For completeness, we now present an algorithm for a threshold that works well in all circumstances. When the noise is not generated from iid Gaussian, it is reasonable to assume that the threshold is affected by the serial dependence structure and/or the extent of heavy-tailedness of noise, which motivates us to use threshold of the form: where I is the long-run standard deviation, K is kurtosis and g is a function. To estimate the unknown parameters in (17), we follow Algorithm 1.
Algorithm 1 Robust threshold selection INPUT: X, λ Naïve , C, η max 1. Pre-estimate the fit,f t , via the TrendSegment algorithm with η max , where η max is a prespecified maximum number of estimated change-points.
2. Compute the empirical residuals,ε t , from the pre-fit obtained in 1.
3. Fromε t , compute the sample kurtosis (K) and the long-run standard deviation estimator (Î) based on AR(1) model. See comments underneath the algorithm for details of this step. (17) with a pre-specified C.

Plug inÎ andK into the threshold formula in
5. To estimate the function g, find a non-parametric regression fit with X =K and Y = λ Naïve CÎ √ 2 log T , where λ Naïve is chosen as the best performing threshold by repeating the simulations with a range of threshold constant C over different types of noise.
OUTPUT: The robust threshold λ Robust .
We now describe the details of each step in Algorithm 1.
Pre-estimated fit in Step 1. In (17), the heavy-tailedness and dependent structure of the noise are captured by K and I, respectively. In practice, estimating I and K is challenging as the observation includes change-points in its underlying signal. One of the most straightforward way is pre-estimating the fitf t via TrendSegment algorithm with a parameter η max , the maximum number estimated change-points. As long as η max is not too large, some extent of overestimation would be acceptable, and we use η max = ⌈0.15T ⌉ as a default in practice, as it empirically led to the best performance and the simulation results do not vary by much over the range η max ∈ [⌈0.1T ⌉, ⌈0.2T ⌉]. The pre-fitting gives us the estimated noiseε t = X t −f t , from which we can estimate both I and K.
Pre-specified constant C in Step 4. We set C = 1.3 as it empirically led to the best performance for iid Gaussian noise with the naive approach in (16). Thus we hope to have bothÎ andĝ(K) close to 1 under iid Gaussian noise, but larger than 1 when the noise has serial dependence and/or heavy-tailedness.
I and K in Step 4. I and K capture dependency and heavy-tailedness of noise, respectively.
First, kurtosis is estimated from the estimated noise as follows: whereε andŝε are sample mean and sample standard deviation ofε, respectively. For estimating I, we consider the case when Gaussian noise has dependent structure. Then the dependencies increase the marginal variance of CUSUM statistic and one way of solving this issue is inflating the threshold by the following factor where φ is the true parameter of a AR(1) process (Fearnhead and Fryzlewicz, 2022). We can estimate φ by fitting AR(1) model to the estimated noiseε t = X t −f t , and this gives us the estimated long-run standard deviationÎ. Although in theory the inflation factor in (19) is valid only for Gaussian noise, we use the estimator of (19) as an estimated long-run standard deviation even when the noise has both serial dependence and heavy-tailedness, hoping that the heavytailedness is captured reasonably well by K.
Kurtosis function g in Step 5. We fit a non-parametric regression as described in step 5 of Algorithm 1 over different models and noise scenarios. We found that g(K) has no particular functional form inK, and is scattered between 0.9 and 1.6 over all noise scenarios and all simu-lations models considered in the paper. Therefore, the resulting non-parametric fitĝ(K) also has a flat shape over a range ofK, and we use this in finding the robust threshold in practice. This is due to the condition on the minimum segment length described earlier which helps the method to be robust to spikes.
The detailed procedure of estimating g is presented in Section C.2 of the supplementary material. Also, the simulation results using Algorithm 1 for dependent and/or heavy-tailed noise can be found in Tables C.1 -C.10 in Section C.1 of the supplementary material. The proposed robust threshold selection algorithm can also be applied to iid Gaussian noise without any knowledge on type of noise and the corresponding simulation results are given in Section 4.3.
We consider iid Gaussian noise and simulate data from model (1)  has three particularly short segments containing 12, 9 and 6 data points, respectively and (M6) has isolated spike-type short segments containing 6 data points each. (M7) is piecewise-constant, and (M8) is a linear signal without change-points. The signals and R code for all simulations can be downloaded from our GitHub repository (Maeng and Fryzlewicz, 2021) and the simulation results under dependent or heavy-tailed errors can be found in Section C of the supplementary materials.

Competing methods and estimators
We perform the TrendSegment procedure based on the parameter choice in Section 4.1 and compare the performance with that of the following competitors: Narrowest-Over-Threshold detection (NOT, Baranowski et al. (2019)) implemented in the R package not from CRAN, Isolate-Detect (ID, Anastasiou and Fryzlewicz (2022)

trendsegmentR.
As BUP requires a pre-specified number of change-points (or a well-chosen stopping criterion which can vary depending on the data), we include it in the simulation study (with the stopping criterion optimised for the best performance using the knowledge of the truth) but not in data applications. We do not include the methods of Spiriti et al. (2013) and Bai and Perron (2003) implemented in the R packages freeknotsplines and strucchange as we have found them to be particularly slow. For instance, the minimum segment size in strucchange can be adjusted to be small as long as it is greater than or equal to 3 for detecting linear trend changes. This is suitable for detecting very short segments (e.g in (M6) lin.sgmts), however this setting is accompanied by extremely heavy computation: with this minimum segment size in place, a single signal simulated from (M6) took us over three hours to process on a standard PC.
Out of the competing methods tested, ID, TF and CPOP return continuous change-points, while the estimated signals of Trendsegment and BUP is in principle discontinuous at changepoints. For NOT, we use the contrast function for not necessarily continuous piecewise-linear signals. Regarding the tuning parameters for the competing methods, we follow the recommendation of each respective paper or the corresponding R package.

Results
The summary of the results for all models and methods can be found in Tables 2 and 3. We run 100 simulations and as a measure of accuracy of estimators, we use Monte-Carlo estimates of the Mean Squared Error of the estimated signal defined as MSE=E{(1/T ) T t=1 ( f t −f t ) 2 }. The empirical distribution ofN − N is also reported whereN is the estimated number of changepoints and N is the true one. In addition to this, for comparing the accuracy of the locations of the estimated change-pointsη i , we show estimates of the scaled Hausdorff distance given by We first emphasise that the results with both the naïve and the robust thresholds (λ Naïve in (16) and λ Robust in (17)   exists discontinuity at change-points, however it shows the best performs in terms of localisation (i.e. the smallest mean of Hausdorff distance) as it tends to estimate more than one (and somewhat frequent) change-points at discontinuous change-points. As TrendSegment and NOT deal with  We see that TrendSegment has a particular advantage over the other methods especially in (M5) and (M6), when frequent change-points composed of the isolated spike-type short segments of length 6 exist. This is due to the bottom-up nature of TrendSegment which focuses on local features in the early stage of merges and enables TrendSegment to detect those short segments.
TrendSegment shows its relative robustness in estimating the number and the location of changepoints while NOT, ID and CPOP significantly underperform.
For the estimation of the piecewise-constant signal (M7), no methods show good performances and NOT, ID and TrendSegment tend to underestimate the number of change-points while CPOP and TF overestimate. In the case of the no-change-point signal (M8), all methods estimate well except TF and BUP. In summary, TrendSegment is never among the worst methods, is almost always among the best ones, and is particularly attractive for signals containing frequent change-points with short segments. With respect to computation time, NOT and ID are very fast in all cases, TrendSegment is slower than these two but is faster than TF, CPOP and BUP, especially when the length of the time series is larger than 2000.
5 Data applications

Average January temperatures in Iceland
We analyse a land temperature dataset available from https://www.kaggle.com/ berkeleyearth/climate-change-earth-surface-temperature-data, consisting of average temperatures in January recorded in Reykjavik recorded from 1763 to 2013. Figure 6 shows the data; the point corresponding to 1918 appears to be an anomalous point. This is some- Harbour as well as for the unusually large number of polar bear sightings in northern Iceland." Out of the competing methods tested, ID, TF and CPOP are in principle able to classify two consecutive time point as change-points, and therefore they are able to detect separate data segments containing only one data point each. NOT and BUP are not designed to detect two consecutive time point as change-points as their minimum distance between two consecutive change-points is restricted to be at least two. In the TrendSegment algorithm, the minimum detects not only change-points in linear trend but it can identify a separate data segment at the same time, which the competing methods do not achieve.

Monthly average sea ice extent of Arctic and Antarctic
We analyse the average sea ice extent of the Arctic and the Antarctic available from https:// nsidc.org to estimate the change-points in its trend. As mentioned in Serreze and Meier (2018)

Extension to non-Gaussian and/or dependent noise
Our TrendSegment algorithm can be extended to more realistic settings e.g. when the noise ε t is possibly dependent and/or non-Gaussian. The extension is performed by slightly altering the estimatorsf,f andf and keeping the rate of threshold the same as the one used in Theorems 1-3 (i.e. λ = O((log T ) 1/2 )) that is established under the iid Gaussian noise. We add an additional step to ensure that only the detail coefficients d ( j,k) p,q,r corresponding to a long enough interval [p, r] survive, as this step enables us to apply strong asymptotic normality of r t=p ε t . Under dependent or non-Gaussian noise, Theorems 1-3 presented in Section 3 still hold with a larger rate that is different by only a logarithmic factor, where the corresponding theories and proofs can be found in Section B of the supplementary material.
In Algorithm 1 in Section 4.1.5, we propose a robust way of threshold selection that works well in all circumstances including iid Gaussian noise. To demonstrate the robustness of our threshold selection in case the noise has serial dependence and/or heavy-tailedness, additional simulations are performed for five distributions of the noise; (a) ε t ∼ i.i.d. scaled t 5 distribution with unit-variance, (b) ε t follows a stationary AR(1) process with φ = 0.3 and Gaussian innovation, (c) the same setting with (b) but with φ = 0.6, (d) ε t follows a stationary AR(1) process with φ = 0.3 and t 5 innovation and (e) the same setting with (d) but with φ = 0.6, where the results are summarised in Tables C.1-C.10 in Section C.1 of the supplementary material. Lastly, in Section D.2 of the supplementary material, we demonstrate that our TrendSegment algorithm shows a good performance on London air quality data that possibly has some non-negligible autocorrelation.

Appendix A Technical proofs
The proof of Theorems 1-3 are given below and Lemmas 1 and 2 can be found in Section A of the supplementary materials.
Proof of Theorem 2. LetB andB the unbalanced wavelet basis corresponding tof andf , respectively. As the change-points inf are a subset of those inf , establishingf can be considered as applying the TGUW transform again tof which is just a repetition of procedure done in estimatingf in the greediest way. ThusB is classified into two categories, 1) all basis vectors ψ ( j,k) ∈B such that ψ ( j,k) is not associated with the change-points inf and | X, ψ ( j,k) | = |d ( j,k) | < λ and 2) all vectors ψ ( j,1) produced in Stage 1 of post-processing.
We now investigate how many scales are used for this particular transform. First, the detail coefficients d ( j,k) corresponding to the basis vectors ψ ( j,k) ∈B live on no more than J = O(log T ) scales and we have |S 1 j | ≤ N by the argument used in the proof of Theorem 1. In addition, the vectors ψ ( j,1) in the second category correspond to different change-points inf and there exist at mostÑ = O(N log T ) change-points inf which we examine one at once (i.e. |S 1 j | ≤ 1), thus at mostÑ scales are required for d ( j,1) . Combining the results of two categories, the equivalent of quantity II in the proof of Theorem 1 forf is bounded by II ≤ C 3 NT −1 log 2 T and this completes the proof of the l 2 result, f − f 2 T = O NT −1 log 2 (T ) where C 3 is a positive constant large enough.
Proof of Theorem 3. From the assumptions of Theorem 3, the followings hold.
• Given any ǫ > 0 and C > 0, for some T 1 and all T > T 1 , it holds that P f − f 2 T > C 3 4 R T ≤ ǫ wheref is the estimated signal specified in Theorem 2.
Following the argument used in the proof of Theorem 19 in Lin et al. (2016), we take T ≥ T * where T * = max{T 1 , T 2 } and let r ℓ,T = ⌊C 1/3 T 1/3 R 1/3 T (¯f ℓ T ) −2/3 ⌋ for ℓ = 1, . . . , N. Suppose that there exist at least one η ℓ whose closest estimated change-point is not within the distance of r ℓ,T . Then there are no estimated change-points inf within r ℓ,T of η ℓ which means thatf j displays a linear trend over the entire segment j ∈ {η ℓ − r ℓ,T , . . . , η ℓ + r ℓ,T }. Hence The first inequality holds by Lemma 20 of Lin et al. (2016), and the second one holds by the definition of r ℓ,T . Assuming that at least one η ℓ does not have an estimated change-point within the distance of r ℓ,T implies that the estimation error exceeds C 3 4 R T which is a contradiction as it is an event that we know occurs with probability at most ǫ. Therefore, there must exist at least one estimated change-point within the distance of r ℓ,T from each true change point η ℓ .
Throughout Stage 2 of post-processing,η ℓ 0 is either the closest estimated change-point of any η ℓ or not. Ifη ℓ 0 is not the closest estimated change-point to the nearest true change-point on either its left or its right, by the construction of detail coefficients in Stage 2 of post-processing, Lemma 2 guarantees that the corresponding detail coefficient has the magnitude less than λ and η ℓ 0 gets removed. Supposeη ℓ 0 is the closest estimated change-point of a true change-point η ℓ and it is within the distance of CT 1/3 R 1/3 T ¯f ℓ T −2/3 from η ℓ . If the corresponding detail coefficient has the magnitude less than λ andη ℓ 0 is removed, there must exist anotherη ℓ within the distance of CT 1/3 R 1/3 T ¯f ℓ T −2/3 from η ℓ . If there are no suchη ℓ , then by the construction of the detail coefficient, the order of magnitude of d p ℓ 0 ,q ℓ 0 ,r ℓ 0 would be such that d p ℓ 0 ,q ℓ 0 ,r ℓ 0 > λ thusη ℓ 0 would not get removed. Therefore, after Stage 2 of post-processing is finished, each true change-point η ℓ has its unique estimator within the distance of CT   j=1,...,J, k=1,...,K( j) and {g l }, we then have λ is as in Theorem 1 and C 2 is a positive constant.
Proof. We firstly show that for any fixed ( j, k), g ( j,k) i and φ ( j,k) i satisfy the conditions, . Depending on the type of merge, ψ ( j,k) fall into one of the followings, Type 1: ψ ( j,k) p,q,r = α 1 e p + α 2 e p+1 + α 3 e p+2 , (S.2) Type 2: ψ ( j,k) p,q,r = β 1 e p + β 2 (0, . . . , 0 p×1 , ℓ ⊤ 1,p+1,r , 0, . . . , 0 Type 3: ψ ( j,k) p,q,r = γ 1 (0, . . . , 0 where e i is a vector of length T having 1 only at i th element and zero for the others. As will be shown in Section E., ℓ 1,i, j and ℓ 2,i, j are an arbitrary orthonormal basis of the subspace In any case, we can obtain the representation ψ ( j,k) in Type 3 and g ( j,k) i is the corresponding vector. From the orthonormality of the basis (ℓ 1,m,n , ℓ 2,m,n ) for any (m, n), we see that the conditions, where i i ′ . In addition, as ψ ( j,k) keep orthonormality, we can argue that φ ( j,k) i is bounded by the If we predefine the pairs (ℓ 1,m,n , ℓ 2,m,n ) for any (m, n) by choosing an orthonormal basis of the subspace where φ Z is the p.d.f. of a standard normal Z. This completes the proof.
where λ is as in Theorem 1.
Proof. On the set A T , the following holds for j = 1, . . . , J, k ∈ S 0 j , where ε = (ε 1 , . . . , ε T ) ⊤ and ψ ( j,k) p,q,r are as in (S.2). The condition, i φ ( j,k) i 2 = 1 for any fixed ( j, k), given in the proof of Lemma 1 implies that max i φ ( j,k) i ≤ 1 for any ( j, k), thus we have (S.6) when the constant C 1 for λ in (S.6) is larger than or equal to 4 times C 1 used in (S.1).

B. Extension to dependent non-Gaussian noise
In this section, we extend the TGUW methodology to more realistic settings when the noise ε t is possibly dependent and/or non-Gaussian. We borrow the idea proposed in the supplementary material of Fryzlewicz (2018) in the sense that the extension is performed in a way of altering the estimatorsf,f andf and keeping the rate of threshold, O((log T ) 1/2 ), used in Theorems 1-3 of the main article established under the iid Gaussian noise. However, our technique is distinguished from Fryzlewicz (2018) in that we put an additional step which ensures that only the detail coefficients d ( j,k) p,q,r corresponding to a long enough interval [p, r] are survived, while Fryzlewicz (2018) gives a condition that both the left ([p, q]) and the right ([q + 1, r]) segments should be long enough. This enables us to use the same size of threshold, O((log T ) 1/2 ), used in the iid Gaussian model without any further procedure such as basis rearrangement proposed in Fryzlewicz (2018).
We now define the sets of short-segment and long-segment coefficients at each scale j as follows: where a will be specified later this section. Those detail coefficients obtained from short segments are set to zero in the construction of the new estimatorsf L ,f L andf L , where L in f L stands for "Long-segment". The initial estimatorf L is obtained from the estimator of µ ( j,k) for j ≥ 1 by applying the "connected" rule that is modified from the original one in Section 2.3 of the main article to satisfy the condition that the minimum segment length is longer than a: where I is an indicator function and We then apply the "two together" rule to (S.8) in which both of the paired detail coefficients (formed by Type 3 mergings) should be survived if at least one is survived as done in thresholding of the main article. Compared to the estimatorμ ( j,k) obtained under the iid Gaussian setting, the only added step is setting all short-segment coefficient d ( j,k) p,q,r to zero.

B.1 Preparatory lemmas
Lemma 3 Let the distribution of ε t in model (1) of the main article as in Theorem 1. Then for a constant C 3 > 0 and λ as in Theorem 1, we have P( and ℓ t k,t 1 ,t 2 is the t-th element of the vector ℓ k,t 1 ,t 2 of length t 2 − t 1 + 1 and the pairs (ℓ 1,t 1 ,t 2 , ℓ 2,t 1 ,t 2 ) are predetermined for any (t 1 , t 2 ) by choosing an orthonormal basis of the subspace Proof. In the following, we consider the single sum a t=1 w t ε t from the interval [1, a] where w t = ℓ t k,1,a for a fixed k ∈ {1, 2}. The results can in principle be applied to an interval with different ends given that the length of the interval is at least a. Since ε t is m-dependent, we have α(l) = 0 for l > m where α(·) is the α-mixing coefficients of ε t .
From Theorem 1.4 in Bosq (1998), if m 2 2 < ∞, for each ǫ > 0 and for a constant c > 0, we obtain The assumption m 2 2 < ∞ is reasonably achievable as we can show m 2 2 = a max t (w 2 t ) is bounded by a constant from two conditions given on w t , 1){w 1 − w 2 = · · · = w a−1 − w a } and 2) a t=1 w 2 t = 1. By setting ǫ = λ/ √ a, a = C log(T ) and λ = C 1 log 1/2 T for large enough C > 0 and C 1 > 0, and setting q = [c 1 a] with a small c 1 (which gives a q+1 ≥ m + 1), we have that a 1 is bounded by a constant and α([a/(q + 1)]) = 0, thus (S.10) can be bounded as where C 2 > 0 is suitably large. Since there exist at most T 2 sub-intervals [t 1 , t 2 ], applying a simple Bonferroni inequality, we have as T → ∞ for a large enough C > 0 and a certain constant C 3 > 0.
Lemma 4 Let S 1 j and S 0 j as in Lemma 2. On the set A L T in (S.9) that satisfies P(A L T ) → 1 as T → ∞, we have max where λ is as in Theorem 1.
Proof. The argument follows the proof of Lemma 2.

B.2 Theoretical results of the length-lowerbounded-basis estimators
We now describe the behaviour of the initial estimatorf L that is built from the basis vectors whose non-zero elements have length larger than a.
Theorem 4 Let the distribution of ε t in model (1) of the main article as follows: (a) ε t has mean zero and satisfies Cramer's conditions that where c > 0.
(b) {ε t } t is the stationary sequence and m-dependent i.e. σ(ε s , s ≤ t) and σ(ε s , s ≥ t + k) are independent for k > m.
Letf = max t f t − min t f t be bounded and let the estimatorf L is obtained from the estimatorμ ( j,k) in (S.8), with a = C log(T ) and the threshold λ = C 1 log 1/2 (T ), for large enough C and C 1 . Then on the set A L T in (S.9), we have for a constantC > 0.
Term II: As there is no short-segment parent coefficient whose children is from long-segment due to the principle of bottom-up merging, the indicator function returns zero and the term II is simplified to 1 We now examine the bound of individual µ ( j,k) p,q,r . Note that only Type 2 and Type 3 basis vectors are considered due to the minimum length constraint given on the set A L T . Borrowing the generalised form of ψ ( j,k) p,q,r in (S.2), for Type 3 basis vector, we obtain µ ( j,k) p,q,r = f, ψ ( j,k) p,q,r = γ 1 ℓ ⊤ 1,p,q f p:q + γ 2 ℓ ⊤ 2,p,q f p:q + γ 3 ℓ ⊤ 1,q+1,r f q+1:r + γ 4 ℓ ⊤ 2,q+1,r f q+1:r ≤ γ 1 f p:q + γ 2 f p:q + γ 3 f q+1:r + γ 4 f q+1:r , (S.12) where f p:q is the subvector of f containing q − p + 1 elements. The inequality (S.12) is obtained from the orthonormality of ℓ 1,p,q , ℓ 2,p,q , ℓ 1,q+1,r , ℓ 2,q+1,r and the definition of inner product a · b = a · b · cos(θ), where θ is the angle between a and b. Note that if f p:q does not contain a change point, the corresponding cos(θ) = 0 as f p:q has a perfect linear trend and in the case when f p:q includes a change point, the size of angle is bounded as | cos(θ)| ≤ 1. Asf = max t f t − min t f t is assumed to be bounded and f p:q 2 ≤ C[(q − p + 1) 2 +f 2 ] regardless of whether there exists a true change point in [p, q], we have (µ ( j,k) p,q,r ) 2 ≤(γ 1 + γ 2 ) 2 · f p:q 2 + (γ 3 + γ 4 ) 2 · f q+1:r 2 + 2 · (γ 1 + γ 2 ) · (γ 3 + γ 4 ) · f p:q · f q+1:r where c i > 0 and C > 0. Without loss of generality, we assume r − q + 1 ≤ a, then using a = O(log(T )) and applying the same upper bounds, J ≤ ⌈log(T )/ log((1 − ρ) −1 ) + log(2)/ log(1 − ρ)⌉ and |S 1 j | ≤ N, used in the proof of Theorem 1 in the main article, we obtain Term III: Denote B = ∃( j ′ , k ′ ) ∈ C j,k |d ( j ′ ,k ′ ) | > λ and k ′ ∈ W L j ′ (a) and on the set A L T in (S.9) we have Following the same argument used in the proof of Theorem 1 in the main article, we have To complete the proof, considering all terms in (S.11), we finally obtain whereC > 0. Comparing it with Theorem 1 of the main article that is presented under the iid Gaussian noise assumption, the ℓ 2 rate in (S.14) is different by only a logarithmic factor.
Theorem 5 X t follows model (1)  Proof. The proof proceeds the same as the proof of Theorem 2 of the main article.
Theorem 6 X t follows model (1) Proof. The proof proceeds the same as the proof of Theorem 3 of the main article.
C. Threshold selection and additional simulation results C.1 Simulation results for non-Gaussian and/or dependent noise In addition to the simulations in Section 4 of the main article, here we present the results for the cases when ε t is possibly dependent and/or non-Gaussian. Including the standard Gaussian noise, we consider the following six scenarios for ε t : (i) standard Gaussian, (ii) iid t 5 distribution with unit-variance, (iii) a stationary Gaussian AR(1) process of φ = 0.3, with zero-mean and unit-variance, (iv) the same setting as in (iii) except φ = 0.6, (v) a stationary AR(1) process of φ = 0.3 with the noise term following t 5 , (vi) the same setting as in (v) except φ = 0.6.
In summary, (ii) is iid but heavy-tailed, (iii) and (iv) are Gaussian AR(1) error with relatively mild and strong dependence, respectively, and (v) and (vi) are both heavy-tailed but different strength of dependence, where the summary of the simulation results can be found in Tables C.1-C.10.
Following the theoretical results presented in Sections B.1-B.2, we need to set the minimum segment length to be an order of log(T ). As already used in the main paper, we set ⌊0.9 log(T )⌋ as a default minimum segment length. We follow the Algorithm 1 introduced in the main paper and use λ Robust as a default threshold, as it is designed to work well in all circumstances.
The simulation results under this robust threshold selection are presented in Tables C.1-C.10 and TrendSegment generally outperforms over all scenarios of noise and over almost all simulation models considered in this paper. Among other competitors, only ID provides the option for heavy-tailed noise in their R package IDetect and other methods are set to their default settings. Table C.1: Distribution ofN − N for models (M1)-(M4) and all methods with the noise term ε t iid ∼ t 5 over 100 simulation runs. Also the average MSE (Mean Squared Error) of the estimated signalf t , the average Hausdorff distance d H and the average computational time in seconds using an Intel Core i5 2.9 GHz CPU with 8 GB of RAM, all over 100 simulations. Bold: methods within 10% of the highest empirical frequency ofN −N = 0 or within 10% of the lowest empirical average d H (×10 2 ). Note that TrendSegment is shortened to TS.

C.2 Choice of the threshold
In this section, we describe the details of how the thresholds, λ Naïve and λ Robust , are built under different scenarios of noise introduced in Section C.1.
The naive threshold selection We first explain how the best performing threshold constant C in the naive threshold. To cover all the noise scenario settings including dependent and/or non-Gaussian noise, here we use a simpler version of the follwoing naïve threshold: Similarly, we can define C η med and C η max by replacing the minimum condition with median and maximum respectively. For evaluating the performance of change-point location, we define whereC d H max, j is the maximum of those constants C that give the minimum value of the average Hausdorff distance for the j th model computed from 100 simulation runs. Note that in contrast to that the maximum number of case {N − N = 0} is considered in (S.17), the minimum value of Hausdorff distance is used in (S.18) as the smaller the Hausdorff distance, the better the estimation of the change-point locations. Similar to (S.17), the maximum condition actually works when there is more than one constant giving the same minimum average Hausdorff distance, and C d H med and C d H min can be defined by replacing the maximum condition with median and minimum respectively.
The best performing constants over all noise scenarios are reported in Table C.11. Compared to the iid Gaussian noise, it seems that a larger threshold constant tends to chosen when the noise is heavy-tailed and/or dependent. Also, compared to the other noise scenarios, when the noise is dependent but generated with Gaussian innovation ((iii) and (iv)), the best performing constant has a narrower range of C · max -C · min . The naïve threshold, λ Naïve , is an essential element in building the robust threshold, λ Robust , as shown in Step 5 of Algorithm 1 in the main paper. For this, we use the default constants C η med in Table C.11, where there is not much difference in simulation performance presented in Tables C.1-C.10 when C η max is used instead.
The robust threshold selection We now describe the details of how λ Robust is built. We first justify using the ratio, λ Naïve 1.3Î 2 log T (S.19) in the process of building the kurtosis function g in Step 5 of Algorithm 1 in the main paper.
We first recall that the ratio in (S.19) corresponds to the kurtosis function g(K) in the following robust threshold: (S.20) Figure C.1 shows that g(K) behaves like constant over a range of theK under all models and noise scenarios we considered. This is due to the condition on the minimum segment length imposed for stable and good performance. With this condition, we found that the constant-like behaviour is also observed in case noise has relatively extreme heavy-tail e.g. t 2.1 , however we do not include such an extreme case in estimating the non-parametric function g.
We are now ready to estimate the function g(·). To avoid the situation that the estimation of non-parametric fit is affected a lot by extremely large size ofK, we splitK into two with the 99% quantile ofκ as shown in Figure C.1. Then we estimate the non-parametric regression fit for each interval, g 1 (K) and g 2 (K) respectively, and use these functions for computing the robust threshold.       In this section, we demonstrate that our TrendSegment algorithm shows a good performance on a real-world dataset that possibly has some nonnegligible autocorrelation. London air quality data is recently studied by Cho and Fryzlewicz (2020) in the context of proposing a methodology for detecting multiple changes in mean of a possibly autocorrelated time series. Using the same data but in a different context, we now detect changes in linear trend. We use the daily average concentrations of nitrogen dioxides (NO 2 ) measured from September 1, 2000 to September 30, 2020 at Marylebone Road in London, United Kingdom, which results in T = 7139 time points. The data is downloaded from https://github.com/haeran-cho/wem.gsc, where the original data can be obtained from Defra (https://uk-air.defra.gov.uk/). We follow the pre-processing steps used in Cho and Fryzlewicz (2020) by taking the square root transform and by removing weekly, seasonal and bank holiday effects.
Considering that the data possibly has serial dependent and/or heavy-tailedness, we use the robust threshold (λ Robust ) introduced in Section 4.1.5 of the main paper. The top plot in Figure   D.6 shows the detected change-points using the robust threshold selection. From the two bottom plots, we see that the persistent autocorrelations are not observable anymore after removing the linear trends, although a certain amount of autocorrelations still exists.

E. Shape of the unbalanced wavelet basis
We now explore the shape of the adaptively constructed unbalanced wavelet basis. First, we denote that ψ ( j,k) is sometimes referred to as ψ ( j,k) p,q,r . One of the important properties of the unbalanced wavelet basis is that ψ ( j,k) p,q,r always has a shape of linear trend in regions that are previously merged and this linearity will also be preserved in future merges, as long as later transforms are performed under the "two together" rule. For example, two vectors (ψ (0,1) , ψ (0,2) ) corresponding to the two smooth coefficients s 1 1,T and s [2] 1,T , have linear trends in the region [1, T ] as they form an orthonormal basis of the subspace {(x 1 , x 2 , . . . , x T ) | x 1 − x 2 = x 2 − x 3 = · · · = x T −1 − x T } of R T . This is due to the fact that the local orthonormal transforms continue in a way of extending the geometric dimension of subspace in which an orthonormal basis lives.
Through an illustrative example, we now show how a basis vector ψ ( j,k) p,q,r keeps its linearity in subregions that are already merged in previous scales, which includes a geometric interpretation of the TGUW transformation. Suppose that the initial data sequence is s 0 = (X 1 , . . . , X 5 ) and the initial weight vectors of constancy and linearity are w c 0 = (1, 1, 1, 1, 1) ⊤ and w l 0 = (1, 2, 3, 4, 5) ⊤ , respectively. As we have the data sequence of length 5, the complete TGUW transform consists of 3 orthonormal transformations and the most important task for each transform is finding an appropriate orthonormal matrix.

(S.22)
where the constants (e c 1 , e c 2 ) and (e l 1 , e l 2 ) are obtained by Λw c 0,3:5 = (e c 1 , e c 2 , 0) ⊤ and Λw l 0,3:5 = (e l 1 , e l 2 , 0) ⊤ , respectively. As ℓ 1 and ℓ 2 form an orthonormal basis of the plane {(x, y, z) | x−2y+z = 0}, e c 1 , e c 2 and e l 1 , e l 2 are unique constants which represent w c 0,3:5 and w l 0,3:5 as a linear span of basis vectors ℓ 1 and ℓ 2 as follows: w c 0,3:5 = e c 1 ℓ 1 + e c 2 ℓ 2 , w l 0,3:5 = e l 1 ℓ 1 + e l 2 ℓ 2 . (S.23) Importantly, the orthonormal transform matrix Ψ T ×T introduced in (5) (i.e. an orthonormal basis in R 5 in this example) is constructed by recursively updating its initial input Ψ 0 = I 5×5 through local orthonormal transforms. For example, if (p, q, r) th elements in s are selected to be merged, then we extract the corresponding (p, q, r) th columns of Ψ ⊤ and update them through the matrix multiplication with Λ used in that merge. Therefore, the first orthonormal transform performed in (S.22) updates the initial matrix Ψ ⊤ 0 by multiplying Λ to the corresponding (3, 4, 5) th columns of Ψ ⊤ 0 which returns the following, The 5 th column of Ψ ⊤ is now fixed (not going to be updated again) as it corresponds to the detail coefficient but other four columns corresponding to the smooth coefficients in s would be updated as the merging continues. Second merge. Suppose that (X 2 , s [1] 3,4,5 , s [2] 3,4,5 ) are selected to be merged next under the "two guarantees the properties, ℓ * * the length of data (T ), the first two columns of the finalised Ψ ⊤ build two smooth coefficients (s [1] 1,T , s [2] 1,T ) and always keep a linear trend with length T , while the shape of other columns of Ψ ⊤ corresponding to the detail coefficients depends on the type of merge and follows one of the forms in (S.2).
As shown above, the non-uniqueness of the low filters has no effect on preserving the linearity of the subregions that are already merged. In simulation studies, we empirically found that the choice of low filters has no qualitative effect on the results as long as they are chosen by satisfying the orthonormality condition of the transform, thus we used a fixed type of function for choosing a set of low filters rather than choosing an arbitrary set of low filters that satisfies the orthonormal condition every run which also saves the computational costs.

F. A practical way to implement the TGUW transformation
In this section, we explore a way of implementing the TGUW transform. As briefly mentioned in Section 2.2.3, it is implemented by consecutively updating so-called weight vectors of constancy and linearity. These two weight vectors are initially used in the first stage of the TGUW transform for obtaining the detail filter h and updated through the orthonormal transform. In detail, Steps 1 and 5 of the TGUW algorithm presented in Section 2.2.3 can be reformulated by weight vectors as follows.
Step 1. At each scale j, find the set of triplets that are candidates for merging under the "two together" rule and compute the corresponding detail coefficients. Regardless of the type of merge, a detail coefficient d · p,q,r is, in general, obtained as d · p,q,r = as 1 p:r + bs 2 p:r + cs 3 p:r , (S.32) where p ≤ q < r, s k p:r is the k th smooth coefficient of the subvector s p:r with a length of r − p + 1 and the constants a, b, c are the elements of the detail filter h = (a, b, c) ⊤ . Specifically, the detail filter h is established by solving the following equations, aw c,1 p:r + bw c,2 p:r + cw c,3 p:r = 0, aw l,1 p:r + bw l,2 p:r + cw l,3 p:r = 0, a 2 + b 2 + c 2 = 1, (S.33) where w ·,k p:r is k th non-zero element of the subvector w · p:r with a length of r − p + 1, and w c and X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 X 13 X 14 X 15 X 16 X 17 X 18 X 19 Type 1 merging