1 Introduction

Change-point detection algorithms have been actively developed and investigated from the scientific community. Their ability to segment data into smaller, homogeneous parts have helped researchers and practitioners to develop flexible statistical models that can adapt in non-stationary environments. Due to the natural data heterogeneity in many real problems, such algorithms have been applied in a wide range of application areas, such as bioinformatics (Picard et al. 2011; Hocking et al. 2013), cyber security (Siris and Papagalou 2004), or finance (Lavielle and Teyssiere 2007; Schröder and Fryzlewicz 2013). The advantages of detecting changes in the behaviour of data fall into two main categories; interpretation and forecasting. Interpretation comes naturally since the detected changes are usually connected with life events that took place near the estimation time. Associating the changes with such real-life phenomena can lead to a better understanding and quantification of the effect that these events had on the behaviour of the stochastic process. With respect to forecasting, the role of the final segment is important because it allows for a more accurate prediction of the future values of the data sequence at hand.

Based on whether we have full knowledge of the data to be analysed, change-point detection is split into two main categories; offline detection, where the data are already obtained, and online detection, in which the observations arrive sequentially at present. With respect to the dimensionality of the data, change-point detection can be further separated into algorithms that act only on univariate data and to those that are suitable for change-point detection in multivariate data sequences. In this paper, we focus on multivariate, possibly high-dimensional, offline settings; our aim is to estimate the number and locations of certain structural changes in the behaviour of given multivariate data. The model is

$$\begin{aligned} \varvec{X_{t}} = \varvec{f_{t}} + \varvec{\epsilon _{t}}, \quad t=1, \ldots , T, \end{aligned}$$
(1)

where \(\varvec{X_{t}} \in \mathbb {R}^{d\times 1}\) are the observed data and \(\varvec{f_{t}} \in \mathbb {R}^{d \times 1}\) is the d-dimensional deterministic signal with structural changes at certain points. The signals that we treat in the current manuscript are those that changes appear in the mean structure or in the vector of the first order derivatives. The noise terms \(\varvec{\epsilon _{t}} \in \mathbb {R}^{d \times 1}\) are random vectors with mean the zero vector and covariance matrix, \(\Sigma \), which is positive definite and not necessarily diagonal. The true number, N, of the change-points, as well as their locations \(r_1, \ldots , r_N\), are unknown and our aim is to estimate them; N is free to grow with the sample size, T, and the dimensionality, d.

The initial purpose of change-point detection algorithms has been to detect a single change in the mean structure of a univariate signal under the setting of Gaussian noise, but much progress has since been made. Researchers have heavily focused on the detection of multiple change-points in the mean structure of a univariate data sequence. Towards this purpose, optimization-based methods have been developed, in which the estimated signal is chosen based on its fit to the data, penalized by a complexity rule. To solve the implied penalization problem, dynamic programming approaches, such as the Segment Neighborhood (SN) and Optimal Partitioning (OP) methods of Auger and Lawrence (1989) and Jackson et al. (2005), have been developed. Killick et al. (2012) and Rigaill (2015) introduce improvements over the classical OP and SN algorithms, respectively. In the context of regression problems, Frick et al. (2014) introduced the simultaneous multiscale change-point estimator (SMUCE) for change-point detection in exponential family regression. Apart from optimization based algorithms, a popular method in the literature is binary segmentation where changes are detected one at a time through an iterative binary splitting of the data. Recent variants of binary segmentation with improved performance are the Wild Binary Segmentation (WBS) of Fryzlewicz (2014) and its recently developed second version of Fryzlewicz (2020), the Narrowest-Over-Threshold (NOT) method of Baranowski et al. (2019), and the Seeded Binary Segmentation of Kovács et al. (2020). The Isolate–Detect (ID) algorithm has been developed in Anastasiou and Fryzlewicz (2022) to detect, one by one, structural changes in a data sequence. It is based on an isolation technique, which leads to very good accuracy on the estimated number and locations of the change-points, particularly in scenarios with many frequent change-points. Our proposed method is partly based on ID; therefore we elaborate on its important parts later. For a more thorough review of the literature on the detection of multiple change-points in the mean of univariate data sequences, see Cho and Kirch (2020) and Yu (2020). Apart from changes in the mean of a univariate data sequence, research has also been done for the detection of change-points under more complex scenarios, such as detection of changes in the slope for piecewise-linear models ( Anastasiou and Fryzlewicz 2022; Baranowski et al. 2019; Fearnhead et al. 2019; Maeng and Fryzlewicz 2019; Tibshirani 2014) changes in the variance ( Inclán and Tiao 1994), as well as for distributional changes under a non-parametric setting ( Matteson and James 2014; Zou et al. 2014; Arlot et al. 2019).

Even though there is an extensive literature on change-point detection for univariate data sequences, the multivariate, possibly high-dimensional setting, which is the focus of this paper, has not been investigated in such degree. Working under the model in (1), Vert and Bleakley (2010) proposes a method for approximating the signal \(\varvec{f_{t}}\) as the solution of a convex optimization problem. In order to achieve this, the problem is first reformulated to a group LASSO one and then the group least-angle regression (LARS) procedure explained in Yuan and Lin (2006) is employed. Another interesting approach with very good behaviour has been introduced in Wang and Samworth (2018). The algorithm, called inspect, estimates the number and locations of the change-points in the mean structure of \(\varvec{f_t}\) as in (1). Firstly, inspect applies a cumulative sum (CUSUM) transformation to the original data matrix. Secondly, a projection direction of the transformed matrix is computed as its leading left singular vector, and, finally, a univariate change-point detection algorithm is applied to the projected series. It is among the many methods that employ CUSUM-type statistics for change-point detection in the multivariate setting. In general, methods that belong to this category, mainly either use CUSUM aggregations of the d component data sequences in order to test the obtained values against a threshold or to construct alternative test-statistics. For instance, Groen et al. (2013) examines the asymptotic behaviour of the maximum absolute and average CUSUM and gives finite-sample performance results. Focusing on testing for the existence of a change-point in the mean structure of the multivariate signal, Enikeeva and Harchaoui (2019) and Horváth and Hušková (2012) propose tests based on the \(\ell _2\) aggregation of the CUSUM statistics for each univariate component, while Jirak (2015) employs the \(\ell _\infty \) aggregation of the aforementioned values. In Cho (2016), the Double CUSUM (DC) operator is introduced, which takes as input the ordered absolute CUSUM values of each individual component and performs a weighted \(\ell _1\) aggregation to construct the DC statistic which is then compared against a test criterion. Departing from the detection of changes in the mean structure of a multivariate signal, Cho and Fryzlewicz (2015) propose the Sparsified Binary Segmentation algorithm (SBS) for the detection of multiple change-points in the second-order structure of a multivariate data sequence. SBS is based on a first, “sparsifying” step which is used to exclude individual component data sequences from an \(\ell _1\) aggregation; a pre-defined threshold is used for the exclusion. The recent work of Anastasiou et al. (2022) introduces Cross-covariance isolate detect (CCID), which, motivated from the necessity of estimating changes in time-varying functional connectivity networks, detects multiple change-points in the second-order (cross-covariance or network) structure of multivariate, possibly high-dimensional time series. Ombao et al. (2005) investigates the application of smooth localized complex exponentials (SLEX) waveforms, to the detection of changes in spectral characteristics of EEG data. Lavielle and Teyssiere (2006) detects changes in the covariance structure of i.i.d. multivariate time series based on the minimization of a penalized Gaussian log likelihood, while Bücher et al. (2014) uses a test statistic based on sequential empirical copula processes to detect changes in the cross-covariance structure. For a survey of various offline change-point detection algorithms on multivariate time series see Truong et al. (2020).

In this paper, we propose a method called Multivariate Isolate–Detect (MID) for the consistent estimation of multiple change-points under the multivariate, possibly high-dimensional, structure of the model in (1). Our method builds on the foundations of the ID algorithm developed in Anastasiou and Fryzlewicz (2022); that is, we first isolate each true change-point within subintervals of the domain \([1,\ldots ,T]\) and then we proceed to detect them. Isolation enhances detection power, especially in frameworks with frequently occurring change-points. MID is explained in detail in Sect. 2; here, we only give a brief description of its important steps. The main idea is that for the observed data sequences \(x_{t,j} \hspace{0.2cm} t=1,\ldots , T,\quad j=1,\ldots ,d,\) and with \(\lambda _{T}\) a positive constant, playing the role of an expansion step as in Anastasiou and Fryzlewicz (2022), our method first creates two ordered sets of \(K=\lceil T/\lambda _{T}\rceil \) right- and left-expanding intervals. For \(i =1,\ldots , K\), the i th right expanding interval is \(R_i=[1,\min \left\{ i\lambda _T, T\right\} ]\) while the i th left-expanding interval is \(L_{i}=[\max \left\{ 1, T-i\lambda _{T}+1\right\} ,T].\) We collect these intervals in the ordered set \(S_{RL}=\{R_1,L_1,R_2,L_2,\ldots ,R_K,L_K\}\). The algorithm first acts on the interval \(R_1 = [1, \lambda _T]\) by calculating, for every univariate component data sequence, the contrast function value for the Q possible candidates in this interval (details are given in Sect. 2). This process will return Q vectors \(\varvec{y_j}, j=1,\ldots ,Q\) of length d each; for example, the elements of \(\varvec{y_1} \in \mathbb {R}^{d}\) will be the contrast function values related to the first change-point candidate in \(R_1\), for each of the d component data sequences, the elements of \(\varvec{y_2} \in \mathbb {R}^{d}\) will be the relevant values for the second candidate in \(R_1\), and so on. The next step is to apply to each \(\varvec{y_j}\) a mean-dominant norm \(L: \mathbb {R}^d \rightarrow \mathbb {R}\). To be more precise, since the contrast function values within each \(\varvec{y_j}\) are all non-negative, in our case \(L: (\mathbb {R}^d)^{+} \rightarrow \mathbb {R}\). The definition of mean-dominant norms can be found in Sect. 2 of Carlstein (1988) and examples include

$$\begin{aligned}&L_2 := L_2(\varvec{y_j}) = \frac{1}{\sqrt{d}}\sqrt{\sum _{i=1}^d y_{j,i}^2}\nonumber \\&L_{\infty } := L_{\infty }(\varvec{y_j}) = \sup _{i=1,\ldots ,d}\left\{ y_{j,i}\right\} . \end{aligned}$$
(2)

Applying \(L(\cdot )\) to each \(\varvec{y_j}\), will return a vector \({\varvec{v}}\) of length Q. We identify \(\tilde{b}_{R_1}\):= \({\textrm{argmax}}_j\left\{ v_j \right\} \). If \(v_{\tilde{b}_{R_1}}\) exceeds a certain threshold, denoted by \(\zeta _{T,d}\), then \(\tilde{b}_{R_1}\) is taken as a change-point. If not, then the process tests the next interval in \(S_{RL}\). Upon detection, the algorithm makes a new start from the end-point (respectively, start-point) of the right- (respectively, left-) expanding interval where the detection occurred. Upon correct choice of \(\zeta _{T,d}\), MID ensures that we work on intervals with at most one change-point.

The rest of the paper is structured as follows. Section 2 describes our proposed methodology and its associated theory. The computational complexity of MID, useful variants of our algorithm and the choice of important parameter values are explained in Sect. 3. In Sect. 4, we perform a thorough simulation study to compare our algorithm with state-of-the-art methods. Furthermore, we show and discuss on the practical performance of MID in the cases of spatially dependent data (spatial independence does not need to be assumed for the theoretical results related to the consistency of the proposed method) and when the normality assumption for the error terms is violated. Sect. 5 illustrates the behaviour of MID on two examples of real data; the monthly percentage changes in the UK house price index over a period of twenty three years in twenty London Boroughs, and the daily number of new COVID-19 cases in the four constituent countries of the United Kingdom; England, Northern Ireland, Scotland, and Wales. In Sect. 6, we first discuss on how we can treat cases where the data exhibit temporal dependence, and then, we conclude the paper with general remarks and reflections on our proposed algorithm. The proof of Theorem 1 is given in the Appendix.

Fig. 1
figure 1

An example of a three dimensional data sequence that undergoes three changes in its mean structure at locations \(r_1 = 27, r_2 = 73\) and \(r_3 = 165\)

Fig. 2
figure 2

An example with three change-points in the mean structure; \(r_1=27, r_2 = 73\) and \(r_3 = 165\). The dashed line is the interval in which the detection took place in each phase

2 Methodology and theory

2.1 Methodology

We work under the model given in (1). The objective is to estimate both the number, N, and the locations \(r_1,\ldots , r_N\) where the multivariate deterministic signal \(\varvec{f_{t}}\) exhibits structural changes. We note that N can possibly grow with the sample size T and with the dimensionality d. In addition, a change-point does not necessarily appear in all univariate component signals; any level of sparsity is allowed. Before providing a full, step-by-step explanation of our algorithm, two simple examples are given to assist in ease of understanding. In Fig. 1, we have a three dimensional data sequence of length \(T = 200\) with three change-points in the mean vector at locations \(r_1=27, r_2=73\) and \(r_3=165\). To be more precise,

$$\begin{aligned}&f_{t,1} = {\left\{ \begin{array}{ll} 0, &{} t=1,\ldots ,27\\ 6, &{} t=28,\ldots ,165\\ 0, &{} t=166,\ldots ,200 \end{array}\right. },\nonumber \\&f_{t,2} = {\left\{ \begin{array}{ll} 0, &{} t=1,\ldots ,73\\ -6, &{} t=74,\ldots ,165\\ 0, &{} t=166,\ldots ,200 \end{array}\right. },\nonumber \\&f_{t,3} = 0, t = 1, \ldots , 200. \end{aligned}$$
(3)

From now on, \({\mathcal {N}}_d(\varvec{\mu }, \Sigma )\) denotes the \(d-\)variate normal distribution with mean vector \(\varvec{\mu }\in \mathbb {R}^{d\times 1}\) and covariance matrix \(\Sigma \in \mathbb {R}^{d \times d}\). For the example, we take \(\varvec{\epsilon _t} \sim {\mathcal {N}}_3({\varvec{0}}, \Sigma _1)\), where

$$\begin{aligned} \Sigma _1 = \left( \begin{matrix} 9&{}0&{}0\\ 0&{}1&{}0\\ 0&{}0&{}4 \end{matrix}\right) . \end{aligned}$$

The component data sequences \(X_{t,1}\) and \(X_{t,2}\) share a common change-point at \(t=165\), while they also have their own change-points at \(t=27\) and \(t=73\), respectively. There are no change-points in \(X_{t,3}\).

Let us denote by \(r_0=0\), \(r_{N+1}=T\). In this toy example, we take the expansion parameter \(\lambda _T=10\), while \(\zeta _{T,d}\) is a well-chosen predefined threshold. The choice of the aforementioned parameters is discussed in detail in Sect. 3.3; special attention will be given to the dimensionality of the data sequence in order to make a robust threshold choice. Figure 2 shows the steps of MID until all change-points are detected. We will be referring to Phases 1, 2 and 3 involving five, six, and five intervals, respectively; these are clearly indicated in Fig. 2. These phases are only related to this specific example of three change-points; in cases with a different number of change-points we would have a different number of such phases. At the beginning of the detection process, we have that \(s = 1\) and \(e = T = 200\). As already mentioned in Sect. 1, the proposed algorithm acts both sequentially and interchangeably on subintervals of the full domain; this being \([1,\ldots ,200]\) for this example. For a well-chosen threshold \(\zeta _{T,d}\), then \(r_1=27\) is the first change-point to be detected; this occurs in the interval [1, 30] as shown in Phase 1 of Fig. 2. We now briefly explain how the detection occurred for \(r_1\). Let A be the \(30 \times 3\) matrix, with each column being the first 30 observations (since we are working in the interval [1, 30]) of each of the three univariate data sequences. The next step is to compute the contrast function (in this specific case of piecewise-constant signals, the function is the absolute value of the widely used CUSUM statistic as given in (5)) values for each candidate point and for all three component data sequences. We end up with a matrix \(B \in \mathbb {R}^{29 \times 3}\), with \(B_{i,j}\) being the value of the contrast function for the i th data point of the j th data sequence when we work in the interval [1, 30]; the last point of the interval is not among the change-point candidates. Applying a mean-dominant norm to each row of B gives us a vector of length 29. Figure 3 (Detection 1), illustrates these values, when we employed the \(L_2\) and the \(L_\infty \) mean-dominant norms as defined in (2). In Fig. 3, we see that for both employed norms, \(t=27\) has the highest value, which exceeds the predefined threshold value obtained as in Sect. 3.3. Therefore, \(\hat{r}_1 = 27\) is assigned as the estimated location for \(r_1\). After the detection, s is updated as the end-point of the (right-expanding) interval where the detection occurred; therefore \(s = 30\), and MID is, in Phase 2, applied to the interval [30, 200]. Then, \(r_3=165\) gets detected at the sixth step of Phase 2 in the interval [161, 200]. After this second detection, MID proceeds to Phase 3, where it is applied to the interval [30, 161] and \(r_2\) gets isolated (for the first time) and detected in the interval [30, 80] as shown in Figs. 2 and 3. In the end, MID is applied to the interval [80, 161], where there will be no expanding interval that contains a point with an aggregated CUSUM value that surpasses the threshold \(\zeta _{T,d}\); therefore, the process will terminate after scanning all the data.

In Fig. 4, we graphically provide an example of a three dimensional data sequence of length \(T = 200\) with three change-points in the slope at locations \(r_1=53, r_2=100\) and \(r_3=124\). To be more precise,

$$\begin{aligned}&f_{t,1} = {\left\{ \begin{array}{ll} -t+1, &{} t=1,\ldots ,53\\ 2t-158, &{} t=54,\ldots ,124\\ -t+214, &{} t=125,\ldots ,200 \end{array}\right. }, \\&f_{t,2} = {\left\{ \begin{array}{ll} -t+1, &{} t=1,\ldots ,100\\ 2t-299, &{} t=101,\ldots ,124\\ -t+73, &{} t=125,\ldots ,200 \end{array}\right. },\\&f_{t,3} = t, t = 1, \ldots , 200. \end{aligned}$$

We take \(\varvec{\epsilon _t} \sim {\mathcal {N}}_3({\varvec{0}}, \Sigma _2)\), with \(\Sigma _2 = 49 I_{3\times 3}\), where for \(k \in \mathbb {Z}^{+}\), the matrix \(I_{k\times k}\) is the \(k \times k\) identity matrix. The first two component data sequences have two change-points each; for \(X_{t,1}\) at locations \(t=53\) and \(t=124\), while for \(X_{t,2}\) at locations \(t = 100\) and \(t=124\). There are no change-points in \(X_{t,3}\).

Fig. 3
figure 3

CUSUM values in the relevant detection intervals explained in Fig. 2. The vertical dashed line indicates the time point with the highest value in the corresponding interval, while the horizontal dashed line is the optimal threshold value. On the left column you can see the results when the \(L_2\) aggregation method was used, while for the right column we employed the \(L_{\infty }\)-based aggregation approach

Fig. 4
figure 4

An example of a three dimensional data sequence with piecewise-linear structure, that undergoes three changes in its first derivative at locations \(r_1 = 53, r_2 = 100\) and \(r_3 = 124\)

After giving two examples of structures that MID can be employed to, our method can now be described in a general framework. Our proposed algorithm is based on the same isolation technique as that of the univariate change-point detection method ID and therefore, extensive details of how this isolation is achieved are avoided and can be found in Section 3.1 of Anastasiou and Fryzlewicz (2022). For the better understanding of MID, we provide its step-by-step outline through pseudocode, followed by a succinct narrative of the steps. For \(K=\lceil T/\lambda _T\rceil \), let \(c_{j}^r = \min \left\{ j\lambda _T, T\right\} \) and \(c_{j}^l = \max \left\{ 1,T - j\lambda _T + 1\right\} \) for \(j=1,\ldots , K\). For a generic interval [se], define the sequences

$$\begin{aligned} {\textrm{R}_{s,e} = \left[ c_{k_1}^r, c_{k_1+1}^r, \ldots ,e \right] , \quad \textrm{L}_{s,e} = \left[ c_{k_2}^l, c_{k_2+1}^l, \ldots , s\right] ,}\nonumber \\ \end{aligned}$$
(4)

where \(k_1:= \textrm{argmin}_{j \in \left\{ 1,2\ldots ,K \right\} }\left\{ j\lambda _T > s \right\} \) and \(k_2:= \textrm{argmin}_{j\in \left\{ 1,2\ldots ,K \right\} }\left\{ T-j\lambda _T+1 < e \right\} \). We denote by

$$\begin{aligned} \varvec{C_{s,e}^{b}(X)} = (C_{s,e}^{b}(\varvec{X^{(1)}}), \ldots , C_{s,e}^{b}(\varvec{X^{(d)}})) \end{aligned}$$

the vector in \((\mathbb {R}^{+})^{d}\) that has the contrast function values for each one of the d component data sequences, \(\varvec{X^{(i)}}, i = 1,\ldots ,d\), at the location b when we work in the interval [se]. Then, denoting by |A| the cardinality of any sequence A, by A(j) its j th element, and for \(L(\cdot )\) being any mean-dominant norm (as those in (2)) employed for the aggregation of the contrast function values, the pseudocode of the main function for the proposed algorithm is as below:

figure a

A brief explanation of the pseudocode follows. With K already defined above, the intervals \([s_1,e_1], \ldots , [s_{2K},e_{2K}]\) are those used for the isolation step. Notice that in the odd intervals \([s_1, e_1], [s_3, e_3], \ldots , [s_{2K - 1}, e_{2K - 1}]\) the start-point is fixed, unchanged, and equal to s, meaning that \(s_1 = s_3 = \ldots = s_{2K-1} = s\). In the even intervals \([s_2, e_2], [s_4, e_4], \ldots , [s_{2K}, e_{2K}]\), it is the end-point that is kept fixed and equal to e, meaning that \(e_2 = e_4 = \ldots = e_{2K} = e\). The process will follow until there are intervals to check. The term “expanding intervals” that is used throughout the paper is due to this one-sided expansion (of magnitude \(\lambda _T\)) of the intervals. As shown in Fig. 2 for the specific example in Fig. 1, the pseudocode makes it also clear that, in general, MID is looking for change-points interchangeably in right- and left-expanding intervals which, with high probability, contain at most one change-point. In each of these intervals, MID acts in the same way as shown in the main part of the pseudocode. The MID procedure is launched by the call MID(\(1, T, \lambda _T, \zeta _{T,d}, L\)).

2.2 Theory

We work under the setting in (1). Denoting by \(r_0 = 0\) and \(r_{N+1} = T\), the illustration scenarios are:

(S1) Changes in the mean structure: For \(k = 1,\ldots , N+1\), \(\varvec{f_t} = \varvec{\mu _k} \in \mathbb {R}^{d}\) for \(t = r_{k-1}+1, \ldots , r_k\). In this case, the univariate component signals \(f_{t,j}\), for \(j = 1,\ldots , d\) are piecewise-constant.

(S2) Changes in the first derivative: For \(k = 1,\ldots , N+1\), \(\varvec{f_t} = \varvec{\mu _{1,k}} + \varvec{\mu _{2,k}}t\) for \(t = r_{k-1}+1,\ldots , r_k\), where \(\varvec{\mu _{1,k}}\) and \(\varvec{\mu _{2,k}}\) are vectors in \(\mathbb {R}^{d}\). In addition, we require that for \(j = 1,\ldots , N\), the equality \(\varvec{\mu _{1,j}} + \varvec{\mu _{2,j}}r_j = \varvec{\mu _{1,j+1}} + \varvec{\mu _{2,j+1}}r_j\) is satisfied. Under this framework, the change-points, \(r_k\), satisfy that \(\varvec{f_{r_{k} - 1}} + \varvec{f_{r_{k} + 1}} \ne 2\varvec{f_{r_{k}}}\). Therefore, the univariate component signals \(f_{t,j}\), for \(j = 1,\ldots , d\) are continuous and piecewise-linear.

The aforementioned scenarios are only two specific illustration cases in which the proposed MID algorithm can be applied. Due to its change-point isolation step prior to detection, our algorithm can be applied in more complicated scenarios where each univariate signal could be, for example, piecewise-polynomial or piecewise-exponential. In Sects. 2.2.1 and 2.2.2, we provide the main theorems for the consistency of our method in accurately estimating the true number and the location of the change-points in Scenarios (S1) and (S2), respectively. The theoretical results presented in this section are for either the \(L_{\infty }\) or the \(L_2\) mean-dominant norms discussed in the paper for the aggregation of the information from the component data sequences; under a similar proof strategy, results can be obtained for other mean-dominant norms as well.

2.2.1 Scenario (S1)

As already discussed, the first step of the detection process depends on an appropriately chosen contrast function, which, for every component data sequence, is applied to each change-point candidate. In Scenario (S1), the contrast function applied to the component data sequences \(X_{t,j}, \forall j \in \left\{ 1,\ldots , d\right\} \) is the absolute value of the widely used CUSUM statistic, with the latter being defined as

$$\begin{aligned} \tilde{X}_{s,e}^{b,j} = \sqrt{\frac{e-b}{n(b-s+1)}}\sum _{t=s}^{b}X_{t,j} - \sqrt{\frac{b-s+1}{n(e-b)}}\sum _{t=b+1}^{e}X_{t,j},\nonumber \\ \end{aligned}$$
(5)

where \(1\le s \le b < e\le T\) and \(n=e-s+1\). Before proceeding with the main theoretical result for the consistency of our method, and with \(L(\cdot )\) denoting any one of the mean-dominant norms in (2), allow us to introduce some more notation, as below.

$$\begin{aligned}&{\delta _T := \min _{j=1,\ldots , N+1}\mid r_j - r_{j-1} \mid , }\nonumber \\&{\varvec{\Delta _{j}} := \mid \varvec{f_{r_{j+1}}} - \varvec{f_{r_j}} \mid , \quad j=1,\ldots , N}\nonumber \\&{\underline{f}_{T} := \inf _{j=1,\ldots ,N}\left\{ L(\varvec{\Delta _{j}})\right\} .} \end{aligned}$$
(6)

The absolute value of the vector \(\varvec{\Delta _j}\in \mathbb {R}^d\) in (6) is taken component-wise. For the consistency result with respect to the estimated number of change-points and their estimated locations obtained by MID, we work under the assumption (A1) as follows:

  1. (A1)

    The quantities \(\delta _T\) and \(\underline{f}_T\), defined in (6), are connected by \(\sqrt{\delta _T}\underline{f}_T \ge \underline{C}\sqrt{\log \left( Td^{1/4}\right) }\), for a large enough constant \(\underline{C}\).

The number of change-points, N, is allowed to grow with the sample size T and the dimensionality d. Theorem 1 provides the main theoretical result for Scenario (S1) when either \(L_\infty \) or \(L_2\) are employed for the aggregation of the contrast function values. The proof is given in the appendix.

Theorem 1

Let \(\left\{ \varvec{X_t} \right\} _{t=1,\ldots ,T}\) follow model (1) with \(\varvec{f_t}\) as in Scenario (S1) and \(\varvec{\epsilon _{t}} \sim {\mathcal {N}}_d({\varvec{0}}, \Sigma )\), where \(\Sigma \in \mathbb {R}^{d \times d}\) is positive definite. Let N and \(r_j, j=1,\ldots ,N\) be the number and locations of the change-points, while \(\hat{N}\) and \(\hat{r}_j, j=1,\ldots ,\hat{N}\) are their estimates sorted in increasing order. Assuming that (A1) holds, then, there exist positive constants \(C_1, C_2, C_3,\) and \(C_4\), which do not depend on T or d, such that for \(C_1\sqrt{\log \left( Td^{1/4}\right) } \le \zeta _{T,d} < C_2\sqrt{\delta _T}\underline{f}_T\), we have:

For \(L(\cdot ) = L_{\infty }(\cdot )\),

$$\begin{aligned} {\mathbb {P}\,\left( \hat{N} = N, V_{\infty } \le C_3\log \left( Td^{\frac{1}{4}}\right) \right) \ge 1 - \frac{C_4}{T},} \end{aligned}$$
(7)

where \(V_{\infty }:= \underset{j = 1,\ldots , N}{\max }\left\{ \mid \hat{r}_j - r_j \mid \left( \Delta _j^{q_j}\right) ^2\right\} \) and \(q_j:= \textrm{argmax}_{k=1,\ldots ,d}\mid \tilde{X}_{s_j,e_j}^{\hat{r}_j,k} \mid \), for \([s_j, e_j]\) being the interval where \(\hat{r}_j\) is obtained.

For \(L(\cdot ) = L_{2}(\cdot )\),

$$\begin{aligned} {\mathbb {P}\,\left( \hat{N} = N, V_2 \le C_3\log \left( Td^{\frac{1}{4}}\right) \right) \ge 1 - \frac{C_4}{T},} \end{aligned}$$
(8)

where \(V_2:= \underset{j = 1,\ldots , N}{\max }\left\{ \mid \hat{r}_j - r_j \mid L_2^2(\varvec{\Delta _j})\right\} \).

The lower bounds for the probabilities in (7) and (8) do not depend on the dimensionality d and their order is \(1 - {\mathcal {O}}\left( T^{-1}\right) \). Furthermore, the rate of convergence of the estimated change-point locations does not depend on the minimum distance between two change-points, \(\delta _T\); the aggregated jump magnitude, \(\Delta _j^{q_j}\) for \(L_{\infty }\) and \(L_2(\varvec{\Delta _j})\) for \(L_2\), is the only quantity that affects the rate. We notice, though, that to be able to match the estimated change-point locations with the true ones, then \(\delta _T\) should be larger than the distance between the estimated and the true change-point locations. Therefore, based on (7) and (8), we deduce that \(\delta _T\) must be at least \({\mathcal {O}}\left( \log \left( Td^{1/4}\right) \right) \). To put the obtained rate of convergence into perspective, it has already been proven in the literature (see, for example Chan and Walther 2013) that in the univariate case the smallest possible \(\delta _T\underline{f}_T^2\), which allows for the detection of changes in the mean of a data sequence, is \({\mathcal {O}}(\log T - \log (\log T))\). In our case, for \(d = 1\), the \({\mathcal {O}}(\log T)\) rate is attained, which is near-optimal up to the rather negligible double logarithmic term. Moving now to the case where \(d > 1\), we will compare the consistency results of MID to those obtained for two known recent procedures for multiple change-point detection in multivariate settings; the inspect algorithm of Wang and Samworth (2018) and the kernel change-point (KCP) algorithm of Arlot et al. (2019). The finite sample bounds like those in (7) and (8), imply a rate of convergence in an asymptotic setting. The comparison of the consistency results between the aforementioned methods is carried out following a known convention in the literature (see, for example Venkatraman (1992)). For \({\mathcal {P}}_T\) being a class of distributions of \({\varvec{X}} \in \mathbb {R}^{d \times T}\), which is as in (1), we state that the procedure under consideration is consistent for \({\mathcal {P}}_T\) with rate of convergence \(\rho _T\) if

$$\begin{aligned} \inf _{P \in {\mathcal {P}}_T}\mathbb {P}\,_P\left( \hat{N} = N, \max _{j = 1, \ldots , N}\left| r_j - \hat{r}_j\right| \le T\rho _T\right) \xrightarrow [T \rightarrow \infty ]{} 1.\nonumber \\ \end{aligned}$$
(9)

For MID, Theorem 1 shows that the relevant rate of convergence is \(\rho _{T}^{\textrm{MID}} = {\mathcal {O}}\left( T^{-1}\underline{f}_T^{-2}\log \left( Td^{1/4}\right) \right) \). Theorem 2 of Wang and Samworth (2018) indicates that, apart from its dependence on the magnitude of the changes, the rate of convergence of the estimated change-point locations in inspect is also affected by the minimum distance between consecutive change-points; as already mentioned, this is not the case for MID. More specifically, employing our notation, Wang and Samworth (2018) show that, as long as \(\delta _T \ge 14\), for inspect the rate of convergence of the estimated locations is \(\rho _T^{insp} = {\mathcal {O}}\left( T^3\log (Td)\vartheta ^{-2}\delta _T^{-4}\right) \), where \(\vartheta \) is such that \(\vartheta \le \Vert \varvec{\theta ^{(i)}}\Vert _2, \forall i \in \left\{ 1,\ldots ,N\right\} \), where \(\varvec{\theta ^{(i)}} \in \mathbb {R}^{d}\) is the vector of the change magnitudes associated to the \(i^{\textrm{th}}\) change-point. Corollary 3 of Wang and Samworth (2018) explains that if \(\log (d) = {\mathcal {O}}(\log (T))\), \(\vartheta \asymp T^{- \alpha }\) and \(\delta _T \asymp T^{1-\beta }\), with \(2\alpha + 5\beta < 1\), then inspect estimates all change-points with rate of convergence \(\rho _T^{insp} = o(T^{-(1-2\alpha -4\beta )+\delta })\), for any \(\delta > 0\). Under the same scenario, and since \(\underline{f}_T \le \Vert \varvec{\theta ^{(i)}}\Vert _2, \forall i \in \left\{ 1,\ldots ,N\right\} \), meaning that we can take \(\underline{f}_T \asymp \vartheta \), our method estimates the change-points with rate of convergence \(\rho _T^{\textrm{MID}} = o(T^{-(1-2\alpha )+\delta })\), which is an improvement over the rate attained by inspect. Consistency results for the KCP algorithm of Arlot et al. (2019) have been extensively studied in Garreau and Arlot (2018). More specifically, for \(k: \mathbb {R}^d \times \mathbb {R}^d \rightarrow \mathbb {R}\) a positive semidefinite kernel, and with M a positive constant such that \(k(\varvec{X_t}, \varvec{X_t}) \le M^2 < \infty , \forall t \in \left\{ 1,\ldots , T\right\} \), Theorem 3.1 of Garreau and Arlot (2018) shows that the change-points are estimated by KCP with rate of convergence \(\rho _T^{\textrm{KCP}} = {\mathcal {O}}\left( T^{-1}\underline{f}_T^{-2}NM^2\log (T)\right) \), where N is the true number of change-points; the order of the lower bound for the probability is the same as in (7) and (8). To compare the consistency result of KCP with that of our method, we first highlight that an upper bound on the true number of change-points, N, is needed in order for KCP to be able to estimate the change-points; this is not the case in MID since there is no prior information on N which is allowed to grow with the sample size, T, and the dimensionality, d. Furthermore, in our understanding the positive constant \(M^2\) appearing in the rate of convergence of KCP depends on the dimensionality d. For example in the special case of the linear kernel (see Arlot et al. (2019) for classical examples of kernels) \(k^{\textrm{lin}}({\varvec{x}}, {\varvec{y}}) = \left\langle {\varvec{x}}, {\varvec{y}}\right\rangle _{\mathbb {R}^{d}}\), for \({\varvec{x}}, {\varvec{y}} \in \mathbb {R}^d\), we deduce that \(M^2 = {\mathcal {O}}(d)\), leading to \(\rho _T^{\textrm{KCP}} = {\mathcal {O}}\left( NdT^{-1}\underline{f}_T^{-2}\log T\right) \), which is worse than \(\rho _T^{\textrm{MID}}\) if either d or N are allowed to diverge with T.

Assumption (A1) requires \(\delta _T(\underline{f}_T)^2\) to be of order at least \({\mathcal {O}}\left( \log \left( Td^{1/4}\right) \right) \). Combining this with the fact that the minimum distance, \(\delta _T\), between successive change-points is at least \({\mathcal {O}}\left( \log \left( Td^{1/4}\right) \right) \), means that \(\underline{f}_T\) could decrease with T in cases where \(\delta _T\) is of order higher than \({\mathcal {O}}\left( \log \left( Td^{1/4}\right) \right) \).

With respect to the threshold parameter, \(\zeta _{T,d}\), the rate of its lower bound is \({\mathcal {O}}\left( \sqrt{\log \left( Td^{1/4}\right) }\right) \); this will also be used in practice as the default rate. Therefore, we have that

$$\begin{aligned} \zeta _{T,d} = C\sqrt{\log \left( Td^{1/4}\right) }, \end{aligned}$$
(10)

where C is a positive constant. More details on the choice of C are given in Sect. 3.3.

2.2.2 Scenario (S2)

We are under the scenario where for any \(j \in \left\{ 1,\ldots , d\right\} \), the underlying signal \(f_{t,j}, t=1,\ldots ,T\) has a continuous and piecewise-linear structure as in Fig. 4. In this case, the contrast function applied to the component data sequences \(X_{t,j}, \forall j \in \left\{ 1,\ldots , d\right\} \) is

$$\begin{aligned} C_{s,e}^b(\varvec{X_j}) = \mid \left\langle \varvec{X_j},\varvec{\phi _{s,e}^b}\right\rangle \mid , \end{aligned}$$
(11)

where for \(n=e-s+1\) and

$$\begin{aligned}&\alpha _{s,e}^b = \sqrt{\frac{6}{n(n^2-1)(1+(e-b+1)(b-s+1)+(e-b)(b-s))}}\\&\beta _{s,e}^b = \sqrt{\frac{(e-b+1)(e-b)}{(b-s+1)(b-s)}}, \end{aligned}$$

we have that the contrast vector, \(\phi _{s,e}^b(t)\), is equal to

$$\begin{aligned}&\alpha _{s,e}^b\beta _{s,e}^b\left[ (e+2b-3s+2)t - (be +bs - 2s^2+2s)\right] , \text{ for } t \in [s,b]\\ {}&\frac{\alpha _{s,e}^b}{\beta _{s,e}^b}\left[ (2e^2+2e-be-bs) -(3e-2b-s+2)t\right] , \text{ for } t \in [b+1,e]\\ {}&0, \text{ otherwise }. \end{aligned}$$

For more details on how this vector is constructed for (S2), please see section B.2 in the online supplementary material of Baranowski et al. (2019). Under Scenario (S2), we denote by

$$\begin{aligned}&{\varvec{\Delta _{j}} := \mid \varvec{f_{r_{j}-1}} + \varvec{f_{r_{j}+1}} - 2\varvec{f_{r_j}} \mid , \quad j=1,\ldots , N}\nonumber \\&{\underline{f}_{T} := \inf _{j=1,\ldots ,N}\left\{ L(\varvec{\Delta _{j}})\right\} ,} \end{aligned}$$
(12)

whereas \(\delta _T\) has the same expression as in (6). We are ready to proceed with the consistency result of MID when applied under Scenario (S2). We work under the assumption:

  1. (A2)

    The quantities \(\delta _T\) and \(\underline{f}_T\) are connected by \(\delta _T^{3/2}\underline{f}_T \ge \underline{C}^*\sqrt{\log \left( Td^{1/4}\right) }\), where \(\underline{C}^*\) is a large enough constant.

Theorem 2 provides the main theoretical result for Scenario (S2). The steps for its proof are similar as those employed in the proof of Theorem 1. Therefore, the proof is given in the supplementary material.

Theorem 2

Let \(\left\{ \varvec{X_t} \right\} _{t=1,\ldots ,T}\) follow model (1) with \(\varvec{f_t}\) as in Scenario (S2) and \(\varvec{\epsilon _{t}} \sim {\mathcal {N}}_d({\varvec{0}}, \Sigma )\), where \(\Sigma \in \mathbb {R}^{d \times d}\) is positive definite. Let N and \(r_j, j=1,\ldots ,N\) be the number and locations of the change-points, while \(\hat{N}\) and \(\hat{r}_j, j=1,\ldots ,\hat{N}\) are their estimates sorted in increasing order. Assuming that (A2) holds, then, there exist positive constants \(C_1, C_2, C_3,\) and \(C_4\), which do not depend on T or d, such that for \(C_1\sqrt{\log \left( Td^{1/4}\right) } \le \zeta _{T,d} < C_2\delta _T^{3/2}\underline{f}_T\), we obtain that:

For \(L(\cdot ) = L_{\infty }(\cdot )\),

$$\begin{aligned} {\mathbb {P}\,\left( \hat{N} = N, \tilde{V}_{\infty } \le C_3\left( \log \left( Td^{\frac{1}{4}}\right) \right) ^{1/3}\right) \ge 1 - \frac{C_4}{T},} \end{aligned}$$
(13)

where \(\tilde{V}_{\infty }:= \underset{j = 1,\ldots , N}{\max }\left\{ \mid \hat{r}_j - r_j\mid \left( \Delta _j^{q_j}\right) ^{2/3}\right\} \) and \(q_j:= \textrm{argmax}_{k=1,\ldots ,d}\left\{ C_{s_j,e_j}^{\hat{r}_j}(\varvec{X_k})\right\} \), for \([s_j, e_j]\) denoting the interval where \(\hat{r}_j\) is obtained during MID.

For \(L(\cdot ) = L_{2}(\cdot )\),

$$\begin{aligned} {\mathbb {P}\,\left( \hat{N} = N, \tilde{V}_{2} \le C_3\left( \log \left( Td^{\frac{1}{4}}\right) \right) ^{1/3}\right) \ge 1 - \frac{C_4}{T},} \end{aligned}$$
(14)

where \(\tilde{V}_{2}:= \underset{j = 1,\ldots , N}{\max }\left\{ \mid \hat{r}_j - r_j\mid \left( L_2(\varvec{\Delta _j})\right) ^{2/3}\right\} \).

Similarly as in (S1), the upper bounds for the probabilities in (13) and (14) do not depend on the dimensionality of the data sequence \(\varvec{X_t}\); their rate of convergence is \(1 - {\mathcal {O}}\left( T^{-1}\right) \). Furthermore, the rate of convergence of the estimated change-point locations depends only on the aggregated change magnitude, \(\Delta _j^{q_j}\) for \(L_{\infty }\) and \(L_2(\varvec{\Delta _j})\) for \(L_2\). Under (S2), the change-points are estimated with rate of convergence \({\mathcal {O}}\left( T^{-1}f_T^{-2/3}\left( \log \left( Td^{1/4}\right) \right) ^{1/3}\right) \). In the special case of \(\underline{f}_T \asymp T^{-1}\), the rate of convergence becomes \({\mathcal {O}}\left( T^{-1/3}\left( \log \left( Td^{1/4}\right) \right) ^{1/3}\right) \). In the univariate case, it is proven in Raimondo (1998) that the asymptotic minimax rate for the change-point detection problem under (S2) is \({\mathcal {O}}\left( T^{-1/3}\right) \), which differs to the convergence rate obtained by MID only by the logarithmic factor.

As in Scenario (S1), the lower bound for the threshold, \(\zeta _{T,d}\), is of order \({\mathcal {O}}\left( \sqrt{\log \left( Td^{1/4}\right) }\right) \); this is the default rate that is used in practise in Sects. 4 and 5. Therefore,

$$\begin{aligned} \zeta _{T,d} = C^*\sqrt{\log \left( Td^{1/4}\right) }, \end{aligned}$$
(15)

where \(C^* > 0\). More details on the choice of the values for \(C^*\) are given in Sect. 3.3.

3 Computational complexity and practicalities

3.1 Computational complexity

With \(K = \lceil T/\lambda _T\rceil \), the total number of distinct intervals, \({\mathcal {I}}\), required in order to cover the whole data sequence is at most 2K (K intervals from each expanding direction). Choosing the expansion step, \(\lambda _T\), small enough leads to isolation with high probability; more information on how to choose \(\lambda _T\) in order to obtain good accuracy performance while maintaining low computational cost can be found in Sect. 3.3. With \(\delta _T\) the minimum distance between consecutive change-points, isolation is guaranteed as long as \(\lambda _T < \delta _T\). We use this inequality, which leads to \(K > \lceil T/\delta _T\rceil \) and therefore, in the worst case scenario \({\mathcal {I}} = 2K > 2\lceil T/\delta _T\rceil \). The lower bound is of order \({\mathcal {O}}(T/\delta _T)\). For generic intervals \([s,e], 1\le s < e \le T\) considered throughout the MID algorithm, the relevant contrast function values \(C_{s,e}^b(\varvec{X_j})\) will be calculated \(\forall b \in [s,e)\) and \(\forall j \in \left\{ 1,\ldots ,d\right\} \) before aggregation takes place. In both scenarios studied in our paper, for fixed j, the cost of computing \(C_{s,e}^b(\varvec{X_j})\) is \({\mathcal {O}}(e-s+1)\), meaning that it is linear in time; see Section B.1 of the online supplement in Baranowski et al. (2019). Therefore, the calculation of the contrast function values for all the component data sequences has computational cost of order \({\mathcal {O}}(KdT)\). Applying now the mean-dominant norms, as in (2), to the obtained values has an order \({\mathcal {O}}(d)\) computational cost. Combining the complexities explained at each step of MID, we conclude that the total computational complexity of the algorithm is of order \({\mathcal {O}}(Kd^2T) = {\mathcal {O}}\left( d^2T^2\delta _T^{-1}\right) \).

3.2 Mean-dominant norms

The proposed MID methodology is based on an aggregation step of the contrast function values obtained from each component data sequence using mean-dominant norms in the way these are defined in Section 2 of Carlstein (1988). Theorems 1 and 2 for the Scenarios (S1) and (S2), respectively, cover the theoretical behaviour of MID under both the \(L_{\infty }\) and \(L_2\) mean-dominant norms given in (2). Therefore, in the remaining sections, the discussion on the choice of the parameter values as well as results on the practical performance of MID will be focused when our method is combined with the \(L_{\infty }\) or the \(L_2\) mean-dominant norm. A data-adaptive variant of MID that chooses the most suitable mean-dominant norm to be used for the aggregation step will also be introduced.

3.3 Choice of parameter values

In order to choose the constants C in (10) and \(C^*\) in (15), we ran a large scale simulation study involving data sequences \(\left\{ \varvec{X_t} \right\} _{t=1,\ldots ,T}\), for \(T = 700, 1400\) following model (1), where \(\varvec{f_t}={\varvec{0}}\) while \(\varvec{\epsilon _{t}}\) are follow the d-variate standard normal distribution. Specifically, for each \(d \in \{1,\ldots ,50\}\) we generated 500 replicates and applied MID under scenarios (S1) and (S2) to each one of those replicates using various threshold constant values C and \(C^*\), respectively, in order to estimate the number of change-points. For each dimension, in order to control the Type I error rate, \(\alpha \), of falsely detecting change-points, we chose the default constant to be the one that its number of times that successfully did not detect any change-points was closer to \((1-\alpha )*500\). For \(d>50\), we keep the threshold that gave the best results for the simulated 50-dimensional data sequence. Table 1 presents the results for \(\alpha \in \left\{ 0.05, 0.1\right\} \) under Scenarios (S1) and (S2). From now on, the obtained values for C and \(C^*\) will be referred to as the default constants.

Regarding the expansion parameter, \(\lambda _T\), we have already discussed that isolation of the change-points is guaranteed in theory, as long as we take \(\lambda _T < \delta _T\). More specifically, in the proofs of Theorems 1 and 2, we show change-point isolation and detection for \(\lambda _T \le \delta _T/3\). However, any value of \(\lambda _T \le \delta _T/m\), for \(m > 1\) suffices; the adjustments to be made in the proof in such a case, as well as why \(m = 3\) leads to a more symmetric approach for the intervals where detection occurs, are explained in Remark 1 of the online supplement of Anastasiou and Fryzlewicz (2022). In practice, \(\delta _T\) is unknown, and to ensure isolation, \(\lambda _T\) can be taken to be as small as equal to 1. If \(\lambda _T > 1\), then isolation is guaranteed with high probability. Now, the computational cost of running our algorithm is inversely proportional to the size of \(\lambda _T\) because the smaller the \(\lambda _T\) the larger the number of the created intervals, which are examined for change-points. However, the computational speed of MID provides flexibility in the choice of the expansion parameter; in all practical examples in Sects. 4 and 5 we take \(\lambda _T = 3\). Tables 2, 3 and 4 indicate the high accuracy of our method even in quite complex scenarios that exhibit low sparsity (see (16) for the relevant definition) and/or a large number of regularly occurring change-points.

Table 1 The optimal values for the threshold constants, C and \(C^*\), which control the Type I error rate \(\alpha \) under the Scenarios (S1) and (S2), respectively, for \(d = 1,\ldots ,50\). Results are presented for the \(L_2\) and the \(L_\infty \) norms

3.4 Decision on the aggregation method

In Cho (2016), Cho and Fryzlewicz (2015), and Anastasiou et al. (2022), it has been explained that the \(L_\infty \)-based aggregation of the contrast functions for each component data sequence tends to exhibit a better behaviour (compared to the \(L_2\)-based aggregation) in scenarios where the true change-points appear only in a small number of the component data sequences. In contrast, due to spuriously large contrast function values, the \(L_{\infty }\) approach could suffer in situations where the change-points are not sparse across the panel of the data sequences.

This difference in the behaviour between the \(L_2\) and the \(L_{\infty }\) mean-dominant norms examined in our paper is what has motivated us to introduce an introductory step in MID, where the sparsity in a given d-dimensional data sequence, \(\varvec{X_t}\), is first estimated. For \(r_1, \ldots , r_N\) being the N true change-points, allow us, for any \(j \in \left\{ 1,\ldots ,N\right\} \), to define by \(A_j \subseteq \left\{ 1,\ldots ,d\right\} \) the set of indices for the univariate component data sequences that contain the change-point \(r_j\). Then, for \(\varvec{X_t}\), the sparsity is given by

$$\begin{aligned} sp = \max _{j = 1,\ldots ,N}\left\{ \mid A_j\mid \right\} /d, \end{aligned}$$
(16)

where \(\mid A_j\mid \) is the cardinality of the set \(A_j\). It is straightforward that \(sp \in [0,1]\). Our proposed hybrid approach aims to data-adaptively decide on the aggregation method to be used and eliminates the necessity of choosing between the \(L_2\) and \(L_\infty \) mean-dominant norms. The steps to achieve our purpose are as follows.

Step 1 We apply MID paired with the \(L_\infty \) aggregation rule in order to obtain \(\hat{r}_1, \ldots , \hat{r}_M\). It has already been explained that the \(L_\infty \) norm is preferable and provides very good results in cases with sparse change-points, while it tends to overestimate the number of change-points (due to the contrast function taking spuriously large values) under the scenario of dense change-points. Therefore, \(M \ge N\).

Step 2 We will now estimate the sparsity in the given data as defined in (16). With \(\hat{r}_0 = 0\) and \(\hat{r}_{M+1} = T\), we first collect the triplets \((\hat{r}_{m-1}+1,\hat{r}_m,\hat{r}_{m+1}), \forall m \in \left\{ 1,\ldots , M \right\} \). After this, \(\forall i \in \left\{ 1,\ldots ,d \right\} \), we calculate \(CS^{(i)}(\hat{r}_m):= C^{\hat{r}_m}_{\hat{r}_{m-1}+1,\hat{r}_{m+1}}(\varvec{X_i})\), with \(C^b_{s,e}(\varvec{X_i})\) being the relevant contrast function (based on whether we are under Scenario (S1) of Sect. 2.2.1 or Scenario (S2) of Sect. 2.2.2) value for the point b, when we are in the interval [se], for the univariate component data sequence \(\varvec{X_i}\). The d contrast function values for each \(\hat{r}_m\) are collected in the sets

$$\begin{aligned} S_m = \left\{ CS^{(1)}(\hat{r}_m), \ldots , CS^{(d)}(\hat{r}_m)\right\} , \quad j = 1,\ldots , M. \end{aligned}$$
(17)

Step 3 For each \(m = 1, \ldots , M\), all the elements of \(S_m\) are tested against the relevant threshold value, \(\zeta _T\), for univariate change-point detection which is of the order \({\mathcal {O}}(\sqrt{\log T})\) in both scenarios of piecewise-constant and continuous piecewise-linear signals (representing scenarios (S1) and (S2), respectively, covered in this paper). In terms of the threshold constants, we employ those of Anastasiou and Fryzlewicz (2022). We denote by \(\hat{sp}_m:= \#\left\{ i:CS^{(i)}(\hat{r}_m) > \zeta _T\right\} /d\). The estimated sparsity is then \(\hat{sp} = \max _{m \in \left\{ 1, \ldots , M\right\} }\left\{ \hat{sp}_m \right\} .\)

Table 2 Distribution of \(\hat{N} - N\) over 100 simulated multivariate data sequences with 3 change-points under Scenario (S1) of Sect. 2.2.1. The signal strength, as defined in (18), is equal to 2 for each change-point. The average ARI, \(d_H\), and computational times are also given

Step 4 For cases where \(\hat{sp} \le 0.4\), we accept the result from Step 1, where MID has been paired with the \(L_{\infty }\) mean-dominant norm, whereas if \(\hat{sp} \ge 0.6\), then MID is paired with the \(L_2\) mean-dominant norm. Extensive simulations have shown that there is no significant difference on the MID’s practical performance with respect to accuracy (on both the estimated number of change-points and the estimated locations) when \(\hat{sp} \in (0.4,0.6)\). Therefore, MID could be paired with any of the aforementioned two norms and give very good results. For computational cost reasons, in such cases we accept the result of Step 1. From now on, we denote by \(\textrm{MID}_{\textrm{opt}}\) to be the data-adaptive, sparsity-based MID version explained in this section, where an aggregation method is chosen based on the estimated sparsity of the change-points in the given data.

3.5 A permutation-based approach

MID is a threshold-based algorithm because at each step, the largest aggregated contrast function value is tested against a predefined threshold in order to decide whether there is a change-point at the corresponding location. In Sect. 3.3 we have explained how the threshold constant is carefully chosen to control the Type I error taking into account the dimensionality of the data. However, misspecification of the threshold can possibly lead to the misestimation of the number of change-points. To solve such issues, we propose a variant of MID based on permutation.

The idea of a “data-adaptive threshold” through permutations or bootstrapping is not new. In Cho (2016), a bootstrap procedure is proposed, which is motivated by the representation theory developed for the Generalised Dynamic Factor Model, in order to approximate the quantiles of the developed double CUSUM test statistic under the null hypothesis of no change-points; the obtained quantiles are used as a test criterion for detection. In Cabrieto et al. (2018), a permutation-based approach is used to test the presence of correlation changes in multivariate time series. Under a univariate framework, in Antoch and Hušková (2001) a permutation scheme is proposed for deriving critical values for test statistics based on functionals of the partial sums \(\sum _{i=1}^{k}(X_i-\bar{X}_n), k=1,\ldots ,n\), where \(X_i\) are the observed data and \(\bar{X}_n\) their mean value. The proposed scheme consists of three steps: a) compute the test statistic using the original data, b) construct the permutation distribution by computing the relevant test statistic on permuted versions of the data, and c) reject the null hypothesis of no change-points if the test statistic lies in the tails of the distribution.

We propose a variant that combines the isolation technique of MID with an extension of the permutation procedure used in Antoch and Hušková (2001) to the multivariate framework. Although this permutation procedure tends to be computationally expensive, it has a straightforward implementation. As generally holds for permutation procedures, the test-statistic obtained from the original data is compared to those obtained when applying the same steps to several permuted versions of the data. For our proposed permutation scheme all steps remain the same as MID with the only difference being the way the algorithm chooses to accept or reject a change-point within each interval. To be more precise, suppose that, for a given data sequence \(\left\{ \varvec{X_t}\right\} _{t=1,\ldots ,T}\), the MID algorithm is at a step where it looks for a change-point in the interval \(I=[s^*,e^*]\), where \(1 \le s^* < e^* \le T\). As described in Sect. 2.1, MID returns a vector \({\varvec{v}} \in \mathbb {R}^{J}\), where J is the amount of all change-point candidates. The elements of \({\varvec{v}}\) correspond to the aggregated contrast function values for each candidate point in I. The next step is to store \(T_{I_{max}} = \underset{i \in \{1,\ldots , J\}}{\max }\left\{ v_i\right\} \) and repeat the following procedure a prespecified amount, denoted by K, of times:

  1. 1.

    Generate a random permutation from \((s^{*},\ldots , e^{*})\).

  2. 2.

    Reorder the data according to the permutation.

  3. 3.

    Calculate and store the maximum aggregated contrast function value for each permutation.

The empirical distribution obtained from the maximum values is used to construct our test. More precisely, we identify \(\hat{b}_{I}=\textrm{argmax}_t\{v_t\}\) as a change-point if, for given \(\alpha \in (0,1)\), \(T_{I_{max}} > q_{1-\alpha }\), where \(q_{1-\alpha }\) is the \(100(1-\alpha )\%\) quantile of the empirical distribution.

The parameter \(\alpha \) controls the probability of false detections. Small values of \(\alpha \) will make it harder for the algorithm to reject the null hypothesis of no change-points, whereas large values can reduce the probability of a Type II error. For the simulations in Sect. 4, we take \(\alpha = 0.01\). Regarding the parameter K, the simulation results provided in Antoch and Hušková (2001)—although for a univariate framework—suggest that the empirical quantiles get stabilised quickly. Therefore, for our simulations, we take \(K=1000\).

4 Numerical studies

4.1 Comparative simulation study

In this section, we investigate the performance of our method in Scenarios (S1) and (S2) covered in Sects. 2.2.1 and 2.2.2, respectively. Furthermore, in (S1), MID is compared with state-of-the-art methods through a comprehensive simulation study. The competitors are the Double Cusum (DC) method of Cho (2016), the Sparsified Binary Segmentation (SBS) algorithm of Cho and Fryzlewicz (2015), the INSPECT algorithm of Wang and Samworth (2018), and the KCP method of Arlot et al. (2019). DC and SBS are implemented in the hdbinseg R package. For INSPECT we have used the InspectChangepoint R package, while KCP is implemented in the kcpRS R package. SBS and INSPECT were called with their default arguments. For DC, the parameters used were the ones that gave the best results in the simulation study carried out in the relevant paper ( Cho 2016). Regarding KCP, an upper bound, \(K_{\textrm{max}}\), for the true number, N, of change-points needs to be provided in order for the method to work. In the simulations carried out in this section, the true number of change-points is no more than 50; we take \(K_{\mathrm{{max}}} = 100\). Furthermore, in the kcpRS R package, the KCP-based estimated number of change-points is selected either through a model selection procedure where the penalty constants are decided through an extensive grid search, or through a scree test. In our simulation results that follow, the grid-based procedure is denoted by KCP_Grid, while the scree-test-based estimation result is denoted by KCP_Scree. With respect to our algorithm, we give results for the data-adaptive, sparsity-based MID version of Sect. 3.4, denoted by \(\textrm{MID}_{\textrm{opt}}\), where the threshold values are taken from Table 1 for \(\alpha = 0.05\). In addition, results are presented for the permutation-based variants explained in Sect. 3.5; these are denoted by \(\textrm{MIDPERL}_2\) and \(\textrm{MIDPERL}_\infty \) for the \(L_2\) and \(L_\infty \) mean-dominant norms, respectively.

Simulation setup We considered the settings where \(T=1500\), \(d \in \left\{ 30,100\right\} \) and \(N \in \left\{ 3,20,50 \right\} \). With the definition of sparsity as in (16), we took \(sp \in \left\{ 0.2,0.5,0.8 \right\} \), meaning that at least one true change-point appears in \(d \times sp\) data sequences. In an attempt to increase the difficulty in the simulations the remaining true change-points appear in only few data sequences. The signal strength at the \(i{\textrm{th}}\) change-point is defined as

$$\begin{aligned} ss^{(S1)}_i:= \Vert \varvec{f_{r_i}} - \varvec{f_{r_i+1}}\Vert _2 \end{aligned}$$
(18)

in the case of Scenario (S1) and as

$$\begin{aligned} ss^{(S2)}_i:= \Vert \varvec{f_{r_i-1}} + \varvec{f_{r_i+1}} - 2\varvec{f_{r_i}}\Vert _2, \end{aligned}$$
(19)

with \(\Vert \cdot \Vert _2\) being the known Euclidean norm. For our simulation study, for Scenario (S1), we take \(ss^{(S1)}_i = s_1, \forall i \in \left\{ 1,\ldots , N\right\} \) and we test the performance of all methods on two signal strength settings with \(s_1 \in \left\{ 2, 2.5\right\} \). Regarding Scenario (S2), where changes in the vector of the first order derivatives are treated, we have \(ss^{(S2)}_i = s_2, \forall i \in \left\{ 1,\ldots , N\right\} \) and we test the performance of MID on two signal strength settings with \(s_2 \in \left\{ 0.3, 0.5 \right\} \). Standard Gaussian noise was added to the signals. In total, we tested the methods in 36 different setups covering a range of different multivariate sequences regarding dimensionality, sparsity, number of change-points, and signal strength.

We ran 100 replications for each setup. The frequency distribution of \(\hat{N}-N\) is provided, while the accuracy of the estimated locations is evaluated through the Adjusted Rand Index (ARI) of the estimated segmentation against the true one ( Hubert and Arabie 1985), and the scaled Hausdorff distance,

$$\begin{aligned} d_H=n_s^{-1}\max \bigg \{\max _j\min _k \mid r_j - \hat{r_k}\mid ,\max _k\min _k \mid r_j-\hat{r_k} \mid ,\bigg \}, \end{aligned}$$

where \(n_s\) is the length of the largest segment. The average computational times, in seconds, are also provided. The results, for MID and the competing methods under Scenario (S1) and for the signal strength, \(s_1\), being equal to 2, are given in Tables 2, 3, and 4; the results for \(s_1 = 2.5\) are given in the online supplement. For each simulation setup, the method with the highest empirical frequency of \(\hat{N} - N\) being equal to zero (or close to zero, depending on the example) and those within \(5\%\) off the highest are given in bold. For Scenario (S2) with \(s_2 = 0.5\), the results are presented in Table 5, while for \(s_2 = 0.3\) the results can be found in the online supplement.

Table 3 Distribution of \(\hat{N} - N\) over 100 simulated multivariate data sequences with 20 change-points under Scenario (S1) of Sect. 2.2.1. The signal strength, as defined in (18), is equal to 2 for each change-point. The average ARI, \(d_H\), and computational times are also given
Table 4 Distribution of \(\hat{N} - N\) over 100 simulated multivariate data sequences with 50 change-points under Scenario (S1) of Sect. 2.2.1. The signal strength, as defined in (18), is equal to 2 for each change-point. The average ARI, \(d_H\), and computational times are also given

As the tables show, MID performs extremely well in all setups in both (S1) and (S2). More specifically, for (S1) our method is either the best method overall or within \(5\%\) off the best method with respect to the estimated number of change-points in all signals. Furthermore, in all cases, MID attains very high values for the ARI, and quite small (in most cases it actually attains the smallest value) ones for the scaled Hausdorff distance; such results justify that apart from being extremely accurate in estimating the correct number of change-points, MID is also very accurate regarding the estimated change-point locations. We highlight that there seems to be a significant advantage in the performance of MID compared to the rest of the methods in cases with a large number of regularly occurring change-points; see, more specifically, Tables 3 and 4. Regarding the permutation-based variants of MID, both \(\textrm{MIDPERL}_2\) and \(\textrm{MIDPERL}_\infty \) show a very good behaviour in all different scenarios in terms of accuracy with respect to both the estimated number and the estimated change-point locations. Although these permutation-based variants do not require specification of the threshold, their computational cost seems to be quite large (see Tables 2, 3, 4).

In regards to the competing methods, INSPECT has a very accurate behaviour in Table 2 which concerns the case of having three change-points. Even though the method’s behaviour is not very bad in all different combinations examined, Tables 3 and 4 show that INSPECT struggles to accurately estimate the change-points in cases where N is relatively high. DC’s, KCP_Scree’s, and SBS’ performances are also very good in those cases where we have three change-points, but they seem to underestimate (SBS more prominently) N in cases with more, regularly occurring change-points. KCP_Grid does not have a good performance in any of the scenarios tested.

With respect to Scenario (S2) as in Sect. 2.2.2, the results in Table 5 exhibit MID’s very strong performance in accurately estimating both the number, and the locations of the change-points. In conclusion, taking into account its low computational time, we can deduce that our proposed method is reliable and quick in accurately detecting change-points under various different (with respect to the number of change-points, the sparsity, the dimensionality of the data, the signal strength, and the structure of the changes) multivariate settings. R code and instructions to replicate the results in this section are available on https://github.com/apapan08/Simulations-MID.

Table 5 Distribution of \(\hat{N} - N\) over 100 simulated multivariate data sequences under Scenario (S2) of Sect. 2.2.2. We present the results of MID for different levels of sparsity and of dimensionality of the data sequence. The number of change-points is equal to 3, 20, or 50 and the signal strength, as defined in (19), is equal to 0.5. The average ARI, \(d_H\), and computational times are also given

4.2 Spatially dependent data

It has already been shown in the proofs of Theorems 1 and 2 that the covariance matrix, \(\Sigma \), of the noise terms, \(\varvec{\epsilon _t}\), does not need to be diagonal. This means that MID allows for the detection of changes in data structures that exhibit spatial dependence. Apart from the theoretical justification, through the relevant proofs, of the aforementioned statement, in this section we investigate the practical performance of MID for data following two spatial dependence structures under both scenarios (S1) and (S2). In the first case, the covariance matrix is \(\Sigma ^{(1)}_{i,j} = 2^{-|i-j|}\), while in the second one, the spatial dependence is much stronger, with the covariance matrix being \(\Sigma ^{(2)} = {\varvec{1}}_d{\varvec{1}}_d^{\intercal }\), where \({\varvec{1}}_d = (1,1,\ldots ,1)^{\intercal } \in \mathbb {R}^{d\times 1}\). With the notation as in Sect. 4.1, we take \(T = 1500, d \in \left\{ 30, 100\right\} , sp \in \left\{ 0.2,0.5\right\} , N \in \left\{ 3, 20 \right\} \), and the signal strength for the scenarios (S1) and (S2) is taken to be equal to 2 and 0.5, respectively, for each one of the change-points. The results over 100 simulations are given in Table 6 for Scenario (S1) and in the online supplement for (S2).

Table 6 indicates the great practical behaviour of MID in such spatial dependence structures. More specifically, for the case of having 3 change-points, when the covariance matrix is \(\Sigma ^{(1)}\), MID succeeds in returning \(\hat{N} = 3\) in at least 92 out of the 100 replications for all different scenarios regarding the dimensionality of the data sequence and the sparsity level employed. For the case where the covariance matrix is \(\Sigma ^{(2)}\), MID exhibits excellent behaviour in all settings apart from the one where \(d = 100\) and \(sp = 0.5\), in which our method accurately estimates the correct number of change-points in 78% of the replications; however, this is still considered to be a good performance level. Moving now to the scenario of 20 change-points, we see in Table 6 that for either \(\Sigma ^{(1)}\) or \(\Sigma ^{(2)}\) and for any combination of d and sp, the estimated number of change-points by MID satisfies that \(\hat{N} \in \left\{ 19, 20,21\right\} \) in at least 98 out of the 100 replications. In addition to the outstanding behaviour of MID with respect to the accuracy on the estimated number of change-points, both the Adjusted Rand Index and the Hausdorff distance values indicate the method’s accuracy regarding the estimated change-point locations in all settings studied.

Table 6 Distribution of \(\hat{N} - N\) when MID was employed for change-point detection over 100 simulated multivariate data sequences under Scenario (S1), with the multivariate data exhibiting two settings of spatial dependence. We take \(T = 1500, d \in \left\{ 30,100\right\} , sp \in \left\{ 0.2,0.5\right\} , N \in \left\{ 3,20\right\} \) and the signal strength is taken to be equal to 2. The average ARI, \(d_H\), and computational times are also given

4.3 Non-Gaussian noise

In this section, we investigate the practical performance of MID in cases where we deviate from the Gaussianity of the error terms, an assumption made in Theorems 1 and 2 in order to prove the consistency of our method in accurately estimating the true number and the location of the change-points. The simulation setup of Sect. 4.1 is repeated for MID; however, instead of standard Gaussian noise being added to each one of the d component signals, we now explore the performance of our method in the following two noise structures:

\({\textbf {S}}_{\textbf {Unif}}:\) \(\epsilon _{t,j} \sim ^{\textrm{iid}}\textrm{Unif}(-\sqrt{3}, \sqrt{3})\),

\({\textbf {S}}_{{\textbf {t}}_{\textbf {8}}}\): \(\epsilon _{t,j} \sim ^{\textrm{iid}}\sqrt{6/8}\textrm{t}_8\),

where \(\textrm{Unif}(a,b)\) denotes the continuous Uniform distribution in the interval [ab], while \(\textrm{t}_v\) is the Student-t distribution with v degrees of freedom. The performance of MID in the case of \(S_{Unif}\) is very good and there are no significant deviations to the method’s performance obtained under Gaussianity. However, preliminary simulations have shown that MID could lead to moderate overestimation on the number of change-points in the case of heavy-tailed distribution, such as the Student-t covered in the \(S_{t_{8}}\) scenario. Therefore, in such cases of heavy-tailed noise, we could take advantage of the Central Limit Theorem and pre-average the data over time in order to obtain a noise structure that is closer to Gaussianity; this is a technique also followed in Section 4.5 of Anastasiou and Fryzlewicz (2022). MID is then applied to the obtained, pre-averaged data in order to estimate the change-points. The results over 100 simulations for Scenario (S1) (changes in the mean structure) are given in Tables 7 and 8 for the cases of uniformly or Student-t distributed error terms, respectively. Similar results for Scenario (S2) (changes in the vector of the first order derivatives) are provided and discussed in the online supplement. From Tables 7 and 8 we can conclude that, with relatively minor overestimation in the case of Student-t noise when \(d = 100\) and \(N = 3\), MID maintains its strong practical performance under non-Gaussian noise and for various different scenarios involving the dimensionality of the given data, the sparsity level, and the true number of change-points.

Table 7 Distribution of \(\hat{N} - N\) when MID was employed for change-point detection over 100 simulated multivariate data sequences under Scenario (S1). The signal strength, as defined in (18), is equal to 2 for every change-point, and we present the results for different combinations involving the true number of change-points, the sparsity level, and the dimensionality of the data sequence at hand. The added noise at each component signal follows the \(\textrm{Unif}(- \sqrt{3}, \sqrt{3})\) distribution. The average ARI, \(d_H\), and computational times are also given

5 Real data examples

5.1 UK House Price Index

In this section, the performance of our method is studied on monthly percentage changes in the UK house price index for all property types from January 2000 to January 2023 in twenty London Boroughs. The data are available from http://landregistry.data.gov.uk/app/ukhpi and they were last accessed in March 2023. We have a multivariate data sequence \(\varvec{X_t} = \left( X_{t,1}, \ldots , X_{t,20}\right) ^{\intercal }\), with \(t = 1,\ldots , 276\). The boroughs used are: Barnet (Ba), Bexley (Be), Bromley (Br), Camden (Ca), Croydon (Cr), Ealing (Ea), Enfield (En), Greenwich (Gr), Hackney (Ha), Hammersmith and Fulham (HF), Harrow (Har), Islington (Is), Kensington and Chelsea (KC), Lambeth (La), Merton (Me), Newham (Ne), Richmond upon Thames (RuT), Sutton (Su), Tower Hamlets (TH), and Wandsworth (Wa). Similar data have been investigated in Baranowski et al. (2019), Fryzlewicz (2014), Fryzlewicz (2020), and Anastasiou and Fryzlewicz (2022); however under a univariate setting. Figure 5 indicates the results when \(\textrm{MID}_{\textrm{opt}}\) of Sect. 3.4 was employed for the detection of changes under Scenario (S1) as explained in Sect. 2.2; with respect to the threshold constant, we used the optimal values as in Table 1 for \(\alpha = 0.05\).

Table 8 Distribution of \(\hat{N} - N\) when MID was employed for change-point detection over 100 simulated multivariate data sequences under Scenario (S1). The signal strength, as defined in (18), is equal to 2 for every change-point, and we present the results for different combinations involving the true number of change-points, the sparsity level, and the dimensionality of the data sequence at hand. The added noise at each component signal follows the \(\textrm{t}_8\) distribution and before applying MID we pre-average the data at every 5 observations. The average ARI, \(d_H\), and computational times are also given
Fig. 5
figure 5

The monthly percentage changes in the UK house price index for the twenty London boroughs under consideration. The estimated change-point locations when \(\textrm{MID}_{\textrm{opt}}\) was employed can be seen with red, vertical lines

Our method identifies 27 change-points in the mean structure of the multivariate data sequence at hand. We have analysed the same data using the competing methods of Sect. 4. INSPECT detects 26 change-points at locations very similar to those detected by our method, DC detects one change-point at location 41, while SBS does not detect any change-points. We need to highlight, though, that in INSPECT, multiple change-points are estimated using a wild binary segmentation scheme, which, due to the randomness involved, does not necessarily detect the same change-points when it is employed more than once over the same data set.

5.2 The COVID-19 outbreak

The performance of our method is also investigated on data from the COVID-19 pandemic. In this case, we focus though on the detection of changes under Scenario (S2) as described in Sect. 2.2. The data under consideration consist of the daily number of new lab-confirmed COVID-19 cases in the four constituent countries of United Kingdom; England, Northern Ireland, Scotland, and Wales. The period under investigation is from 01/04/2020 until 30/04/2022; there are no data for Northern Ireland after the 15\(^{\textrm{th}}\) of May, 2022. The data are available from https://coronavirus.data.gov.uk and they were last accessed on the \(7^{\textrm{th}}\) of March 2023. Based on the description, in this example we have a multivariate data sequence \(\varvec{X_t} = \left( X_{t,1}, \ldots , X_{t,4}\right) ^{\intercal }\), with \(t = 1,\ldots , 760\). Due to the fact that the data are positive integer numbers, we perform the Anscombe transform, \(a:\mathbb {N} \rightarrow \mathbb {R}\), with \(a(x) = 2\sqrt{x+3/8}\), to each \(X_{t,j}\); this transform brings the distribution of the component data sequences closer to the Gaussian one with constant variance.

Our method has detected 32 change-points in the vector of the first partial derivatives (changes in the slope for the component univariate data sequences) which seem to capture the important movements in the data.

Fig. 6
figure 6

The daily number of COVID-19 cases for England, Northern Ireland, Scotland, and Wales. The estimated change-point locations in the trend of each component data sequence when our proposed \(\textrm{MID}_{\textrm{opt}}\) method was employed can be seen with red, vertical lines

6 Further discussion

6.1 Temporal dependence

In this section, we explore how our method can be employed for change-point detection in cases where the data exhibit temporal dependence. For \(d=1\), MID has already managed to earn independent praise regarding its robustness on moderate deviations from serial indepence situations; see, for example, Fearnhead and Rigaill (2020) where MID’s univariate analogue (the Isolate–Detect algorithm) is shown through an extensive simulation study to have very strong performance for a number of scenarios when either auto-correlated or heavy-tailed noise is added to the signals. Heuristically, such a good behaviour is expected because of the isolation aspect of MID, which ensures that the change-points will be detected one-by-one while they are still in intervals that contain no other change-points.

Practical ways that can be used in order to enhance MID’s performance in cases of very high serial correlation in the data, are the subsampling and pre-averaging techniques, which have already been described in detail in Section 5.1 of Anastasiou et al. (2022); here, we only give a brief idea of the steps involved. Starting with subsampling, the strategy is to choose a positive integer s and subsample every s data points from our original multivariate data sequence \(\varvec{X_t}\), this will create s mutually exclusive (sub-sampled) data sequences; the autocorrelation in these new data sequences is expected to be lower than that in the original data. The next step is to apply MID on each one of the s created data sequences. This will give back s different sets of estimated change-points. A majority voting rule is applied to these sets and we keep only those estimation values that appear at least a pre-decided amount of times, \(\eta \), with \(\eta \le s\). Once the estimated change-point values, based on this majority voting rule, are extracted, the change-points are then transformed to represent the change-point locations with respect to the original data sequence. Regarding pre-averaging, which we have already applied in Sect. 4 in order to deal with heavy-tailed noise structures, the data are uniformly averaged over pre-specified, short time periods; this can significantly reduce autocorrelation. After the aforementioned pre-processing, MID is applied to the new, pre-averaged data.

6.2 Concluding reflections

In this paper, the MID methodology has been proposed for multiple generalized change-point detection in multivariate, possibly high-dimensional, data sequences, which could exhibit spatial dependence. Mean-dominant norms (see (2)) are employed for the aggregation of the information across the different components of the multivariate data sequence. The aggregated values for the relevant contrast function (depending on the structure of the changes) are collected and compared to a threshold value, \(\zeta _{T,d}\). The rate of \(\zeta _{T,d}\) with respect to both the length, T, of the data sequence and its dimensionality, d, has been proven to be \({\mathcal {O}}\left( \sqrt{\log (Td^{1/4})}\right) \). The optimal multiplicative threshold constant, C, so that \(\zeta _{T,d} = C\sqrt{\log (Td^{1/4})}\) has been derived through a large scale simulation study, with special attention in controlling the Type I error rate, \(\alpha \), of falsely detecting change-points for various values regarding the dimensionality of the data sequence.

Misspecification of the threshold is possible and could lead to misestimation of the underlying signal. To solve such issues, in Sect. 3.5, a permutation-based variant of our MID algorithm has been proposed with no threshold choice requirement. The algorithmic steps of MID remain the same; the difference lies in the way the method chooses to accept or reject a change-point within the interval under consideration, suppose this is denoted by \(I=[s^*,e^*]\), where \(1 \le s^* < e^* \le T\). In Sect. 2.1, we provide pseudocode for the better understanding of the method and it has been explained in detail that MID will first return a vector \({\varvec{v}} \in \mathbb {R}^{J}\), where J is the amount of all change-point candidates. The elements of \({\varvec{v}}\) correspond to the aggregated contrast function values for each candidate point in the interval I. The next step is to store the maximum value of \({\varvec{v}}\) as obtained from the original data and compare it to the appropriate quantiles of the empirical distribution of the values obtained when applying the same steps to several permuted versions of the data; see the exact steps in Section  3.5. The proposed permutation-based variant of MID, though computationally expensive, performs very well in terms of accuracy with respect to the estimated number and locations of the change-points; see the results in Sect. 4.

The choice of the mean-dominant norm to be employed in the aggregation step of MID has already been discussed in the current paper, more specifically in Sect. 3.4. Our aim has been to provide a method that in practice would require minimal parameter choice to the user; towards this purpose, a data-adaptive variant, named \(\textrm{MID}_{\textrm{opt}}\), of the method has been constructed. We first estimate the sparsity level in the given multivariate data; the steps are explained in detail in Sect. 3.4. Depending on the value of the estimated sparsity, \(\textrm{MID}_{\textrm{opt}}\) estimates the change-points employing either the \(L_{\infty }\) or the \(L_2\) mean-dominant norm as defined in (2).

Through simulated and real-life data examples presented in Sects. 4 and 5, respectively, it has been shown that MID has very good performance in terms of accuracy and speed. Specifically, MID lies in the top 5% (in terms of the accurate estimation of the number and the location of the change-points) of the best methods when compared to the state-of-the-art competitors. In addition, \(\textrm{MID}_{\textrm{opt}}\) is a very quick detection method which in a few seconds can analyse signals with length in the range of thousands and dimensionality in the range of hundreds; this is carried out automatically with minimal decision making from the user on the aggregation method to be employed; more details on the computational complexity of our proposed algorithm can be found in Sect. 3.1.

MID has been proven to be a consistent method, with near-optimal rate, in accurately estimating the true number and the locations of the change-points. The consistency result also holds at the presence of spatial dependence between the component data sequences; the very good practical performance of the proposed algorithm (without any modification) in scenarios of multivariate data sequences that exhibit spatial dependence can be found in Table 6. Regarding temporal dependence, the method of proof for the consistency result requires that the data are independent over time. In the univariate case, the method has already been shown (and has acquired independent praise on the matter) to be robust in the presence of auto-correlated data. In order to reduce the effect that, possibly strong, temporal dependence can have on our method’s accuracy in practice, in Sect. 6.1, we provide an explanation of two different approaches that can be followed prior to applying MID to the data; the first one is based on a sub-sampling scheme, while the second one requires to pre-average the given data.

No algorithm is perfect and we are now at the point to present limitations of MID regarding its practical behaviour. Firstly, the method can be slow in situations of long signals that do not exhibit any changes. The reason behind this behaviour is that, due to its expanding intervals characteristic, in such scenarios MID will keep testing for change-points in growing, overlapping intervals pushing the method to be slower than usual. In these settings one can eliminate the aforementioned MID weakness by splitting the data uniformly into smaller windows and then separately detect the change-points within each window. Even though MID has been shown to be quite robust in moderate deviations from Gaussianity (see, for example Table 7), the method seems to require some light data pre-processing in cases of heavy-tailed noise (see Table 8) or in the presence of temporal dependence.