1 Introduction

During the last decade, there has been a significant focus on the important challenge of efficient and accurate detection of changes in both univariate and multivariate data sequences (Cho and Fryzlewicz 2014; Fisch et al. 2022; Kovács et al. 2023; Truong et al. 2020; Tveten et al. 2022; Wang and Samworth 2018). More recently, focus has turned to translating the efficiency of such approaches to the online setting, typically motivated by an applied challenge such as how to deal with limited computational power (e.g. Ward et al. 2024). Recent major contributions to the online setting include Adams and MacKay (2007), Tartakovsky et al. (2014), Yu et al. (2023), Chen et al. (2022), and Romano et al. (2023). In this paper we consider a less studied scenario, monitoring edge-behaviour within distributed sensor networks, which are common architectures within the Internet of Things framework (IoT). The importance of efficiently detecting changes at the edge efficiently, whilst minimising communication between sensors and the cloud is perhaps best appreciated by considering two key applications: detecting cyber-attacks on smart cities (Alrashdi et al. 2019) and optimising the performance of base stations (Wu et al. 2019).

Consider, by way of example, Fig. 1 which shows a schematic representation of real-time monitoring within a distributed network. Here we assume that d data streams are monitored, each by its own sensor. Communication between the sensors and the centre is possible as shown by the dashed lines. An unusual event happens at time \(\tau \), and we want to detect this event as quickly as possible. However, in modern sensor networks that deploy IoT devices the computational resources of the sensors can be substantial. Moreover, communication between the sensors and the cloud can be problematic due to the heavy energy usage involved with transmitting data (Varghese et al. 2016; Pinto and Castor 2017). As such, we need algorithms that can identify the time when it is important for information to be shared with the cloud. More specifically, in this article, we seek to develop a new method to detect changes within such a network in real time with high statistical power and as little communication and computation as possible.

Fig. 1
figure 1

Schematic representation of a sensor network made up of d sensors, where \(S_i\) is the index for sensor i, \(X_{i,t}\) is the data observed at sensor i, and \(M_{i,t}\) is the message transmitted from sensor i to centre at time t

Changepoint methods which can be applied in the fully centralised problem, when the data from the sensors is processed and transmitted to the centre (cloud) at every time step, are well studied. Approaches typically seek to calculate the maximum or the sum of all the test statistics (see, e.g., Mei 2010; Xie and Siegmund 2013; Chan 2017; Chen et al. 2022; Gösmann et al. 2022). The rationale behind these methods is to set thresholds and raise the alarm if the aggregated test statistics from multiple streams exceed pre-defined thresholds. Numerical experiments Mei (2010) indicate that taking the maximum is the optimal method when there are only a few affected data streams—what we will term a sparse change. Conversely, taking the sum is optimal when most data streams are affected, also known as a dense change.

Recent contributions to this distributed problem include (Rago et al. 1996; Veeravalli 2001; Appadwedula et al. 2005; Mei 2005, 2011; Tartakovsky and Kim 2006; Banerjee and Veeravalli 2015). Among them, two recent papers of particular interest develop communication efficient schemes for monitoring a large number of data streams (Zhang and Mei 2018; Liu et al. 2019). The key idea is that each sensor computes a local monitoring statistic and then employs a thresholding step, only sending the statistic to the centre if there is some evidence of a change. The information from multiple sensors is then combined at the centre. This approach reduces unnecessary transmission by ignoring streams with little evidence for a change, while only focusing on data streams that show signs of change.

Although computationally feasible, existing works assume that the pre- and post-change mean are known. In practice, the pre-change mean can be estimated based on historical data. However assuming a known post-change mean is typically unrealistic in practice, with an incorrect value potentially leading to a failure to detect, or poor detection power. Liu et al. (2019) approximate the post-change mean recursively but, as a consequence, somewhat sacrifice statistical power of the algorithm.

Our approach builds on recent work developing the moving sum (MOSUM) as a window-based changepoint methods (see, e.g., Aue et al. 2012; Kirch and Kamgaing 2015; Kirch and Weber 2018). Specifically, we propose an online communication-efficient changepoint detection algorithm (distributed MOSUM) to detect changes in real-time within the distributed network setting. A local threshold is chosen to filter out unimportant information and only transmit the statistically important test statistic to the centre. The change will be alarmed when the aggregated test statistic exceeds the pre-defined global threshold in the central cloud. The low time complexity and communication efficient scheme of our proposed method makes it suitable for online monitoring. We also establish that the proposed method can achieve similar statistical power as the idealistic setting, where there is no communication constraint, at detecting large changes whilst substantially reducing the transmission cost. Moreover, we also show how to make the detection performance of distributed MOSUM close to that of the idealised setting by increasing the window size, which will only sacrifice the storage cost and a little transmission cost.

The key differences between our work and previous distributed changepoint detection contributions (e.g., Liu et al. 2019) are: Firstly, a moving window-based test statistic MOSUM is chosen to avoid the requirement of knowledge of the post-change mean. Secondly, earlier works have been based on the framework that controls the average run length (ARL)—the average amount of time until incorrectly detect a change. However, such a metric gives a somewhat limited amount of information since the distribution of run length is usually unknown. For instance, if multiple procedures end quickly while a few replications stop significantly longer, the ARL would be the same if all the replications terminated around the same time. Conversely, in this work, we present methods in terms of controlling the error rate under the null at a specific level, and with asymptotic power 1 under alternatives. Furthermore, our ideas generalise trivially to methods controlling the average run length.

The structure of this paper is as follows. In Sect. 2, the problem setting is outlined, before introducing the distributed MOSUM methodology in Sect. 3. Several theoretical results for this new approach are given in Sect. 4. Simulation studies are carried out in Sect. 5, before ending with some concluding remarks (Sect. 6).

2 Problem setting

We begin by assuming that we have d sensors, each of which is observed as follows: \(\mathbf {X_t}=(X_{1,t},X_{2,t},X_{3,t},\ldots ,X_{d,t})\) at every time point \(t \in \mathbb {N}\). Here \(\mathbf {X_t}\) could be raw data or the residuals after pre-processing the data. These observations are assumed to be identically distributed and independent across series. Such assumptions are common in the problem of detecting changes within a distributed system setting (e.g., Tartakovsky and Veeravalli 2002; Mei 2010; Xie and Siegmund 2013; Liu et al. 2019). We do not strictly assume time independence here, but our method is optimal when this assumption holds. Moreover, the impact of time dependence will be numerically studied in Sect. 5.3.

We begin by assuming that at some unknown time, \(\tau \), the distribution of some unknown subsets of d sensors will change. For simplicity, we only consider change in mean, but the ideas below are easily extended to other changepoint settings. Therefore, in this illustrative change in mean setting, the model for the data is expressed as follows:

$$\begin{aligned} X_{i,t} = \mu _i + \delta _{i} \mathbb {1}_{ \{ t > \tau \}} + \epsilon _{i,t}, \quad t \in \mathbb {N}, 1 \le i \le d, \end{aligned}$$
(2.1)

where \(\mu _i\) is the known pre-change mean, \(\delta _i\) is the mean shift, and \(\{\epsilon _{i,t}: t \in \mathbb {N}\}\) are strictly stationary error sequences. After time \(\tau \), the mean of the i-th data stream changes immediately from \(\mu _i\) to \(\mu _i + \delta _i\). Here it is useful to note that our setting also permits some \(\delta _i=0\), which means that only a subset of data streams are affected by the change. Without loss of generality, we assume \(\mu _i=0\). Under the null hypothesis, the model for the data can be rewritten as

$$\begin{aligned} X_{i,t} = \epsilon _{i,t}, \quad t \in \mathbb {N}, 1 \le i \le d. \end{aligned}$$
(2.2)

Moreover under the alternative hypothesis, the model is \(X_{i,t} = \delta _i+\epsilon _{i,t}, t \in \mathbb {N}, 1 \le i \le d\). Our aim is to monitor such a system and raise the alarm as soon as possible following the event at time \(\tau \). One way of achieving this is to perform hypothesis testing sequentially, i.e., evaluate the null hypothesis of no change in mean at each time point \(t \in \mathbb {N}\). The algorithm will stop and declare a change when we can reject the null hypothesis.

In the classical sequential changepoint detection problem, we evaluate the performance of an algorithm subject to a constraint on its false alarm rate. First, consider an open-ended stopping rule where the algorithm never we have an infinite time-window of measurements and the algorithm never halts until it detects a change. The false alarm rate can be evaluated in two ways. Assume there is no change, and let \({\widehat{\tau }}\) be the time at which we detect a change, with the convention that \({\widehat{\tau }}=\infty \) if we detect no change. One approach is to control the average run length, \(E^{\infty }({\widehat{\tau }})\), the expected time of to a false alarm. This makes sense for procedures with a constant threshold for detection, for which we are certain to detect a change under the Null if we monitor for an infinite time period. Alternatively, one can control the false alarm rate, \(P^\infty ({\widehat{\tau }} <\infty )\), the probability of a false alarm. To control this over an infinite time horizon requires increasing the threshold for detecting a change over time. Equivalently, this can be achieved by multiplying the test statistic with a weight function \(w(\cdot )<1\). See Leisch et al. (2000); Zeileis et al. (2005); Horváth et al. (2008); Aue et al. (2012); Kirch and Kamgaing (2015); Weber (2017); Yau et al. (2017); Kirch and Weber (2018); Kengne and Ngongo (2022) for examples of how to choose an appropriate weight function.

In our paper, we focus on controlling the false alarm rate. However Aue et al. (2012) states that “applying open-ended procedures built from the asymptotic critical values have a tendency to be too conservative infinite samples”. Therefore, our paper considers a close-ended stopping rule. In this approach, the algorithm will stop either upon detecting a change or upon reaching the predefined monitoring time T. We thus control the false alarm rate over a time time window of length T. However, the ideas we present can easily be adapted to the open-ended setting, and also to methods which control the average run length.

Under the context of distributed changepoint detection problem, we additionally evaluate the index—the average transmission cost \({\bar{\Delta }}\). This is the average number of transmissions at each time step for d sensors, and should be smaller than the pre-specified transmission cost \(\Delta \).

Before introducing our proposed method, we first review relevant work. At time t, the local monitoring statistic, \(\mathcal {T}_{i}\) is calculated for the ith stream. Then all the local statistics \(\mathcal {T}_{i}\) can be combined into a global monitoring statistic \(\mathcal {T}\) at the fusion centre. There are two common choices of message combinations for monitoring changes within the distributed system. One of these two types, the SUM scheme (Mei 2010), declares a change when the sum of all the local monitoring statistics exceeds a pre-defined threshold, that is:

$$\begin{aligned}&{\hat{\tau }}_{\textrm{sum}}(c_{\textrm{Global}})=\inf \left\{ t\ge 1: \mathcal {T} \ge c_{\textrm{Global}} \right\} \\&\quad =\inf \left\{ t\ge 1: \sum _{i=1}^d \mathcal {T}_{i} \ge c_{\textrm{Global}} \right\} , \end{aligned}$$

where \(c_{\textrm{Global}}\) is global threshold. This way of combining statistics across streams is known to be good if the series are independent and the changes are dense. However, implementing this method on the distributed system requires sending every \(\mathcal {T}_{i}\) to the fusion centre, which is expensive. A sum-shrinkage method (Liu et al. 2019) is proposed to reduce the communication cost by thresholding the test statistics before summing them:

$$\begin{aligned}&{\hat{\tau }}_{\textrm{sum}}(c_{\textrm{Local}}, c_{\textrm{Global}})=\inf \left\{ t\ge 1: \mathcal {T} \ge c_{\textrm{Global}} \right\} \\&\quad =\inf \left\{ t\ge 1: \sum _{i=1}^d \mathcal {T}_{i}\mathbb {I}(\mathcal {T}_{i}\ge c_{\textrm{Local}}) \ge c_{\textrm{Global}} \right\} . \end{aligned}$$

Empirically the sum-shrinkage method could achieve similar performance as the SUM scheme in the dense case and surprisingly performs better in the sparse case.

When the change is sparse, it has been shown both theoretically and empirically (Mei 2010; Liu et al. 2019; Chen et al. 2022) that monitoring the maximum of the test-statistics across series is best. In such a setting, the MAX procedure (Tartakovsky and Veeravalli 2002) monitors the maximum of test statistics and raises the alarm when the maximum of the local test statistics exceeds the thresholds, that is:

$$\begin{aligned} {\hat{\tau }}_{\textrm{max}}(c_{\textrm{Global}})&=\inf \left\{ t\ge 1: \mathcal {T}\ge c_{\textrm{Global}} \right\} \\ {}&=\inf \left\{ t\ge 1: \max _{ 1\le i\le d} \mathcal {T}_{i} \ge c_{\textrm{Global}} \right\} . \end{aligned}$$

The best choice of different schemes depends on the sparsity of changes which is based on the number of affected data streams p. This can be made precise if we consider an asymptotic setting where \(p\rightarrow \infty \) (Enikeeva and Harchaoui 2019), and define a change to be sparse if the number of affected streams is \(p=o(\sqrt{d})\), and it to be a dense change otherwise. A recent paper (Chen et al. 2022) combines both SUM procedure and MAX procedure to achieve good performance regardless of the sparsity. In the context of distributed monitoring, the MAX procedure is trivially implemented without any communication. Specifically, each sensor has the threshold for the max-statistic and flags a change if their local statistic is above this threshold. Therefore, within this paper, we only focus on developing a communication-efficient version of the SUM scheme. Our aim is a method that performs well for dense changes, but limits the communication cost. We will use the SUM scheme as the ideal method to compare against since it has no restrictions on communication.

3 Distributed change point detection method

Our proposed methodology is summarized in Algorithm 1, and described in detail below. The method essentially comprises of three steps. The first step involves the parallel local monitoring of each data stream by the sensors. As the monitoring unfolds, messages are occasionally sent from the sensors to the centre to indicate the presence of a potential change. Finally, at the centre these messages are aggregated to find changes that occur across a number of data streams.

Algorithm 1
figure a

Centralized and distributed MOSUM

3.1 Local monitoring

3.1.1 Estimating the baseline parameters

Our sequential testing approach requires a historic data set of length m to estimate the baseline parameters. Theoretical results are obtained later in the paper when \(m \rightarrow \infty \). The parameters of interest are the mean of each data stream \(\mu _i\) and the variance of the errors \(\sigma _i^2\). For the ith data stream these estimates are,

$$\begin{aligned} \begin{aligned} \hat{\mu }_i&= \frac{1}{m} \sum _{t=1}^{m} X_{i,t}, \\ \hat{\sigma }_i^2&= \frac{1}{m} \sum _{t=1}^{m} \left( X_{i,t} - \hat{\mu }_i \right) ^2. \end{aligned} \end{aligned}$$
(3.1)

If the errors cannot be assumed to be independent we can estimate the long run variance. This requires specifying a kernel function \(K(\cdot )\):

$$\begin{aligned}&\hat{\sigma }_i^2 = \frac{1}{m} \sum _{t=1}^{m} \left( X_{i,t} - \hat{\mu }_i \right) ^2 + 2\sum _{j=1}^{m-1} K\left( \frac{j}{l} \right) \hat{\gamma }_{j}^{(i)}, \end{aligned}$$
(3.2)
$$\begin{aligned}&\text {where } \hat{\gamma }_{j}^{(i)} = \frac{1}{m - j} \sum _{t=1}^{m-j} \left( X_{i,t} - \hat{\mu }_i \right) \left( X_{i,t+j} - \hat{\mu }_i \right) . \end{aligned}$$
(3.3)

In this setting, the Kernel function can be seen as a weighting function for sample covariance \(\hat{\gamma }_{j}^{(i)}\). The kernel function must be symmetric and such that \(K(0)=1\). Various kernel functions are proposed. Standard kernel functions include Truncated (White and Domowitz 1984), Bartlett (Newey and West 1986) and Parzen (Gallant 2009) amongst others. Among them, the Bartlett kernel is frequently used in Econometrics. This kernel takes the form:

$$\begin{aligned} K_{\text {Bartlett}}\left( \frac{j}{l}\right) = {\left\{ \begin{array}{ll} 1-\frac{j}{l}, &{} \hbox { for}\ 0 \le j \le l-1,\\ 0, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

For more details, see Horváth and Hušková (2012); Kiefer and Vogelsang (2002a, 2002b).

3.1.2 Starting local monitoring

Once the baseline parameters have been estimated, beginning at time \(m + 1\) data \(X_{i,m+1}, X_{i,m+2}, \hdots \) are observed sequentially and monitored for a change. This is achieved using a MOSUM statistic which at monitoring time, k, takes a window containing the most recent h observations:

$$\begin{aligned} \mathcal {T}_i(m,k,h) = \frac{1}{\hat{\sigma }_i} \left| \sum _{t = m + k - h + 1}^{m+k}\left( X_{i,t} - \hat{\mu }_i \right) \right| . \end{aligned}$$
(3.4)

Following Aue et al. (2012), the MOSUM statistic will declare a change at time k when the weighted local MOSUM statistic \( w(k,h)\mathcal {T}_i(m,k,h)\) exceeds a pre-defined threshold. A weight function \(w(\cdot ,\cdot )\) is introduced to control the asymptotic size of the detection procedure. Typically \(w(\cdot ,\cdot )\) depends on the monitoring time k, and the window size h,

$$\begin{aligned} w(k,h) = \frac{1}{\sqrt{h}} \rho \left( \frac{k}{h} \right) , \end{aligned}$$
(3.5)

for some appropriate \(\rho (\cdot )\). The choice of the weight function controls the sensitivity of the test. A wide range of weight functions can be used as long as they are continuous functions that satisfy \(\inf _{0\le t \le T} \rho (t)> 0\). In this paper, we use the weight function proposed in Leisch et al. (2000) and Zeileis et al. (2005):

$$\begin{aligned} \rho (t) = \max ( 1, \log \left( 1 + t \right) )^{-1/2}. \end{aligned}$$

Intuitively, if there is no change the weighted MOSUM will remain small, but it will be large if there is a change. Figure 2 gives the behavior of weighted MOSUM statistic under the null and the alternative assumptions for one data stream.

Fig. 2
figure 2

Example time series with no change (a) and a single change (b) in the top row. The bottom row shows the weighted MOSUM statistic with a historic period of length \(m = 100\) and a window size of \(h = 50\)

3.2 Message passing

The local monitoring described in the previous section is applied to each sensor independently. In order to make global decisions about the state of the system, messages from the sensors must be passed to the central hub (see Fig. 1). However, since there are constraints on communication in the system, the message passing process must be carefully designed.

At time \(t = m + k\), where m is the historic period of length m, and k is the monitoring time, each sensor makes a decision as to whether or not to transmit a message to the centre. This message vector is denoted as \(\textbf{M}_t = (M_{1,t}, M_{2,t}, \hdots , M_{d,t})\). We consider two different messaging regimes:

  • Centralized messaging regime: \(\textbf{M}_t = \mathcal {T}_{i}(m,k,h)\).

  • Distributed messaging regime:

    $$\begin{aligned} M_{i,t} = {\left\{ \begin{array}{ll} \mathcal {T}_{i}(m,k,h) &{}\quad \text {if } w(k,h)\mathcal {T}_{i}(m,k,h) > c_{\text {Local}}, \\ NULL &{}\quad \text {otherwise.} \\ \end{array}\right. } \end{aligned}$$
    (3.6)

The centralized massaging regime is one where there is no constraint on the communication between the sensors and the centre, so all sensors send a message to the centre at each time instant. This is similar to the “SUM” scheme changepoint detection method proposed by Mei (2010). However, when communication is expensive, a “distributed” messaging regime can be used where each of the sensors only send local monitoring statistics that exceed a chosen threshold. The NULL means no message is sent. The threshold \(c_{\text {Local}}\) can be chosen to control the fraction of transmitting sensors when there is no change. It is worth noting that when \(c_{\text {Local}}=0\), the “distributed” messaging regime is equivalent to “centralized” messaging regime.

3.3 Global monitoring

In our paper, we assume that there is no communication delay between sensors and the central hub, so the message could be immediately received by the centre at time t. Based on the messages received, the centre will make the decision as to whether or not to flag a change.

3.3.1 Combining messages

Depending on different messaging regimes, the global MOSUM statistics are constructed as follows:

  • Centralized global MOSUM statistic:

    $$\begin{aligned} \mathcal {T}(m,k,h) = \sqrt{ \sum _{i=1}^{d} M_{i,t}^2 }, \end{aligned}$$
    (3.7)

    This is similar to the SUM scheme mentioned in Sect. 2. By using such a scheme, Formula 3.7 is the idealistic scheme under dense change.

  • Distributed global MOSUM statistic:

    $$\begin{aligned} \mathcal {T}(m,k,h) = \sqrt{ \sum _{i=1}^{d} M_{i,t}^2 \mathbb {1}_{ \mathcal {T}_i(m,k,h) > c_{\text {Local}}} }, \end{aligned}$$
    (3.8)

    where NULL values in Formula 3.6 are taken to be zeros in the sum. The form of Eq. (3.8) is taken from the multivariate MOSUM (Kirch and Kamgaing 2015; Weber 2017; Kirch and Weber 2018).

3.3.2 Declaring the change

Similar to the local monitoring procedure, a change is declared as soon as the weighted global MOSUM exceeds a threshold. A closed-end stopping rule can be used when the aim is to monitor changes within a fixed time. This can be formalised as

$$\begin{aligned} \tau _{m,\tilde{T}} {=} \min \left\{ 1 \le k \le \lfloor m\tilde{T} \rfloor : w(k,h)\mathcal {T}(m,k,h) {>} c_{\text {Global}} \right\} , \end{aligned}$$
(3.9)

where \(\min \{ \emptyset \} = \infty \) and the total length of the data \(T=m\tilde{T}\). If no change is detected by this stopping rule prior to \(\lfloor m\tilde{T} \rfloor \), the monitoring procedure is terminated. The parameter \(\tilde{T} > 0\) governing the length of the monitoring period is chosen in advance (Horváth et al. 2008; Aue et al. 2012).

Figure 3 shows the weighted global MOSUM statistic for the distributed and centralized messaging regimes on the same dataset. Whenever the weighted global MOSUM of distributed regime hits zero, there is no communication between the edges and the centre at that time.

Fig. 3
figure 3

Example of the weighted global MOSUM statistic for the distributed (red dashed line) and centralized (black line) regime. The result is obtained with \(T=1000, d=100, m=100\), \(h=50, \delta =0.5\) and the number of affected sensors \(p=50\). A value of \(c_{\text {Local}} = 3.44\) was used in the distributed regime. (Color figure online)

In the next section, we will show the theoretical properties of our proposed method under \(H_0\) and \(H_A\).

4 Theoretical properties for distributed MOSUM

This section considers the theoretical properties of the closed-end stopping rule, \(\tau _{m,\tilde{T}}\) defined in Eq. (3.9) as \(m \rightarrow \infty \). Firstly, in Sect. 4.1 we find the limiting distribution under the null hypothesis for the different procedures. Then, appropriate choices for the thresholds, \(c_{\text {Local}}\) and \(c_{\text {Global}}\) are given in Sect. 4.2 using these results. Finally, in Sect. 4.3 we prove that the detection procedures we have studied are consistent under alternatives.

Three key assumptions are made in order to derive asymptotic results, which are the same in Horváth et al. (2008), Aue et al. (2012), and Weber (2017):

Assumption 1

(Clean historic data) \(h \rightarrow \infty \) as \(m \rightarrow \infty \) and the location of the changepoint \(\tau > m\) for \(1 \le i \le d\).

This assumption is to guarantee we can get good estimators based on the training dataset, and it can be easily achieved in real applications.

Assumption 2

(Asymptotic regime) \(h \rightarrow \infty \) as \(m \rightarrow \infty \) and

$$\begin{aligned} \lim _{m \rightarrow \infty } \frac{h}{m} \rightarrow \beta \in (0,1]. \end{aligned}$$

This assumption quantifies the long run connection between the length of the historical period m and the window size \(h:= h(m)\).

Assumption 3

(FCLT on errors)

$$\begin{aligned} \lim _{m \rightarrow \infty } \frac{1}{\sqrt{m}} S_i(mt) \overset{\mathcal {D}}{\longrightarrow } \sigma _i W_{i}(t) \end{aligned}$$

where \(\sigma _i > 0\), \(\{ W_{i}(t), 0 \le t < \infty \}\) is a standard Brownian motion when \(h \rightarrow \infty \), and \(S_i(x) = \sum _{t = 1}^{\lfloor x \rfloor } \epsilon _{i,t}\). \(\sigma _i\) can be estimated by \({\hat{\sigma }}_i\). Furthermore, \({\hat{\sigma }}_i\) satisfying \({\hat{\sigma }}_i \overset{\mathcal {P}}{\longrightarrow } \sigma _i\) as \(m \rightarrow \infty \).

This assumption is a functional central limit theorem on the errors, \(\epsilon \), in the model for the data (2.1).

4.1 Asymptotics under the null

In this part, the asymptotic theories of our proposed method will be given, which can help guide the choice of thresholds.

The local monitoring process of our proposed method within each sensor is the same as univariate MOSUM detection process. Thus, Theorem 1 and Corollary 1 of the local MOSUM can be directly cited from Horváth et al. (2008), Aue et al. (2012) and Weber (2017). For simplicity, we denote

$$\begin{aligned}&Z_i(t) = \left| W_{i}\left( \frac{1}{\beta } + t \right) - W_i\left( \frac{1}{\beta } + t - 1 \right) \right. \nonumber \\&\quad \left. - \beta W_{i}\left( \frac{1}{\beta }\right) \right| , \quad 1 \le i \le d \end{aligned}$$
(4.1)

where \(\{ W_{i}(t), 0 \le t < \infty \}\) are independent standard Brownian motions.

Theorem 1

(Local MOSUM) If Assumption 13, and model 2.2 holds, then under \(H_0\), let \(k = ht\) for any \(t > 0\)

$$\begin{aligned} \lim _{m \rightarrow \infty } w(k,h)\mathcal {T}_i(m,k,h) \overset{\mathcal {D}}{\longrightarrow } \rho (t)Z_i(t). \end{aligned}$$

Corollary 1

(Local MOSUM - asymptotic type-I error) Under \(H_0\), for any \(\tilde{T} > 0\) and ith data stream,

$$\begin{aligned} \lim _{m \rightarrow \infty } P \left( \tau _{m,\tilde{T}}^{(i)} < \infty \right) = P \left( \sup _{0 \le t \le \tilde{T}/\beta } \rho (t)Z_{i}(t) > c_{\text {Local}} \right) . \end{aligned}$$

Thus, the false alarm rate for one data stream is asymptotically equal to a pre-specified type-I-error \(\in (0,1)\).

Following the results of local MOSUM, similar results for global MOSUM follow readily. These can be used to choose thresholds given the pre-defined Type-I-error. Below we obtain two limiting distributions, for the centralized and distributed regime settings of Sect. 3 respectively.

Theorem 2

(Global MOSUM) Let \(k = ht\) for any \(t > 0\), then under \(H_0\),

$$\begin{aligned}&\lim _{m \rightarrow \infty } w(k,h)\mathcal {T}(m,k,h)\\&\quad \overset{\mathcal {D}}{\longrightarrow } \rho (t) {\left\{ \begin{array}{ll} \sqrt{ \sum _{i=1}^{d} Z_{i}(t)^2 } &{}\quad \text {centralized case,} \\ \sqrt{ \sum _{i=1}^{d} Z_i(t)^2\mathbb {1}_{ \rho (t)Z_{i}(t) > c_{\text {Local}} }} &{}\quad \text {distributed case}. \end{array}\right. } \end{aligned}$$

Proof

See Appendix A.1. \(\square \)

Thus, their limiting distribution will be a function of Gaussian process. Using the Theorem 2, the following may be obtained:

Corollary 2

(Global MOSUM—asymptotic type-I error) Under \(H_0\), for any \(\tilde{T} > 0\),

$$\begin{aligned}&\lim _{m \rightarrow \infty } P \left( \tau _{m,\tilde{T}} < \infty \right) = {\left\{ \begin{array}{ll} P\left( \sup _{0 \le t \le \tilde{T}/\beta } \rho (t) \sqrt{ \sum _{i=1}^{d} Z_{i}(t)^2 }> c_{\text {Global}} \right) &{}\quad \text {centralized case,} \\ P\left( \sup _{0 \le t \le \tilde{T}/\beta } \rho (t) \sqrt{ \sum _{i=1}^{d} Z_i(t)^2\mathbb {1}_{\rho (t)Z_{i}(t)> c_{\text {Local}} }} > c_{\text {Global}} \right) &{}\quad \text {distributed case}. \end{array}\right. } \end{aligned}$$

This result can lead us to find the local and global thresholds which can obtain the pre-defined type-I-error.

4.2 Obtaining critical values

Using the results of the previous section, appropriate critical values can be found such that the asymptotic type-I error is controlled for the different procedures. To achieve this the stochastic processes \(\{ Z_{i}(t), 0 \le t \le \tilde{T}/\beta , 1 \le i \le d\}\) need to be approximated on a fine grid. This is done in the same way as Aue et al. (2012), simulating the component standard Brownian motions using ten thousand i.i.d. standard normal random variables. The parameters used were \(\beta = 1/2\) and \(\tilde{T} = 10\). Tables 1 and 2 give critical values for \(\alpha \in \{0.10, 0.05, 0.01\}\).

Table 1 Critical values for the centralized procedures, results averaged over five thousand replications
Table 2 Critical values for the distributed procedure with different values for \(c_{\text {Local}}\), results averaged over five thousand replications

Since the critical values obtained above are valid asymptotically (in m), an important question to consider is how they perform in finite samples. Numerical results of empirical size in the finite sample are shown in Table 3. Thse indicate that the implementation in the finite sample setting can be conservative, as per Aue et al. (2012). However, approximately, the type-I error is controlled at the correct level for both of the global procedures in finite samples.

Table 3 Empirical size, results averaged over one thousand replications with \(\alpha =0.05, \tilde{T}=10\), and \(\beta =1/2\)

4.2.1 The choice of local threshold \(c_{\text {Local}}\)

The values for \(c_{\text {Local}}\) used in Table 2 are somewhat arbitrary. The main influence of the value of local threshold is that it controls the proportion of messages that the system can pass (on average) per iteration. For d streams, the number of sensors passing message at each time step is:

Corollary 3

(Transmission cost) For any t>0 and k=ht, the expected fraction of transmitting sensors at each time step is

$$\begin{aligned} {\bar{\Delta }}_t=d P\left( \rho (t)|Z| >c_{\text {Local}}\right) . \end{aligned}$$

where Z is the standard normal distribution.

Therefore, the local threshold can be chosen based on the restriction of the transmission cost. Combined with pre-defined type-I-error, the global threshold will be given based on Theorem 2.

4.3 Asymptotics under the alternative

Under the alternative it is assumed that there is a changepoint at monitoring time \(k^{*}\) and a subset \(\mathcal {S}\) of the data streams have an altered mean

$$ \begin{aligned} H_A: \tau = m + k^{*} \quad \& \quad \exists \mathcal {S} \subset \{1, 2, \hdots , d\}: \delta _i \ne 0 \quad \text {for } i \in \mathcal {S}. \end{aligned}$$

Deriving sharp asymptotic results on the detection delay of the proposed method is challenging, and thus we focus only on giving consistency results. A procedure is consistent if it stops in finite time with probability approaching one as \(m \rightarrow \infty \). In other words, the test statistic should tend to infinity as \(m \rightarrow \infty \). In the asymptotic regime of interest, we additionally assume that the changepoint \(k^{*}\) grows at the same order as h, that is \(\frac{k^*}{h} \rightarrow \gamma \ge 0\), and the size of change \(\delta _{i,t}\) satisfies \(\sqrt{h}|\delta _{i,t}|\rightarrow \infty \) as \(m \rightarrow \infty \) and \(h \rightarrow \infty \). These assumptions are the same in Aue et al. (2012).

Theorem 3

(Global MOSUM: Consistency) If the assumption above holds, under \(H_A\),

  1. (i)

    the changepoint \(k^{*} \le \lfloor h \nu \rfloor \) for some \(0< \nu < \tilde{T}\frac{m}{h}\),

  2. (ii)

    there exists a constant \(c > 0\) such that \(\rho (x+1) \ge c\) for all \(x \in (\nu , \tilde{T}\frac{m}{h} - 1)\).

Then, as \(m \rightarrow \infty \) and \(h \rightarrow \infty \)

$$\begin{aligned} \max _{1 \le k \le \lfloor m\tilde{T} \rfloor } w(k,h)\mathcal {T}(m,k,h) \overset{\mathcal {P}}{\longrightarrow } \infty . \end{aligned}$$

Proof

See Appendix A.2. \(\square \)

Thus, our proposed method is consistent.

5 Simulations

In this section, we will present the numeric performance of our algorithm. Since the SUM procedure that is optimal when the change is dense, we will evaluate the performance in the dense case, specifically when the affected data streams \(p=d\). Firstly, the different practical choices of thresholds at fixed type-I-error will be investigated. Here the performance of our proposed method was also compared against the idealistic setting. Finally, the effect of parameters and the violation of the independence assumption are investigated.

The set-up of the simulations is as follows. For simplicity, the data generating process under the null is that \(X_{i,t} \sim N(0,1)\) for \(1 \le i \le d\) and \(1 \le t \le T\). To compare fairly, the type-I-error of all procedures is controlled to be 0.05 under the null.

The family of alternatives considered is that

$$\begin{aligned}&X_{i,t} \sim N(0,1)\quad \text {for} \quad 1 \le i \le d, 1 \le t < \tau \quad \text {and} \\&\quad X_{i,t} \sim N(\delta _i,1) \quad \text {for} \quad 1 \le i \le d, \tau \le t \le T. \end{aligned}$$

We assume the change will affect all the sensors instantaneously. But the size of the change is unknown. We consider two scenarios of mean shift: 1) Same size: \(\delta _i=\delta =\) some constant values for \(1 \le i \le d \); 2) Random size: \(\delta _i=\eta N(0,1)\), where \(\eta \) is the scale factor controlling the magnitude of size. The average detection delay (ADD) \({\bar{D}}\) and average communication cost \(\bar{\Delta }\) are then measured:

$$\begin{aligned} {\bar{D}}&= E({\hat{\tau }} -\tau |{\hat{\tau }}>\tau )\\ \bar{\Delta }&=\sum _{t=m+1}^{\hat{\tau }}\frac{\sum _{i=1}^d\mathbb {l}(w(k,h)\mathcal {T}_i(m,k,h)>c_{\text{ Local }})}{\hat{\tau }-m -1}, \end{aligned}$$

5.1 The numerical dependency on local thresholds

Our proposed method requires specifying two thresholds. Usually, \(c_{\text {Global}}\) can be given based on the Theorem 2 once \(\alpha \) and \(c_{\text {Local}}\) are confirmed. Therefore, it is crucial to pick an appropriate local threshold. This section gives numeric results with different values of local thresholds, which may provide some guidance in choosing the local threshold.

Fig. 4
figure 4

The average number of messages transmitted to the centre (top) and average detection delay across varying mean shifts (bottom). Results are obtained when \(m = 200\), \(h = 100\), \(T=10{,}000, \tau = 5000, \alpha =0.05\). Each line corresponds to a different local threshold, which is labelled on the top right. The colour changes from orange to blue as the local thresholds increase from 0 to 5.2. When the local threshold is 5.2, the global threshold will be 0. So all possible combinations of thresholds are covered. (Color figure online)

Figure 4 gives the average detection delay and transmission cost for different values of local thresholds. There is a trade-off between communication savings and detection performance when choosing the local threshold. Larger local thresholds can reduce the transmission cost but will also lead to longer delays, especially when the change is small. However, with the increase in the mean shift, the detecting power of larger thresholds will close to that of small thresholds.

A centralized framework can be seen as an idealistic setting, which is equivalent to distributed setting when \(c_{\text {local}}=0\). Compared with the idealistic setting, the distributed MOSUM can achieve similar performance when the size of the change is not small but also reduces massive transmission costs. But we will lose power in detecting small changes. We show the result below that distributed MOSUM can approximate the performance of idealistic setting overall by increasing the window size.

5.2 The numerical dependency on parameters

One advantage of using MOSUM statistics is that we do not need to specify the post-change mean. Instead, our proposed method requires specifying the window size h and the training size m. In this section, we will investigate the impact of bandwidth and training size.

5.2.1 The impact of bandwidth

As shown in Fig. 5a, increasing the window size can increase the power of detecting small changes while leading to a slight delay in detecting large changes. Although increasing window size will increase the storage cost, it will not significantly increase the transmission cost as shown in Fig. 5b. This drive us to think about whether we can improve the ability of distributed MOSUM with a large threshold to detect small changes by increasing the window size. Ideally, we would expect distributed MOSUM with increased window size can achieve similar performance as the idealistic setting.

Fig. 5
figure 5

The influence of window size. Results are obtained over 1000 replications and take \(m=200, d=100, T=10{,}000, \tau =5000, \alpha =0.05, c_{\text {Local}}=3.44\)

Recovering detectability

For simplicity, we denote that the default window size for centralized MOSUM is \(h^0\), and \(h^*\) is the smallest window size that would allow distributed MOSUM to have similar performance as the idealistic setting. It is difficult to develop a neat theoretical formula between \(h^*\) and \(h^0\). But we can approximately find \(h^*\) under alternatives by simulation. Our idea can be summarized as follows, and Fig. 6 is the graphic explanation:

  • The behaviour of \({{\bar{D}}}\) will decrease dramatically when the mean shift is within a certain range (gray area). Therefore, we can find the median or mean \(\delta \) of this certain range, denoted by \(\delta ^0\). Also, the corresponding ADD \({{\bar{D}}}^0\) can be calculated.

  • Fix \(\delta ^0\), calculate \({{\bar{D}}} ^ {c_{\text {Local}}}(h)\) iteratively for distributed MOSUM, where \(h\in [h^0,m]\).

  • The optimal window size \(h^*=\arg \min \left\{ {{\bar{D}}} ^ {c_{\text {Local}}}(h)\right. \left. -{{\bar{D}}}^0 (h^0) \right\} \). See blue arrow (\(h^*\)) is shorter than yellow arrow (h).

Fig. 6
figure 6

An graphic explanation of our proposed idea. Black line is the ADD for centralized MOSUM with window size h. Yellow dashed line is the ADD for distributed MOSUM with window size h; while blue dashed is the ADD for distributed MOSUM with window size \(h*\). (Color figure online)

Figure 7 displays the simulation results that, for distributed regime, we can recover the same detectability of the centralized statistic by inflating h.

5.2.2 The impact of the training dataset

Fix the bandwidth h, the impact of the size of the training dataset can be investigated. Table 4 gives the thresholds, empirical size, and mean square errors (MSE) of estimated baseline parameters in our simulation. As we expected, the larger the training size is, the more accurate estimators are Fig. 8 indicates that overall the detection powers of four different sizes of training datasets are similar. A larger training size could slightly increase the detection power when detecting small changes, which is attributed to more accurate estimators. Thus, in the real application, it is beneficial to choose a large-size training dataset because it is not expensive that can be done offline.

Fig. 7
figure 7

An simple example showing that distributed MOSUM could approximate the detection power of centralized MOSUM by inflating window size. Results are obtained over 500 replications and take \(m=200, d=100, T=1000, \tau =600, \alpha =0.05\), and \(c_{\text {Local}}\in [0,4.4]\). When \(c_{\text {Local}} = 4.4\), \(c_{\text {Global}} = 0\). So all possible local thresholds are covered. For centralized setting, window size \(h^0=50\)

5.3 The violation of the independence assumption

Before, we assume that there is temporal independence among observations. However, it may not always hold in the real application. This section will investigate the performance when this assumption is violated. Here we measure our algorithm under AR(1) noise process, that is

$$\begin{aligned} X_{i,t}= \delta _{i,t}\mathbb {1}_{}\left\{ t>\tau \right\} +\epsilon _{i,t}, \end{aligned}$$

where \(\epsilon _{i,t} = \phi \epsilon _{i,t-1}+v_t\) with \(v_t \sim N(0,1)\). \(|\phi |<1\) is used to measure the strength of the auto-correlation.

Auto-correlation will inflate the variance of data. There are two possible ways to handle this problem. The first one is to estimate the long-run variance as shown in Sect. 3.2. And one can also inflate the thresholds. We measure the false positives, average detection delay and the number of transmitted messages with fixed type-I-error of these two solutions under different scenarios. For better comparison, we also show the result of MOSUM without any adjustment. This will give us hints that to what extent our method fails to detect the change if we ignore the auto-correlation.

As Table 5 shows, our proposed method without adjustment can lose the ability to detect changes when introducing auto-correlation, that it fails to detect the change and always alarms. The performances of MOSUM with inflating thresholds are generally better than MOSUM with LRV since the former can detect the change in most scenarios. However, for those scenarios that the MOSUM with LRV can detect (usually \(\delta \) is not small and \(\phi \) is not large), it always has the lowest transmission cost and reasonable detection power. For example, when \(p=100, \delta =1\), and \(\phi =0.25\), both solutions have similar false positive rates and average detection delay, while MOSUM with LRV has lower transmission cost. It is surprising that estimating LRV has the lowest false positive rates and average detection delay when \(\phi =0\) and \(p=100/50\). This may be because it underestimates the variance.

Table 4 Empirical size, and MSE for estimated mean and standard deviation, results averaged over one thousand replications with \(c_{\text {Local}}=3.44, h=50, T=6000\) and \(\alpha =0.05\)
Fig. 8
figure 8

\({{\bar{D}}}\) versus \(\delta \) when varing the size of training dataset. Result averaged over 500 replications with \(\alpha =0.05, c_{\text {Local}}=3.44, T=6000\), \(\tau =3000\) and \(h=50\). The corresponding global thresholds are shown in Table 4

However, when the auto-correlation is serious, it is not appropriate to apply our method to the raw data. Instead, it is more reasonable to apply our method after pre-processing the data, such as the residuals of AR models.

Table 5 Results are obtained over 1000 replications with \(T=10{,}000\), \(m=200\), \(h=100\), \(\tau =5000\), \(d=100, c_{\text {Local}}=3.44\), and \(\alpha =0.05\) for all three methods

6 Conclusion

Within this paper, we proposed an online communication-efficient distributed changepoint detection method, and it can achieve similar performance as an idealistic setting but save many transmission costs. Numerically, we show that the local threshold and window size have an impact on the performance of our algorithm, and there is a trade-off in choosing a local threshold and window size. In application, we recommend choosing a large local threshold in general cases. But when the change is extremely small, the choice of the local threshold depends on the communication and storage budgets. If the communication budget is much more limited, choosing a large threshold with a large window size is sensible. If the storage cost is much more expensive, choosing a small threshold with small window size will approximately achieve the idealistic performance.

The violation of independent assumptions will negatively affect the power of our proposed method. We tried to solve this problem by inflating thresholds or estimating the long-run variance. Both ways can, to some extent, improve our algorithm when the auto-correlation problem is not severe. However, both approaches fail to detect changes in highly auto-correlated data. Therefore, one of the future research directions is how to detect change within highly auto-correlated data in real-time.