Changepoint detection in non-exchangeable data

Hallgren, Karl L.; Heard, Nicholas A.; Adams, Niall M.

doi:10.1007/s11222-022-10176-1

Changepoint detection in non-exchangeable data

Original Paper
Open access
Published: 16 November 2022

Volume 32, article number 110, (2022)
Cite this article

Download PDF

You have full access to this open access article

Statistics and Computing Aims and scope Submit manuscript

Changepoint detection in non-exchangeable data

Download PDF

Karl L. Hallgren¹,
Nicholas A. Heard¹ &
Niall M. Adams¹

2200 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Changepoint models typically assume the data within each segment are independent and identically distributed conditional on some parameters that change across segments. This construction may be inadequate when data are subject to local correlation patterns, often resulting in many more changepoints fitted than preferable. This article proposes a Bayesian changepoint model that relaxes the assumption of exchangeability within segments. The proposed model supposes data within a segment are m-dependent for some unknown $m \geqslant 0$ that may vary between segments, resulting in a model suitable for detecting clear discontinuities in data that are subject to different local temporal correlations. The approach is suited to both continuous and discrete data. A novel reversible jump Markov chain Monte Carlo algorithm is proposed to sample from the model; in particular, a detailed analysis of the parameter space is exploited to build proposals for the orders of dependence. Two applications demonstrate the benefits of the proposed model: computer network monitoring via change detection in count data, and segmentation of financial time series.

Bayesian multiple changepoints detection for Markov jump processes

Article 25 January 2020

Detection of multiple change-points in high-dimensional panel data with cross-sectional and temporal dependence

Article 20 September 2023

Data segmentation for time series based on a general moving sum approach

Article 14 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Standard changepoint models rely on partitioning the passage of time into segments, and fitting relatively simple models within each segment. In particular, the data within each segment are often assumed to be independent and identically distributed conditional on some segment specific parameter (Green 1995; Fearnhead 2006; Fryzlewicz 2014). This construction assumes the data within each segment are exchangeable, rendering the order in which the data are observed irrelevant when calculating their joint likelihood (Bernardo and Smith 1993). There are many examples of applications where this assumption is suitable; see for example Olshen et al. (2004), Fryzlewicz (2014) and Fearnhead and Rigaill (2019) for the detection of changes in the mean or variance of time series.

However, for some applications it can be reductive to assume data are exchangeable within segments. For illustration purposes, we consider an application of changepoint detection in computer network monitoring. A cyber-attack typically changes the behaviour of the target network. Therefore, to detect the presence of a network intrusion, it might be informative to monitor for changes in the volumes of different types of traffic passing through a network over time. Yet, cyber data are often subject to population drifts, seasonal variations and other temporal trends that are unlikely to be evidence for cyber-attacks. As a result, traditional changepoint detection methods, which assume the data are exchangeable within segments, will fail to capture temporal dynamics and consequently fit many more changepoints than necessary. For example, consider the second-by-second counts of network events recorded on the Los Alamos National Laboratory enterprise network (Turcotte et al. 2017) that are displayed in Fig. 1; the data will be presented in more detail in Sect. 7.1. The vertical lines in the bottom plot indicate the maximum a posteriori (MAP) changepoints obtained with the standard changepoint model described in Fearnhead (2006), which assumes data are exchangeable within segments. No labels are available to indicate the optimal segmentation for these data, but we argue that more changepoints are fitted than would be preferable for network monitoring. It is desirable for abrupt changes, such as the ones observed near the 420th and 880th s, to be detected because they correspond to behavioural changes of the network that are sufficiently drastic that cyber-analysts may want to investigate whether they correspond to malicious activities. However, changes due to small fluctuations, such as the ones between the 100th and 380th s, may not be relevant to cyber-analysts because they result from local temporal dynamics in traffic volumes that are not suggestive of behavioural changes of the network. There are various possible causes for such small fluctuations, including data transfer mechanisms in networks. For example, when large data are transferred across a network, they are often divided up into batches following protocols that involve routing and queueing steps, which may lead to temporal correlations in traffic volumes. Hence, a changepoint model to detect clear discontinuities in the presence of non-exchangeable data is needed. Moreover, since temporal dynamics for cyber data may change when a clear discontinuity occurs, it would not be satisfactory to assume the dependence structure of the data is the same for each segment. For example, an attack may involve reconnaissance techniques such as port scanning that could change temporal dynamics in traffic volumes.

Existing models to detect clear discontinuities in the presence of non-exchangeable data typically assume the dependence structure is identical for each segment (Albert and Chib 1993; Sparks et al. 2011; Wyse et al 2011; Chakar et al. 2017; Romano et al. 2021; Cho and Fryzlewicz 2021). In particular, it is often assumed the data within each segment are Markov conditional on some segment parameter. Moreover, changepoint models for dependent data are often designed for a specific marginal distribution, for example the normal distribution (Chakar et al. 2017; Romano et al. 2021; Cho and Fryzlewicz 2021), the negative binomial distribution (Sparks et al. 2011; Yu et al. 2013) or the Poisson distribution (Weiß 2011; Franke et al. 2012).

This article extends a standard changepoint model (Fearnhead 2006), relaxing the assumption that data are exchangeable within segments. The proposed changepoint model, named the moving-sum changepoint model, supposes a segment model that is related to a model for m-dependent, stationary data discussed in Joe (1996); a sequence $x_1, x_2, \ldots $ is m-dependent if $(x_{t+m+1}, x_{t+m+2}, \ldots )$ is unconditionally independent of $(x_1, x_2, \ldots , x_t)$ for all $t \geqslant 1$. Within each segment, our model assumes the data are m-dependent and identically distributed conditional on some parameter $\theta $, where both $\theta $ and $m \geqslant 0$ are unknown and change from one segment to the next. Whilst $\theta $ denotes a parameter of the marginal distribution of the data such as the mean or the variance, m corresponds to the level of dependency within the segment. To maintain tractability, the marginal distribution of the observed data is assumed to belong to the class of convolution-closed infinitely divisible distributions, which includes, for example, the normal, negative binomial and Poisson distributions. Therefore, the moving-sum changepoint model is suitable for various settings where it is of interest to detect clear discontinuities in the presence of non-exchangeable data. For example, consider the MAP changepoints obtained with the moving-sum model for the counts of network events displayed in the middle panel of Fig. 1. In comparison with the standard changepoint model, the moving-sum changepoint model captures temporal dynamics of the network behaviour, resulting in a segmentation of the data that is more adapted to network monitoring.

A common approach to sampling changepoints for a time series is that of Green (1995), using a reversible jump Markov chain Monte Carlo (MCMC) algorithm to explore the state space of changepoints. At each iteration of the algorithm, one of the following move types is proposed: sample a segment parameter, propose a new changepoint, or delete or shift an existing changepoint to a new position. This article proposes a sampling strategy within that framework to sample from the moving-sum changepoint model. In particular, our approach exploits an analysis of the constraints of the parameter space; for example, for non-negative data such as count data, the constraints of the parameter space depend on the observed data, and this must be understood to build proposals for segment specific dependency levels.

The remainder of the article is organised as follows. Section 2 introduces a novel changepoint model for non-exchangeable data. Section 3 gives an approach for deriving the likelihood of the data conditional on proposed changepoints, characterising the segment model in terms of a stochastic difference equation with an unknown initial condition. Section 4 provides a detailed analysis of the constraints to the parameter space, along with asymptotic results on the behaviour of segment parameters. A reversible jump MCMC sampling strategy is given in Sect. 5. Section 6 presents results demonstrating the benefits of the proposed changepoint model, via a comparison with the standard model (Fearnhead 2006), PELT (Killick et al. 2012), DeCAFS (Romano et al. 2021) and WCMgSA (Cho and Fryzlewicz 2021). Section 7 considers two applications of changepoint detection showing the benefits of the proposed changepoint model: computer network monitoring via change detection in count data, and detection of breaks in daily prices of a stock. Section 8 concludes.

2 Moving-sum changepoint model

This section introduces the moving-sum model, which is used as a segment model to define a novel changepoint model for non-exchangeable data.

2.1 Moving-sum model

A moving-sum model assumes that observed data $x_1, \ldots , x_n$ satisfy

$$\begin{aligned} x_t \overset{}{=} \sum _{i=0}^{m}y_{t-i}, \end{aligned}$$

(1)

for $t=1, \ldots , n$, where $y_{-(m-1)}, \ldots , y_n$ are $m+n$ iid latent random variables with common parametric density $f_m( \cdot \, | \, \theta )$ for some unknown parameters $\theta \in \Theta $ and $m \geqslant 0$. If $m=0$, the construction in (1) implies that, for all $t=1, \ldots , n$,

$$\begin{aligned} x_t \overset{\text {iid}}{\sim } f_0( \cdot \, | \, \theta ), \end{aligned}$$

(2)

and, consequently, is equivalent to exchangeability in the data. Yet, if $m>0$, the sequence of observed data (1) is m-dependent and therefore non-exchangeable.

Definition 1

(m-dependence) For $m \geqslant 0$, the sequence $x_1, x_2, \ldots $ is $\, m$-dependent if $(x_{t+m+1}, x_{t+m+2}, \ldots )$ is unconditionally independent of $(x_1, x_2, \ldots , x_t)$ for all $t \geqslant 1$. Note that if a sequence is m-dependent, then it is also Markov of order m.

For all t, $x_t$ in (1) is the sum of $m+1$ latent random variables, leading to m-dependence. Noting this duality, for simplicity of presentation in the following discussion we use the notational convention

$$\begin{aligned} {\bar{m}} = m+1. \end{aligned}$$

It will be helpful to identify a class of distributions for which the construction in (1) gives rise to a tractable marginal distribution of the observed data. Recall the distribution of a random variable x is infinitely divisible if, for all $m \geqslant 0$, there exists a sequence of iid random variables $y_0, \ldots , y_m$ such that $\sum _{i=0}^{m}y_i$ has the same distribution as x. For all infinitely divisible marginal distributions F for $x_t$ (1), there exists a distribution $F_m$ for the latent random variables for all m, and $F_m$ is known if F is closed under convolution. In this article, it will be assumed that the marginal distribution of $x_t$ is an infinitely divisible distribution that is closed under convolution, so that the corresponding density $f_m( \cdot \, | \, \theta )$ of the iid latent variables is available for all m.

We consider in detail three instances of the moving-sum segment model based on such distributions, one for continuous data with unbounded domain given in Example 1, one for continuous data with bounded domain given in Example 2, and one for discrete data with bounded domain given in Example 3, which we will refer back to throughout the article for illustration. Conjugate priors for $\theta $ are given for each example.

Example 1

(Normal distribution) Suppose that $f_m( \cdot \, | \, \theta )$ corresponds to density of the normal distribution with mean $ \mu / {\bar{m}}$ and variance $\sigma ^2 / {\bar{m}}$, for some $\theta = ( \mu , \sigma )$ where $\mu \in {\mathbb {R}}$ and $\sigma >0$. It follows that $(x_t)$ is marginally $N(\mu , \sigma ^2)$ with m-dependence. Moreover, a priori $\sigma ^{-2} \sim \text {Gamma}(\alpha , \beta )$, for some $\alpha >0$ and $\beta >0$, and $\mu \sim N(\mu _0, \sigma ^2 / \lambda )$ for some $\mu _0 \in {\mathbb {R}}$ and $\lambda > 0$.

Example 2

(Gamma distribution) Suppose that $f_m( \cdot \, | \, \theta )$ corresponds to density of the gamma distribution with shape parameter $\lambda /{\bar{m}}$ and rate $\theta $, where $\lambda >0$ and $\theta >0$. It follows that $(x_t)$ is marginally $\Gamma (\lambda , \theta )$ with m-dependence. The prior for $\theta $ is assumed to be $\Gamma (\alpha , \beta )$ for some $\alpha >0$ and $\beta >0$.

Example 3

(Negative binomial distribution) Suppose that $f_m( \cdot \, | \, \theta )$ corresponds to density of the negative binomial distribution with number of failures $ r / {\bar{m}}$ and success probability $\theta \in [0, 1]$, for some fixed $r>0$. It follows that $(x_t)$ is marginally $NB(r, \theta )$ with m-dependence. Moreover, a priori $\theta \sim \text {Beta}(\alpha , \beta )$, for some $\alpha >0$ and $\beta >0$.

Other examples of such distributions include the Poisson, Cauchy and chi-squared distributions.

2.2 Bayesian changepoint analysis with moving-sums

Suppose we observe real-valued discrete time data $x_{1:T} = (x_{1}, \ldots , x_{T})$. The changepoint model assumes $k \geqslant 0 $ changepoints with ordered positions $\tau _{1:k} = (\tau _1, \ldots , \tau _k)$, such that $1 \equiv \tau _{0}< \tau _{1}< \cdots< \tau _{k} < \tau _{k+1} \equiv T+1$, which partition the passage of time into $k + 1$ independent segments. The changepoints are assumed to follow a Bernoulli process, implying a joint prior probability mass function

$$\begin{aligned} \pi ( k, \tau _{1:k} ) = p^{k}(1-p)^{T-1-k}, \end{aligned}$$

for some $0<p<1$ that characterises the expected number of changepoints.

For the moving-sum changepoint model, within each segment j, the data $x_{\tau _{ j-1}}, \ldots , x_{\tau _{j}-1}$ are assumed to follow the moving-sum model (1) conditional on some unknown dependency level $m_j\geqslant 0$ and parameter $\theta _j\in \Theta $, which both change from one segment to the next. Dependency levels $m_1, \ldots , m_{k+1}$ and segment parameters $\theta _1, \ldots , \theta _{k+1}$ are assumed to be independent. For all j, it is assumed a priori that $m_j$ is drawn from a geometric distribution with parameter $0< \rho <1$, so that

$$\begin{aligned} \pi (m_j) = \rho (1-\rho )^{m_j}, \end{aligned}$$

meaning that for each segment the order of dependence may be increased or decreased at a fixed cost. Moreover, motivated by computational considerations, the prior for $\theta _j$ is chosen to be conjugate for $f_0$. For notational simplicity, we denote by $\pi $ the density of the prior distribution of both $m_j$ and $\theta _j$.

2.3 Simulation from the model

Figure 2 displays data generated from the moving-sum changepoint model, given a fixed sequence of changepoints, for negative binomial data given in Example 3 with $r = 200$, $\alpha = 20$ and $\beta = 10$. It is apparent that the changepoints correspond to changes in both the dependence structure and the mean of the data. In particular, we note that for larger values of $m_j$, the data tend to be smoother in the corresponding segment. For segments with $m_j>0$, it is reductive to judge the data to be exchangeable since there are clear temporal dynamics.

The bottom panel of Fig. 2 displays the positions of the MAP changepoints obtained by fitting the standard changepoint model for exchangeable data given in (2) to the simulated data using Metropolis-Hastings sampling of the changepoints as described in Denison et al. (2002). Within segments where the data are not exchangeable, the standard model cannot capture the temporal dynamics and therefore the data are inferred to be more segmented than preferable.

3 Conditional likelihood for the moving-sum changepoint model

Since changepoints split the data into independent segments, the joint posterior density of changepoints is tractable, up to a normalising constant, if the conditional likelihood of data within each segment can be computed. This section discusses the computation of the conditional likelihood of some data $x_1, \ldots , x_n$ observed within a single segment, assuming the moving-sum model defined in Sect. 2.1 for some $m \geqslant 0$ and $\theta $.

3.1 Relationship between the observed data and the latent variables

This section gives insights on the relationship between the observed data and the latent variables, which is exploited to compute the conditional likelihood of the observed data within a generic segment in Sect. 3.2. We show that, for the latent m-dependence framework (1), there are m free latent variables subject to some constraints, and then all further latent variables are implied by the observed data sequence. Some of the notation introduced in this section will be useful in the remainder of the article.

Consider the backward difference operator $\triangledown $ and the forward difference operator $ \vartriangle \, $ defined such that, for all t,

$$\begin{aligned} \triangledown x_t&= x_t - x_{t-1}, \nonumber \\ \vartriangle x_t&= x_t - x_{t+1} = - \triangledown x_{t+1} . \end{aligned}$$

(3)

The equation given in (1) may be equivalently expressed as

$$\begin{aligned} y_t = y_{t-{\bar{m}}} + \triangledown x_t , \end{aligned}$$

(4)

for $t=2, \ldots , n$, and $y_1 = x_1 - (y_{-m+1} + \cdots + y_{0})$. Iterating the expression in (4) shows that, given the initial m latent random variables $y_{-m+1}, \ldots , y_{0}$, there is a one-to-one relationship between the finite differences of $x_{1:n}$ (3) and the remaining latent variables $y_{1:n}$. Let the first m latent variables be $\gamma _{1:m}= (\gamma _1, \ldots , \gamma _m)$, with

$$\begin{aligned} \gamma _{r} = y_{-m+r} \end{aligned}$$

for $r = 1, \ldots , m$. Explicitly, for all $t=1, \ldots , n$, letting r be the remainder and q the quotient of the Euclidean division of $t-1$ by ${\bar{m}}$ such that $t=q{\bar{m}}+r+1$, we have

(5)

where $x_0 \equiv 0$ and

(6)

The role played by $\gamma _{1:m}$ is akin to the role played by the unknown initial conditions of a stochastic difference equation.

The choice to condition on the first m latent random variables is arbitrary; given any sequence of m consecutive latent random variables, there is a one-to-one relationship between the other latent variables and the observed data. The following definition gives a transformation that may be used to obtain $\gamma _{1:m}$ from any m consecutive latent random variables, so that it is sufficient to consider conditioning on the first m latent variables in all subsequent discussion. The transformation is also useful in the later sections of the article, where, for example, we need to obtain the initial latent variables when some data are added to, or removed from, an edge of a segment.

Definition 2

Let $x_1, \ldots , x_n$ be data observed within a segment, assuming the model (1) for some $m \geqslant 0$. Let $S[y_{t}, \ldots , y_{t+m-1} ] = (y_{t+1}, \ldots , y_{t+m} )$ denote the ‘shift’ map, where $y_{t+m} = x_{t+m} - \sum _{i=0}^{m-1} y_{t+i} $, for all suitable t. Clearly, S is iterable and invertible, and for all sequences of m consecutive latent random variables $y_{(-m+1+u):u}$, with $ 0 \leqslant u \leqslant n$, $S^{-u}[y_{(-m+1+u):u}] = y_{(-m+1):0} = \gamma _{1:m} $.

3.2 Conditional likelihood of the data within a segment

Given m and $\gamma _{1:m}$, (5) provides a one-to-one deterministic mapping between $x_{1:n}$ and $y_{1:n}$ with unit Jacobian. Hence, if we treat the sequence $\gamma _{1:m}$ as an additional unknown segment parameter, whose elements are independent and identically distributed with density $f_m(\cdot \, |\, \theta )$, then the conditional likelihood of the observed data within a segment is

$$\begin{aligned} L(x_{1:n} \, | \,\theta , m, \gamma _{1:m})&= L(y_{1:n} \, | \, \theta , m, \gamma _{1:m}) \nonumber \\&= \prod _{i=1}^{n} f_m(y_i | \theta ). \end{aligned}$$

(7)

Thus, using the notation introduced in Sect. 2.2, but ignoring the subscripts corresponding to the indices of segments, the unknown segment parameters are $(\theta , m , \gamma _{1:m})$, with prior density $\pi (\theta ) \pi (m) \pi (\gamma _{1:m} \, | \theta , m)$ where $\pi (\gamma _{1:m} \, | \theta , m) = \prod _{r=1}^{m} f_m(\gamma _{r} \, | \, \theta )$.

Recall it is assumed that the prior for $\theta $ is chosen to be conjugate for $f_m(\cdot \, |\, \theta )$ conditional on m. Consequently, the joint likelihood of the data and the initial latent variables conditional on m can be derived by invoking Bayes’ theorem,

$$\begin{aligned}{} & {} L(x_{1:n}, \gamma _{1:m} \, | \, m) \nonumber \\{} & {} \quad = \int L(y_{1:n} \, | \, \theta , m, \gamma _{1:m} ) \pi (\gamma _{1:m} \, | \, \theta , m) \pi (\theta ) d \theta . \end{aligned}$$

(8)

Let ${\mathcal {F}}$ denote the support of $f_m(\cdot \, |\, \theta )$ and, given m, $x_{1:n}$ and (5), let

$$\begin{aligned} {\mathcal {Y}}_m&\equiv {\mathcal {Y}}_m(x_{1:n}) \nonumber \\&= \{ \gamma _{1:m} \in {\mathcal {F}}^m ; y_t \in {\mathcal {F}} \text { for all } t=1, \ldots , n \}. \end{aligned}$$

(9)

${\mathcal {Y}}_{m}$ is the set of sequences $\gamma _{1:m}$ such that the joint conditional probability density (8) is non-zero. As stated in Remark 1, it is not guaranteed that, for all m, ${\mathcal {Y}}_m$ will be non-empty. An expression for (8) is given below for the three examplar segment models introduced in Sect. 2.1.

Remark 1

(Set ${\mathcal {Y}}_m$) Note that in the case of $m=0$, where the sequence $x_{1:n}$ is assumed to be exchangeable, then $\gamma _{1:m}$ is the empty sequence, and the expression in (7) is always well defined. Now, if $m>0$, two cases need to be considered separately. If ${\mathcal {F}}$ is unbounded, for all m and sequence $x_{1:n}$, the set ${\mathcal {Y}}_m$ is ${\mathcal {F}}^{m}$. However, if ${\mathcal {F}}$ is bounded then ${\mathcal {Y}}_m$ is a proper subset of ${\mathcal {F}}^{m}$ and is not necessarily non-empty for all $m>0$ and $x_{1:n}$. For example, if ${\mathcal {F}}$ is bounded below by 0, for any non-negative sequence $x_{1:n}$ with $x_2 > x_1 + x_3$, the set ${\mathcal {Y}}_m$ is empty for all $m>0$.