Sequential hypothesis testing for selecting the number of changepoints in segmented regression models

Priulla, Andrea; D’Angelo, Nicoletta

doi:10.1007/s10651-024-00605-x

Sequential hypothesis testing for selecting the number of changepoints in segmented regression models

Open access
Published: 21 March 2024

Volume 31, pages 583–604, (2024)
Cite this article

Download PDF

You have full access to this open access article

Environmental and Ecological Statistics Aims and scope Submit manuscript

Sequential hypothesis testing for selecting the number of changepoints in segmented regression models

Download PDF

Andrea Priulla¹ &
Nicoletta D’Angelo¹

592 Accesses
1 Citation
Explore all metrics

Abstract

Segmented regression is widely used in many disciplines, especially when dealing with environmental data. This paper deals with the problem of selecting the correct number of changepoints in segmented regression models. A review of the usual selection criteria, namely information criteria and hypothesis testing, is provided. We enhance the latter method by proposing a novel sequential hypothesis testing procedure to address this problem. Our sequential procedure’s performance is compared to methods based on information-based criteria through simulation studies. The results show that our proposal performs similarly to its competitors for the Gaussian, Binomial, and Poisson cases. Finally, we present two applications to environmental datasets of crime data in Valencia and global temperature land data.

A heuristic, iterative algorithm for change-point detection in abrupt change models

Article 10 June 2017

A sequential multiple change-point detection procedure via VIF regression

Article 03 June 2015

1 Introduction

Segmented regression is a standard tool in many fields, including epidemiology (Ulm 1991), occupational medicine, toxicology, ecology, biology (Betts et al. 2007), and more recently, higher education (Li et al. 2019; Priulla et al. 2021).

Segmented or broken-line models are regression models where the relationships between the response and one or more explanatory variables are piecewise linear, namely represented by two or more straight lines connected at unknown values. These values are commonly referred to as changepoints or breakpoints. The main advantage of these models lies in the results’ interpretability while also achieving a good trade-off with flexibility, typically achieved by non-parametric approaches.

This paper deals with the task of selecting the number of changepoints in segmented regression models, a topic widely discussed by many authors, as Lerman (1980) and Kim et al. (2000). This is a common problem in segmented regression models. Indeed, if the number of changepoints is too small, the model may not capture all the changes in the relationship between the variables, resulting in bias and reduced model fit. On the other hand, if the number of changepoints is too large, the model may overfit the data and not generalize well to new data.

Other common problems in segmented regression models include the identification of the location of the changepoints in a segmented variable, and the estimation of its effects on the response variable, to cite a few. For instance, Horváth et al. (2004) proposes two classes of monitoring schemes to (sequentially) detect a structural change in a linear model. Aue et al. (2006) develop an asymptotic theory for two monitoring schemes aimed at detecting a change in the regression parameters, showing to have a correct asymptotic size and detecting a change with probability approaching unity. Then, Chen et al. (2011) deal with two problems concerning locating changepoints in a linear regression model, namely, the one involving jump discontinuities (not covered in this paper) in a regression model and the other involving regression lines connected at unknown points. The latter is the main framework covered in our work. Muggeo and Adelfio (2011) present a computationally efficient method to obtain estimates of the number and location of the changepoints in genomic sequences, or, more generally, in mean-shift regression models. Adelfio (2012) introduces a new approach based on the fit of a generalized linear regression model for detecting changepoints in the variance of heteroscedastic Gaussian variables with piecewise constant variance function, and D’Angelo et al. (2022) extend such approach in order to detect changepoints in the variance of multivariate Gaussian variables, allowing to provide simultaneous detection of changepoints in functional time series.

Moving to the main topic of our work, in literature several approaches have been proposed to select the optimal number of changepoints in segmented regression models. One common approach is to use information criteria, such as the Akaike Information Criterion (AIC) (Akaike 1974) or the Bayesian Information Criterion (BIC) (Schwarz 1978), which balance the model’s goodness of fit with the model’s complexity. These criteria penalize models with more parameters, which can prevent overfitting.

Another approach is to use cross-validation (Zou et al. 2020; Pein 2023), which involves splitting the data into training and validation sets, fitting models with different numbers of changepoints to the training set, and then selecting the number of changepoints that minimizes the prediction error on the validation set. The model that performs best on the validation set is selected. Cross-validation can be computationally intensive, but it can also provide a more accurate estimate of model performance than AIC or BIC.

A further approach is hypothesis testing to determine the significance of adding a changepoint to the model. Typically, this consists of performing different hypothesis tests starting from testing ${\mathcal {H}}_0 : K_{0} = 0$ vs ${\mathcal {H}}_1 : K_{0} = K_{max}$ where $K_0$ is the true number of changepoints and $K_{max}$ is the maximum number of potential changepoints fixed a priori, as done by Kim et al. (2000). However, this well-established procedure, which requires sequentially testing for the existence of a changepoint, makes testing for any additional changepoints unfeasible.

In light of this, this paper proposes a novel sequential hypothesis testing procedure that overcomes this problem, having the perk of not being limited to testing for a maximum number of additional changepoints fixed a priori. Starting from the work of Kim et al. (2000), we enhance such a method by proposing a novel sequential hypothesis testing, to identify the correct number of changepoints.

First, we provide an overview of the segmented regression models and a review of the main tools useful for selecting the number of changepoints. Regarding the information-based criteria, we consider the AIC, the BIC and the generalized Bayesian Information Criterion (gBIC). As regards hypothesis testing, we consider Davies’ test (Davies 1977) and the Score test (Muggeo 2016). The performance of the different tools and our proposed procedure is then assessed through simulation studies. Finally, to explore the applicability of the considered framework, two original applications are proposed: the first one deals with the crime events that occurred in Valencia in 2019, available from the stopp package (D’Angelo and Adelfio 2023) of the software R (R Core Team 2023); the second one deals with global temperature anomalies data from the NOAA Merged Land Ocean Global Surface Temperature Analysis Data set (Smith et al. 2008). All the analyses are performed using the segmented package (Muggeo 2008) of the R Core Team (2023) statistical software, and original codes from the authors.

The structure of the paper is as follows. Section 2 introduces the segmented regression model and Section 3 reviews suitable criteria for model selection in this context. Section 4 illustrates our proposal. Section 5 presents simulations to study the performance of the given criteria, and Section 6 proposes two applications dealing with crime events in Valencia and with global anomalies temperature data. The paper ends with conclusions in Section 7. The Appendix contains supplementary material in support of the run experiments.

2 Background on the segmented regression models

The segmented linear regression is expressed as

$$\begin{aligned} g({\mathbb {E}}[Y|x_i,z_i])= \beta _0 +\theta z_i+\beta _1 x_i+\sum _{k=1}^{K_0}\delta _k (x_{i}-\psi _k)_+ \end{aligned}$$

(1)

where g is the link function, x is a broken-line covariate and z is a covariate whose relationship with the response variable is not broken-line. Multiple covariates can be accounted for, but we limit our study to unique covariates. We denote by $K_0$ the true number of changepoints and by $\psi _k$ the $K_0$ locations of the changes in the relationship that we call, from now on, changepoints. These are selected among all the possible values in the range of x. The notation $(x_i-{\psi }_k)_+$ is to be read as $(x_i-{\psi }_k)I(x_i>{\psi }_k)$. The coefficient $\theta$ represents the non-broken-line effect of z, $\beta _1$ represents the effect when $x_i < \psi _1$, while $\varvec{\delta }= \{\delta _k\}_{k=1}^{K_0}$ is the vector of the differences in the effects.

The basic statistical problem dealt in this paper is the identification of the number of changepoints $K_0$. The estimation of their locations, that is the vector of $\psi _k$, and the broken-line effects, represented by $\beta _1$ and the vector $\varvec{\delta }$, may also be of interest.

For estimation purposes, a reparametrization of the segmented model in Equation (1) is considered, dropping the non-segmented covariate z without any loss of generality. This reparametrization has the advantage of an efficient estimating approach via the algorithm discussed in Muggeo (2003, 2008), fitting iteratively the generalized linear model:

$$\begin{aligned} g({\mathbb {E}}[Y|x_i])=\beta _1x_i+\sum _k \delta _k \tilde{U}_{ik}+\sum _k \gamma _k \tilde{V}_{ik}^{-}, \end{aligned}$$

(2)

where $\tilde{U}_{ik}=(x_i-\tilde{\psi }_k)_+$, $\tilde{V}_{ik}^{-}=-I(x_i>\tilde{\psi }_k)$. The parameters $\beta$ and $\delta _k$ are the same as in Equation (1), while the $\gamma$ are the working coefficients useful only for the estimation procedure. At each iteration, the working model in Equation (2) is fitted and new estimates of the changepoints are obtained via: $\hat{\psi }_k=\tilde{\psi }_k+\frac{\hat{\gamma }_k}{\hat{\delta }_k}$ iterating the process up to convergence. Inferences on $\hat{\psi }$ is usually the main interest, and can be drawn by means of bootstrap, likelihood-based or Wald-type methods. In particular, Muggeo (2003) discusses and implements the usage of the Wald statistics. The standard error of $\hat{\psi }$ is obtained through a linear approximation for the ratio of two random variables using the Delta Method: $SE(\hat{\psi })=\{ [\text {var}(\hat{\gamma }) +\text {var}(\hat{\beta })(\hat{\gamma }/\hat{\beta })^2+2(\hat{\gamma }/\hat{\beta })\text {cov}(\hat{\gamma },\hat{\beta })]/\hat{\beta }^2 \}^{\frac{1}{2}},$ with $\text {var}(\cdot )$ and $\text {cov}(\cdot ; \cdot )$ the variance and covariance, respectively.

For further details on the estimation procedure, we refer to Muggeo (2003).

3 Selecting the number of changepoints

In a more general context, with multiple changepoints as in Equation (1), we need to select only the significant changepoints by removing the spurious ones. Indeed, whether the generic $\hat{\psi }_k$ is not significant, the corresponding covariate $V_k$ in Equation (2) should be a noise variable, as it would be $\hat{\delta }_k \approx 0$. Therefore, selecting the number of significant changepoints in model (1) means selecting the significant variables among $V_1, \dots , V_{K^*}$, from model (2), where $K^*$ is the number of estimated changepoints. The fitted optimal model will have $\hat{K} \le K^*$ changepoints selected by any criterion. It is important to notice that these models are not nested, so likelihood ratio tests for model selection cannot be used.

Furthermore, the usual statistics cannot be used to verify the existence of a changepoint, since it is present only under the alternative hypothesis. This leads to a non-linear problem because the regularity conditions of the log-likelihood are not satisfied.

Basically, we need to select the $\hat{\psi }_1,\dots ,\hat{\psi }_{\hat{K}}$ among the $\hat{\psi }_1,\dots ,\hat{\psi }_{K^*}$ via a selection criterion. The changepoints $\hat{\psi }_1,\dots ,\hat{\psi }_{\hat{K}}$ will be a subset of the estimates $\hat{\psi }_1,\dots ,\hat{\psi }_{K^*}$, since one or more changepoints are not included due to the deletion of one or more variables $V_k$ by means of the given selection criterion. Therefore, it should be noticed that, while $\hat{\psi }_1,\dots ,\hat{\psi }_{K^*}$ are the estimates maximising the likelihood with $K^*$ changepoints, there is no guarantee that the subset $\hat{\psi }_1,\dots ,\hat{\psi }_{\hat{K}}$ constitutes also the best estimate for the number of changepoints.

Much of the literature deals with the problem of determining the ‘best’ subset of independent variables: Hocking (1976) summarizes various selection criteria, reviewed below. These can be classified under two major approaches: information criteria and hypothesis testing.

The first information criterion is the well-known Akaike Information Criterion (Akaike 1974), expressed as $AIC=-2\log L + 2 p$, where L represents the likelihood function and p stands for the actual model dimension quantified by the number of estimated parameters, including $\hat{\beta }$, the $\hat{\delta }$ and $\hat{\psi }$ vectors in the segmented regression models, that is $p=1+2\hat{K}$. Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. Thus, AIC rewards goodness of fit (as assessed by the likelihood function), but also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.

The second criterion is the Bayesian Information Criterion (Schwarz 1978) $BIC=-2\log L + p \log (n)$ that includes both a penalty for the number of estimated parameters p and for the logarithm of the number of observations n. In the most common Gaussian case, let’s denote by $y_i$ the response variable and by $\hat{\mu }_i$ the estimated expectation through a generalized linear model. We can therefore express the BIC as $BIC= n\log {\hat{\sigma }^2} + p \log (n),$ where $\hat{\sigma }^2$ is the error variance, defined as $\frac{1}{n-p}\sum _{i=1}^n (y_i-\hat{\mu }_i)^2$, which is an unbiased estimator for the true variance. The one with the lowest BIC is preferred when picking from several models. The BIC is an increasing function of the error variance $\hat{\sigma }^2$ and an increasing function of p. That is, unexplained variation in the dependent variable and the number of explanatory variables increases the value of BIC. Hence, lower BIC implies either fewer explanatory variables, better fit, or both. The BIC generally penalizes parameters more strongly than the AIC, though it depends on n and p. For a typical linear regression model, it is well understood that the traditional best subset selection method with the BIC can identify the true model consistently (Shao 1997; Shi and Tsai 2002). With a fixed predictor dimension, Wang et al. (2009) showed that the tuning parameters for high dimensional model selection procedures selected by a BIC type criterion can identify the true model consistently, and similar results are further extended to the situation with a diverging number of parameters for both unpenalized and penalized estimators. Therefore, the definition of the generalized BIC based on Gaussian distributed iid errors is $gBIC=\log (\hat{\sigma }^2) + p \frac{\log (n)}{n} C_n$, where $C_n$ is a known constant (e.g. 1, $\sqrt{n}$, $\log n$, $\log \log n$). The definition reduces to $gBIC= -2\log L + p \log (n) C_n$ in the case of non-Gaussian errors (which we will also refer to when dealing with Binomial and Poisson responses). In general, the larger $C_n$, the more parsimonious the selected model. Note that the gBIC reduces to the usual BIC when $C_n=1$. The same considerations for the BIC hold, that is, when choosing from several models, the one with the lowest gBIC is the one to be preferred.

An alternative approach to selecting the number of changepoints relies on sequential hypothesis testing. Typically, this consists of performing different hypothesis tests starting from

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {H}}_0 : K_{0} = 0\\ {\mathcal {H}}_1 : K_{0} = K_{max} \end{array}\right. } \end{aligned}$$

where $K_{max}$ is fixed a priori. Depending on the rejection or not of the null hypothesis, the procedure can test for the next hypothesis system by either increasing the number of changepoints specified under ${\mathcal {H}}_0$ or decreasing the one under ${\mathcal {H}}_1$, respectively (Kim et al. 2000).

4 Proposed sequential hypothesis testing

In this paper, we propose a novel sequential procedure to identify the correct number of changepoints resorting to the pseudo-score (Muggeo 2016) or Davies’ test (Davies 1977).

Testing for the existence of a changepoint means that we are dealing with the following system of hypotheses:

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {H}}_{0} : &{} \delta _{k}=0\\ {\mathcal {H}}_{1} : &{} \delta _{k}\ne 0 \end{array}\right. }. \end{aligned}$$

Evaluating the existence of a changepoint is actually a non-regular problem, because $\psi _k$ is present only under the alternative ${\mathcal {H}}_{1}$. This problem makes usual statistical tests, such as the Wald or the likelihood ratio test, useless, because of the lack of a reference null distribution, even asymptotically. Therefore, we review below two tests used to evaluate the presence of a changepoint.

The first test proposed is the Davies’ Test (Davies 1977), an asymptotic test useful for dealing with hypothesis testing with a nuisance parameter present only under the alternative. Assuming fixed and known changepoints, the procedure computes K ‘naive’ test statistics $S(\psi _k)$ for the difference-in-slope $\delta _k$, seeks the lowest value and corresponding naive p-value (according to the alternative hypothesis), and then corrects the selected (minimum) p-value by means of the K values of the test statistic.

Considering the case of multiple changepoints $\psi _{1}$ $<\psi _{2}< \dots <\psi _{k}$ and relevant K test statistics, Davies defined an upper bound for the p-value given by

$$\begin{aligned} \text {p-value} \approx \Phi (-M)+V\exp {(-M^2/2)}(8\pi )^{-1/2} \end{aligned}$$

where $\Phi (\cdot )$ is the cumulative Normal distribution function. M is the supremum of the test statistics $S(\psi )$, that is, $M = \sup \{ S(\psi ) : {\mathcal {L}} \le \psi \le {\mathcal {U}} \}$ where $\{{\mathcal {L}},{\mathcal {U}}\}$ is the range of possible values of $\psi$ (typically, the support of the segmented covariate).

Then, V is the total variation of $S(\psi )$, computed as $V=\int _{{\mathcal {L}}}^{{\mathcal {U}}}\frac{\partial S(\psi )}{\partial \psi } \text {d}\theta =|S(\psi _1)-S({\mathcal {L}})|+|S(\psi _2)-S(\psi _1)|+\ldots +|S({\mathcal {U}})-S(\psi _n)|$, with $\psi _1,\ldots ,\psi _n$ the successive changepoints of $S(\psi )$.

Although Davies’ test is useful to test for the existence of a changepoint, it is not considered ideal to identify the number of changepoints or their location. Indeed, the alternative hypothesis ${\mathcal {H}}_{1}$ actually states the existence of at least one additional changepoint, that is $K_0>k$ when $\delta _k \ne 0$.

The second one is a pseudo-score test proposed by Muggeo (2016), which is based on an adjustment of the score statistic. This approach requires quantities only from the null fit and thus it has the advantage that it is not necessary to estimate the nuisance parameter under the alternative. The proposed statistic has the form:

$$\begin{aligned} s_{0}=\frac{\bar{\varphi }^{T}(I_{n}-A)y}{\sigma \{\bar{\varphi }^{T}(I_{n}-A)y\}^{\frac{1}{2}}} \end{aligned}$$

where $(I_{n}-A)y$ is the residual vector under ${\mathcal {H}}_{0}$, with $I_{n}$ the identity matrix, A the hat matrix, y the observed response vector, and $\bar{\varphi }=\{\bar{\varphi }_{1},\ldots ,\bar{\varphi }_{n}\}^{T}$ the vector of the means of the nuisance parameter $\psi _k$ averaged over the range ${\{{\mathcal L},{\mathcal U}\}}$, i.e. $\bar{\varphi }=K^{-1}\sum _{k=1}^{K}\varphi (x_{i},\psi _{k}), i = 1,\ldots ,n$. This does not depend on $\psi _k$, so the score can be computed even under ${\mathcal {H}}_{0}$ : $\delta _k$ = 0 when $\psi _k$ is not defined. The function $\varphi (x_{i},\psi _{k})$ includes the case of discontinuous changepoint $\varphi (x_{i},\psi _{k}) = I(x_{i} > \psi _{k})$ and the linear segmented $\varphi (x_{i},\psi _{k}) = (x_{i} - \psi _{k})_{+}$, which is the one covered in this paper.

Contrary to the procedure of Kim et al. (2000), our proposal has the advantage of not being limited to testing for a maximum number of additional changepoints fixed a priori. Indeed, the previously explained procedure makes testing for more than two additional changepoints with the pseudo-score unfeasible. Our proposal overcomes this problem by making it possible to test for any number of additional changepoints thanks to the sequential procedure.

Starting from

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {H}}_{0} : K_0=0\\ {\mathcal {H}}_{1} : K_0=1 \end{array}\right. } \end{aligned}$$

and depending on the tests’ results, the procedure ends testing at most

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {H}}_{0} : K_0=K_{max}-1\\ {\mathcal {H}}_{1} : K_0=K_{max} \end{array}\right. } \end{aligned}$$

and selecting up to $K_{max}$ changepoints. The p-value for each hypothesis can be obtained via the Davies’ or the pseudo-score test. Furthermore, we control for over-rejection of the null hypotheses at the overall level $\alpha$ employing the Bonferroni correction, comparing each p-value with $\alpha /K_{max}$. Of course, setting the Bonferroni correction to $\alpha /K_{max}$ means putting ourselves in the most conservative setting.

For simplicity, we outline the algorithm when the maximum number of changepoints is $K_{max}=3$, restricting the analyses to a contained limited number of changepoints.

The procedure works iteratively fitting models following Muggeo (2003), as sketched in Section 2, as follows:

1.
Fit a segmented model to the data with $K=1$ and test
$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {H}}_{0} : &{} \delta _{1}=0\quad (K_0=0)\\ {\mathcal {H}}_{1} : &{} \delta _{1}\ne 0\quad (K_0 \ge 1) \end{array}\right. } \end{aligned}$$
via the Score or Davies’ test. If it is not rejected then $\hat{K}=0$ and the procedure stops at this step. Otherwise, we proceed with the algorithm;
2.
Fit a segmented model with $K=2$ and test
$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {H}}_{0} : &{} \delta _{2}=0 \quad (K_0=1)\\ {\mathcal {H}}_{1} : &{} \delta _{2}\ne 0\quad (K_0 \ge 2) \end{array}\right. } \end{aligned}$$
If ${\mathcal {H}}_{0}$ is not rejected then $\hat{K}=1$ and the procedure stops. Otherwise, we proceed to fit the following model;
3.
Fit a segmented model with $K=3$ and test
$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {H}}_{0} : &{} \delta _{3}=0\quad (K_0=2)\\ {\mathcal {H}}_{1} : &{} \delta _{3}\ne 0\quad (K_0 \ge 3) \end{array}\right. } \end{aligned}$$
If ${\mathcal {H}}_{0}$ is not rejected then $\hat{K}=2$. Otherwise, $\hat{K} \ge 3$.

It is important to remind that when using the Davies’ test, even if, based on the rejection of the last test, the number of changepoints selected is equal to 3 (or in general $K_{max}$), the actual number could be larger, as we are actually testing for (at least) one additional changepoint at each step.

5 Simulation studies

This section is devoted to simulation studies for comparing the performance of our proposed method to the previously introduced criteria for selecting the true number of changepoints, considering Gaussian, Binomial, and Poisson responses.

We simulated from four different scenarios, generating and then fitting models with different true values of the number of changepoints, namely $K_0\in \{0,1,2,3\}$. We consider three different sample sizes $n \in \{100,250,500\}$, including one covariate $x_i$, whose effect on the response is assumed broken-line, taking equispaced values ranging from 0 to 1. An example for each scenario, with $n = 100$, is represented in Figure 3. The segmented models used for the simulations are reported in Table 1, firstly considering iid Gaussian errors with standard deviation equal to $\sigma =0.3$.

Table 1 Linear segmented regression models fitted for the simulations

Sequential hypothesis testing for selecting the number of changepoints in segmented regression models

Abstract

Similar content being viewed by others

A heuristic, iterative algorithm for change-point detection in abrupt change models

A sequential multiple change-point detection procedure via VIF regression

Other Inference Problems and Conclusion

1 Introduction

2 Background on the segmented regression models

3 Selecting the number of changepoints

4 Proposed sequential hypothesis testing

5 Simulation studies

6 Applications to real data

6.1 Application to crime data

6.2 Application to global land temperature data

7 Conclusions

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix. Supplementary material

Appendix. Supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation