Application of the bootstrap method for change points analysis in generalized linear models
- 113 Downloads
Abstract
In this paper, we focus on the construction methods of the prediction model, estimation methods of the change point locations, and the confidence intervals for the generalized linear model with piecewise different coefficients. As a standard approach for multiple change point analysis, the application of the hierarchical splitting algorithm is widely used. However, the hierarchical splitting algorithm has a high risk in that the standard error of the change point estimators become large and, therefore, the prediction accuracy of the estimated model decreases. To deal with this problem, we consider the application of a bootstrap method based on the hierarchical splitting algorithm. Through simulation studies, we compare the algorithms in terms of the prediction accuracy of the estimated model, bias and variance of the change point estimators, and the accuracy of the confidence intervals of the change points. From the result, we confirmed the utility of the bootstrap-based methods for change point analysis, especially the increased prediction accuracy of the obtained model, decreased standard error of the change point estimators, and construction of better confidence intervals depending on the situation. We also present the results of a simple example to demonstrate the utility of the method.
Keywords
Bagging Break point Confidence interval Ensemble method Hierarchical splitting1 Introduction
Generalized linear models are widely used for modeling an interesting response variable based on explanatory variables. In ordinal analysis based on a generalized linear model (GLM), the model is assumed to hold for the entire data set. However, it is widely understood that the assumption does not hold for several situations. For example, in epidemiological studies in occupational medicine, there is often a threshold concentration of a specific agent, which has an adverse health effect (Ulm 1991). As another example, in medical research, there is a possibility that the mortality rate for a certain disease changes suddenly with a certain threshold of age. To deal with these data, we can consider linear models, where the structure is changed at some points by an explanatory variable. These points are called change points or break points. In this study, we focus on the construction method of the prediction model, estimation methods of the change points, and the confidence interval for the GLM with piecewise different coefficients.
The change point analysis has been studied for a number of years. For example, Hawkins (1977), Worsley (1979), Inclán (1993), and Chen and Gupta (1997) studied the detection of change point locations in a sequence of random variables, which follows a normal distribution. Hawkins (1977) and Worsley (1979) described a method based on the likelihood procedure test. On the other hand, Inclán (1993) proposed a Bayesian-based approach, and Chen and Gupta (1997) studied an approach based on a Bayesian information criterion (Schwarz 1978). If there are multiple change points, the grid search could be used on each method. However, if the number of search points is large, this method is not practical from the viewpoint of computational complexity. To deal with this problem, the hierarchical splitting (HS) algorithm, which dichotomizes data recursively in the same way as a classification and regression tree (Breiman et al. 1984), is widely used (Chen and Gupta 2012). As an improved version of the HS algorithm, Hawkins (2001) proposed the dynamic programing (DP) algorithm, which can change the determination of location of change points according to the number of change points. In cases where the number of change points is unknown, information criteria or tests based on the limiting theorem are used to estimate it (Chen and Gupta 2012). The studies in change point analysis for a sequence of random variables are summarized by Csörgő and Horváth (1997) and Chen and Gupta (2012).
There are also many studies on change point analysis for ordinary linear models (OLM). For example, the likelihood-ratio-based methods are discussed in Quandt (1958, 1960), Kim and Siegmund (1989), and Kim (1994). Brown et al. (1975) and James et al. (1987) introduced the regression residual-based method, and Brown et al. (1975) also described the recursive residual-based method. These methods are based on the use of the limiting distribution under the null hypothesis that there is no change point. In addition to these methods, there are regression spline-based approaches (Smith 1979), and Bayesian-based approaches (Holbert 1982), etc. Recently, a method for carrying out change point analysis and variable selection simultaneously has also been proposed (Wu 2008). For the case of OLM, the HS algorithm is generally used to search for the locations of multiple change points, and there is also research into methods using the DP algorithm (Bai and Perron 2003). In cases where the number of change points is unknown, information criteria are generally used to estimate it. Studies into change point analysis for OLM have been summarized by Chen and Gupta (2012).
Although there are fewer studies on change point analysis in GLM than on the sequence of random variables or OLM, several have been published. Stasinopoulos and Rigby (1992) discussed the detection method of a change point in univariate GLM, and showed the results of its application to medical data. Ulm (1991) and Gurevich and Vexler (2005) discussed the detection method of a change point in logistic regression models for epidemiological data analysis. Küchenhoff and Carroll (1997) discussed the estimation methods of change points in a segmented GLM with measurement error. As with the case of a sequence of random variables or OLM, application of the HS algorithm to the detection of the multiple change points in GLM can naturally be considered. On the other hand, the DP algorithm cannot be applied under the assumption that the dispersion parameters in each segment are equal. Because we treat the GLM by assuming that the coefficients are different in each segment but the dispersion parameters are equal, methods based on the HS algorithm are considered in this study. In cases where the number of change points is unknown, we consider the use of the information criteria.
As a disadvantage of the HS algorithm, the estimated locations of change points are fixed until the end of the algorithm. From this, an optimal combination of change points may not be found in some cases. As a result, there is a high risk that the variance of the estimator will become large. Moreover, if the locations of change points are estimated incorrectly, it is expected that the prediction accuracy of the finally obtained model will decrease. To deal with this problem, we consider the application of the HS algorithm with a bootstrap method in this study. It is expected to decrease the variance of estimators of change points by aggregating the estimators obtained from each model, which are given by resampling data and the HS algorithm. Moreover, it is expected to increase the prediction accuracy by aggregating (bagging) the models obtained from resampling data and the HS algorithm.
As another disadvantage of the HS algorithm, the distributions of estimators of change points are not clear. As discussed above, in many studies the limiting distributions of statistics are investigated under the null hypothesis that there is no change point. On the other hand, the distribution of the estimators given by the HS algorithm is not well known. Therefore, we compare the confidence intervals of estimators through simulation studies. For the construction method of confidence intervals, we mainly compare two methods: the method which assumes the asymptotic normality of estimators, and a method based on empirical distribution.
The remainder of this paper is organized as follows: In Sect. 2, we introduce the notation and models treated in this study. In Sect. 3, the HS algorithm and bagging algorithm of the models obtained by the HS algorithm for prediction are described. In Sect. 4, we present the construction methods of confidence intervals of change point estimators, which are compared in this study. In Sect. 5, the results of the simulation studies are described. The application results of the method to simply example data are shown in Sect. 6. Finally, we describe the conclusions of this paper in Sect. 7.
2 Notation and model
Because \({\varvec{\tau }}\) is actually unknown, some iterative search method is needed that estimates \({\varvec{\beta }}\), \({\varvec{\alpha }}\) and \(\phi \) under possible combinations of fixed \({\varvec{\tau }}\). It seems to be intuitive to use the grid search over all possible \({\varvec{\tau }}\), but there is a problem from the perspective of computational effort. That is, the order of a grid search for the known number of change points d is \(O(n^d)\), and this method is not realistic when the number of samples n or segments d is large. A more efficient method that dichotomizes samples recursively is introduced in Sect. 3.
3 Model construction
3.1 Hierarchical splitting
The HS algorithm is a repeated method similar to classification and regression trees. As the first step of the algorithm, all learning samples \({\mathcal {L}}\) are split into two segments based on the splitting rule \(x_{i1}\le \tau '\). To determine the splitting rule \(x_{i1}\le \tau '\), that is to estimate the change point \(\tau '\), all possible splits are evaluated and a split that maximizes the sum of the log likelihood of both segments is selected as the optimum.
As the second step, the next splitting rule \(x_{i1}\le \tau ''\) is determined under the assumption that the splitting rule \(x_{i1}\le \tau '\) is retained. That is, all learning samples \({\mathcal {L}}\) are split into three segments based on the two splitting rules \(x_{i1}\le \tau '\) and \(x_{i1}\le \tau ''\). The second rule \(x_{i1}\le \tau ''\) is determined which maximizes the sum of the log likelihood of the three segments from all possible splits. It should be noted that all possible splits must satisfy the condition that the number of samples included in each segment is larger than q.
- 1.
The initial set of change points is given by \(T_1 = \{-\infty , +\infty \}\).
- 2.
For \(k\leftarrow 2\) to the number of known segments d, or predefined maximum search number of segments \(d^{\max }\) do
- 3.
Find the set of possible change points \(T_k'=\{(T_{k-1}, \tau '_{(1)}), (T_{k-1}, \tau '_{(2)}), \ldots \}\) which segments the data into k segments under the assumption that the splitting rule \(T_{k-1}\) is given. \(\tau '_{(1)}, \tau '_{(2)}, \cdots \) are candidates of kth change point.
- 4.Define the optimal set of k change points by$$\begin{aligned} T_k = \arg \max _{(T_{k-1}, \tau '_{(l)})\in T_k'} l\left( \hat{\varvec{\beta }},{\hat{\varvec{\alpha }}},\hat{\phi }|(T_{k-1}, \tau '_{(l)}\right) ,{\varvec{y}}). \end{aligned}$$
- 5.
end
- 6.If the number of segments d is unknown, the optimal set of change points is estimated by using (1) or (2) as follows:or$$\begin{aligned} T_d = \arg \max _{T_{d'}\in \{T_1, T_2, \ldots , T_{\max }\}} AIC(d'), \end{aligned}$$$$\begin{aligned} T_d = \arg \max _{T_{d'}\in \{T_1, T_2, \ldots , T_{\max }\}} BIC(d'). \end{aligned}$$
- 7.The estimated linear predictor model is given bywhere \(I(\cdot )\) represents the indicator function, and \(({\hat{\tau }}_0=-\infty< {\hat{\tau }}_1< {\hat{\tau }}_2< \cdots , {\hat{\tau }}_{d-1}< {\hat{\tau }}_d=+\infty )\) are values obtained by rearranging the elements in \(T_d\) in ascending order.$$\begin{aligned} {\hat{\eta }}_i^\mathrm{HS} = \sum _{k=1}^{d} I({\hat{\tau }}_{k-1} < x_{i1} \le {\hat{\tau }}_{k}){\varvec{x}}_i'\hat{\varvec{\beta }}_k + {\varvec{z}}_i'{\hat{\varvec{\alpha }}}, \end{aligned}$$(3)
On the other hand, the disadvantage is that the change points that are determined as optimal in previous steps are fixed, and cannot be changed in later steps. Because of this lack of flexibility in the HS algorithm, there are often cases that the optimal combination of change points is not found. Thus, the variance of the estimates of change points will become large. This is also shown in the simulation results of Sect. 5. Owing to this disadvantage, there is a risk that the model obtained by the HS algorithm has low prediction accuracy. Moreover, as discussed in the introduction, the distribution of the estimators given by the HS algorithm is not well known. Obviously, there are cases where the estimates of the change points do not become the MLE. Therefore, for example, in the construction of the confidence interval, there is a risk that the interval will be largely miscalculated if the asymptotic normality is assumed. To deal with these problems, we consider the application of the bootstrap method to the model, and study the performance through simulations in the following sections.
3.2 Bagging
To construct a model with better prediction accuracy, we consider the use of the bagging algorithm. The bagging algorithm (Breiman 1996) is a representative method in the parallel ensemble methods that constructs a set of base models and combines them. In general, the base models are called base learners or base predictors in the field of machine learning. It is well known that bagging improves the prediction error dramatically by exploiting the independence between the base learners. As stated by Zhou (2012), for a regression problem, the degree of improvement of the mean squared error by bagging depends on the instability of the base learners. Because the variance of estimators of change points given by the HS algorithm is expected to be large, as discussed in above, the instability of the obtained model will also increase. Therefore, the bagging algorithm is expected to work effectively in the construction of models that include the change points.
To construct multiple base models, bootstrap samples are used. In the framework of linear models, there are two main bootstrap methods (Fox 2015). One is the method that treats the explanatory variables as random and constructs a set of bootstrap samples directly from the observed learning sample. The other method is treats the explanatory variables as fixed and samples a set of residuals from the model fitted by using \({\mathcal {L}}\). Then, the bootstrap sample of the response variable \(y_i\) corresponding to \(({\varvec{x}}_i, {\varvec{z}}_i)\) is constructed by the sum of the predicted value from the fitted model given \(({\varvec{x}}_i, {\varvec{z}}_i)\) and the sampled residuals.
The latter approach assumes that the fitted regression model using \({\mathcal {L}}\) is correct and the errors are identically distributed. This assumption roughly holds in general OLM, but it is unlikely that this assumption would hold in our model. First, the theoretical errors of GLM corresponding to different explanatory variables are usually different. In addition, the fitted model with change points using the HS algorithm is at high risk of instability due to the reasons discussed above. For these reasons, we use the former method which directly resamples the original data set \({\mathcal {L}}\).
- 1.
For \(b\leftarrow 1\) to the predefined iterative number B do
- 2.
Construct a set of bootstrap samples \({\mathcal {L}}^{(b)}\) by sampling with replacement from \({\mathcal {L}}\).
- 3.Estimate a linear predictor model \({\hat{\eta }}_i^{HS(b)}\) by using the HS algorithm based on \({\mathcal {L}}^{(b)}\):where \(d^{(b)}\) is the known number of change points d, or if the number of change points is unknown, it is estimated by using the AIC or BIC based on \({\mathcal {L}}^{(b)}\).$$\begin{aligned} {\hat{\eta }}_i^{HS(b)} = \sum _{k=1}^{d^{(b)}} I\left( {\hat{\tau }}_{k-1}^{(b)} < x_{i1} \le {\hat{\tau }}_{k}^{(b)}\right) {\varvec{x}}_i'\hat{\varvec{\beta }}_k^{(b)} + {\varvec{z}}_i'{\hat{\varvec{\alpha }}}^{(b)}, \end{aligned}$$
- 4.
end
- 5.The estimated linear predictor model is given by$$\begin{aligned} {\hat{\eta }}_i^\mathrm{Bag} = \frac{1}{B}\sum _{b=1}^B \left\{ \sum _{k=1}^{d^{(b)}} I\left( {\hat{\tau }}_{k-1}^{(b)} < x_{i1} \le {\hat{\tau }}_{k}^{(b)}\right) {\varvec{x}}_i'\hat{\varvec{\beta }}_k^{(b)} + {\varvec{z}}_i'{\hat{\varvec{\alpha }}}^{(b)} \right\} . \end{aligned}$$(4)
From the central limiting theorem, if the bootstrap distribution used in bagging mimics the population distribution well, the estimator \({\hat{\tau }}_k^*\) is expected to follow a normal distribution asymptotically. On the other hand, \({\tilde{\tau }}_k^*\) is expected to be robust against extreme estimates. Because it is predicted that the variance of estimators of change points given by the HS algorithm will become large, there is a possibility of obtaining an extreme estimated value. We expect \({\tilde{\tau }}_k^*\) to deal with this problem. To clarify the notation, we express the estimate of \(\tau _k\) which is given by the HS algorithm on \({\mathcal {L}}\) as \({\hat{\tau }}_k^\mathrm{HS}\). The performance of these estimators is compared through simulations in Sect. 5.
4 Confidence intervals
When the empirical distribution has a long tail or is very skewed, it is expected that the interval (8) would be too much longer than (9). In this case, the interval (8) will be too conservative compared with (9). On the other hand, when the distribution is almost symmetric, these two intervals will be almost the same. In the next section, we compare the performance of these intervals through the simulation studies.
5 Simulations
5.1 Models and setting
The number of learning samples n used are 100 and 300. We set the restriction of the number of learning samples included in a segment as 10 for efficient calculation. When the number of change points is unknown, we set the maximum search number of segments \(d^{\max }\) as 6. The iteration number of bootstrap B is set to 500. Since there were no major differences in the simulation results in other settings of these values, we only showing the results under these settings here. Simulations are repeated 300 times in every data group.
We use a workstation which has the Intel Core i7-4600U CPU processor (Intel co.). All the simulations and analyses were performed in MATLAB (Version R 2017b, Mathworks inc.). A typical experiment of once simulation to obtain the \({\hat{\eta }}^\mathrm{HS}\) and \({\hat{\eta }}^\mathrm{Bag}\) for Model 1 with \(n = 100\) required 2 and 443 s, respectively. For \(n=300\), it was 8 s for \({\hat{\eta }}^\mathrm{HS}\), and 2330 s for \({\hat{\eta }}^\mathrm{Bag}\). In Model 2 with \(n=100\), a typical running time for the \({\hat{\eta }}^\mathrm{HS}\) and \({\hat{\eta }}^\mathrm{Bag}\) were 1.5 and 313 s, respectively. For \(n=300\), it was 7 s for \({\hat{\eta }}^\mathrm{HS}\), and 2070 s for \({\hat{\eta }}^\mathrm{Bag}\).
5.2 Comparison of model prediction accuracy
Comparison of the simulation results of model prediction accuracy for the HS algorithm and bagging algorithm when the true model contains change points
Model | n | Algorithm | d known | d unknown | |
---|---|---|---|---|---|
AIC | BIC | ||||
Model 1 | 100 | \({\hat{\eta }}^\mathrm{HS}\) | 0.22 | 0.24 | 0.24 |
(0.13) | (0.12) | (0.14) | |||
\({\hat{\eta }}^\mathrm{Bag}\) | 0.16 | 0.18 | 0.16 | ||
(0.07) | (0.07) | (0.07) | |||
300 | \({\hat{\eta }}^\mathrm{HS}\) | 0.08 | 0.12 | 0.06 | |
(0.05) | (0.04) | (0.03) | |||
\({\hat{\eta }}^\mathrm{Bag}\) | 0.06 | 0.07 | 0.05 | ||
(0.03) | (0.02) | (0.02) | |||
Model 2 | 100 | \({\hat{\eta }}^\mathrm{HS}\) | 9.17 | 7.58 | 7.14 |
(3.65) | (4.28) | (3.78) | |||
\({\hat{\eta }}^\mathrm{Bag}\) | 5.56 | 5.65 | 5.00 | ||
(2.15) | (2.50) | (2.17) | |||
300 | \({\hat{\eta }}^\mathrm{HS}\) | 6.54 | 3.25 | 1.92 | |
(1.81) | (1.34) | (1.06) | |||
\({\hat{\eta }}^\mathrm{Bag}\) | 3.61 | 2.12 | 1.63 | ||
(1.05) | (0.77) | (0.68) |
For Model 1, which is the multivariate regression model with two change points, there is only a slight difference in the results of \({\hat{\eta }}^\mathrm{Bag}\) between the cases where the number of change points is known and unknown. For the results of \({\hat{\eta }}^\mathrm{HS}\) when \(n=300\), the model obtained by using BIC is about twice as accurate as that obtained by using AIC. The reason for this is that AIC tends to overestimate the number of change points.
For Model 2, which is the Poisson regression model with two change points, the model obtained by using BIC has greatest accuracy of the three patterns (d known, AIC, and BIC). This result is somewhat strange, because the accuracy of the model obtained in the case where d is unknown is higher than in the case where d is known. The reason for this is that there is a possibility that the estimation of the change points is largely incorrect for \({\hat{\eta }}^\mathrm{HS}\). This is discussed in the next simulation result.
For \({\hat{\eta }}^\mathrm{Bag}\), this result seems to be due to the diversity of the base models. That is, when the number of change points is known, the base models which construct the model (4) have the same number of segments. On the other hand, when the number of change points is unknown, the base models have several different numbers of segments. As the result, the diversity of the base models included in the estimated model when d is unknown is higher than when d is known. See Zhou (2012) for details on the diversity of the bagging algorithm.
5.3 Comparison of change point estimators
The simulation results of the comparison of three change point estimators. The values in the table represent the average values of the estimators in 300 simulations. The values in parentheses represent the standard deviations of the estimators in the simulations.
Model | n | Estimator | \(\tau _1 = 3\) | \(\tau _2 = 6\) | ||
---|---|---|---|---|---|---|
Model 1 | 100 | \({\hat{\tau }}_k^\mathrm{HS}\) | 3.16 | (1.02) | 5.88 | (0.63) |
\({\hat{\tau }}_k^*\) | 3.29 | (0.44) | 5.97 | (0.35) | ||
\({\tilde{\tau }}_k^*\) | 3.31 | (0.69) | 5.95 | (0.36) | ||
300 | \({\hat{\tau }}_k^\mathrm{HS}\) | 2.99 | (0.76) | 5.98 | (0.19) | |
\({\hat{\tau }}_k^*\) | 3.14 | (0.41) | 5.92 | (0.16) | ||
\({\tilde{\tau }}_k^*\) | 3.11 | (0.60) | 5.99 | (0.13) | ||
Model 2 | 100 | \({\hat{\tau }}_k^\mathrm{HS}\) | 3.29 | (0.78) | 5.15 | (0.84) |
\({\hat{\tau }}_k^*\) | 3.24 | (0.39) | 5.31 | (0.41) | ||
\({\tilde{\tau }}_k^*\) | 3.33 | (0.56) | 5.24 | (0.68) | ||
300 | \({\hat{\tau }}_k^\mathrm{HS}\) | 3.41 | (0.63) | 5.13 | (0.82) | |
\({\hat{\tau }}_k^*\) | 3.33 | (0.30) | 5.13 | (0.37) | ||
\({\tilde{\tau }}_k^*\) | 3.38 | (0.46) | 5.12 | (0.71) |
For Model 1, the average values of estimates for \({\hat{\tau }}_k^\mathrm{HS}\), \({\hat{\tau }}_k^*\), and \({\tilde{\tau }}_k^*\) take on similar values. On the other hand, the standard deviations are largely different. The standard deviation of \({\hat{\tau }}_k^*\) is the smallest, then that of \({\tilde{\tau }}_k^*\), and that of \({\hat{\tau }}_k^\mathrm{HS}\) is the largest. From Fig. 3, it can be seen that the empirical distribution of estimates for \({\hat{\tau }}_1^\mathrm{HS}\), \({\hat{\tau }}_1^*\), and \({\tilde{\tau }}_1^*\) on \(\tau _1\) becomes unimodal, and the dispersion of \({\hat{\tau }}_1^*\) is the smallest though the center is a little biased. Although the center of \({\hat{\tau }}_1^\mathrm{HS}\) is almost unbiased, its standard deviation is about twice as large as \({\hat{\tau }}_1^*\). The empirical distribution of \({\hat{\tau }}_2^\mathrm{HS}\) and \({\tilde{\tau }}_2^*\) for \(\tau _2\) are slightly bimodal towards the center. Here \({\hat{\tau }}_2^*\) has a unimodal empirical distribution with an almost unbiased center.
For Model 2, the average values of estimates for \({\hat{\tau }}_k^\mathrm{HS}\), \({\hat{\tau }}_k^*\), and \({\tilde{\tau }}_k^*\) are slightly biased. In particular, for \(\tau _2\), the differences between the average values of estimates and the true value are about 0.8 for all estimators. As in the case of Model 1, the standard deviation of \({\hat{\tau }}_k^*\) is the smallest, then that of \({\tilde{\tau }}_k^*\), and that of \({\hat{\tau }}_k^\mathrm{HS}\) is the largest. The difference in standard deviation between \({\hat{\tau }}_k^*\) and \({\hat{\tau }}_k^\mathrm{HS}\) is almost double for all patterns. From Fig. 4, it is clear that the empirical distributions of the estimators are biased. As in the case for Model 1, \({\hat{\tau }}_1^*\) and \({\hat{\tau }}_2^*\) have a unimodal empirical distribution. The empirical distributions of \({\hat{\tau }}_1^\mathrm{HS}\) and \({\tilde{\tau }}_1^*\) for \(\tau _1\) are skewed towards \(\tau _2\). The empirical distributions of \({\hat{\tau }}_2^\mathrm{HS}\) and \({\tilde{\tau }}_2^*\) for \(\tau _2\) are obviously bimodal where one center is about the true value of \(\tau _2\) and the other center is biased towards \(\tau _1\).
As a result of these simulations, depending on the model, there are cases where the obtained estimates of change points have an obvious bias. The bias tends to be pulled in the direction of another true change point. The empirical distribution of \({\hat{\tau }}_k^*\) tends to have a unimodal distribution with small standard deviation. On the other hand, the empirical distributions of \({\hat{\tau }}_k^\mathrm{HS}\) and \({\tilde{\tau }}_k^*\) have the same shape, but the standard deviation of \({\tilde{\tau }}_k^*\) tends to be smaller than \({\hat{\tau }}_k^\mathrm{HS}\).
5.4 Comparison of confidence intervals
To compare the accuracy and length of the confidence intervals (7), (8), and (9) for change points, we used the percentage that a true change point is included in the constructed intervals and the average and standard deviation of length of the intervals in all simulations. The level for constructing the intervals is set to \(1-\alpha =0.95\). The results are described in Table 3.
Comparison of the simulation results of three confidence intervals
Model | n | Interval | \(\tau _1 = 3\) | \(\tau _2 = 6\) | ||
---|---|---|---|---|---|---|
Prop. | Length (std.) | Prop. | Length (std.) | |||
Model 1 | 100 | \({\mathrm{CI}}_{\mathrm{basic}}\) | 56.0 | 3.9 (0.8) | 82.7 | 3.2 (1.3) |
\({\mathrm{CI}}_{\mathrm{equal}}\) | 100.0 | 3.9 (0.8) | 100.0 | 3.2 (1.3) | ||
\({\mathrm{CI}}_{\mathrm{unequal}}\) | 100.0 | 3.6 (0.8) | 100.0 | 2.8 (1.3) | ||
300 | \({\mathrm{CI}}_{\mathrm{basic}}\) | 60.3 | 2.9 (0.6) | 87.7 | 0.9 (0.8) | |
\({\mathrm{CI}}_{\mathrm{equal}}\) | 99.7 | 2.9 (0.6) | 93.7 | 0.9 (0.8) | ||
\({\mathrm{CI}}_{\mathrm{unequal}}\) | 98.7 | 2.7 (0.6) | 91.0 | 0.7 (0.7) | ||
Model 2 | 100 | \({\mathrm{CI}}_{\mathrm{basic}}\) | 49.0 | 2.8 (0.6) | 43.7 | 3.0 (1.0) |
\({\mathrm{CI}}_{\mathrm{equal}}\) | 98.3 | 2.8 (0.6) | 97.3 | 3.0 (1.0) | ||
\({\mathrm{CI}}_{\mathrm{unequal}}\) | 97.0 | 2.5 (0.6) | 96.0 | 2.7 (0.9) | ||
300 | \({\mathrm{CI}}_{\mathrm{basic}}\) | 44.3 | 2.1 (0.5) | 40.7 | 2.0 (0.2) | |
\({\mathrm{CI}}_{\mathrm{equal}}\) | 98.3 | 2.1 (0.5) | 88.7 | 2.0 (0.2) | ||
\({\mathrm{CI}}_{\mathrm{unequal}}\) | 96.0 | 1.9 (0.5) | 86.7 | 1.9 (0.3) |
For Model 1, all intervals are too wide when the sample size is small, and the results of \({\mathrm{CI}}_{\mathrm{equal}}\) and \({\mathrm{CI}}_{\mathrm{unequal}}\) become conservative. The result of \({\mathrm{CI}}_{\mathrm{basic}}\) is not work well. When the number of samples increases, all intervals return the results that are closest to the nominal level. Since the length of interval of \({\mathrm{CI}}_{\mathrm{basic}}\) is same as the length of \({\mathrm{CI}}_{\mathrm{equal}}\), the performance of \({\mathrm{CI}}_{\mathrm{equal}}\) is higher than \({\mathrm{CI}}_{\mathrm{basic}}\) in these situations. The reason of this is considered that the empirical distribution of estimators is symmetric and centered on \({\hat{\tau }}_k^\mathrm{HS}\). This is confirmed by the fact that the histogram of Fig. 3a is nearly symmetrical in shape.
For Model 2, \({\mathrm{CI}}_{\mathrm{basic}}\) is often unable to function. The reason of this is considered that the correct confidence interval for the true parameter for bootstrap distribution should not converge to the correct confidence interval on the true parameter for population distribution. There is a slight divergence between \({\mathrm{CI}}_{\mathrm{equal}}\) and \({\mathrm{CI}}_{\mathrm{unequal}}\) from the nominal level, especially for \(\tau _2\) when \(n = 300\). The reason for this is the bias of estimators \({\hat{\tau }}_k^\mathrm{HS}\) from the true value is considered. As a result of these simulations, it is recommended to use \({\mathrm{CI}}_{\mathrm{equal}}\) or \({\mathrm{CI}}_{\mathrm{unequal}}\) when building a confidence interval of a change point based on the HS algorithm.
6 Example
We present a simple application of the change point analysis discussed in this study in reference to the study of female horseshoe crabs on an island in the Gulf of Mexico. These data were used by Agresti (2013), and the data set is available at the website offered by him. The data consist of 173 female crabs. The interesting response variable is the number of male crabs that cluster around a female crab during spawning. In this study, we used the carapace width as the explanatory variable with piecewise different coefficient vectors within the model. In addition to the carapace width, we used the crab’s weight as the explanatory variable with common coefficient vector. That is, \(x_{i} = \mathrm{``carapace~width~(cm)''}\) and \(z_{i} = \mathrm{``crab's~weight~(kg)''}\). As in the case of simulation studies, the iteration number of bootstrap B is set to 500.
To proceed with the analysis discussed in this study, we need to decide on the number of change points as a known number. It is well known that the AIC tends to have a high estimate of the number of parameters in a model. In addition, because it does not have an asymptotically consistent estimator for changes in model order, we considered that 2 is the optimal number of change points for this data. Therefore, in following analysis, we will consider the known number of change points as \(d - 1 = 2\).
For the estimation of the location of change points, the estimates given by the HS algorithm are \({\hat{\tau }}_1^\mathrm{HS} = 25.15\) and \({\hat{\tau }}_2^\mathrm{HS} = 26.05\) as described in (14). On the other hand, the estimates given by (5) are \({\hat{\tau }}_1^* = 25.28\) and \({\hat{\tau }}_2^* = 27.16\). Moreover, the estimates given by (6) are \({\tilde{\tau }}_1^* = 25.15\) and \({\tilde{\tau }}_2^* = 27.35\). For \(\tau _1\), there is only a small difference between the three estimates. On the other hand, for \(\tau _2\), the estimates based on the bootstrap method are one or higher than the value of \({\hat{\tau }}_2^\mathrm{HS}\). Because the standard error of \({\hat{\tau }}_2^\mathrm{HS}\) is high from the simulation results in Table 2, there is a need to be aware that the value of \(\tau _2\) may be higher than 26.05.
Based on the simulation results in Table 3, we think that \({\mathrm{CI}}_{\mathrm{equal}}\) and \({\mathrm{CI}}_{\mathrm{unequal}}\) are more suitable than \({\mathrm{CI}}_{\mathrm{basic}}\). Therefore, we calculate the confidenced intervals of \({\hat{\tau }}_k^\mathrm{HS}\) given by (8) and (9). The 95% confidence intervals for \({\hat{\tau }}_1^\mathrm{HS}\) are \({\mathrm{CI}}_{\mathrm{equal}}=[24.05, 26.40]\) and \({\mathrm{CI}}_{\mathrm{unequal}}=[24.05, 26.15]\). The difference between these intervals is small, and we conclude that the first change point of the carapace width against the number of clusters is in the interval of about 24–26 cm.
The 95% confidence intervals for \({\hat{\tau }}_2^\mathrm{HS}\) given by (8) and (9) are \({\mathrm{CI}}_{\mathrm{equal}}=[25.55, 28.40]\) and \({\mathrm{CI}}_{\mathrm{unequal}}=[25.85, 28.45]\), respectively. From the results of the model (13) which has three change points with 28.10 as the third change point, where the values \({\hat{\tau }}_2^*\) and \({\tilde{\tau }}_2^*\) are given as more than 27, and the upper bounds of the intervals \({\mathrm{CI}}_{\mathrm{equal}}\) and \({\mathrm{CI}}_{\mathrm{unequal}}\) are more than 28, we conclude that the second change point of the carapace width against the number of clusters is in the interval of about 25.5–28.5 cm.
7 Conclusion
The application of the HS algorithm is widely used as a standard approach for multiple change point analysis. The algorithm is easy to execute and the computational efficiency is good. However, there is a risk that the estimated change points obtained by the algorithm do not become the MLE, and as a result its variance increases and it has no consistency and asymptotic normality. To deal with this problem, we focused on the application of the bootstrap method based on the HS algorithm in GLM with piecewise different coefficients. In particular, we studied the convenience of the method from the three viewpoints: improvement of the prediction accuracy by bagging, reduction of the standard error of the estimator of the change points, and construction of the confidence interval of change points.
The following main results were obtained by simulation studies. First, the prediction accuracy of the model obtained by the bagging algorithm was almost certainly higher than the model obtained by the HS algorithm. A surprise result was that the model obtained by the bagging algorithm, when the number of change points was estimated in each base model, was more accurate than the model obtained when the number of change points was known. The reason for this is considered to be due to the diversity of the basic models, and therefore application of further developments of the algorithm such as VR-Tree ensemble (Liu et al. 2008) can be considered.
Second, there is little difference between the average values of estimators of the change points obtained by the HS algorithm and bootstrap method. Depending on the settings of the true model, both estimators have bias, but the standard error of the estimator obtained by the bootstrap method is smaller than that of the HS algorithm. Moreover, the estimator based on the mean value of the bootstrap estimates has a distribution that is unimodal and nearly symmetrical.
Third, \({\mathrm{CI}}_{\mathrm{equal}}\) and \({\mathrm{CI}}_{\mathrm{unequal}}\) are recommended when building the confidence interval of a change point based on the HS algorithm. Although there is some variation from the nominal level in the intervals based on these methods, it is closer to the nominal level than the intervals based on the basic bootstrap method.
Through the change point analysis of the study of female horseshoe crabs, we showed a simple application of the discussed methods. From the plot of the mean models given by the HS algorithm and bagging algorithm, it turned out that the latter model is reasonable and represents the model well. Based on the plot, the estimates of the change points, and the confidence intervals, we predicted a range that is somewhat conservative but has the possibility of change points.
There are several tasks that need further research. In this study, we compared the generally used HS algorithm and bagging algorithm from the perspective of prediction accuracy. Another comparable method is, for example, a model constructed by a non-parametric approach or a Bayesian approach. Therefore, a comparison of these methods through an extensive simulation study is a subject for future research.
In addition to this, there are many elements that are necessary for constructing an appropriate model such as variable selection, detection of interaction, sensitivity analysis, and confirmation of the linearity of the explanatory variables in the linear predictor. Moreover, we focus on the change in mean and/or variance structure by considering the piecewise different coefficients in GLM. As the extension of this, the detection of the variants change point in the model is an important problem. In order to deal with this problem, further extension methods need to be considered. These tasks will be the subject of future studies.
References
- Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Hoboken, New Jersey: Wiley.zbMATHGoogle Scholar
- Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B.N. Petrov, F. Csáki (Eds.)proceedings of the 2nd International Symposium on Information Theory (pp. 267-281). Budapest.Google Scholar
- Bai, J., & Perron, P. (2003). Computation and analysis of multiple structural change models. Journal of Applied Econometrics, 18, 1–22.CrossRefGoogle Scholar
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.MathSciNetzbMATHGoogle Scholar
- Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. (1984). Classification and Regression Trees. California: Wadsworth.zbMATHGoogle Scholar
- Brown, R. L., Durbin, J., & Evans, J. M. (1975). Techniques for testing the constancy of regression relationships over time. Journal of the Royal Statistical Society Series B, 37, 149–192.MathSciNetzbMATHGoogle Scholar
- Chen, J., & Gupta, A. K. (1997). Testing and locating variance changepoints with application to stock prices. Journal of the American Statistical Association, 92, 739–747.MathSciNetCrossRefGoogle Scholar
- Chen, J., & Gupta, A. K. (2012). Parametric Statistical Change Point Analysis (2nd ed.). New York: Birkhäuser.CrossRefGoogle Scholar
- Csörgő, M., & Horváth, L. (1997). Limit Theorems in Change-Point Analysis. New York: John Wiley & Sons.zbMATHGoogle Scholar
- Davis, R. A., Lee, T. C. M., & Rodriguez-Yam, G. A. (2006). Structural break estimation for nonstationary time series models. Journal of the American Statistical Association, 101, 223–239.MathSciNetCrossRefGoogle Scholar
- Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and their Application. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
- Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Boca Raton, Florida: Chapman and Hall/CRC Press.CrossRefGoogle Scholar
- Fox, J. (2015). Applied Regression Analysis and Generalized Linear Models (3rd ed.). Thousand Oaks: Sage Publicatons.Google Scholar
- Gurevich, G., & Vexler, A. (2005). Change point problems in the model of logistic regression. Journal of Statistical Planning and Inference, 131, 313–331.MathSciNetCrossRefGoogle Scholar
- Hawkins, D. M. (1977). Testing a sequence of observations for a shift in location. Journal of the American Statistical Association, 72, 180–186.MathSciNetCrossRefGoogle Scholar
- Hawkins, D. M. (2001). Fitting multiple change-point models to data. Computational Statistics & Data Analysis, 37, 323–341.MathSciNetCrossRefGoogle Scholar
- Holbert, D. (1982). A Bayesian analysis of a switching linear model. Journal of Econometrics, 19, 77–87.CrossRefGoogle Scholar
- Inclán, C. (1993). Detection of multiple changes of variance using posterior odds. Journal of Business and Economic Statistics, 11, 289–300.Google Scholar
- James, B. J., James, K. L., & Siegmund, D. (1987). Tests for a change-point. Biometrika, 74, 71–84.MathSciNetCrossRefGoogle Scholar
- Kim, H. (1994). Tests for a change-point in linear regression. IMS Lecture Notes-Monograph Series, 23, 170–176.MathSciNetCrossRefGoogle Scholar
- Kim, H., & Siegmund, D. (1989). The likelihood ratio test for a change-point in simple linear regression. Biometrika, 76, 409–423.MathSciNetCrossRefGoogle Scholar
- Küchenhoff, H., & Carroll, R. J. (1997). Segmented regression with errors in predictors: semi-parametric and parametric methods. Statistics in Medicine, 16, 169–188.CrossRefGoogle Scholar
- Liu, F. T., Ting, K. M., Yu, Y., & Zhou, Z. H. (2008). Spectrum of variable-random trees. Journal of Artificial Intelligence Research, 32, 355–384.CrossRefGoogle Scholar
- Lu, Q., Lund, R., & Lee, T. C. M. (2010). An mdl approach to the climate segmentation problem. The Annals of Applied Statistics, 4, 299–319.MathSciNetCrossRefGoogle Scholar
- Quandt, R. E. (1958). The estimation of parameters of a linear regression system obeying two separate regimes. Journal of the American Statistical Association, 53, 873–880.MathSciNetCrossRefGoogle Scholar
- Quandt, R. E. (1960). Tests of the hypothesis that a linear regression system obeys two separate regimes. Journal of the American Statistical Association, 55, 324–330.MathSciNetCrossRefGoogle Scholar
- Rissanen, J. (2007). Information and Complexity in Statistical Modeling. New York: Springer.zbMATHGoogle Scholar
- Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.MathSciNetCrossRefGoogle Scholar
- Smith, P. L. (1979). Splines as a useful and convenient statistical tool. The American Statistician, 33, 57–62.Google Scholar
- Stasinopoulos, D. M., & Rigby, R. A. (1992). Detecting break points in generalised linear models. Computational Statistics & Data Analysis, 13, 461–471.CrossRefGoogle Scholar
- Ulm, K. (1991). A statistical method for assessing a threshold in epidemiological studies. Statistics in Medicine, 10, 341–349.CrossRefGoogle Scholar
- Worsley, K. J. (1979). On the likelihood ratio test for a shift in location of normal populations. Journal of the American Statistical Association, 74, 365–367.MathSciNetzbMATHGoogle Scholar
- Wu, Y. (2008). Simultaneous change point analysis and variable selection in a regression problem. Journal of Multivariate Analysis, 99, 2154–2171.MathSciNetCrossRefGoogle Scholar
- Zhou, Z. H. (2012). Ensemble Methods Foundations and Algorithms. Boca Raton: Chapman and Hall/CRC Press.CrossRefGoogle Scholar