Skip to main content

Regularization techniques in joinpoint regression

Abstract

Joinpoint regression models are popular in various situations (modeling different trends in economics, mortality and incidence series or epidemiology studies and clinical trials). The literature on joinpoint regression mostly focuses on either the frequentist point of view, or discusses Bayesian approaches instead. A model selection step in all these scenarios considers only some limited set of alternatives, from which the final model is chosen. We present a different model estimation approach: the final model is selected out of all possible alternatives admitted by the data. We apply the \(L_{1}\)-regularization idea and via the sparsity principle we identify significant joinpoint locations to construct the final model. Some theoretical results and practical examples are given as well.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  • Blanks RG, Moss SM, McGahan CE, Quinn MJ, Babb PJ (2000) Effect of NHS breast screening programme on mortality from breast cancer in England and Wales.’ 1990–8: comparison of observed with predicted mortality. BMJ 321:665–669

    Article  Google Scholar 

  • Bosetti C, Bertuccio P, Levi F, Lucchini F, Negri E, La Vecchia C (2008) Cancer mortality in the European Union, 1970–2003, with a joinpoint analysis. Ann Oncol 16:631–640

    Google Scholar 

  • Carlin BP, Gelfand AE, Smith AFM (1992) Hierarchical Bayesian analysis of changepoint problems. Appl Stat 41:389–405

    Article  MATH  Google Scholar 

  • Efron B, Hastie T, Tibshirani R (2004) Least angle regression. Ann Stat 32:407–499

    MathSciNet  Article  MATH  Google Scholar 

  • Feder P (1975) The log likelihood ratio in segmented regression. Ann Stat 3:84–97

    MathSciNet  Article  MATH  Google Scholar 

  • Harchaoui Z, Lévy-Leduc C (2010) Multiple change-point estimation with a total variation penalty. J Am Stat Assoc 105(492):1480–1493

    MathSciNet  Article  MATH  Google Scholar 

  • Hinkley DV (1971) Inference in two-phase regression. J Am Stat Assoc 66(336):736–743

    Article  MATH  Google Scholar 

  • Hudecová Š (2011) Jak na odhad joinpoint regrese. Inf Bull České Stat Společnosti 22:7–20 [in Czech]

    Google Scholar 

  • Hudson D (1966) Fitting segmented curves whose join points have to be estimated. J Am Stat Assoc 61:1097–1129

    MathSciNet  Article  MATH  Google Scholar 

  • Kim HJ, Fay MP, Feuer EJ, Midthune DN (2000) Permutation tests for joinpoint regression with application to cancer rates. Stat Med 19:335–351

    Article  Google Scholar 

  • Kim HJ, Fay MP, Yu B (2004) Comparability of segmented line regression models. Biometrics 60(4):1005–1014

    MathSciNet  Article  MATH  Google Scholar 

  • Kim HJ, Yu B, Feuer EJ (2009) Selecting the number of change-points in segmented line regression. Stat Sin 19:597–609

    MathSciNet  MATH  Google Scholar 

  • Koenker R, Mizera I (2004) Penalized triograms: total variation regularization for bivariate smoothing. J R Stat Soc Ser B 66:1681–1736

    MathSciNet  Article  MATH  Google Scholar 

  • Koenker R, Ng P, Portnoy S (1994) Quantile smoothing splines. Biometrika 81:673–680

    MathSciNet  Article  MATH  Google Scholar 

  • Lerman P (1980) Fitting segmented regression models by grid search. Appl Stat 29:77–84

    Article  Google Scholar 

  • Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R (2013) A significance test for the LASSO. Ann Stat 42(2):413–468

    MathSciNet  Article  MATH  Google Scholar 

  • Mammen E, Van De Geer S (1997) Locally adaptive regression splines. Ann Stat 25(1):387–413

    MathSciNet  Article  MATH  Google Scholar 

  • Martinez-Beneito M, García-Donato G, Salmerón D (2011) A Bayesian joinpoint regression model with an unknown number of break-points. Ann Appl Stat 5(3):2150–2168

    MathSciNet  Article  MATH  Google Scholar 

  • Massart P (1996) Concentration inequalties and model selection. Ecole dEté de Probabilités. Springer, New York

    Google Scholar 

  • Peto R (1996) Five years of tamoxifen or more? J Natl Cancer Inst 88:1791–1793

    Article  Google Scholar 

  • R Core Team (2016) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/

  • Qui D, Katanoda K, Tomomi M, Tomotaka S (2009) A joinpoint regression analysis of long-term trends in cancer mortality in Japan (1958–2004). Int J Cancer 24:443–448

    Google Scholar 

  • Sprent P (1961) Some hypothesis concerning two-stage regression lines. Biometrics 17:634–645

    Article  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Yu B, Barrett M, Kim HJ, Feuer EJ (2007) Estimating Joinpoints in continuous time scale for multiple change-point models. Comput Stat Data Anal 51(5):2420–2427

    MathSciNet  Article  MATH  Google Scholar 

  • Zhang NR, Siegmund DO (2007) A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics 63:22–32

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to express sincere thanks to both referees and the editors for proposing valuable suggestions that led to the improvement of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matúš Maciak.

Appendix

Appendix

Proof of Lemma 1

Let us start with the minimization formulation given in (4). Note that the vector of unknown parameters \(\varvec{\beta } = (a_{1}, b_{1}, b_{2} - b_{1} , \dots , b_{n - 1} - b_{n - 2})^\top \in \mathbb {R}^{n}\) can be decomposed into two parts: the first two parameters \(a_{1}, b_{1} \in \mathbb {R}\) represent the overal intercept and slope for a classical linear regression and are not penalized in the \(L_{1}\) penalty. The second part consists of differences \((b_{2} - b_{1}) , \dots , (b_{n - 1} - b_{n - 2})\) which we denoted as \(\varvec{\beta }_{(-2)} = (\beta _{2}, \dots , \beta _{n - 1})^\top \in \mathbb {R}^{n - 2}\). These parameters are all included in the LASSO penalty in (4).

Thus, the minimization problem (4) can be equivalently expressed as

$$\begin{aligned} \begin{array}{c} ~\\ Minimize\\ {\varvec{\beta }_{0} \in \mathbb {R}^2, \varvec{\beta }_{(-2)} \in \mathbb {R}^{n}} \end{array} \frac{1}{n}\left\| \varvec{Y} - \left( \mathbb {X}_{0}, \mathbb {X}_{(-2)}\right) \left( \begin{array}{c} \varvec{\beta }_{0} \\ \varvec{\beta }_{(-2)} \end{array} \right) \right\| _{2}^{2} + \lambda _{n} \sum _{j = 2}^{n - 1} |\beta _{j} |, \end{aligned}$$
(7)

where \(\varvec{\beta }_{0} = (a_{1}, b_{1})^{\top }\), \(\varvec{\beta }_{(-2)} = (b_{2} - b_{1}, \dots , b_{n - 1} - b_{n - 2})^{\top }\) and \(\mathbb {X}_{0}\) is a matrix consisting of the first two columns of the design matrix in (5) and \(\mathbb {X}_{(-2)}\) is a matrix consisting of all remaining columns.

We can define a hat matrix

$$\begin{aligned} \mathbb {H} = \mathbb {X}_{0} \Big (\mathbb {X}_{0}^{\top } \mathbb {X}_{0}\Big )^{-1} \mathbb {X}_{0} \end{aligned}$$
(8)

for a projection into a linear span of the first two columns of the design matrix \(\mathbb {X}\) and just by using simple matrix algebra computation we can verify that the projection of \(\varvec{Y}\) (into a linear span of columns of \(\mathbb {X}\)) defined as \(\mathbb {X}_{0}\widehat{\varvec{\beta }}_{0} + \mathbb {X}_{(-2)}\widehat{\varvec{\beta }}_{(-2)}\), for \(\widehat{\varvec{\beta }} = (\widehat{\varvec{\beta }}_{0}^\top , \widehat{\varvec{\beta }}_{(-2)}^\top )^\top \) being the solution of (7), can be equivalently expressed as \(\mathbb {H}\varvec{Y} + (\mathbb {I} - \mathbb {H})\mathbb {X}_{(-2)}\widehat{\varvec{\beta }}_{(-2)}\) where \(\widehat{\varvec{\beta }}_{(-2)}\) now solves the minimization problem

$$\begin{aligned} \begin{array}{c} ~\\ Minimize\\ \varvec{\beta }_{(-2)} \in \mathbb {R}^{n - 2} \end{array} \frac{1}{n} \Vert (\mathbb {I} - \mathbb {H})\varvec{Y} - (\mathbb {I} - \mathbb {H})\mathbb {X}_{(-2)}\varvec{\beta }_{(-2)}\Vert _{2}^{2} + \lambda _{n} \Vert \varvec{\beta }_{(-2)}\Vert _{1}. \end{aligned}$$
(9)

We only need to put \(\widetilde{\varvec{Y}} = (\mathbb {I} - \mathbb {H})\varvec{Y}\) and \(\widetilde{\mathbb {X}} = (\mathbb {I} - \mathbb {H})\mathbb {X}_{(-2)}\), which completes the proof of Lemma 1. \(\square \)

Proof of Theorem 1

We start from the assertion of Lemma 1. The design points are all drawn from the interval (0, 1) and without any loss of generality we will assume that they are centered (\(\sum _{i = 1}^{n} X_{i} = 0\)). Using the fact that \(\widehat{\varvec{\beta }}_{(-2)} = (\widehat{\beta }_{2}, \dots , \widehat{\beta }_{n - 1})^\top \in \mathbb {R}^{n - 2}\) is the minimizer of (9) we obtain that

$$\begin{aligned} \frac{1}{n}\Big \Vert \widehat{\varvec{Y}} - \widetilde{\mathbb {X}}\widehat{\varvec{\beta }}_{(-2)} \Big \Vert _{2}^{2} + \lambda _{n} \Big \Vert \widehat{\varvec{\beta }}_{(-2)} \Big \Vert _{1} \le \frac{1}{n}\Big \Vert \widetilde{\varvec{Y}} - \widetilde{\mathbb {X}}\varvec{\beta }_{(-2)} \Big \Vert _{2}^{2} + \lambda _{n} \Big \Vert \varvec{\beta }_{(-2)} \Big \Vert _{1}, \end{aligned}$$

for \(\varvec{\beta }_{(-2)} = (\beta _{2}, \dots , \beta _{n - 1}) \in \mathbb {R}^{n - 2}\) being the true but unknown vector of parameters. Using the model formula we have

$$\begin{aligned}&\frac{1}{n}\Big \Vert \widetilde{\mathbb {X}} \Big (\varvec{\beta }_{(-2)} - \widehat{\varvec{\beta }}_{(-2)} \Big ) \Big \Vert _{2}^{2} \le \frac{2}{n} \Big (\varvec{\beta }_{(-2)} - \widehat{\varvec{\beta }}_{(-2)} \Big )^\top \widetilde{\mathbb {X}}^{\top }\varvec{\varepsilon } + \lambda _{n}\Big ( \Vert \varvec{\beta }_{(-2)} \Vert _{1} - \Vert \widehat{\varvec{\beta }}_{(-2)} \Vert _{1} \Big )\\&\quad \le \frac{2}{n} \Big (\varvec{\beta }_{(-2)} - \widehat{\varvec{\beta }}_{(-2)} \Big )^\top \widetilde{\mathbb {X}}^{\top }\varvec{\varepsilon } + \lambda _{n} \left( \sum _{j; \beta _{j} \ne 0} |\beta _{j}| - |\widehat{\beta }_{j}| \right) - \lambda _{n} \sum _{j; \beta _{j} = 0} |\widehat{\beta }_{j}|, \end{aligned}$$

where we distinguished for two disjoint sets of indexes for elements of \(\varvec{\beta }_{(-2)} = (\beta _{2}, \dots , \beta _{n - 1})^\top \) and \(\widehat{\varvec{\beta }}_{(-2)} = (\widehat{\beta }_{2}, \dots , \widehat{\beta }_{n - 1})^\top \) depending on whether the true value of each parameter is zero or not. Recall, that we assume that \(\varvec{\beta }_{(-2)}\) is a sparse vector with most elements being zeros.

Using now the definition of matrices \(\mathbb {H}\) in (8), \(\widetilde{\mathbb {X}}\) in the proof Lemma 1 and the design matrix \(\mathbb {X}\) in (5) we obtain that

$$\begin{aligned}&\Big (\varvec{\beta }_{(-2)} - \widehat{\varvec{\beta }}_{(-2)} \Big )^\top \widetilde{\mathbb {X}}^{\top }\varvec{\varepsilon } \nonumber \\&\quad = \sum _{k = 2}^{n - 1} \big ( \widehat{\beta }_{k} - \beta _{k} \big ) \big ( X_{k} - X_{k - 1} \big ) \left[ \sum _{i = k}^{n} \varepsilon _{i} - \sum _{i = 1}^{n} \sum _{j = k}^{n} \Big (\frac{1}{n} + \frac{X_{i}X_{j}}{\sum _{l = 1}^{n} X_{l}^{2}}\Big ) \varepsilon _{i} \right] \\&\quad \le \sum _{k = 2}^{n - 1}\big ( \widehat{\beta }_{k} - \beta _{k} \big ) \left[ \sum _{i = k}^{n} \underbrace{\Big ( 1 - \sum _{j = k}^n \Big ( \frac{1}{n} + \frac{X_{i}X_{j}}{\sum _{l = 1}^{n} X_{l}^{2}} \Big )}_{m_{i}(\varvec{X}, n, k) } \Big )\varepsilon _{i}\right. \ \nonumber \\&\qquad \left. - \sum _{i = 1}^{k - 1} \Big (\underbrace{\sum _{j = k}^n \Big ( \frac{1}{n} + \frac{X_{i}X_{j}}{\sum _{l = 1}^{n} X_{l}^{2}} \Big )}_{h_{i}(\varvec{X}, n, k) }\Big ) \varepsilon _{i} \right] \nonumber \end{aligned}$$
(10)

Note, that expressions \(h_{i}(\varvec{X}, n, k)\) and \(m_{i}(\varvec{X}, n, k)\) are only defined as some combinations of elements of the projection matrices \(\mathbb {H}\) and \((\mathbb {I} - \mathbb {H})\). Therefore, conditionally on \(\varvec{X} = (X_{1}, \dots , X_{n})^\top \) we have that

$$\begin{aligned} \sum _{i = 1}^{n} \widetilde{\varepsilon }_{i} = \sum _{i = k}^{n} m_{i}(\varvec{X}, n, k)\varepsilon _{i} + \sum _{i = 1}^{k - 1} - h_{i}(\varvec{X}, n, k) \varepsilon _{i}, \end{aligned}$$

for \(\widetilde{\varepsilon }_{i} = - h_{i}(\varvec{X}, n, k) \varepsilon _{i}\), if \(i \le k - 1\), and \(\widetilde{\varepsilon }_{i} = m_{i}(\varvec{X}, n, k)\varepsilon _{i}\), otherwise, is a sum of independent Gaussian random variables with zero means and variances at most \((\kappa _{0} + 1)^{2} \sigma ^2\) as one can easily verify that uniformly for any \(n \in \mathbb {N}\), \(i = 1, \dots , n\) and \(k = 2, \dots , n - 1\) it holds that

$$\begin{aligned} Var ( h_{i}(\varvec{X}, n, k) \varepsilon _{i})&\le \sigma ^2 \left[ \sum _{j = k}^n \Big ( \frac{1}{n} + \frac{1}{\sum _{l = 1}^{n} X_{l}^{2}} \Big )\right] ^2 \le \sigma ^2 \left[ \sum _{j = k}^n \Big (\frac{\kappa _{0} + 1}{n}\Big )\right] ^2\\&= \sigma ^2 \left[ \Big (\frac{n - k + 1}{n}\Big )(\kappa _{0} + 1)\right] ^2 \le (\kappa _{0} + 1)^2 \sigma ^2, \end{aligned}$$

and analogously also

$$\begin{aligned} Var ( m_{i}(\varvec{X}, n, k) \varepsilon _{i})&= \sigma ^2 \left[ 1 - \sum _{j = k}^n \Big ( \frac{1}{n} + \frac{X_{i}X_{j}}{\sum _{l = 1}^{n} X_{l}^{2}} \Big )\right] ^2 \\&\le \sigma ^2 \left( \frac{k - 1}{n} - \frac{X_{i} \sum _{j = k}^{n}X_{j}}{\sum _{l = 1}^{n} X_{l}^{2}}\right) ^2\\&\le \sigma ^2 \left( 1 + \frac{2 n}{\sum _{l = 1}^{n}X_{l}^{2}} + \frac{(n - k + 1)^2}{(\sum _{l = 1}^{n}X_{l}^{2})^2}\right) \le (\kappa _{0} + 1)^2 \sigma ^2, \end{aligned}$$

where in both cases we used the eigenvalue restriction assumption given as \(\sum _{l = 1}^{n}X_{i}^2 \ge n/\kappa _0 > 0\).

Using now Theorem 3.8 from Massart (1996) we obtain that

$$\begin{aligned} P\left( \frac{1}{n}\sum _{i = 1}^{n} \widetilde{\varepsilon }_{i} > \frac{\lambda _{n}}{2}\right) \le 2 \exp {\Big (- \frac{\lambda _{n}^2 n }{8 (\kappa _{0} + 1)^2 \sigma ^2}\Big )}, \end{aligned}$$

and thus, for \(\lambda _{n} = \sqrt{\frac{\log n}{n}} \sigma K\) we have that with probability at least \(1 - 2e^{-n}\) it holds that

$$\begin{aligned} \frac{1}{n} \Big \Vert \widetilde{\mathbb {X}} \Big (\varvec{\beta }_{(-2)} - \widehat{\varvec{\beta }}_{(-2)} \Big ) \Big \Vert _{2}^{2} \le&\ \lambda _{n} \sum _{k = 2}^{n - 1} | \widehat{\beta }_{k} - \beta _{k} | + \lambda _{n} \Big ( \sum _{j; \beta _{j} \ne 0} |\beta _{j}| - |\widehat{\beta }_{j}| \Big ) \\&- \lambda _{n} \sum _{j; \beta _{j} = 0} |\widehat{\beta }_{j}|, \end{aligned}$$

and given the fact that the first sum can be decomposed with respect to true zero and nonzero parameters as \(\sum _{k = 2}^{n - 1} | \widehat{\beta }_{k} - \beta _{k} | = \sum _{k; \beta _{k} \ne 0}|\widehat{\beta }_{k} - \beta _{k}| + \sum _{k; \beta _{k} = 0}|\widehat{\beta }_{k}|\) we finally obtain that with probability at least \(1 - 2e^{-n}\) it holds that

$$\begin{aligned}&\frac{1}{n} \Big \Vert \widetilde{\mathbb {X}} \Big (\varvec{\beta }_{(-2)} - \widehat{\varvec{\beta }}_{(-2)} \Big ) \Big \Vert _{2}^{2} \le 2 \lambda _{n} \sum _{k; \beta _{k} \ne 0} |\beta _{k}| \le 2 K \sigma \sqrt{\frac{\log n}{n}} \overline{\beta } M, \end{aligned}$$

where \(M \in \mathbb {N}\) is the maximum number of changepoints in the mode and \(\overline{\beta }\) is some maximum allowed magnitude change. Using now the unitary property of matrix \((\mathbb {I} - \mathbb {H})\) we can directly apply the assertion of Lemma 1 to get the result of Theorem 1. \(\square \)

Let us conclude that for some specific scenarios we can provide stronger asymptotic results. For instance, if the design points \(X_{1}, \dots , X_{n}\) are fixed and moreover, equidistant, we have \((X_{i} - X_{i - 1})\) of the order 1 / n for all \(i = 2, \dots , n\) which can be used to improve the inequality in (10). Similarly, if the design points are random but with some given distribution the distribution can be used to bound terms \((X_{i} - X_{i - 1})\) in probability which also improves the overall rate at the end.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Maciak, M., Mizera, I. Regularization techniques in joinpoint regression. Stat Papers 57, 939–955 (2016). https://doi.org/10.1007/s00362-016-0823-2

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-016-0823-2

Keywords

  • Joinpoint regression
  • Segmented regression
  • Piecewise linear
  • Changepoints
  • LASSO
  • Regularization
  • Model selection

Mathematics Subject Classification

  • 62J07
  • 62F99
  • 62P25