Segmentation of the mean of heteroscedastic data via cross-validation

Arlot, Sylvain; Celisse, Alain

doi:10.1007/s11222-010-9196-x

Segmentation of the mean of heteroscedastic data via cross-validation

Published: 24 August 2010

Volume 21, pages 613–632, (2011)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Sylvain Arlot¹ &
Alain Celisse²

561 Accesses
28 Citations
Explore all metrics

Abstract

This paper tackles the problem of detecting abrupt changes in the mean of a heteroscedastic signal by model selection, without knowledge on the variations of the noise. A new family of change-point detection procedures is proposed, showing that cross-validation methods can be successful in the heteroscedastic framework, whereas most existing procedures are not robust to heteroscedasticity. The robustness to heteroscedasticity of the proposed procedures is supported by an extensive simulation study, together with recent partial theoretical results. An application to Comparative Genomic Hybridization (CGH) data is provided, showing that robustness to heteroscedasticity can indeed be required for their analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abramovich, F., Benjamini, Y., Donoho, D.L., Johnstone, I.M.: Adapting to unknown sparsity by controlling the false discovery rate. Ann. Stat. 34(2), 584–653 (2006)
Article MATH MathSciNet Google Scholar
Akaike, H.: Statistical predictor identification. Ann. Inst. Stat. Math. 22, 203–217 (1970)
Article MATH MathSciNet Google Scholar
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Second International Symposium on Information Theory, Tsahkadsor, 1971, pp. 267–281. Akadémiai Kiadó, Budapest (1973)
Google Scholar
Allen, D.M.: The relationship between variable selection and data augmentation and a method for prediction. Technometrics 16, 125–127 (1974)
Article MATH MathSciNet Google Scholar
Arlot, S.: V-fold cross-validation improved: V-fold penalization. arXiv:0802.0566v2 (2008)
Arlot, S.: Model selection by resampling penalization. Electron. J. Stat. 3, 557–624 (2009) (electronic)
Article MathSciNet Google Scholar
Arlot, S.: Choosing a penalty for model selection in heteroscedastic regression. arXiv:0812.3141 (2010)
Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010). doi:10.1214/09-SS054
Article MATH MathSciNet Google Scholar
Arlot, S., Massart, P.: Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10, 245–279 (2009) (electronic)
Google Scholar
Baraud, Y.: Model selection for regression on a fixed design. Probab. Theory Relat. Fields 117(4), 467–493 (2000)
Article MATH MathSciNet Google Scholar
Baraud, Y.: Model selection for regression on a random design. ESAIM Probab. Stat. 6, 127–146 (2002) (electronic)
Article MATH MathSciNet Google Scholar
Baraud, Y., Giraud, C., Huet, S.: Gaussian model selection with an unknown variance. Ann. Stat. 37(2), 630–672 (2009)
Article MATH MathSciNet Google Scholar
Barron, A., Birgé, L., Massart, P.: Risk bounds for model selection via penalization. Probab. Theory Relat. Fields 113, 301–413 (1999)
Article MATH Google Scholar
Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes: Theory and Application. Prentice Hall Information and System Sciences Series. Englewood Cliffs, Prentice Hall (1993)
Google Scholar
Bellman, R.E., Dreyfus, S.E.: Applied Dynamic Programming. Princeton University Press, Princeton (1962)
MATH Google Scholar
Birgé, L., Massart, P.: From model selection to adaptive estimation. In: Pollard, D., Torgensen, E., Yang, G. (eds.) Festschrift for Lucien Le Cam: Research Papers in Probability and Statistics, pp. 55–87. Springer, New York (1997)
Google Scholar
Birgé, L., Massart, P.: Gaussian model selection. J. Eur. Math. Soc. 3(3), 203–268 (2001)
Article MATH MathSciNet Google Scholar
Birgé, L., Massart, P.: Minimal penalties for Gaussian model selection. Probab. Theory Relat. Fields 138(1–2), 33–73 (2007)
Article MATH Google Scholar
Brodsky, B.E., Darkhovsky, B.S.: Methods in Change-Point Problems. Kluwer Academic, Dordrecht (1993)
Google Scholar
Burman, P.: A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning-testing methods. Biometrika 76(3), 503–514 (1989)
MATH MathSciNet Google Scholar
Burman, P., Nolan, D.: Data-dependent estimation of prediction functions. J. Time Ser. Anal. 13(3), 189–207 (1992)
Article MATH MathSciNet Google Scholar
Celisse, A.: Model selection in density estimation via cross-validation. Technical Report (2008a). arXiv:0811.0802v2
Celisse, A.: Model selection via cross-validation in density estimation, regression and change-points detection. PhD thesis, University Paris-Sud 11 (2008b). http://tel.archives-ouvertes.fr/tel-00346320/
Celisse, A., Robin, S.: Nonparametric density estimation by exact leave-p-out cross-validation. Comput. Stat. Data Anal. 52(5), 2350–2368 (2008)
Article MATH MathSciNet Google Scholar
Celisse, A., Robin, S.: A cross-validation based estimation of the proportion of true null hypotheses. J. Stat. Plan. Inference (2010). doi:10.1016/j.jspi.2010.04.014
MathSciNet Google Scholar
Chu, C.-K., Marron, J.S.: Comparison of two bandwidth selectors with dependent errors. Ann. Stat. 19(4), 1906–1918 (1991)
Article MATH MathSciNet Google Scholar
Comte, F., Rozenholc, Y.: Adaptive estimation of mean and volatility functions in (auto-)regressive models. Stoch. Process. Appl. 97(1), 111–145 (2002)
Article MATH MathSciNet Google Scholar
Dudoit, S., van der Laan, M.J.: Asymptotics of cross-validated risk estimation in estimator selection and performance assessment. Stat. Methodol. 2(2), 131–154 (2005)
Article MathSciNet Google Scholar
Geisser, S.: A predictive approach to the random effect model. Biometrika 61(1), 101–107 (1974)
Article MATH MathSciNet Google Scholar
Geisser, S.: The predictive sample reuse method with applications. J. Am. Stat. Assoc. 70, 320–328 (1975)
Article MATH Google Scholar
Gendre, X.: Simultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression. Electron. J. Stat. 2, 1345–1372 (2008)
Article MathSciNet Google Scholar
Harchaoui, Z., Vallet, F., Lung-Yut-Fong, A., Cappé, O.: A regularized kernel-based approach to unsupervised audio segmentation. In: Proc. International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2009
Kearns, M., Mansour, Y., Ng, A.Y., Ron, D.: An experimental and theoretical comparison of model selection methods. Mach. Learn. 7, 7–50 (1997)
Article Google Scholar
Lachenbruch, P.A., Mickey, M.R.: Estimation of error rates in discriminant analysis. Technometrics 10, 1–11 (1968)
Article MathSciNet Google Scholar
Lavielle, M.: Using penalized contrasts for the change-point problem. Signal Process. 85, 1501–1510 (2005)
Article MATH Google Scholar
Lavielle, M., Teyssière, G.: Detection of multiple change-points in multivariate time series. Lith. Math. J. 46, 287–306 (2006)
Article MATH Google Scholar
Lebarbier, É.: Detecting multiple change-points in the mean of a Gaussian process by model selection. Signal Process. 85, 717–736 (2005)
Article MATH Google Scholar
Li, K.-C.: Asymptotic optimality for C _p, C _L, cross-validation and generalized cross-validation: discrete index set. Ann. Stat. 15(3), 958–975 (1987)
Article MATH Google Scholar
Mallows, C.L.: Some comments on C _p. Technometrics 15, 661–675 (1973)
Article MATH Google Scholar
Massart, P.: Concentration Inequalities and Model Selection. Lecture Notes in Mathematics. Springer, Berlin (2007)
MATH Google Scholar
Miao, B.Q., Zhao, L.C.: On detection of change points when the number is unknown. Chin. J. Appl. Probab. Stat. 9(2), 138–145 (1993)
MATH MathSciNet Google Scholar
Opsomer, J., Wang, Y., Yang, Y.: Nonparametric regression with correlated errors. Stat. Sci. 16(2), 134–153 (2001)
Article MATH MathSciNet Google Scholar
Picard, D.: Testing and estimating change-points in time series. Adv. Appl. Probab. 17(4), 841–867 (1985)
Article MATH MathSciNet Google Scholar
Picard, F.: Process segmentation/clustering application to the analysis of array CGH data. PhD thesis, Université Paris-Sud 11, 2005. http://tel.archives-ouvertes.fr/tel-00116025/fr/
Picard, F., Robin, S., Lavielle, M., Vaisse, C., Daudin, J.-J.: A statistical approach for array CGH data analysis. BMC Bioinform. 27(6) (2005) (electronic access)
Picard, F., Robin, S., Lebarbier, É., Daudin, J.-J.: A segmentation/clustering model for the analysis of array CGH data. Biometrics 63(3), 758–766 (2007)
Article MATH MathSciNet Google Scholar
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–431 (1983)
Article MATH MathSciNet Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MATH Google Scholar
Shao, J.: An asymptotic theory for linear model selection. Stat. Sinica 7, 221–264 (1997)
MATH Google Scholar
Shibata, R.: An optimal selection of regression variables. Biometrika 68(1), 45–54 (1981)
Article MATH MathSciNet Google Scholar
Stone, C.J.: An asymptotically optimal window selection rule for kernel density estimates. Ann. Stat. 12(4), 1285–1297 (1984)
Article MATH Google Scholar
Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc., Ser. B 36, 111–147 (1974)
MATH Google Scholar
Stone, M.: An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. B 39(1), 44–47 (1977)
MATH Google Scholar
Tibshirani, R., Knight, K.: The covariance inflation criterion for adaptive model selection. J. R. Stat. Soc., Ser. B Stat. Methodol. 61(3), 529–546 (1999)
Article MATH MathSciNet Google Scholar
Yang, Y.: Regression with multiple candidate model: selection or mixing? Stat. Sinica 13, 783–809 (2003)
MATH Google Scholar
Yang, Y.: Comparing learning methods for classification. Stat. Sinica 16, 635–657 (2006)
MATH Google Scholar
Yang, Y.: Consistency of cross-validation for comparing regression procedures. Ann. Stat. 35(6), 2450–2473 (2007)
Article MATH Google Scholar
Yao, Y.-C.: Estimating the number of change-points via Schwarz’ criterion. Stat. Probab. Lett. 6(3), 181–189 (1988)
Article MATH Google Scholar
Zhang, N.R., Siegmund, D.O.: Modified Bayes information criterion with application to the analysis of comparative genomic hybridization data. Biometrics 63, 22–32 (2007)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Willow Project-Team, Laboratoire d’Informatique de l’Ecole Normale Superieure, CNRS/ENS/INRIA UMR 8548, 23 avenue d’Italie, 75214, Paris Cedex 13, France
Sylvain Arlot
Laboratoire de Mathématique Paul Painlevé UMR 8524 CNRS, Université Lille 1, 59 655, Villeneuve d’Ascq Cedex, France
Alain Celisse

Authors

Sylvain Arlot
View author publications
You can also search for this author in PubMed Google Scholar
Alain Celisse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alain Celisse.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 1.20 MB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arlot, S., Celisse, A. Segmentation of the mean of heteroscedastic data via cross-validation. Stat Comput 21, 613–632 (2011). https://doi.org/10.1007/s11222-010-9196-x

Download citation

Received: 08 April 2009
Accepted: 26 July 2010
Published: 24 August 2010
Issue Date: October 2011
DOI: https://doi.org/10.1007/s11222-010-9196-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Segmentation of the mean of heteroscedastic data via cross-validation

Abstract

Access this article

Similar content being viewed by others

Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations

A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models

Genomic Outlier Detection in High-Throughput Data Analysis

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

(PDF 1.20 MB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Segmentation of the mean of heteroscedastic data via cross-validation

Abstract

Access this article

Similar content being viewed by others

Stable Variable Selection for High-Dimensional Genomic Data with Strong Correlations

A closer look at cross-validation for assessing the accuracy of gene regulatory networks and models

Genomic Outlier Detection in High-Throughput Data Analysis

References

Author information

Authors and Affiliations

Corresponding author

Electronic Supplementary Material

(PDF 1.20 MB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation