Methodology and Computing in Applied Probability

, Volume 16, Issue 4, pp 987–1008 | Cite as

Non-Parametric Change-Point Estimation using String Matching Algorithms

  • Oliver Johnson
  • Dino Sejdinovic
  • James Cruise
  • Robert Piechocki
  • Ayalvadi Ganesh
Article

Abstract

Given the output of a data source taking values in a finite alphabet, we wish to estimate change-points, that is times when the statistical properties of the source change. Motivated by ideas of match lengths in information theory, we introduce a novel non-parametric estimator which we call CRECHE (CRossings Enumeration CHange Estimator). We present simulation evidence that this estimator performs well, both for simulated sources and for real data formed by concatenating text sources. For example, we show that we can accurately estimate the point at which a source changes from a Markov chain to an IID source with the same stationary distribution. Our estimator requires no assumptions about the form of the source distribution, and avoids the need to estimate its probabilities. Further, establishing a fluid limit and using martingale arguments.

Keywords

Change-point estimation Entropy Non-parametric String matching 

AMS 2000 Subject Classifications

Primary 62L10 Secondary 62M09 68W32 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal R, Inclan C, Leal R (1999) Volatility in emerging stock markets. J Finan Quant Anal 34(1):33–55CrossRefGoogle Scholar
  2. Algoet PH, Cover TM (1988) A sandwich proof of the Shannon–McMillan–Breiman theorem. Ann Probab 16:899–909MathSciNetCrossRefMATHGoogle Scholar
  3. Alzaid AA, Al-Osh M (1990) An integer-valued pth-order autoregressive structure (INAR(p)) process. J Appl Probab 27(2):314–324MathSciNetCrossRefMATHGoogle Scholar
  4. Arratia R, Waterman MS (1985) Critical phenomena in sequence matching. Ann Probab 13(4):1236–1249MathSciNetCrossRefMATHGoogle Scholar
  5. Arratia R, Waterman MS (1989) The Erdős–Rényi strong law for pattern matching with a given proportion of mismatches. Ann Probab 17(3):1152–1169MathSciNetCrossRefMATHGoogle Scholar
  6. Barnett TP, Pierce DW, Schnur R (2001) Detection of anthropogenic climate change in the world’s oceans. Science 292(5515):270–274CrossRefGoogle Scholar
  7. Bell C, Gordon L, Pollak M (1994) An efficient nonparametric detection scheme and its application to surveillance of a Bernoulli process with unknown baseline. Lect Notes Monogr Ser 23:7–27MathSciNetCrossRefGoogle Scholar
  8. Ben Hariz S, Wylie JJ, Zhang Q (2007) Optimal rate of convergence for nonparametric change-point estimators for nonstationary sequences. Ann Stat 35(4):1802–1826MathSciNetCrossRefMATHGoogle Scholar
  9. Braun V, Braun RK, Muller HG (2000) Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika 87(2):301–314MathSciNetCrossRefMATHGoogle Scholar
  10. Brodsky BE, Darkhovsky BS (1993) Nonparametric methods in change-point problems. In: Mathematics and its applications, vol 243. Kluwer Academic Publishers Group, DordrechtGoogle Scholar
  11. Brodsky BE, Darkhovsky BS (2000) Non-parametric statistical diagnosis. In: Mathematics and its applications, vol 509. Kluwer Academic Publishers, DordrechtGoogle Scholar
  12. Cai H, Kulkarni SR, Verdú S (2006) Universal divergence estimation for finite-alphabet sources. IEEE Trans Inform Theory 52(8):3456–3475MathSciNetCrossRefGoogle Scholar
  13. Carlstein E (1988) Nonparametric change-point estimation. Ann Stat 16(1):188–197MathSciNetCrossRefMATHGoogle Scholar
  14. Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New YorkCrossRefMATHGoogle Scholar
  15. Darling RWR (2002) Fluid limits of pure jump markov processes: a practical guide. arXiv:math/0210109
  16. Dümbgen L (1991) The asymptotic behavior of some nonparametric change-point estimators. Ann Stat 19(3):1471–1495CrossRefMATHGoogle Scholar
  17. Frisén M, Maré JD (1991) Optimal surveillance. Biometrika 78(2):271–280MathSciNetCrossRefMATHGoogle Scholar
  18. Gao Y, Kontoyiannis I, Bienenstock E (2008) Estimating the entropy of binary time series: methodology, some theory and a simulation study. Entropy 10(2):71–99MathSciNetCrossRefMATHGoogle Scholar
  19. Girón J, Ginebra J, Riba A (2005) Bayesian analysis of a multinomial sequence and homogeneity of literary style. Am Stat 59(1):19–30CrossRefGoogle Scholar
  20. Goldenshluger A, Tsybakov A, Zeevi A (2006) Optimal change-point estimation from indirect observations. Ann Stat 34(1):350–372MathSciNetCrossRefMATHGoogle Scholar
  21. Gordon L, Pollak M (1994) An efficient sequential nonparametric scheme for detecting a change of distribution. Ann Stat 22(2):763–804MathSciNetCrossRefMATHGoogle Scholar
  22. Grassberger P (1989) Estimating the information content of symbol sequences and efficient codes. IEEE Trans Inform Theory 35:669–675MathSciNetCrossRefGoogle Scholar
  23. Horváth L (1993) The maximum likelihood method for testing changes in the parameters of normal observations. Ann Stat 21(2):671–680CrossRefMATHGoogle Scholar
  24. Killick R, Fearnhead P, Eckley IA (2012) Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc 107(500):1590–1598MathSciNetCrossRefMATHGoogle Scholar
  25. Kim H, Rozovskii BL, Tartakovsky AG (2004) A nonparametric multichart CUSUM test for rapid detection of DOS attacks in computer networks. Int J Comput Inform Sci 2(3):149–158Google Scholar
  26. Kontoyiannis I, Suhov YM (1993) Prefixes and the entropy rate for long-range sources. In: Kelly FP (ed) Probability, statistics and optimisation. Wiley, New York, pp 89–98Google Scholar
  27. Nguyen X, Wainwright M, Jordan M (2005) Nonparametric decentralized detection using kernel methods. IEEE Trans Signal Process 53(11):4053–4066MathSciNetCrossRefGoogle Scholar
  28. Ornstein DS, Weiss B (1990) How sampling reveals a process. Ann Probab 18:905–930MathSciNetCrossRefMATHGoogle Scholar
  29. Ornstein DS, Weiss B (1993) Entropy and data compression schemes. IEEE Trans Inform Theory 39:78–83MathSciNetCrossRefMATHGoogle Scholar
  30. Pollak M (1985) Optimal detection of a change in distribution. Ann Stat 13(1):206–227MathSciNetCrossRefMATHGoogle Scholar
  31. Poor HV, Hadjiliadis O (2009) Quickest detection. Cambridge University Press, CambridgeMATHGoogle Scholar
  32. Quas AN (1999) An entropy estimator for a class of infinite processes. Theory Probab Appl 43(3):496–507MathSciNetCrossRefGoogle Scholar
  33. Rényi A (1956) A characterization of poisson processes. Magyar Tud Akad Mat Kutató Int Közl 1:519–527Google Scholar
  34. Riba A, Ginebra J (2005) Change-point estimation in a multinomial sequence and homogeneity of literary style. J Appl Stat 32(1):61–74MathSciNetCrossRefMATHGoogle Scholar
  35. Scott AJ, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 30(3):507–512CrossRefMATHGoogle Scholar
  36. Shields PC (1992) Entropy and prefixes. Ann Probab 20:403–409MathSciNetCrossRefMATHGoogle Scholar
  37. Shields PC (1997) String matching bounds via coding. Ann Probab 25:329–336MathSciNetCrossRefMATHGoogle Scholar
  38. Wellner JA (1977) A martingale inequality for the empirical process. Ann Probab 5(2):303–308MathSciNetCrossRefMATHGoogle Scholar
  39. Williams D (1991) Probability with martingales. Cambridge University Press, CambridgeCrossRefMATHGoogle Scholar
  40. Ziv J, Merhav N (1993) A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans Inform Theory 39(4):1270–1279MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Oliver Johnson
    • 1
  • Dino Sejdinovic
    • 2
  • James Cruise
    • 3
  • Robert Piechocki
    • 4
  • Ayalvadi Ganesh
    • 1
  1. 1.School of MathematicsUniversity of BristolBristolUK
  2. 2.Gatsby Computational Neuroscience UnitUniversity College LondonLondonUK
  3. 3.The Department of Actuarial Mathematics and Statistics, and the Maxwell Institute for Mathematical SciencesHeriot-Watt University Edinburgh CampusEdinburghScotland
  4. 4.Centre for Communications ResearchUniversity of BristolBristolUK

Personalised recommendations