Skip to main content
Log in

Optimal segmentation using tree models

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Sequence data are abundant in application areas such as computational biology, environmental sciences, and telecommunications. Many real-life sequences have a strong segmental structure, with segments of different complexities. In this paper we study the description of sequence segments using variable length Markov chains (VLMCs), also known as tree models. We discover the segment boundaries of a sequence and at the same time we compute a VLMC for each segment. We use the Bayesian information criterion (BIC) and a variant of the minimum description length (MDL) principle that uses the Krichevsky-Trofimov (KT) code length to select the number of segments of a sequence. On DNA data the method selects segments that closely correspond to the annotated regions of the genes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Barron A, Rissanen J and Yu B (1998). The minimum desiption length principle in coding and modeling. IEEE Trans Inf Theory 44(6): 2743–2760

    Article  MathSciNet  MATH  Google Scholar 

  2. Bellman R (1961). On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6): 284

    Article  Google Scholar 

  3. Bernaola-Galvan P, Grosse I, Carpena P, Oliver J, Roman-Roland R and Stanley H (2000). Finding borders between coding and noncoding dna regions by an entropic segmentation method. Phys Rev Lett 85(6): 1342–1345

    Article  Google Scholar 

  4. Braun J and Muller H (1998). Statistical methods for dna sequence segmentation. Statist Sci 13(2): 142–162

    Article  MATH  Google Scholar 

  5. Bühlmann P and Wyner A (1999). Variable length Markov chains. Ann Statist 27: 480–513

    Article  MathSciNet  MATH  Google Scholar 

  6. Burge Ch and Karlin S (1997). Prediction of complete gene structures in human genomic dna. J Mol Biol 268: 78–94

    Article  Google Scholar 

  7. Csiszar I and Talata Z (2006). Context tree estimation for not necessarily finite memory processes, via bic and mdl. IEEE Trans Inf Theory 52(3): 1007–1016

    Article  MathSciNet  Google Scholar 

  8. Grünwald P (2005) A tutorial introduction to the minimum description length principle. In: Advances in minimum description length: theory and applications. MIT Press

  9. Guigo R and Fickett J (1995). Distinctive sequence features in protein coding genic non-coding and intergenic human dna. J Mol Biol 253: 51–60

    Article  Google Scholar 

  10. Hansen M and Yu B (2001). Model selection and the principle of minimum description length. J Am Statist Assoc 96(454): 746–774

    Article  MathSciNet  MATH  Google Scholar 

  11. Herzel H and Grosse I (1997). Correlations in dna sequences: the role of protein coding segments. Phys Rev Lett 55(1): 800–810

    Google Scholar 

  12. Mannila H, Tikanmki J, Himberg J, Korpiaho K, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: First IEEE international conference on data mining, pp 203–210

  13. Kehagias Ath (2004). A hidden markov model segmentation procedure for hydrological and environmental time series. Stoch Environ Res Risk Assess (SERRA) 18(2): 117–130

    Article  MathSciNet  MATH  Google Scholar 

  14. Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In: ICDM, pp 289–296

  15. Krichevsky R and Trofimov V (1981). The performance of universal encoding. IEEE Trans Inf Theory IT-27(2): 199–207

    Article  MathSciNet  Google Scholar 

  16. Li W (2001) DNA segmentation as a model selection process. In: International conference on research in computational molecular biology, pp 204–210

  17. Liu S and Lawrence C (1999). Bayesian inference of biopolymer models. Bioinformatic 15: 38–52

    Article  Google Scholar 

  18. Makeev V, Ramensky V, Gelfand M, Roytberg M, Tumanyan V (2000) Bayesian approach to dna segmentation into regions with different average nucleotide composition. Lecture Notes in Computer Science, 2066:54–73, Computational Biology

  19. Orlov Y, Potapov V, Filipov V (2002) Recognizing functional dna sites and segmenting genomes using the program “complexity”. In: Proceedings of BGRS 2002, vol 3. Novosibirsk Insititute of Cytology and Genetics Press, pp 244–247

  20. Henderson D, Boys R and Wilkinson D (2000). Detecting homogeneous segments in dna sequences by using hidden markov models. Appl Statist 49(2): 269–285

    MathSciNet  MATH  Google Scholar 

  21. Rissanen J (1983). A universal data compression system. IEEE Trans Inf Theory IT-29(5): 656–664

    Article  MathSciNet  Google Scholar 

  22. Rissanen J (1999). Fast universal coding with context models. IEEE Trans Inf Theory 45(4): 1065–1071

    Article  MathSciNet  MATH  Google Scholar 

  23. Salmenkivi M and Mannila H (2005). Using markov chain monte carlo and dynamic programming for event sequence data. Knowl Inf Systems 7(3): 267–288

    Article  Google Scholar 

  24. Schwarz G (1978). Estimating the dimension of a model. Ann Statist 7(2): 461–464

    Article  Google Scholar 

  25. Szpankowski W, Ren W, Szpankowski L (2003) An optimal DNA segmentation based on the MDL principle. In: IEEE computer society bioinformatics conference, pp 541–546

  26. Weinberger M, Rissanen J and Feder M (1995). A universal finite memory source. IEEE Trans Inf Theory 41(3): 643–652

    Article  MATH  Google Scholar 

  27. Willems F, Shtarkov Y and Tjalkens T (1995). The context-tree weighting method: basic properties. IEEE Trans Inf Theory IT-41: 653–664

    Article  Google Scholar 

  28. Willems F, Shtarkov Y, Tjalkens T (2000) Context tree maximizing. In: Conference on information sciences and systems, pp 7–12

  29. Zhang M (1998). Statistical features of human exons and their flanking regions. Hum Mol Genet 7(5): 919–932

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Gwadera.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gwadera, R., Gionis, A. & Mannila, H. Optimal segmentation using tree models. Knowl Inf Syst 15, 259–283 (2008). https://doi.org/10.1007/s10115-007-0091-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-007-0091-5

Keywords

Navigation