Knowledge and Information Systems

, Volume 15, Issue 3, pp 259–283 | Cite as

Optimal segmentation using tree models

Regular Paper

Abstract

Sequence data are abundant in application areas such as computational biology, environmental sciences, and telecommunications. Many real-life sequences have a strong segmental structure, with segments of different complexities. In this paper we study the description of sequence segments using variable length Markov chains (VLMCs), also known as tree models. We discover the segment boundaries of a sequence and at the same time we compute a VLMC for each segment. We use the Bayesian information criterion (BIC) and a variant of the minimum description length (MDL) principle that uses the Krichevsky-Trofimov (KT) code length to select the number of segments of a sequence. On DNA data the method selects segments that closely correspond to the annotated regions of the genes.

Keywords

Sequence segmentation MDL DNA segmentation Sequence data mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Barron A, Rissanen J and Yu B (1998). The minimum desiption length principle in coding and modeling. IEEE Trans Inf Theory 44(6): 2743–2760 CrossRefMathSciNetMATHGoogle Scholar
  2. 2.
    Bellman R (1961). On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6): 284 CrossRefGoogle Scholar
  3. 3.
    Bernaola-Galvan P, Grosse I, Carpena P, Oliver J, Roman-Roland R and Stanley H (2000). Finding borders between coding and noncoding dna regions by an entropic segmentation method. Phys Rev Lett 85(6): 1342–1345 CrossRefGoogle Scholar
  4. 4.
    Braun J and Muller H (1998). Statistical methods for dna sequence segmentation. Statist Sci 13(2): 142–162 CrossRefMATHGoogle Scholar
  5. 5.
    Bühlmann P and Wyner A (1999). Variable length Markov chains. Ann Statist 27: 480–513 CrossRefMathSciNetMATHGoogle Scholar
  6. 6.
    Burge Ch and Karlin S (1997). Prediction of complete gene structures in human genomic dna. J Mol Biol 268: 78–94 CrossRefGoogle Scholar
  7. 7.
    Csiszar I and Talata Z (2006). Context tree estimation for not necessarily finite memory processes, via bic and mdl. IEEE Trans Inf Theory 52(3): 1007–1016 CrossRefMathSciNetGoogle Scholar
  8. 8.
    Grünwald P (2005) A tutorial introduction to the minimum description length principle. In: Advances in minimum description length: theory and applications. MIT PressGoogle Scholar
  9. 9.
    Guigo R and Fickett J (1995). Distinctive sequence features in protein coding genic non-coding and intergenic human dna. J Mol Biol 253: 51–60 CrossRefGoogle Scholar
  10. 10.
    Hansen M and Yu B (2001). Model selection and the principle of minimum description length. J Am Statist Assoc 96(454): 746–774 CrossRefMathSciNetMATHGoogle Scholar
  11. 11.
    Herzel H and Grosse I (1997). Correlations in dna sequences: the role of protein coding segments. Phys Rev Lett 55(1): 800–810 Google Scholar
  12. 12.
    Mannila H, Tikanmki J, Himberg J, Korpiaho K, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: First IEEE international conference on data mining, pp 203–210Google Scholar
  13. 13.
    Kehagias Ath (2004). A hidden markov model segmentation procedure for hydrological and environmental time series. Stoch Environ Res Risk Assess (SERRA) 18(2): 117–130 CrossRefMathSciNetMATHGoogle Scholar
  14. 14.
    Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In: ICDM, pp 289–296Google Scholar
  15. 15.
    Krichevsky R and Trofimov V (1981). The performance of universal encoding. IEEE Trans Inf Theory IT-27(2): 199–207 CrossRefMathSciNetGoogle Scholar
  16. 16.
    Li W (2001) DNA segmentation as a model selection process. In: International conference on research in computational molecular biology, pp 204–210Google Scholar
  17. 17.
    Liu S and Lawrence C (1999). Bayesian inference of biopolymer models. Bioinformatic 15: 38–52 CrossRefGoogle Scholar
  18. 18.
    Makeev V, Ramensky V, Gelfand M, Roytberg M, Tumanyan V (2000) Bayesian approach to dna segmentation into regions with different average nucleotide composition. Lecture Notes in Computer Science, 2066:54–73, Computational BiologyGoogle Scholar
  19. 19.
    Orlov Y, Potapov V, Filipov V (2002) Recognizing functional dna sites and segmenting genomes using the program “complexity”. In: Proceedings of BGRS 2002, vol 3. Novosibirsk Insititute of Cytology and Genetics Press, pp 244–247Google Scholar
  20. 20.
    Henderson D, Boys R and Wilkinson D (2000). Detecting homogeneous segments in dna sequences by using hidden markov models. Appl Statist 49(2): 269–285 MathSciNetMATHGoogle Scholar
  21. 21.
    Rissanen J (1983). A universal data compression system. IEEE Trans Inf Theory IT-29(5): 656–664 CrossRefMathSciNetGoogle Scholar
  22. 22.
    Rissanen J (1999). Fast universal coding with context models. IEEE Trans Inf Theory 45(4): 1065–1071 CrossRefMathSciNetMATHGoogle Scholar
  23. 23.
    Salmenkivi M and Mannila H (2005). Using markov chain monte carlo and dynamic programming for event sequence data. Knowl Inf Systems 7(3): 267–288 CrossRefGoogle Scholar
  24. 24.
    Schwarz G (1978). Estimating the dimension of a model. Ann Statist 7(2): 461–464 CrossRefGoogle Scholar
  25. 25.
    Szpankowski W, Ren W, Szpankowski L (2003) An optimal DNA segmentation based on the MDL principle. In: IEEE computer society bioinformatics conference, pp 541–546Google Scholar
  26. 26.
    Weinberger M, Rissanen J and Feder M (1995). A universal finite memory source. IEEE Trans Inf Theory 41(3): 643–652 CrossRefMATHGoogle Scholar
  27. 27.
    Willems F, Shtarkov Y and Tjalkens T (1995). The context-tree weighting method: basic properties. IEEE Trans Inf Theory IT-41: 653–664 CrossRefGoogle Scholar
  28. 28.
    Willems F, Shtarkov Y, Tjalkens T (2000) Context tree maximizing. In: Conference on information sciences and systems, pp 7–12Google Scholar
  29. 29.
    Zhang M (1998). Statistical features of human exons and their flanking regions. Hum Mol Genet 7(5): 919–932 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2007

Authors and Affiliations

  • Robert Gwadera
    • 1
  • Aristides Gionis
    • 1
  • Heikki Mannila
    • 1
  1. 1.HIIT, Basic Research UnitHelsinki University of Technology and University of HelsinkiHelsinkiFinland

Personalised recommendations