Optimal segmentation using tree models

Gwadera, Robert; Gionis, Aristides; Mannila, Heikki

doi:10.1007/s10115-007-0091-5

Optimal segmentation using tree models

Regular Paper
Published: 28 July 2007

Volume 15, pages 259–283, (2008)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Robert Gwadera¹,
Aristides Gionis¹ &
Heikki Mannila¹

118 Accesses
7 Citations
Explore all metrics

Abstract

Sequence data are abundant in application areas such as computational biology, environmental sciences, and telecommunications. Many real-life sequences have a strong segmental structure, with segments of different complexities. In this paper we study the description of sequence segments using variable length Markov chains (VLMCs), also known as tree models. We discover the segment boundaries of a sequence and at the same time we compute a VLMC for each segment. We use the Bayesian information criterion (BIC) and a variant of the minimum description length (MDL) principle that uses the Krichevsky-Trofimov (KT) code length to select the number of segments of a sequence. On DNA data the method selects segments that closely correspond to the annotated regions of the genes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Barron A, Rissanen J and Yu B (1998). The minimum desiption length principle in coding and modeling. IEEE Trans Inf Theory 44(6): 2743–2760
Article MathSciNet MATH Google Scholar
Bellman R (1961). On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6): 284
Article Google Scholar
Bernaola-Galvan P, Grosse I, Carpena P, Oliver J, Roman-Roland R and Stanley H (2000). Finding borders between coding and noncoding dna regions by an entropic segmentation method. Phys Rev Lett 85(6): 1342–1345
Article Google Scholar
Braun J and Muller H (1998). Statistical methods for dna sequence segmentation. Statist Sci 13(2): 142–162
Article MATH Google Scholar
Bühlmann P and Wyner A (1999). Variable length Markov chains. Ann Statist 27: 480–513
Article MathSciNet MATH Google Scholar
Burge Ch and Karlin S (1997). Prediction of complete gene structures in human genomic dna. J Mol Biol 268: 78–94
Article Google Scholar
Csiszar I and Talata Z (2006). Context tree estimation for not necessarily finite memory processes, via bic and mdl. IEEE Trans Inf Theory 52(3): 1007–1016
Article MathSciNet Google Scholar
Grünwald P (2005) A tutorial introduction to the minimum description length principle. In: Advances in minimum description length: theory and applications. MIT Press
Guigo R and Fickett J (1995). Distinctive sequence features in protein coding genic non-coding and intergenic human dna. J Mol Biol 253: 51–60
Article Google Scholar
Hansen M and Yu B (2001). Model selection and the principle of minimum description length. J Am Statist Assoc 96(454): 746–774
Article MathSciNet MATH Google Scholar
Herzel H and Grosse I (1997). Correlations in dna sequences: the role of protein coding segments. Phys Rev Lett 55(1): 800–810
Google Scholar
Mannila H, Tikanmki J, Himberg J, Korpiaho K, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: First IEEE international conference on data mining, pp 203–210
Kehagias Ath (2004). A hidden markov model segmentation procedure for hydrological and environmental time series. Stoch Environ Res Risk Assess (SERRA) 18(2): 117–130
Article MathSciNet MATH Google Scholar
Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In: ICDM, pp 289–296
Krichevsky R and Trofimov V (1981). The performance of universal encoding. IEEE Trans Inf Theory IT-27(2): 199–207
Article MathSciNet Google Scholar
Li W (2001) DNA segmentation as a model selection process. In: International conference on research in computational molecular biology, pp 204–210
Liu S and Lawrence C (1999). Bayesian inference of biopolymer models. Bioinformatic 15: 38–52
Article Google Scholar
Makeev V, Ramensky V, Gelfand M, Roytberg M, Tumanyan V (2000) Bayesian approach to dna segmentation into regions with different average nucleotide composition. Lecture Notes in Computer Science, 2066:54–73, Computational Biology
Orlov Y, Potapov V, Filipov V (2002) Recognizing functional dna sites and segmenting genomes using the program “complexity”. In: Proceedings of BGRS 2002, vol 3. Novosibirsk Insititute of Cytology and Genetics Press, pp 244–247
Henderson D, Boys R and Wilkinson D (2000). Detecting homogeneous segments in dna sequences by using hidden markov models. Appl Statist 49(2): 269–285
MathSciNet MATH Google Scholar
Rissanen J (1983). A universal data compression system. IEEE Trans Inf Theory IT-29(5): 656–664
Article MathSciNet Google Scholar
Rissanen J (1999). Fast universal coding with context models. IEEE Trans Inf Theory 45(4): 1065–1071
Article MathSciNet MATH Google Scholar
Salmenkivi M and Mannila H (2005). Using markov chain monte carlo and dynamic programming for event sequence data. Knowl Inf Systems 7(3): 267–288
Article Google Scholar
Schwarz G (1978). Estimating the dimension of a model. Ann Statist 7(2): 461–464
Article Google Scholar
Szpankowski W, Ren W, Szpankowski L (2003) An optimal DNA segmentation based on the MDL principle. In: IEEE computer society bioinformatics conference, pp 541–546
Weinberger M, Rissanen J and Feder M (1995). A universal finite memory source. IEEE Trans Inf Theory 41(3): 643–652
Article MATH Google Scholar
Willems F, Shtarkov Y and Tjalkens T (1995). The context-tree weighting method: basic properties. IEEE Trans Inf Theory IT-41: 653–664
Article Google Scholar
Willems F, Shtarkov Y, Tjalkens T (2000) Context tree maximizing. In: Conference on information sciences and systems, pp 7–12
Zhang M (1998). Statistical features of human exons and their flanking regions. Hum Mol Genet 7(5): 919–932
Article Google Scholar

Download references

Author information

Authors and Affiliations

HIIT, Basic Research Unit, Helsinki University of Technology and University of Helsinki, Helsinki, Finland
Robert Gwadera, Aristides Gionis & Heikki Mannila

Authors

Robert Gwadera
View author publications
You can also search for this author in PubMed Google Scholar
Aristides Gionis
View author publications
You can also search for this author in PubMed Google Scholar
Heikki Mannila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Gwadera.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gwadera, R., Gionis, A. & Mannila, H. Optimal segmentation using tree models. Knowl Inf Syst 15, 259–283 (2008). https://doi.org/10.1007/s10115-007-0091-5

Download citation

Received: 28 March 2007
Accepted: 28 April 2007
Published: 28 July 2007
Issue Date: June 2008
DOI: https://doi.org/10.1007/s10115-007-0091-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimal segmentation using tree models

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimal segmentation using tree models

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation