Computational Statistics

, Volume 22, Issue 1, pp 49–69

Using a VOM model for reconstructing potential coding regions in EST sequences

Original Paper

Abstract

This paper presents a method for annotating coding and noncoding DNA regions by using variable order Markov (VOM) models. A main advantage in using VOM models is that their order may vary for different sequences, depending on the sequences’ statistics. As a result, VOM models are more flexible with respect to model parameterization and can be trained on relatively short sequences and on low-quality datasets, such as expressed sequence tags (ESTs). The paper presents a modified VOM model for detecting and correcting insertion and deletion sequencing errors that are commonly found in ESTs. In a series of experiments the proposed method is found to be robust to random errors in these sequences.

Keywords

Variable order Markov model Coding and noncoding DNA Context tree Gene annotation Sequencing error detection and correction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Begleiter R, El-Yaniv R, Yona G (2004) On prediction using variable order markov models. J Artif Intell 22:385–421MATHMathSciNetGoogle Scholar
  2. Bejerano G (2001) Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1):23–43CrossRefGoogle Scholar
  3. Ben-Gal I, Shmilovici A, Morag G (2003) CSPC: a monitoring procedure for state dependent processes. Technometrics 45(4):293–311CrossRefMathSciNetGoogle Scholar
  4. Ben-Gal I, Shani A et al. (2005) Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21(11):2657–2666CrossRefGoogle Scholar
  5. Bernaola-Galvan P, Grosse I et al. (2000) Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Phys Rev Lett 85(6):1342–1345CrossRefGoogle Scholar
  6. Bilu Y, Linial M, Slonim N. Tishby N (2002) Locating transcription factors binding sites a Variable Memory Markov Model, Leibintz Center TR 2002–57. Available online at http://www.cs.huji.ac.il/~johnblue/papers/Google Scholar
  7. Brejova B, Brown D.G, Li M, Vinai T (2005) ExonHunter: a comprehensive approach to gene finding. Bioinformatics 21(Suppl 1):i57–i65CrossRefGoogle Scholar
  8. Brown NP, Sander C et al. (1998) Frame: detection of genomic sequencing errors. Bioinformatics 14(4):367–371CrossRefGoogle Scholar
  9. Burge C, Karlin S (1998) Finding the genes in genomic DNA. Curr Opin Struct Biol 8(3):346–354CrossRefGoogle Scholar
  10. Cawley SL, Pachter L (2003) HMM sampling and applications to gene finding and alternative splicing. Bioinformatics 19(Suppl 2):ii36–ii41Google Scholar
  11. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with GLIMMER. Nucl Acids Res 27(23):4636–4641CrossRefGoogle Scholar
  12. Feder M, Merhav N (1994) Relations between entropy and error probability. IEEE Trans Inf Theory 40(1):259–266MATHCrossRefMathSciNetGoogle Scholar
  13. Fickett JW (1996) Finding genes by computer: the state of the art. Trends Genet 12(8):316–320CrossRefGoogle Scholar
  14. Fickett JW, Tung CS (1992) Assessment of protein coding measures. Nucl Acids Res 20(24): 6441–6450CrossRefGoogle Scholar
  15. Freund Y, Schapira RE (1997) A decision theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139MATHCrossRefGoogle Scholar
  16. GENIE data-sets, from Genbank version 105 (1998) Available: http://www.fruitfly.org/seq_tools/ datasets/Human/CDS_v105/ ; http://www.fruitfly.org/seq_tools/datasets/Human/intron_v105/Google Scholar
  17. Hanisch D et al. (2002) Co-clustering of biological networks and gene expression data. Bioinformatics 1:1–10Google Scholar
  18. Hatzigorgiou AG, Fiziev P, Reczko M (2001) DIANA-EST: a statistical analysis. Bioinformatics 17(10):913–919CrossRefGoogle Scholar
  19. Herzel H, Grosse I (1995) Measuring correlations in symbols sequences. Phys A 216:518–542CrossRefMathSciNetGoogle Scholar
  20. Iseli C, Jongeneel CV, Bucher P (1999) ESTScan: a program for detecting, evaluating, and reconstructing potential coding regions in EST sequences. In: Proceedings of intelligent systems for molecular biology. AAAI Press, Menlo ParkGoogle Scholar
  21. Kel AE, Gossling E et al. (2003) MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucl Acids Res 31(13):3576–3579CrossRefGoogle Scholar
  22. Larsen TS, Krogh A (2003) EasyGene—a prokaryotic gene finder that ranks ORFs by statistical significance. BMC Bioinf 4(21) Available Online www.biomedcentral.com/1471-2105/4/21Google Scholar
  23. Lottaz C, Iseli C, Jongeneel CV, Bucher P (2003) Modeling sequencing errors by combining Hidden markov models. Bioinformatics 19(Suppl 2):ii103–ii112Google Scholar
  24. Majoros WH, Pertea M, Salzberg SL (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatic 20:2878–2879CrossRefGoogle Scholar
  25. Nicorici N, Berger JA, Astola J, Mitra SK (2003) Finding borders between coding and noncoding DNA regions using recursive segmentation and statistics of stop codons. Available Online: http://www.engineering.ucsb.edu/~jaberger/pubs/FINSIG03_Nicorici.pdfGoogle Scholar
  26. Ohler U, Niemann H (2001) Identification and analysis of eukaryotic promoters: recent computational approaches. Trends Genet 17:56–60CrossRefGoogle Scholar
  27. Ohler U, Harbeck S, Niemann H, Noth E, Reese M (1999) Interpolated Markov chains for eukaryotic promoter recognition. Bioinformatics 15(5):362–369CrossRefGoogle Scholar
  28. Orlov YL, Filippov VP, Potapov VN, Kolchanov NA (2002) Construction of stochastic context trees for genetic texts. In Silico Biol 2(3):233–247Google Scholar
  29. Rissanen J (1983) A universal data compression system. IEEE Trans Inf Theory 29(5):656–664MATHCrossRefMathSciNetGoogle Scholar
  30. Shmilovici A, Ben-Gal I (2004) Using a compressibility measure to distinguish coding and noncoding DNA. Far East J Theoret Stat 13(2):215–234MATHMathSciNetGoogle Scholar
  31. Shmilovici A, Alon-Brimer Y, Hauser S (2003) Using a stochastic complexity measure to check the efficient market hypothesis. Comput Econ 22(3):273–284MATHCrossRefGoogle Scholar
  32. Vert JP (2001) Adaptive context trees and text clustering. IEEE Trans Inf Theory 47(5):1884–1901MATHCrossRefMathSciNetGoogle Scholar
  33. Xu Y, Mural RJ, Uberbacher EC (1995) Correcting sequencing errors in DNA coding regions using a dynamic programming approach. Bioinformatics 11:117–124CrossRefGoogle Scholar
  34. Zaidenraise KOS, Shmilovici A, Ben-Gal I (2004) A VOM based gene-finder that specializes in short genes. In: Proceedings of the 23th convention of electrical and electronics engineers in Israel, September 6–7, Herzelia, Israel, pp. 189–192Google Scholar
  35. Ziv J (2001) A universal prediction lemma and applications to universal data compression and prediction. IEEE Trans Inf Theory 47(4):1528–1532MATHCrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  1. 1.Department of Information Systems EngineeringBen-Gurion UniversityBeer-ShevaIsrael
  2. 2.Department of Industrial EngineeringTel-Aviv UniversityTel-AvivIsrael

Personalised recommendations