Pattern Analysis and Applications

, Volume 19, Issue 3, pp 793–805 | Cite as

Hidden Markov models for gene sequence classification

Classifying the VSG gene in the Trypanosoma brucei genome
  • Andrea Mesa
  • Sebastián BasterrechEmail author
  • Gustavo Guerberoff
  • Fernando Alvarez-Valin
Short Paper


The article presents an application of hidden Markov models (HMMs) for pattern recognition on genome sequences. We apply HMM for identifying genes encoding the variant surface glycoprotein (VSG) in the genomes of Trypanosoma brucei (T. brucei) and other African trypanosomes. These are parasitic protozoa causative agents of sleeping sickness and several diseases in domestic and wild animals. These parasites have a peculiar strategy to evade the host’s immune system that consists in periodically changing their predominant cellular surface protein (VSG). The motivation for using patterns recognition methods to identify these genes, instead of traditional homology based ones, is that the levels of sequence identity (amino acid and DNA sequence) amongst these genes is often below of what is considered reliable in these methods. Among pattern recognition approaches, HMM are particularly suitable to tackle this problem because they can handle more naturally the determination of gene edges. We evaluate the performance of the model using different number of states in the Markov model, as well as several performance metrics. The model is applied using public genomic data. Our empirical results show that the VSG genes on T. brucei can be safely identified (high sensitivity and low rate of false positives) using HMM.


Hidden Markov model Classification Gene sequence classification Trypanosoma brucei Variant surface glycoprotein 



This article has been elaborated in the framework of the project New creative teams in priorities of scientific research, reg. no. CZ.1.07/2.3.00/30.0055, supported by Operational Programme Education for Competitiveness and co-financed by the European Social Fund and the state budget of the Czech Republic and supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme, and by the Project SP2015/105 DPDM-Database of Performance and Dependability Models of the Student Grand System, VSB-Technical University of Ostrava.


  1. 1.
    Allen JE, Pertea M, Salzberg SL (2004) Computational gene prediction using multiple sources of evidence. Genome Res 14(1):142–148CrossRefGoogle Scholar
  2. 2.
    Alvarez F, Cortinas MN, Musto H (1996) The analysis of protein coding genes suggests monophyly of trypanosoma. Mol Phylogenet Evol 5(2):333–343. doi: 10.1006/mpev.1996.0028
  3. 3.
    Baldi P, Chauvin Y, Hunkapiller T, McClure MA (1994) Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci 91(3):1059–1063CrossRefGoogle Scholar
  4. 4.
    Baum L, Eagon J (1967) An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull Am Math Soc 73(3):360–363MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41(1):164–171MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Sayers E (2009) 37(database issue):d26–d31. Technical report, GenBank. doi: 10.1093/nar/gkn723.
  7. 7.
    Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA (2012) Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics (Oxford, England) 28(4). doi: 10.1093/bioinformatics/btr703.
  8. 8.
    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357zbMATHGoogle Scholar
  9. 9.
    Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6(1):1–6. doi: 10.1145/1007730.1007733.
  10. 10.
    Choo KH, Tong JC, Zhang L (2004) Recent applications of hidden Markov models in computational biology. Genomics Proteomics Bioinform 2(2):84–96Google Scholar
  11. 11.
    Churchill GA (1989) Stochastic models for heterogeneous DNA sequences. Bull Math Biol 51(1):79–94MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Dahl G, Yu D, Deng L, Acero A (2011) Large vocabulary continuous speech recognition with context-dependent DBN-HMMS. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4688–4691. doi: 10.1109/ICASSP.2011.5947401
  13. 13.
    Decaprio D, Vinson J, Pearson M, Montgomery P, Doherty M, Galagan J (2007) Conrad: gene prediction using conditional random fields. Neural Netw 17(9):1389–1398Google Scholar
  14. 14.
    Delcher AL, Harmon D, Kasif S, White O, Salzberg SL (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res 27(23):4636–4641CrossRefGoogle Scholar
  15. 15.
    Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38MathSciNetzbMATHGoogle Scholar
  16. 16.
    Durbin R, Eddy S, Krogh A, Mitchinson G (1998) Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press, LondonCrossRefzbMATHGoogle Scholar
  17. 17.
    Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6(3):361–365CrossRefGoogle Scholar
  18. 18.
    Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763CrossRefGoogle Scholar
  19. 19.
    Eddy SR, Mitchison G, Durbin R (1995) Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol 2(1):9–23CrossRefGoogle Scholar
  20. 20.
    El-sayed NMA, Ghedin E, Song J, Macleod A, Bringaud F, Larkin C, Wanless D, Peterson J, Hou L, Taylor S, Tweedie A, Biteau N, Khalak HG, Lin X, Mason T, Simpson AJ, Kaul S, Zhao H, Pai G, Van Aken S, Utterback T, Haas B, Koo HL, Umayam L, Suh B, Gerrard C, Leech V, Qi R, Zhou S, Schwartz D, Feldblyum T, Salzberg S, Tait A, Michael C, Turner R, Ullu E, White O, Melville S, Adams MD, Fraser CM, Donelson JE (2003) The sequence and analysis of Trypanosoma brucei chromosome II. Nucleic Acids Res 16(31):4856–4863CrossRefGoogle Scholar
  21. 21.
    Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic acids Res 39:W29–W39. doi: 10.1093/nar/gkr367 CrossRefGoogle Scholar
  22. 22.
    Flickek P (2007) Gene prediction: compare and contrast. Genome Biol 8(12):233.1–233.3. doi: 10.1186/gb-2007-8-12-233 Google Scholar
  23. 23.
    Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol doi: 10.1006/jmbi.2001.5080
  24. 24.
    Gross SS, Do CB, Sirota M, Batzoglou S (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8(12):R269.1–R269.16. doi: 10.1186/gb-2007-8-12-r269 CrossRefGoogle Scholar
  25. 25.
    Harmanci AO, Sharma G, Mathews DH (2007) Efficient pairwise RNA structure prediction using probabilistic alignment constraints in dynalign. BMC Bioinform 8(130). doi: 10.1186/1471-2105-8-130
  26. 26.
    Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Spring series in statistics. Springer, New YorkCrossRefzbMATHGoogle Scholar
  27. 27.
    Henderson J, Salzberg S, Fasman K (1996) Finding genes in human DNA with a hidden Markov model. J Comput Biol 4(2):127–141CrossRefGoogle Scholar
  28. 28.
    Johansen O, Ryen T, Eftesøl T, Kjosmoen T, Ruoff P (2009) Splice site prediction using artificial neural networks. In: Masulli F, Tagliaferri R, Verkhivker GM (eds) Computational intelligence methods for bioinformatics and biostatistics. Lecture notes in computer science, vol 5488. Springer, Berlin, pp 102–113. doi: 10.1007/978-3-642-02504-4_9
  29. 29.
    Juang B, Levinson S, Sondhi M (1986) Maximum likelihood estimation for multivariate mixture observations of Markov chains. IEEE Trans Inf Theory 32(2):307–309Google Scholar
  30. 30.
    Krogh A, Brown M, Mian IS, Sjölander K, Haussler D (1994) Hidden Markov models in computational biology: applications to protein modeling. J Mol Biol 235(5):1501–1531CrossRefGoogle Scholar
  31. 31.
    Liu Z, Wang S (2011) Emotion recognition using hidden Markov models from facial temperature sequence. In: Proceedings of the 4th international conference on affective computing and intelligent interaction, volume part II (ACII’11), pp 240–247Google Scholar
  32. 32.
    Lottaz C, Iseli C, Jongeneel CV, Bucher P (2003) Modeling sequencing errors by combining hidden Markov models. Bioinformatics 19(suppl 2):ii103–ii112Google Scholar
  33. 33.
    Lukashin AV, Borodovsky M (1998) Genemark HMM: new solutions for gene finding. Nucleic Acids Res 26(4):1107–1115CrossRefGoogle Scholar
  34. 34.
    Munch K, Krogh A (2006) Automatic generation of gene finders for eukaryotic species. BMC Bioinform 7(263). doi: 10.1186/1471-2105-7-263
  35. 35.
    Pachter L, Alexandersson M, Cawley S (2002) Applications of generalized hidden Markov models to aligament and gene finding problems. J Comput Biol 9:389–399CrossRefGoogle Scholar
  36. 36.
    Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286. doi:  10.1109/5.18626
  37. 37.
    Rabiner LR, Juang BH (1986) An introduction to hidden Markov models. IEEE ASSP Mag 3(1):4–16CrossRefGoogle Scholar
  38. 38.
    Rabiner LR, Schafer RW (2007) Introduction to digital speech processing. Found Trends Signal Process 1(1):1–194. doi: 10.1561/2000000001 CrossRefzbMATHGoogle Scholar
  39. 39.
    Rebello S, Maheshwari U, DSouza SV, DSouza RV (2011) Back propagation neural network method for predicting Lac gene structures in Streptococcus pyogenes M Group A Streptococcus strains. Int J Biotechnol Mol Biol Res 2(4):61–72Google Scholar
  40. 40.
    Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream M, Barrell B (2000) Artemis: sequence visualization and annotation. Bioinformatics (Oxford, England) 16(10). doi: 10.1093/bioinformatics/16.10.944
  41. 41.
    Salzberg S, Chen X, Henderson J, Fasman K (1996) Finding genes in DNA using decision trees and dynamic programming. In. In: Proceedings of fourth international conference intelligent systems for molecular biology (ISMB-96), pp 201–210. AAAI Press, Menlo ParkGoogle Scholar
  42. 42.
    Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CSS, Philips P, De Bona F, Hartmann L, Bohlen A, Krüger N, Sonnenburg S, Rätsch G (2009) mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res 19(11):2133–2143 doi:  10.1101/gr.090597.108
  43. 43.
    Stultz CM, White JV, Smith TF (1993) Structural analysis based on state-space modeling. Protein Sci 2(3):305–314CrossRefGoogle Scholar
  44. 44.
    Trentin E, Gori M (2001) A survey of hybrid ANN/HMM models for automatic speech recognition. Neurocomputing 37:91–126CrossRefzbMATHGoogle Scholar
  45. 45.
    Wang Z, Chen Y, Li Y (2004) A brief review of computational gene prediction methods. Genom Proteom Bioinform 2(4):216–221Google Scholar
  46. 46.
    Welch L (2003) Hidden Markov models and the Baum–Welch Algorithm. IEEE Info Theory Soc Newsl 4(53):1, 10–13Google Scholar
  47. 47.
    Won K-J, Hamelryck T, Prügel-Bennett A, Krogh A (2007) An evolutionary method for learning HMM structure: prediction of protein secondary structure. BMC Bioinform 8(1):357CrossRefGoogle Scholar
  48. 48.
    World Health Organization (2006) Trypanosomiasis, human African (sleeping sickness). Technical Report Fact sheet Number 259, World Health Organization. Accessed 05 Feb 2015
  49. 49.
    Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using hidden Markov model. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition, 1992 (CVPR ’92), pp 379–385. doi:  10.1109/CVPR.1992.223161
  50. 50.
    Yoon BJ (2009) Hidden Markov models and their applications in biological sequence analysis. Curr Genomics 10(6):402–415CrossRefGoogle Scholar
  51. 51.
    Yoon B-J, Vaidyanathan PP (2008) Structural alignment of RNAs using profile-caHMMs and its application to RNA homology search: overview and new results. IEEE Trans Autom Control (Joint Special Issue on Systems Biology with IEEE Transactions on Circuits and System: Part-I) 53:10–25Google Scholar

Copyright information

© Springer-Verlag London 2015

Authors and Affiliations

  • Andrea Mesa
    • 1
  • Sebastián Basterrech
    • 2
    Email author
  • Gustavo Guerberoff
    • 3
  • Fernando Alvarez-Valin
    • 4
  1. 1.Departamento de Métodos Matemáticos Cuantitativos, Facultad de Ciencias Económicas y AdministraciónUniversidad de la RepúblicaMontevideoUruguay
  2. 2.National Supercomputing CenterVŠB-Technical University of OstravaOstrava-PorubaCzech Republic
  3. 3.Facultad de Ingeniería, Instituto de Matemática y EstadísticaUniversidad de la RepúblicaMontevideoUruguay
  4. 4.Sección Biomatemática-Facultad de CienciasUniversidad de la RepúblicaMontevideoUruguay

Personalised recommendations