A Similar Fragments Merging Approach to Learn Automata on Proteins

  • François Coste
  • Goulven Kerbellec
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)


We propose here to learn automata for the characterization of proteins families to overcome the limitations of the position-specific characterizations classically used in Pattern Discovery. We introduce a new heuristic approach learning non-deterministic automata based on selection and ordering of significantly similar fragments to be merged and on physico-chemical properties identification. Quality of the characterization of the major intrinsic protein (MIP) family is assessed by leave-one-out cross-validation for a large range of models specificity.


Hide Markov Model Pattern Discovery Learn Automaton Fragment Pair Negative Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Hulo, N., Sigrist, C.J.A., Le Saux, V., Langendijk-Genevaux, P.S., Bordoli, L., Gattiker, A., De Castro, E., Bucher, P., Bairoch, A.: Recent improvements to the PROSITE database. Nucl. Acids Res. 32, D134–D137 (2004)CrossRefGoogle Scholar
  2. 2.
    Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., Platt, D.: The emergence of pattern discovery techniques in computational biology. Metabolic Engineering 2, 159–177 (2000)CrossRefGoogle Scholar
  3. 3.
    Brejova, B., DiMarco, C., Vinar, T., Hidalgo, S., Holguin, G., Patten, C.: Finding Patterns in Biological Sequences. Unpublished project report for CS798G (2000)Google Scholar
  4. 4.
    Jonassen, I., Collins, J., Higgins, D.: Finding flexible patterns in unaligned protein sequences. Protein Science 4, 1587–1595 (1995)CrossRefGoogle Scholar
  5. 5.
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14, 55–67 (1998)CrossRefGoogle Scholar
  6. 6.
    Califano, A.: Splash: structural pattern localization analysis by sequential histograms. Bioinformatics 16, 341–357 (2000)CrossRefGoogle Scholar
  7. 7.
    Eddy, S.: Hmmer user’s guide: biological sequence analysis using prole hidden markov models (1998),
  8. 8.
    Karplus, K.: Hidden markov models for detecting remote protein homologies. Bioinformatics 14, 846–865 (1998)CrossRefGoogle Scholar
  9. 9.
    Oncina, J., Garcia, P.: Inferring regular languages in polynomial update time. Pattern Recognition and Image Analysis, 49–61 (1992)Google Scholar
  10. 10.
    Lang, K.J.: Random dfa’s can be approximately learned from sparse uniform examples. In: 5th ACM workshop on Computation Learning Theorie, pp. 45–52 (1992)Google Scholar
  11. 11.
    Sakakibara, Y., Brown, M., Hughey, K., Mian, S., Sjolander, K., Underwood, R.C., Haussler, D.: Recent methods for RNA modeling using stochastic context-free grammars. In: Crochemore, M., Gusfield, D. (eds.) CPM 1994. LNCS, vol. 807. Springer, Heidelberg (1994)Google Scholar
  12. 12.
    Nevill-Manning, C., Witten, I.: Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7, 67–82 (1997)zbMATHGoogle Scholar
  13. 13.
    Yokomori, T.: Learning non-deterministic finite automata from queries and counterexamples. Machine Intelligence 13, 169–189 (1994)Google Scholar
  14. 14.
    Coste, F., Kerbellec, G.: A similar fragments merging approach to learn automata on proteins. Technical report, IRISA, PI-1735 (2005)Google Scholar
  15. 15.
    Morgenstern, B.: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15, 211–218 (1999)CrossRefGoogle Scholar
  16. 16.
    Lerman, I., Azé, J.: Indice probabiliste discriminant de vraisemblance du lien pour des données volumineuses. In: Briand, H., Sebag, M., Gras, R., Guillet, F. (eds.) RNTI-E-1, numéro spécial Mesures de Qualité pour la Fouille des Données, CEPADUES, pp. 69–94 (2004)Google Scholar
  17. 17.
    Lang, K.J., Pearlmutter, B.A., Price, R.A.: Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In: Honavar, V.G., Slutzki, G. (eds.) ICGI 1998. LNCS (LNAI), vol. 1433, pp. 1–12. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  18. 18.
    Coste, F., Fredouille, D.: What is the search space for the inference of nondeterministic, unambiguous and deterministic automata? Technical report, IRISA - INRIA, RR-4907 (2003)Google Scholar
  19. 19.
    Taylor, W.R.: The classification of amino acid conservation. Journal of theoretical Biology 119, 205–218 (1986)CrossRefGoogle Scholar
  20. 20.
    Karkouri, K.E., Gueune, H., Delamarche, C.: Mipdb: a relational database dedicated to mip family proteins. Biol. Cell 97, 535–543 (2005)CrossRefGoogle Scholar
  21. 21.
    Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915–10919 (1992)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • François Coste
    • 1
  • Goulven Kerbellec
    • 1
  1. 1.Symbiose, IRISARennes CedexFrance

Personalised recommendations