Multi-relational Data Mining for Tetratricopeptide Repeats (TPR)-Like Superfamily Members in Leishmania spp.: Acting-by-Connecting Proteins

  • Karen T. Girão
  • Fátima C. E. Oliveira
  • Kaio M. Farias
  • Italo M. C. Maia
  • Samara C. Silva
  • Carla R. F. Gadelha
  • Laura D. G. Carneiro
  • Ana C. L. Pacheco
  • Michel T. Kamimura
  • Michely C. Diniz
  • Maria C. Silva
  • Diana M. Oliveira
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5265)


The multi-relational data mining (MRDM) approach looks for patterns that involve multiple tables from a relational database made of complex/structured objects whose normalized representation does require multiple tables. We have applied MRDM methods (relational association rule discovery and probabilistic relational models) with hidden Markov models (HMMs) and Viterbi algorithm (VA) to mine tetratricopeptide repeat (TPR), pentatricopeptide (PPR) and half-a-TPR (HAT) in genomes of pathogenic protozoa Leishmania. TPR is a protein-protein interaction module and TPR-containing proteins (TPRPs) act as scaffolds for the assembly of different multiprotein complexes. Our aim is to build a great panel of the TPR-like superfamily of Leishmania. Distributed relational state representations for complex stochastic processes were applied to identification, clustering and classification of Leishmania genes and we were able to detect putative 104 TPRPs, 36 PPRPs and 08 HATPs, comprising the TPR-like superfamily. We have also compared currently available resources (Pfam, SMART, SUPER-FAMILY and TPRpred) with our approach (MRDM/HMM/VA).


Multi-relational data mining hidden Markov models Viterbi algorithm tetratricopeptide repeat motif Leishmania proteins 


  1. 1.
    Ideker, T., Bafna, V., Lemberger, T.: Integrating scientific cultures. Mol. Syst. Biol. 3, 105–112 (2007), doi:10.1038/msb4100145CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Getoor, L.: Multi-relational data mining using probabilistic relational models: research summary. In: Knobbe, A.J., van der Wallen, D.M.G. (eds.) Proceedings 1st Workshop in Multi-relational Data Mining, KDD (2001)Google Scholar
  3. 3.
    Dehaspe, L., De Raedt, L.: Mining association rules in multiple relations. In: Džeroski, S., Lavrač, N. (eds.) ILP 1997. LNCS, vol. 1297. Springer, Heidelberg (1997)Google Scholar
  4. 4.
    Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)CrossRefPubMedGoogle Scholar
  5. 5.
    Winters-Hilt, S.: Hidden Markov Model Variants and their Application. BMC Bioinformatics 7, 14 (2006)CrossRefGoogle Scholar
  6. 6.
    Forney Jr., G.D.: The Viterbi algorithm. Proc. IEEE 61, 268 (1973)CrossRefGoogle Scholar
  7. 7.
    Blatch, G.L., Lässle, M.: The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. Bio. Essays 21, 932–939 (1999)Google Scholar
  8. 8.
    D’Andrea, L.D., Regan, L.: TPR proteins: the versatile helix. Trends Biochem. Sci. 28, 655–662 (2003)CrossRefPubMedGoogle Scholar
  9. 9.
    Das, A.K., Cohen, P.W., Barford, D.: The structure of the tetratricopeptide repeats of protein phosphatase 5: implications for TPR-mediated protein-protein interactions. EMBO. J. 17, 1192–1199 (1998)CrossRefPubMedPubMedCentralGoogle Scholar
  10. 10.
    Small, I.D., Peeters, N.: The PPR motif – a TPR-related motif prevalent in plant organellar proteins. Trends Biochem. Sci. 25, 46–47 (2000)CrossRefPubMedGoogle Scholar
  11. 11.
    Kotera, E., Tasaka, M., Shikanai, T.: A pentatricopeptide repeat protein is essential for RNA editing in chloroplasts. Nature 433, 326–330 (2005)CrossRefPubMedGoogle Scholar
  12. 12.
    Preker, P.J., Keller, W.: The HAT helix, a repetitive motif implicated in RNA processing. Trends Biochem. Sci. 23, 15–16 (1998)CrossRefPubMedGoogle Scholar
  13. 13.
    Scheufler, C., Brinker, A., Bourenkov, G., et al.: Structure of TPR domain-peptide complexes: critical elements in the assembly of the Hsp70-Hsp90 multichaperone machine. Cell 101, 199–210 (2000)CrossRefPubMedGoogle Scholar
  14. 14.
    Koga, H., Terasawa, H., Nunoi, H., et al.: Tetratricopeptide Repeat (TPR) Motifs of p67phox Participate in Interaction with the Small GTPase Rac and Activation of the Phagocyte NADPH Oxidase. Biol. Chem. 274, 25051–25060 (1999)CrossRefGoogle Scholar
  15. 15.
    Lurin, C., Andrès, C., Aubourg, S., et al.: Genome-Wide Analysis of Arabidopsis Pentatricopeptide Repeat Proteins Reveals Their Essential Role in Organelle Biogenesis. The Plant Cell 16, 2089–2103 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Rivals, E., Bruyëre, C., Toffano-Nioche, C., Lecharny, A.: Formation of the Arabidopsis Pentatricopeptide Repeat Family. Plant Physiol 141, 825–839 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  17. 17.
    Main, E.R.G., Lowe, A.R., Mochrie, S.G.J., Jackson, S.E., Regan, L.: A recurring theme in protein engineering: the design, stability and folding of repeat proteins. Curr. Opin. Struct. Biol. 15, 464–471 (2005)CrossRefPubMedGoogle Scholar
  18. 18.
    Karpenahalli, M.R., Lupas, A.N., Söding, J.: TPRpred: a tool for prediction of TPR-, PPR- and SEL1-like repeats from protein sequences. BMC Bioinformatics 8, 2 (2007), doi:10.1186/1471-2105-8-2CrossRefPubMedPubMedCentralGoogle Scholar
  19. 19.
    Groves, M.R., Barford, D.: Topological characteristics of helical repeat proteins. Curr. Opin. Struct. Biol. 9, 383–389 (1999)CrossRefPubMedGoogle Scholar
  20. 20.
    Kobe, B., Kajava, A.V.: When protein folding is simplified to protein coiling: the continuum of solenoid protein structures. Trends Biochem. Sci. 25, 509–515 (2000)CrossRefPubMedGoogle Scholar
  21. 21.
    Sonnhammer, E.L., Eddy, S.R., Durbin, R.: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997)CrossRefPubMedGoogle Scholar
  22. 22.
    Schultz, J., Milpetz, F., Bork, P., Ponting, C.P.: SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. USA 95, 5857–5864 (1998)CrossRefPubMedPubMedCentralGoogle Scholar
  23. 23.
    Madera, M., Vogel, C., Kummerfeld, S.K., Chothia, C., Gough, J.: The superfamily database in 2004: additions and improvements. Nucleic Acids Res. 32, 235–239 (2004)CrossRefGoogle Scholar
  24. 24.
    Hertz-Fowler, C., Peacock, C.S., Wood, C., et al.: GeneDB: a resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res. 32, D339–D343 (2004) The Pathogen Sequencing Unit - Wellcome Trust Sanger Institute – GeneDB – (2004), CrossRefGoogle Scholar
  25. 25.
    Altschul, S.F., Madden, T.L., Schäffer, A.A., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)CrossRefPubMedPubMedCentralGoogle Scholar
  26. 26.
    Majoros, W.H., Pertea, M., Salzberg, S.L.: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004)CrossRefPubMedGoogle Scholar
  27. 27.
    Marchler-Bauer, A., Anderson, J.B., Derbyshire, M.K., et al.: CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35, D237–240 (2007)CrossRefGoogle Scholar
  28. 28.
    Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004)CrossRefPubMedPubMedCentralGoogle Scholar
  29. 29.
    Pacheco, A.C.L., Araujo, F.F., Kamimura, M.T., et al.: Following the Viterbi Path to Deduce Flagellar Actin-Interacting Proteins of Leishmania spp.: Report on Cofilins and Twinfilins. In: Pham, T. (ed.) AIP Proceedings of Computer Models for Life Sciences, CMLS 2007, vol. 952, pp. 315–324. American Institute of Physics, Australia (2007)Google Scholar
  30. 30.
    Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)PubMedGoogle Scholar
  31. 31.
    Li, W., Jaroszewski, L., Godzik, A.: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 282–283 (2001)CrossRefPubMedGoogle Scholar
  32. 32.
    Li, W., Jaroszewski, L., Godzik, A.: Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18, 77–82 (2002)CrossRefPubMedGoogle Scholar
  33. 33.
    Karplus, K., Barrett, C., Hughey, R.: Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856 (1998)CrossRefPubMedGoogle Scholar
  34. 34.
    Friedrich, T., Pils, B., Dandekar, T., Schultz, J., Müller, T.: Modelling interaction sites in protein domains with interaction profile hidden Markov models. Bioinformatics 22, 2851–2857 (2006), doi:10.1093/bioinformatics/btl486Google Scholar
  35. 35.
    Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1300–1307. Morgan Kaufman, Stockholm (1999)Google Scholar
  36. 36.
    Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, pp. 307–335. Kluwer, Dordrecht (2001)CrossRefGoogle Scholar
  37. 37.
    Getoor, L., Taskar, B., Koller, D.: Using probabilistic models for selectivity estimation. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 461–472. ACM Press, New York (2001)Google Scholar
  38. 38.
    Craven, M., Page, D., Shavlik, J., Bockhorst, J., Glasner, J.: A probabilistic learning approach to whole-genome operon prediction. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 116–127. AAAI Press, La Jolla (2000)Google Scholar
  39. 39.
    Segal, E., Taskar, B., Gasch, A., Friedman, N., Koller, D.: Rich probabilistic models for gene expression. Bioinformatics 1, 1–10 (2001)Google Scholar
  40. 40.
    Bjorklund, A.K., et al.: Expansion of protein domain repeats. PLoS Comput. Biol. 2, 114 (2006)CrossRefGoogle Scholar
  41. 41.
    Servant, F., Bru, C., Carrère, S., et al.: ProDom: automated clustering of homologous domains. Brief. Bioinform. 3, 246–251 (2002)CrossRefPubMedGoogle Scholar
  42. 42.
    Rivals, E., Bruyere, E., Toffano-Nioche, C., Lecharny, A.: Formation of the Arabidopsis pentatricopeptide repeat family. Plant Physiol. 141, 825–839 (2006)CrossRefPubMedPubMedCentralGoogle Scholar
  43. 43.
    Mingler, M.K., Hingst, A.M., Clement, S.L., et al.: Identification of pentatricopeptide repeat proteins in Trypanosoma brucei. Mol. Biochem. Parasitol. 150, 37–45 (2006)CrossRefPubMedGoogle Scholar
  44. 44.
    Pusnik, M., Small, I., Read, L.K., Fabbro, T., Schneider, A.: Pentatricopeptide Repeat Proteins in Trypanosoma brucei Function in Mitochondrial Ribosomes. Mol. Cell. Biol. 27, 6876–6888 (2007)CrossRefPubMedPubMedCentralGoogle Scholar
  45. 45.
    NCBI (National Center for Biotechnology Information / Entrez / Cn3D (All Databases),
  46. 46.
  47. 47.
    AMIGO after GeneDB access,
  48. 48.
  49. 49.
  50. 50.
  51. 51.
    Arabidopsis Genome Initiative (AGI, 2000),
  52. 52.

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Karen T. Girão
    • 1
  • Fátima C. E. Oliveira
    • 1
  • Kaio M. Farias
    • 1
  • Italo M. C. Maia
    • 1
  • Samara C. Silva
    • 1
  • Carla R. F. Gadelha
    • 1
  • Laura D. G. Carneiro
    • 1
  • Ana C. L. Pacheco
    • 1
  • Michel T. Kamimura
    • 1
  • Michely C. Diniz
    • 1
  • Maria C. Silva
    • 1
  • Diana M. Oliveira
    • 1
  1. 1.Núcleo Tarcisio Pimenta de Pesquisa Genômica e Bioinformática - NUGEN, Faculdade de VeterináriaUniversidade Estadual do Ceara - UECEFortalezaBrazil

Personalised recommendations