Abstract
The multi-relational data mining (MRDM) approach looks for patterns that involve multiple tables from a relational database made of complex/structured objects whose normalized representation does require multiple tables. We have applied MRDM methods (relational association rule discovery and probabilistic relational models) with hidden Markov models (HMMs) and Viterbi algorithm (VA) to mine tetratricopeptide repeat (TPR), pentatricopeptide (PPR) and half-a-TPR (HAT) in genomes of pathogenic protozoa Leishmania. TPR is a protein-protein interaction module and TPR-containing proteins (TPRPs) act as scaffolds for the assembly of different multiprotein complexes. Our aim is to build a great panel of the TPR-like superfamily of Leishmania. Distributed relational state representations for complex stochastic processes were applied to identification, clustering and classification of Leishmania genes and we were able to detect putative 104 TPRPs, 36 PPRPs and 08 HATPs, comprising the TPR-like superfamily. We have also compared currently available resources (Pfam, SMART, SUPER-FAMILY and TPRpred) with our approach (MRDM/HMM/VA).
Keywords
Download to read the full chapter text
Chapter PDF
References
Ideker, T., Bafna, V., Lemberger, T.: Integrating scientific cultures. Mol. Syst. Biol. 3, 105–112 (2007), doi:10.1038/msb4100145
Getoor, L.: Multi-relational data mining using probabilistic relational models: research summary. In: Knobbe, A.J., van der Wallen, D.M.G. (eds.) Proceedings 1st Workshop in Multi-relational Data Mining, KDD (2001)
Dehaspe, L., De Raedt, L.: Mining association rules in multiple relations. In: Džeroski, S., Lavrač, N. (eds.) ILP 1997. LNCS, vol. 1297. Springer, Heidelberg (1997)
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)
Winters-Hilt, S.: Hidden Markov Model Variants and their Application. BMC Bioinformatics 7, 14 (2006)
Forney Jr., G.D.: The Viterbi algorithm. Proc. IEEE 61, 268 (1973)
Blatch, G.L., Lässle, M.: The tetratricopeptide repeat: a structural motif mediating protein-protein interactions. Bio. Essays 21, 932–939 (1999)
D’Andrea, L.D., Regan, L.: TPR proteins: the versatile helix. Trends Biochem. Sci. 28, 655–662 (2003)
Das, A.K., Cohen, P.W., Barford, D.: The structure of the tetratricopeptide repeats of protein phosphatase 5: implications for TPR-mediated protein-protein interactions. EMBO. J. 17, 1192–1199 (1998)
Small, I.D., Peeters, N.: The PPR motif – a TPR-related motif prevalent in plant organellar proteins. Trends Biochem. Sci. 25, 46–47 (2000)
Kotera, E., Tasaka, M., Shikanai, T.: A pentatricopeptide repeat protein is essential for RNA editing in chloroplasts. Nature 433, 326–330 (2005)
Preker, P.J., Keller, W.: The HAT helix, a repetitive motif implicated in RNA processing. Trends Biochem. Sci. 23, 15–16 (1998)
Scheufler, C., Brinker, A., Bourenkov, G., et al.: Structure of TPR domain-peptide complexes: critical elements in the assembly of the Hsp70-Hsp90 multichaperone machine. Cell 101, 199–210 (2000)
Koga, H., Terasawa, H., Nunoi, H., et al.: Tetratricopeptide Repeat (TPR) Motifs of p67phox Participate in Interaction with the Small GTPase Rac and Activation of the Phagocyte NADPH Oxidase. Biol. Chem. 274, 25051–25060 (1999)
Lurin, C., Andrès, C., Aubourg, S., et al.: Genome-Wide Analysis of Arabidopsis Pentatricopeptide Repeat Proteins Reveals Their Essential Role in Organelle Biogenesis. The Plant Cell 16, 2089–2103 (2004)
Rivals, E., Bruyëre, C., Toffano-Nioche, C., Lecharny, A.: Formation of the Arabidopsis Pentatricopeptide Repeat Family. Plant Physiol 141, 825–839 (2006)
Main, E.R.G., Lowe, A.R., Mochrie, S.G.J., Jackson, S.E., Regan, L.: A recurring theme in protein engineering: the design, stability and folding of repeat proteins. Curr. Opin. Struct. Biol. 15, 464–471 (2005)
Karpenahalli, M.R., Lupas, A.N., Söding, J.: TPRpred: a tool for prediction of TPR-, PPR- and SEL1-like repeats from protein sequences. BMC Bioinformatics 8, 2 (2007), doi:10.1186/1471-2105-8-2
Groves, M.R., Barford, D.: Topological characteristics of helical repeat proteins. Curr. Opin. Struct. Biol. 9, 383–389 (1999)
Kobe, B., Kajava, A.V.: When protein folding is simplified to protein coiling: the continuum of solenoid protein structures. Trends Biochem. Sci. 25, 509–515 (2000)
Sonnhammer, E.L., Eddy, S.R., Durbin, R.: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997)
Schultz, J., Milpetz, F., Bork, P., Ponting, C.P.: SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. USA 95, 5857–5864 (1998)
Madera, M., Vogel, C., Kummerfeld, S.K., Chothia, C., Gough, J.: The superfamily database in 2004: additions and improvements. Nucleic Acids Res. 32, 235–239 (2004)
Hertz-Fowler, C., Peacock, C.S., Wood, C., et al.: GeneDB: a resource for prokaryotic and eukaryotic organisms. Nucleic Acids Res. 32, D339–D343 (2004) The Pathogen Sequencing Unit - Wellcome Trust Sanger Institute – GeneDB – (2004), http://www.genedb.org
Altschul, S.F., Madden, T.L., Schäffer, A.A., et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997)
Majoros, W.H., Pertea, M., Salzberg, S.L.: TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20, 2878–2879 (2004)
Marchler-Bauer, A., Anderson, J.B., Derbyshire, M.K., et al.: CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 35, D237–240 (2007)
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004)
Pacheco, A.C.L., Araujo, F.F., Kamimura, M.T., et al.: Following the Viterbi Path to Deduce Flagellar Actin-Interacting Proteins of Leishmania spp.: Report on Cofilins and Twinfilins. In: Pham, T. (ed.) AIP Proceedings of Computer Models for Life Sciences, CMLS 2007, vol. 952, pp. 315–324. American Institute of Physics, Australia (2007)
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995)
Li, W., Jaroszewski, L., Godzik, A.: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17, 282–283 (2001)
Li, W., Jaroszewski, L., Godzik, A.: Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18, 77–82 (2002)
Karplus, K., Barrett, C., Hughey, R.: Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846–856 (1998)
Friedrich, T., Pils, B., Dandekar, T., Schultz, J., Müller, T.: Modelling interaction sites in protein domains with interaction profile hidden Markov models. Bioinformatics 22, 2851–2857 (2006), doi:10.1093/bioinformatics/btl486
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1300–1307. Morgan Kaufman, Stockholm (1999)
Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, pp. 307–335. Kluwer, Dordrecht (2001)
Getoor, L., Taskar, B., Koller, D.: Using probabilistic models for selectivity estimation. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 461–472. ACM Press, New York (2001)
Craven, M., Page, D., Shavlik, J., Bockhorst, J., Glasner, J.: A probabilistic learning approach to whole-genome operon prediction. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 116–127. AAAI Press, La Jolla (2000)
Segal, E., Taskar, B., Gasch, A., Friedman, N., Koller, D.: Rich probabilistic models for gene expression. Bioinformatics 1, 1–10 (2001)
Bjorklund, A.K., et al.: Expansion of protein domain repeats. PLoS Comput. Biol. 2, 114 (2006)
Servant, F., Bru, C., Carrère, S., et al.: ProDom: automated clustering of homologous domains. Brief. Bioinform. 3, 246–251 (2002)
Rivals, E., Bruyere, E., Toffano-Nioche, C., Lecharny, A.: Formation of the Arabidopsis pentatricopeptide repeat family. Plant Physiol. 141, 825–839 (2006)
Mingler, M.K., Hingst, A.M., Clement, S.L., et al.: Identification of pentatricopeptide repeat proteins in Trypanosoma brucei. Mol. Biochem. Parasitol. 150, 37–45 (2006)
Pusnik, M., Small, I., Read, L.K., Fabbro, T., Schneider, A.: Pentatricopeptide Repeat Proteins in Trypanosoma brucei Function in Mitochondrial Ribosomes. Mol. Cell. Biol. 27, 6876–6888 (2007)
NCBI (National Center for Biotechnology Information / Entrez / Cn3D (All Databases), http://www.ncbi.nlm.nih.gov/sites/gquery
Swiss-Prot/trEMBL, www.expasy.org/sprot
AMIGO after GeneDB access, www.genedb.org/amigo/perl
SMART, http://smart.embl.de
Superfamily, http://supfam.cs.bris.ac.uk
Arabidopsis Genome Initiative (AGI, 2000), http://www.arabidopsis.org/portals
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Girão, K.T. et al. (2008). Multi-relational Data Mining for Tetratricopeptide Repeats (TPR)-Like Superfamily Members in Leishmania spp.: Acting-by-Connecting Proteins. In: Chetty, M., Ngom, A., Ahmad, S. (eds) Pattern Recognition in Bioinformatics. PRIB 2008. Lecture Notes in Computer Science(), vol 5265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88436-1_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-88436-1_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88434-7
Online ISBN: 978-3-540-88436-1
eBook Packages: Computer ScienceComputer Science (R0)