Abstract
Identifying adenylation domains (A-domains) and their substrate specificity can aid the detection of nonribosomal peptide synthetases (NRPS) at genome/proteome level and allow inferring the structure of oligopeptides with relevant biological activities. However, that is challenging task due to the high sequence diversity of A-domains (~10–40 % of amino acid identity) and their selectivity for 50 different natural/unnatural amino acids. Altogether these characteristics make their detection and the prediction of their substrate specificity a real challenge when using traditional sequence alignment methods, e.g., BLAST searches. In this chapter we describe two workflows based on alignment-free methods intended for the identification and substrate specificity prediction of A-domains.
To identify A-domains we introduce a graphical-numerical method, implemented in TI2BioP version 2.0 (topological indices to biopolymers), which in a first step uses protein four-color maps to represent A-domains. In a second step, simple topological indices (TIs), called spectral moments, are derived from the graphical representations of known A-domains (positive dataset) and of unrelated but well-characterized sequences (negative set). Spectral moments are then used as input predictors for statistical classification techniques to build alignment-free models. Finally, the resulting alignment-free models can be used to explore entire proteomes for unannotated A-domains. In addition, this graphical-numerical methodology works as a sequence-search method that can be ensemble with homology-based tools to deeply explore the A-domain signature and cope with the diversity of this class (Aguero-Chapin et al., PLoS One 8(7):e65926, 2013).
The second workflow for the prediction of A-domain’s substrate specificity is based on alignment-free models constructed by transductive support vector machines (TSVMs) that incorporate information of uncharacterized A-domains. The construction of the models was implemented in the NRPSpredictor and in a first step uses the physicochemical fingerprint of the 34 residues lining the active site of the phenylalanine-adenylation domain of gramicidin synthetase A [PDB ID 1 amu] to derive a feature vector. Homologous positions were extracted for A-domains with known and unknown substrate specificities and turned into feature vectors. At the same time, A-domains with known specificities towards similar substrates were clustered by physicochemical properties of amino acids (AA). In a second step, support vector machines (SVMs) were optimized from feature vectors of characterized A-domains in each of the resulting clusters. Later, SVMs were used in the variant of TSVMs that integrate a fraction of uncharacterized A-domains during training to predict unknown specificities. Finally, uncharacterized A-domains were scored by each of the constructed alignment-free models (TSVM) representing each substrate specificity resulting from the clustering. The model producing the largest score for the uncharacterized A-domain assigns the substrate specificity to it (Rausch et al., Nucleic Acids Res 33:5799–5808, 2005).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jenke-Kodama H, Dittmann E (2009) Bioinformatic perspectives on NRPS/PKS megasynthases: advances and challenges. Nat Prod Rep. doi:10.1039/b810283j
Ansari MZ, Yadav G, Gokhale RS et al (2004) NRPS-PKS: a knowledge-based resource for analysis of NRPS/PKS megasynthases. Nucleic Acids Res. doi:10.1093/nar/gkh359
Randic M, Zupan J, Balaban AT et al (2011) Graphical representation of proteins. Chem Rev. doi:10.1021/cr800198j
Gonzalez-Diaz H, Vilar S, Santana L et al (2007) Medicinal chemistry and bioinformatics – current trends in drugs discovery with networks topological indices. Curr Top Med Chem 7:1015–1029
Estrada E, Uriarte E (2001) Recent advances on the role of topological indices in drug discovery research. Curr Med Chem 8:1573–1588
Randic M (2004) Graphical representation of DNA as a 2-D map. Chem Phys Lett 386:468–471
Randic M, Zupan J, Vikic-Topic D (2007) On representation of proteins by star-like graphs. J Mol Graph Model 26:290–305
Randic M, Zupan J (2004) Highly compact 2D graphical representation of DNA sequences. SAR QSAR Environ Res 15:191–205
Nandy A (1994) Recent investigations into global characteristics of long DNA sequences. Indian J Biochem Biophys 31:149–155
Randic M, Zupan J (2001) On interpretation of well-known topological indices. J Chem Inf Comput Sci 41:550–560
Gonzalez-Diaz H, Gonzalez-Diaz Y, Santana L et al (2008) Proteomics, networks and connectivity indices. Proteomics. Doi: 10.1002/pmic.200700638
Aguero-Chapin G, Molina-Ruiz R, Maldonado E et al (2013) Exploring the adenylation domain repertoire of nonribosomal peptide synthetases using an ensemble of sequence-search methods. PLoS One 8(7):e65926. doi:10.1371/journal.pone.0065926
Aguero-Chapin G, Varona-Santos J, de la Riva GA et al (2009) Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Coffea arabica and prediction of a new sequence. J Proteome Res 8:2122–2128
Aguero-Chapin G, Sánchez-Rodríguez A, Hidalgo-Yanes PI et al (2011) An alignment-free approach for eukaryotic ITS2 annotation and phylogenetic inference. PLoS One 6:e26638
Cruz-Monteagudo M, Gonzalez-Diaz H, Borges F et al (2008) 3D-MEDNEs: an alternative “in silico” technique for chemical research in toxicology. 2. Quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. Chem Res Toxicol. Doi: 10.1021/tx700296t
Estrada E (1996) Spectral moments of the edge adjacency matrix in molecular graphs. 1. Definition and applications to the prediction of physical properties of alkanes. J Chem Inf Comput Sci 36:844–849
von Döhren H, Dieckmann R, Pavela-Vrancic M (1999) The nonribosomal code. Chem Biol 6:273–279
Stachelhaus T, Mootz HD, Marahiel MA (1999) The specificity-conferring code of adenylation domains in nonribosomal peptide synthetases. Chem Biol. doi:10.1016/s1074-5521(99)80082-9
Conti E, Franks NP, Brick P (1996) Crystal structure of firefly luciferase throws light on a superfamily of adenylate-forming enzymes. Structure 4:287–298
Conti E, Stachelhaus T, Marahiel MA, Brick P (1997) Structural basis for the activation of phenylalanine in the non-ribosomal biosynthesis of gramicidin S. EMBO J 16(14):4174–4183
Challis GL, Ravel J, Townsend CA (2000) Predictive, structure-based model of amino acid recognition by nonribosomal peptide synthetase adenylation domains. Chem Biol 7:211–224
Rausch C, Weber T, Kohlbacher O et al (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res 33:5799–5808
Bastanlar Y, Ozuysal M (2014) Introduction to machine learning. Methods Mol Biol. doi:10.1007/978-1-62703-748-8_7
Schölkopf B, Tsuda K, Vert J (eds) (2004) Kernel methods in computational biology. MIT Press, Cambridge, MA
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31:365–370
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32:D115–D119
Miller G, Lipman M (1973) Release of infectious Epstein-Barr virus by transformed marmoset leukocytes. Proc Natl Acad Sci 70:190–194
Kohlbacher O, Lenhof H (2000) BALL-rapid software prototyping in computational molecular biology. Biochemicals Algorithms Library. Bioinformatics 16:815–824
Kawashima S, Kanehisa M (2000) Aaindex: amino acid index database. Nucleic Acids Res 28:374
Randic M, Lers N, Plavšić D et al (2005) Four-color map representation of DNA or RNA sequences and their numerical characterization. Chem Phys Lett 407:205–208
Randic M, Mehulic K, Vukicevic D et al (2009) Graphical representation of proteins as four-color maps and their numerical characterization. J Mol Graph Model. doi:10.1016/j.jmgm.2008.10.004
Aguero-Chapin G, Gonzalez-Diaz H, Molina R et al (2006) Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett 580:723–730
Aguero-Chapin G, de la Riva GA, Molina-Ruiz R et al (2011) Non-linear models based on simple topological indices to identify RNase III protein members. J Theor Biol 273:167–178
Estrada E (1997) Spectral moments of the edge-adjacency matrix of molecular graphs. 2. Molecules containing heteroatoms and QSAR applications. J Chem Inf Comput Sci 37:320–328
Estrada E (1995) Edge adjacency relationships and a novel topological index related to molecular volume. J Chem Inf Comput Sci 35:31–33
Cornell WD, Cieplak P, IBayly C et al (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197
Statsoft (2008) STATISTICA 8.0 (data analysis software system for windows). Version 8.0 edn
Rivals I, Personnaz L (1999) On cross validation for model selection. Neural Comput 11:863–870
Zhou D, Bousquet O, Lal T et al (2004) Learning with local and global consistency. Adv Neural Inform Process Syst 16:321–328
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel methods. MIT-Press, Cambrige, MA, pp 169–184
Rottig M, Medema MH, Blin K et al (2011) NRPSpredictor2 – a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res. doi:10.1093/nar/gkr323
Shen HB, Chou KC (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373:386–388
Boekhorst J, Snel B (2007) Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties. BMC Bioinformatics 8:356
Collobert R, Sinz F, Weston J, Bottou L (2006) Large scale transductive SVMs. J Mach Learn Res 7:1687–1712
Acknowledgments
The authors acknowledge the Portuguese Fundação para a Ciência e a Tecnologia (FCT) for the financial support to GACH (SFRH/BPD/92978/2013). This work was partially supported by the European Regional Development Fund (ERDF) through the COMPETE—Operational Competitiveness Programme and national funds through FCT under the projects PEst-C/MAR/LA0015/2013 and the projects PTDC/AAC-CLI/116122/2009 (FCOMP-01-0124-FEDER-014029) and PTDC/MAR/115199/2009 (FCOMP-01-0124-FEDER-015451). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Agüero-Chapin, G., Pérez-Machado, G., Sánchez-Rodríguez, A., Santos, M.M., Antunes, A. (2016). Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains. In: Evans, B. (eds) Nonribosomal Peptide and Polyketide Biosynthesis. Methods in Molecular Biology, vol 1401. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3375-4_16
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3375-4_16
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3373-0
Online ISBN: 978-1-4939-3375-4
eBook Packages: Springer Protocols