Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains

Agüero-Chapin, Guillermin; Pérez-Machado, Gisselle; Sánchez-Rodríguez, Aminael; Santos, Miguel Machado; Antunes, Agostinho

doi:10.1007/978-1-4939-3375-4_16

Guillermin Agüero-Chapin^3,4,
Gisselle Pérez-Machado⁴,
Aminael Sánchez-Rodríguez⁵,
Miguel Machado Santos^3,6 &
…
Agostinho Antunes^3,6

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1401))

3204 Accesses
3 Citations
2 Altmetric

Abstract

Identifying adenylation domains (A-domains) and their substrate specificity can aid the detection of nonribosomal peptide synthetases (NRPS) at genome/proteome level and allow inferring the structure of oligopeptides with relevant biological activities. However, that is challenging task due to the high sequence diversity of A-domains (~10–40 % of amino acid identity) and their selectivity for 50 different natural/unnatural amino acids. Altogether these characteristics make their detection and the prediction of their substrate specificity a real challenge when using traditional sequence alignment methods, e.g., BLAST searches. In this chapter we describe two workflows based on alignment-free methods intended for the identification and substrate specificity prediction of A-domains.

To identify A-domains we introduce a graphical-numerical method, implemented in TI2BioP version 2.0 (topological indices to biopolymers), which in a first step uses protein four-color maps to represent A-domains. In a second step, simple topological indices (TIs), called spectral moments, are derived from the graphical representations of known A-domains (positive dataset) and of unrelated but well-characterized sequences (negative set). Spectral moments are then used as input predictors for statistical classification techniques to build alignment-free models. Finally, the resulting alignment-free models can be used to explore entire proteomes for unannotated A-domains. In addition, this graphical-numerical methodology works as a sequence-search method that can be ensemble with homology-based tools to deeply explore the A-domain signature and cope with the diversity of this class (Aguero-Chapin et al., PLoS One 8(7):e65926, 2013).

The second workflow for the prediction of A-domain’s substrate specificity is based on alignment-free models constructed by transductive support vector machines (TSVMs) that incorporate information of uncharacterized A-domains. The construction of the models was implemented in the NRPSpredictor and in a first step uses the physicochemical fingerprint of the 34 residues lining the active site of the phenylalanine-adenylation domain of gramicidin synthetase A [PDB ID 1 amu] to derive a feature vector. Homologous positions were extracted for A-domains with known and unknown substrate specificities and turned into feature vectors. At the same time, A-domains with known specificities towards similar substrates were clustered by physicochemical properties of amino acids (AA). In a second step, support vector machines (SVMs) were optimized from feature vectors of characterized A-domains in each of the resulting clusters. Later, SVMs were used in the variant of TSVMs that integrate a fraction of uncharacterized A-domains during training to predict unknown specificities. Finally, uncharacterized A-domains were scored by each of the constructed alignment-free models (TSVM) representing each substrate specificity resulting from the clustering. The model producing the largest score for the uncharacterized A-domain assigns the substrate specificity to it (Rausch et al., Nucleic Acids Res 33:5799–5808, 2005).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jenke-Kodama H, Dittmann E (2009) Bioinformatic perspectives on NRPS/PKS megasynthases: advances and challenges. Nat Prod Rep. doi:10.1039/b810283j
PubMed Google Scholar
Ansari MZ, Yadav G, Gokhale RS et al (2004) NRPS-PKS: a knowledge-based resource for analysis of NRPS/PKS megasynthases. Nucleic Acids Res. doi:10.1093/nar/gkh359
Google Scholar
Randic M, Zupan J, Balaban AT et al (2011) Graphical representation of proteins. Chem Rev. doi:10.1021/cr800198j
PubMed Google Scholar
Gonzalez-Diaz H, Vilar S, Santana L et al (2007) Medicinal chemistry and bioinformatics – current trends in drugs discovery with networks topological indices. Curr Top Med Chem 7:1015–1029
Article CAS PubMed Google Scholar
Estrada E, Uriarte E (2001) Recent advances on the role of topological indices in drug discovery research. Curr Med Chem 8:1573–1588
Article CAS PubMed Google Scholar
Randic M (2004) Graphical representation of DNA as a 2-D map. Chem Phys Lett 386:468–471
Article CAS Google Scholar
Randic M, Zupan J, Vikic-Topic D (2007) On representation of proteins by star-like graphs. J Mol Graph Model 26:290–305
Article CAS PubMed Google Scholar
Randic M, Zupan J (2004) Highly compact 2D graphical representation of DNA sequences. SAR QSAR Environ Res 15:191–205
Article CAS PubMed Google Scholar
Nandy A (1994) Recent investigations into global characteristics of long DNA sequences. Indian J Biochem Biophys 31:149–155
CAS PubMed Google Scholar
Randic M, Zupan J (2001) On interpretation of well-known topological indices. J Chem Inf Comput Sci 41:550–560
Article CAS PubMed Google Scholar
Gonzalez-Diaz H, Gonzalez-Diaz Y, Santana L et al (2008) Proteomics, networks and connectivity indices. Proteomics. Doi: 10.1002/pmic.200700638
Google Scholar
Aguero-Chapin G, Molina-Ruiz R, Maldonado E et al (2013) Exploring the adenylation domain repertoire of nonribosomal peptide synthetases using an ensemble of sequence-search methods. PLoS One 8(7):e65926. doi:10.1371/journal.pone.0065926
Article PubMed Central CAS PubMed Google Scholar
Aguero-Chapin G, Varona-Santos J, de la Riva GA et al (2009) Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Coffea arabica and prediction of a new sequence. J Proteome Res 8:2122–2128
Article CAS PubMed Google Scholar
Aguero-Chapin G, Sánchez-Rodríguez A, Hidalgo-Yanes PI et al (2011) An alignment-free approach for eukaryotic ITS2 annotation and phylogenetic inference. PLoS One 6:e26638
Article PubMed Central CAS PubMed Google Scholar
Cruz-Monteagudo M, Gonzalez-Diaz H, Borges F et al (2008) 3D-MEDNEs: an alternative “in silico” technique for chemical research in toxicology. 2. Quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. Chem Res Toxicol. Doi: 10.1021/tx700296t
Google Scholar
Estrada E (1996) Spectral moments of the edge adjacency matrix in molecular graphs. 1. Definition and applications to the prediction of physical properties of alkanes. J Chem Inf Comput Sci 36:844–849
Article CAS Google Scholar
von Döhren H, Dieckmann R, Pavela-Vrancic M (1999) The nonribosomal code. Chem Biol 6:273–279
Article Google Scholar
Stachelhaus T, Mootz HD, Marahiel MA (1999) The specificity-conferring code of adenylation domains in nonribosomal peptide synthetases. Chem Biol. doi:10.1016/s1074-5521(99)80082-9
PubMed Google Scholar
Conti E, Franks NP, Brick P (1996) Crystal structure of firefly luciferase throws light on a superfamily of adenylate-forming enzymes. Structure 4:287–298
Article CAS PubMed Google Scholar
Conti E, Stachelhaus T, Marahiel MA, Brick P (1997) Structural basis for the activation of phenylalanine in the non-ribosomal biosynthesis of gramicidin S. EMBO J 16(14):4174–4183
Article PubMed Central CAS PubMed Google Scholar
Challis GL, Ravel J, Townsend CA (2000) Predictive, structure-based model of amino acid recognition by nonribosomal peptide synthetase adenylation domains. Chem Biol 7:211–224
Article CAS PubMed Google Scholar
Rausch C, Weber T, Kohlbacher O et al (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res 33:5799–5808
Article PubMed Central CAS PubMed Google Scholar
Bastanlar Y, Ozuysal M (2014) Introduction to machine learning. Methods Mol Biol. doi:10.1007/978-1-62703-748-8_7
PubMed Google Scholar
Schölkopf B, Tsuda K, Vert J (eds) (2004) Kernel methods in computational biology. MIT Press, Cambridge, MA
Google Scholar
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31:365–370
Article PubMed Central CAS PubMed Google Scholar
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32:D115–D119
Article PubMed Central CAS PubMed Google Scholar
Miller G, Lipman M (1973) Release of infectious Epstein-Barr virus by transformed marmoset leukocytes. Proc Natl Acad Sci 70:190–194
Article PubMed Central CAS PubMed Google Scholar
Kohlbacher O, Lenhof H (2000) BALL-rapid software prototyping in computational molecular biology. Biochemicals Algorithms Library. Bioinformatics 16:815–824
Article CAS PubMed Google Scholar
Kawashima S, Kanehisa M (2000) Aaindex: amino acid index database. Nucleic Acids Res 28:374
Article PubMed Central CAS PubMed Google Scholar
Randic M, Lers N, Plavšić D et al (2005) Four-color map representation of DNA or RNA sequences and their numerical characterization. Chem Phys Lett 407:205–208
Article CAS Google Scholar
Randic M, Mehulic K, Vukicevic D et al (2009) Graphical representation of proteins as four-color maps and their numerical characterization. J Mol Graph Model. doi:10.1016/j.jmgm.2008.10.004
PubMed Google Scholar
Aguero-Chapin G, Gonzalez-Diaz H, Molina R et al (2006) Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett 580:723–730
Article PubMed Google Scholar
Aguero-Chapin G, de la Riva GA, Molina-Ruiz R et al (2011) Non-linear models based on simple topological indices to identify RNase III protein members. J Theor Biol 273:167–178
Article PubMed Google Scholar
Estrada E (1997) Spectral moments of the edge-adjacency matrix of molecular graphs. 2. Molecules containing heteroatoms and QSAR applications. J Chem Inf Comput Sci 37:320–328
Article CAS Google Scholar
Estrada E (1995) Edge adjacency relationships and a novel topological index related to molecular volume. J Chem Inf Comput Sci 35:31–33
Article CAS Google Scholar
Cornell WD, Cieplak P, IBayly C et al (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197
Article CAS Google Scholar
Statsoft (2008) STATISTICA 8.0 (data analysis software system for windows). Version 8.0 edn
Google Scholar
Rivals I, Personnaz L (1999) On cross validation for model selection. Neural Comput 11:863–870
Article CAS PubMed Google Scholar
Zhou D, Bousquet O, Lal T et al (2004) Learning with local and global consistency. Adv Neural Inform Process Syst 16:321–328
Google Scholar
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel methods. MIT-Press, Cambrige, MA, pp 169–184
Google Scholar
Rottig M, Medema MH, Blin K et al (2011) NRPSpredictor2 – a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res. doi:10.1093/nar/gkr323
PubMed Central PubMed Google Scholar
Shen HB, Chou KC (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373:386–388
Article CAS PubMed Google Scholar
Boekhorst J, Snel B (2007) Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties. BMC Bioinformatics 8:356
Article PubMed Central PubMed Google Scholar
Collobert R, Sinz F, Weston J, Bottou L (2006) Large scale transductive SVMs. J Mach Learn Res 7:1687–1712
Google Scholar

Download references

Acknowledgments

The authors acknowledge the Portuguese Fundação para a Ciência e a Tecnologia (FCT) for the financial support to GACH (SFRH/BPD/92978/2013). This work was partially supported by the European Regional Development Fund (ERDF) through the COMPETE—Operational Competitiveness Programme and national funds through FCT under the projects PEst-C/MAR/LA0015/2013 and the projects PTDC/AAC-CLI/116122/2009 (FCOMP-01-0124-FEDER-014029) and PTDC/MAR/115199/2009 (FCOMP-01-0124-FEDER-015451). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

CIMAR/CIIMAR, Centro Interdisciplinar de Investigação Marinha e Ambiental, Universidade do Porto, Rua dos Bragas, 177, Porto, 4050-123, Portugal
Guillermin Agüero-Chapin, Miguel Machado Santos & Agostinho Antunes
Centro de Bioactivos Químicos, Universidad Central “Marta Abreu” de Las Villas (UCLV), Santa Clara, 54830, Cuba
Guillermin Agüero-Chapin & Gisselle Pérez-Machado
Departamento de Ciencias Naturales, Universidad Técnica Particular de Loja, San Cayetano Alto, S/N, Loja, Ecuador
Aminael Sánchez-Rodríguez
Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, Porto, 4169-007, Portugal
Miguel Machado Santos & Agostinho Antunes

Authors

Guillermin Agüero-Chapin
View author publications
You can also search for this author in PubMed Google Scholar
Gisselle Pérez-Machado
View author publications
You can also search for this author in PubMed Google Scholar
Aminael Sánchez-Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar
Miguel Machado Santos
View author publications
You can also search for this author in PubMed Google Scholar
Agostinho Antunes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Agostinho Antunes .

Editor information

Editors and Affiliations

Donald Danforth Plant Science Center, Saint Louis, Missouri, USA
Bradley S. Evans

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Agüero-Chapin, G., Pérez-Machado, G., Sánchez-Rodríguez, A., Santos, M.M., Antunes, A. (2016). Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains. In: Evans, B. (eds) Nonribosomal Peptide and Polyketide Biosynthesis. Methods in Molecular Biology, vol 1401. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3375-4_16

Download citation

DOI: https://doi.org/10.1007/978-1-4939-3375-4_16
Published: 02 February 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3373-0
Online ISBN: 978-1-4939-3375-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics