Skip to main content

Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains

  • Protocol
  • First Online:
Nonribosomal Peptide and Polyketide Biosynthesis

Abstract

Identifying adenylation domains (A-domains) and their substrate specificity can aid the detection of nonribosomal peptide synthetases (NRPS) at genome/proteome level and allow inferring the structure of oligopeptides with relevant biological activities. However, that is challenging task due to the high sequence diversity of A-domains (~10–40 % of amino acid identity) and their selectivity for 50 different natural/unnatural amino acids. Altogether these characteristics make their detection and the prediction of their substrate specificity a real challenge when using traditional sequence alignment methods, e.g., BLAST searches. In this chapter we describe two workflows based on alignment-free methods intended for the identification and substrate specificity prediction of A-domains.

To identify A-domains we introduce a graphical-numerical method, implemented in TI2BioP version 2.0 (topological indices to biopolymers), which in a first step uses protein four-color maps to represent A-domains. In a second step, simple topological indices (TIs), called spectral moments, are derived from the graphical representations of known A-domains (positive dataset) and of unrelated but well-characterized sequences (negative set). Spectral moments are then used as input predictors for statistical classification techniques to build alignment-free models. Finally, the resulting alignment-free models can be used to explore entire proteomes for unannotated A-domains. In addition, this graphical-numerical methodology works as a sequence-search method that can be ensemble with homology-based tools to deeply explore the A-domain signature and cope with the diversity of this class (Aguero-Chapin et al., PLoS One 8(7):e65926, 2013).

The second workflow for the prediction of A-domain’s substrate specificity is based on alignment-free models constructed by transductive support vector machines (TSVMs) that incorporate information of uncharacterized A-domains. The construction of the models was implemented in the NRPSpredictor and in a first step uses the physicochemical fingerprint of the 34 residues lining the active site of the phenylalanine-adenylation domain of gramicidin synthetase A [PDB ID 1 amu] to derive a feature vector. Homologous positions were extracted for A-domains with known and unknown substrate specificities and turned into feature vectors. At the same time, A-domains with known specificities towards similar substrates were clustered by physicochemical properties of amino acids (AA). In a second step, support vector machines (SVMs) were optimized from feature vectors of characterized A-domains in each of the resulting clusters. Later, SVMs were used in the variant of TSVMs that integrate a fraction of uncharacterized A-domains during training to predict unknown specificities. Finally, uncharacterized A-domains were scored by each of the constructed alignment-free models (TSVM) representing each substrate specificity resulting from the clustering. The model producing the largest score for the uncharacterized A-domain assigns the substrate specificity to it (Rausch et al., Nucleic Acids Res 33:5799–5808, 2005).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jenke-Kodama H, Dittmann E (2009) Bioinformatic perspectives on NRPS/PKS megasynthases: advances and challenges. Nat Prod Rep. doi:10.1039/b810283j

    PubMed  Google Scholar 

  2. Ansari MZ, Yadav G, Gokhale RS et al (2004) NRPS-PKS: a knowledge-based resource for analysis of NRPS/PKS megasynthases. Nucleic Acids Res. doi:10.1093/nar/gkh359

    Google Scholar 

  3. Randic M, Zupan J, Balaban AT et al (2011) Graphical representation of proteins. Chem Rev. doi:10.1021/cr800198j

    PubMed  Google Scholar 

  4. Gonzalez-Diaz H, Vilar S, Santana L et al (2007) Medicinal chemistry and bioinformatics – current trends in drugs discovery with networks topological indices. Curr Top Med Chem 7:1015–1029

    Article  CAS  PubMed  Google Scholar 

  5. Estrada E, Uriarte E (2001) Recent advances on the role of topological indices in drug discovery research. Curr Med Chem 8:1573–1588

    Article  CAS  PubMed  Google Scholar 

  6. Randic M (2004) Graphical representation of DNA as a 2-D map. Chem Phys Lett 386:468–471

    Article  CAS  Google Scholar 

  7. Randic M, Zupan J, Vikic-Topic D (2007) On representation of proteins by star-like graphs. J Mol Graph Model 26:290–305

    Article  CAS  PubMed  Google Scholar 

  8. Randic M, Zupan J (2004) Highly compact 2D graphical representation of DNA sequences. SAR QSAR Environ Res 15:191–205

    Article  CAS  PubMed  Google Scholar 

  9. Nandy A (1994) Recent investigations into global characteristics of long DNA sequences. Indian J Biochem Biophys 31:149–155

    CAS  PubMed  Google Scholar 

  10. Randic M, Zupan J (2001) On interpretation of well-known topological indices. J Chem Inf Comput Sci 41:550–560

    Article  CAS  PubMed  Google Scholar 

  11. Gonzalez-Diaz H, Gonzalez-Diaz Y, Santana L et al (2008) Proteomics, networks and connectivity indices. Proteomics. Doi: 10.1002/pmic.200700638

    Google Scholar 

  12. Aguero-Chapin G, Molina-Ruiz R, Maldonado E et al (2013) Exploring the adenylation domain repertoire of nonribosomal peptide synthetases using an ensemble of sequence-search methods. PLoS One 8(7):e65926. doi:10.1371/journal.pone.0065926

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. Aguero-Chapin G, Varona-Santos J, de la Riva GA et al (2009) Alignment-free prediction of polygalacturonases with pseudofolding topological indices: experimental isolation from Coffea arabica and prediction of a new sequence. J Proteome Res 8:2122–2128

    Article  CAS  PubMed  Google Scholar 

  14. Aguero-Chapin G, Sánchez-Rodríguez A, Hidalgo-Yanes PI et al (2011) An alignment-free approach for eukaryotic ITS2 annotation and phylogenetic inference. PLoS One 6:e26638

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Cruz-Monteagudo M, Gonzalez-Diaz H, Borges F et al (2008) 3D-MEDNEs: an alternative “in silico” technique for chemical research in toxicology. 2. Quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. Chem Res Toxicol. Doi: 10.1021/tx700296t

    Google Scholar 

  16. Estrada E (1996) Spectral moments of the edge adjacency matrix in molecular graphs. 1. Definition and applications to the prediction of physical properties of alkanes. J Chem Inf Comput Sci 36:844–849

    Article  CAS  Google Scholar 

  17. von Döhren H, Dieckmann R, Pavela-Vrancic M (1999) The nonribosomal code. Chem Biol 6:273–279

    Article  Google Scholar 

  18. Stachelhaus T, Mootz HD, Marahiel MA (1999) The specificity-conferring code of adenylation domains in nonribosomal peptide synthetases. Chem Biol. doi:10.1016/s1074-5521(99)80082-9

    PubMed  Google Scholar 

  19. Conti E, Franks NP, Brick P (1996) Crystal structure of firefly luciferase throws light on a superfamily of adenylate-forming enzymes. Structure 4:287–298

    Article  CAS  PubMed  Google Scholar 

  20. Conti E, Stachelhaus T, Marahiel MA, Brick P (1997) Structural basis for the activation of phenylalanine in the non-ribosomal biosynthesis of gramicidin S. EMBO J 16(14):4174–4183

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Challis GL, Ravel J, Townsend CA (2000) Predictive, structure-based model of amino acid recognition by nonribosomal peptide synthetase adenylation domains. Chem Biol 7:211–224

    Article  CAS  PubMed  Google Scholar 

  22. Rausch C, Weber T, Kohlbacher O et al (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res 33:5799–5808

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Bastanlar Y, Ozuysal M (2014) Introduction to machine learning. Methods Mol Biol. doi:10.1007/978-1-62703-748-8_7

    PubMed  Google Scholar 

  24. Schölkopf B, Tsuda K, Vert J (eds) (2004) Kernel methods in computational biology. MIT Press, Cambridge, MA

    Google Scholar 

  25. Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O’Donovan C, Phan I (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31:365–370

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32:D115–D119

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  27. Miller G, Lipman M (1973) Release of infectious Epstein-Barr virus by transformed marmoset leukocytes. Proc Natl Acad Sci 70:190–194

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Kohlbacher O, Lenhof H (2000) BALL-rapid software prototyping in computational molecular biology. Biochemicals Algorithms Library. Bioinformatics 16:815–824

    Article  CAS  PubMed  Google Scholar 

  29. Kawashima S, Kanehisa M (2000) Aaindex: amino acid index database. Nucleic Acids Res 28:374

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  30. Randic M, Lers N, Plavšić D et al (2005) Four-color map representation of DNA or RNA sequences and their numerical characterization. Chem Phys Lett 407:205–208

    Article  CAS  Google Scholar 

  31. Randic M, Mehulic K, Vukicevic D et al (2009) Graphical representation of proteins as four-color maps and their numerical characterization. J Mol Graph Model. doi:10.1016/j.jmgm.2008.10.004

    PubMed  Google Scholar 

  32. Aguero-Chapin G, Gonzalez-Diaz H, Molina R et al (2006) Novel 2D maps and coupling numbers for protein sequences. The first QSAR study of polygalacturonases; isolation and prediction of a novel sequence from Psidium guajava L. FEBS Lett 580:723–730

    Article  PubMed  Google Scholar 

  33. Aguero-Chapin G, de la Riva GA, Molina-Ruiz R et al (2011) Non-linear models based on simple topological indices to identify RNase III protein members. J Theor Biol 273:167–178

    Article  PubMed  Google Scholar 

  34. Estrada E (1997) Spectral moments of the edge-adjacency matrix of molecular graphs. 2. Molecules containing heteroatoms and QSAR applications. J Chem Inf Comput Sci 37:320–328

    Article  CAS  Google Scholar 

  35. Estrada E (1995) Edge adjacency relationships and a novel topological index related to molecular volume. J Chem Inf Comput Sci 35:31–33

    Article  CAS  Google Scholar 

  36. Cornell WD, Cieplak P, IBayly C et al (1995) A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J Am Chem Soc 117:5179–5197

    Article  CAS  Google Scholar 

  37. Statsoft (2008) STATISTICA 8.0 (data analysis software system for windows). Version 8.0 edn

    Google Scholar 

  38. Rivals I, Personnaz L (1999) On cross validation for model selection. Neural Comput 11:863–870

    Article  CAS  PubMed  Google Scholar 

  39. Zhou D, Bousquet O, Lal T et al (2004) Learning with local and global consistency. Adv Neural Inform Process Syst 16:321–328

    Google Scholar 

  40. Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel methods. MIT-Press, Cambrige, MA, pp 169–184

    Google Scholar 

  41. Rottig M, Medema MH, Blin K et al (2011) NRPSpredictor2 – a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res. doi:10.1093/nar/gkr323

    PubMed Central  PubMed  Google Scholar 

  42. Shen HB, Chou KC (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373:386–388

    Article  CAS  PubMed  Google Scholar 

  43. Boekhorst J, Snel B (2007) Identification of homologs in insignificant blast hits by exploiting extrinsic gene properties. BMC Bioinformatics 8:356

    Article  PubMed Central  PubMed  Google Scholar 

  44. Collobert R, Sinz F, Weston J, Bottou L (2006) Large scale transductive SVMs. J Mach Learn Res 7:1687–1712

    Google Scholar 

Download references

Acknowledgments

The authors acknowledge the Portuguese Fundação para a Ciência e a Tecnologia (FCT) for the financial support to GACH (SFRH/BPD/92978/2013). This work was partially supported by the European Regional Development Fund (ERDF) through the COMPETE—Operational Competitiveness Programme and national funds through FCT under the projects PEst-C/MAR/LA0015/2013 and the projects PTDC/AAC-CLI/116122/2009 (FCOMP-01-0124-FEDER-014029) and PTDC/MAR/115199/2009 (FCOMP-01-0124-FEDER-015451). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Agostinho Antunes .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media New York

About this protocol

Cite this protocol

Agüero-Chapin, G., Pérez-Machado, G., Sánchez-Rodríguez, A., Santos, M.M., Antunes, A. (2016). Alignment-Free Methods for the Detection and Specificity Prediction of Adenylation Domains. In: Evans, B. (eds) Nonribosomal Peptide and Polyketide Biosynthesis. Methods in Molecular Biology, vol 1401. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3375-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-3375-4_16

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-3373-0

  • Online ISBN: 978-1-4939-3375-4

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics