Skip to main content
Log in

Predicting substrate specificity of adenylation domains of nonribosomal peptide synthetases and other protein properties by latent semantic indexing

  • Original Article
  • Published:
Journal of Industrial Microbiology & Biotechnology

Abstract

Successful genome mining is dependent on accurate prediction of protein function from sequence. This often involves dividing protein families into functional subtypes (e.g., with different substrates). In many cases, there are only a small number of known functional subtypes, but in the case of the adenylation domains of nonribosomal peptide synthetases (NRPS), there are >500 known substrates. Latent semantic indexing (LSI) was originally developed for text processing but has also been used to assign proteins to families. Proteins are treated as ‘‘documents’’ and it is necessary to encode properties of the amino acid sequence as ‘‘terms’’ in order to construct a term-document matrix, which counts the terms in each document. This matrix is then processed to produce a document-concept matrix, where each protein is represented as a row vector. A standard measure of the closeness of vectors to each other (cosines of the angle between them) provides a measure of protein similarity. Previous work encoded proteins as oligopeptide terms, i.e. counted oligopeptides, but used no information regarding location of oligopeptides in the proteins. A novel tokenization method was developed to analyze information from multiple alignments. LSI successfully distinguished between two functional subtypes in five well-characterized families. Visualization of different ‘‘concept’’ dimensions allows exploration of the structure of protein families. LSI was also used to predict the amino acid substrate of adenylation domains of NRPS. Better results were obtained when selected residues from multiple alignments were used rather than the total sequence of the adenylation domains. Using ten residues from the substrate binding pocket performed better than using 34 residues within 8 Å of the active site. Prediction efficiency was somewhat better than that of the best published method using a support vector machine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL et al (2009) BLAST + : architecture and applications. BMC Bioinf 10:421

    Article  Google Scholar 

  2. Challis GL, Ravel J, Townsend CA (2000) Predictive, structure-based model of amino acid recognition by nonribosomal peptide synthetase adenylation domains. Chem Biol 7:211–224

    Article  CAS  PubMed  Google Scholar 

  3. Couto BR, Ladeira AP, Santos MA (2007) Application of latent semantic indexing to evaluate the similarity of sets of sequences without multiple alignments character-by-character. Genet Mol Res 6:983–999

    CAS  PubMed  Google Scholar 

  4. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R et al (1990) Indexing by latent semantic analysis. J Am Soc Inform Sci 41:391–407

    Article  Google Scholar 

  5. Diminic J, Zucko J, Trninic Ruzic I, Gacesa R, Hranueli D, Long PF, Cullum J, Starcevic A et al (2013) Databases of the Thiotemplate Modular Systems (CSDB) and their in silico recombinants (r-CSDB). J Ind Microbiol Biotechnol 40:653–659

    Article  CAS  PubMed  Google Scholar 

  6. Eddy SR (2008) A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 4:e1000069

    Article  PubMed Central  PubMed  Google Scholar 

  7. Goldstein P, Zucko J, Vujaklija D, Krisko A, Hranueli D, Long PF, Etchebest C, Basrak B, Cullum J et al (2009) Clustering of protein domains for functional and evolutionary studies. BMC Bioinformatics 10:335

    Article  PubMed Central  PubMed  Google Scholar 

  8. Hannenhalli SS, Russell RB (2000) Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 303:61–76

    Article  CAS  PubMed  Google Scholar 

  9. Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG et al (2007) Clustal W and Clustal X version 2.0. Bioinformatics 23:2947–2948

    Article  CAS  PubMed  Google Scholar 

  10. MATLAB®. http://www.mathworks.com/products/matlab/

  11. National Center for Biotechnology Information (NCBI GenBank database). http://www.ncbi.nlm.nih.gov/

  12. Rausch C, Weber T, Kohlbacher O, Wohlleben W, Huson DH et al (2005) Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res 33:5799–5808

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  13. Röttig M, Medema MH, Blin K, Weber T, Rausch C, Kohlbacher O (2011) NRPSpredictor2–a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res (Web Server issue) 39:W362–W367

    Article  Google Scholar 

  14. Stachelhaus T, Mootz HD, Marahiel MA (1999) The specificity-conferring code of adenylation domains in nonribosomal peptide synthetases. Chem Biol 6:493–505

    Article  CAS  PubMed  Google Scholar 

  15. Starcevic A, Zucko J, Simunkovic J, Long PF, Cullum J, Hranueli D (2008) ClustScan: an integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures. Nucleic Acids Res 36:6882–6892

    Article  CAS  PubMed Central  PubMed  Google Scholar 

  16. Strieker M, Tanovic A, Marahiel MA (2010) Nonribosomal peptide synthetases: structures and dynamics. Curr Opin Struct Biol 20:234–240

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This work was supported by the grant 09/5 (to DH) from the Croatian Science Foundation, Republic of Croatia, and by a cooperation grant of the German Academic Exchange Service (DAAD) and the Ministry of Science, Education and Sports, Republic of Croatia (to DH and JC). It was also supported by King’s College London (to PFL).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Starcevic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baranašić, D., Zucko, J., Diminic, J. et al. Predicting substrate specificity of adenylation domains of nonribosomal peptide synthetases and other protein properties by latent semantic indexing. J Ind Microbiol Biotechnol 41, 461–467 (2014). https://doi.org/10.1007/s10295-013-1322-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10295-013-1322-2

Keywords

Navigation