Skip to main content

The Little Known Universe of Short Proteins in Insects: A Machine Learning Approach

  • Chapter
  • First Online:
Short Views on Insect Genomics and Proteomics

Part of the book series: Entomology in Focus ((ENFO,volume 3))

Abstract

Modern genomics and proteomics technologies are turning out immense quantities of sequenced proteins. The only feasible way to assign functions to this flood of sequences is by applying state-of-the-art computational methods for automated functional annotation. We illustrate the significance of machine learning tools in identifying and annotating short bioactive proteins and peptides from insect genomes. Over 500,000 full-length proteins from insects are currently archived in databases, of which ~15 % are short proteins. Currently, most short sequences remain uncharacterized. We developed a platform to systematically identify the functional class of short toxin-like peptides in metazoa. We present data from eight representative genomes (140,000 proteins) that cover the main phylogenetic branches of Hexapoda. The platform is a trained machine-predictor that successfully identified ~800 toxin-like candidates, 250 of them predicted with high confidence. These proteins’ functions include ion channel inhibition, protease inhibitors, antimicrobial peptides, and components of the innate immune system. Our systematic approach can be expanded to new genomes and other biological classes of proteins. Using similar methodologies, we illustrate the success of identifying overlooked neuropeptide precursors. The systematic discovery of insect neuropeptides and short toxin-like proteins allows developing new strategies for pest control and manipulating insects’ behavior. The overlooked secreted short peptides are discussed with respect to their evolution and potential applications in biotechnology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

ANN:

Artificial neural network

AUC:

Area under ROC curve

CAFA:

Critical automatic functional annotation

ClanTox:

Classifier of animal toxins

CNS:

Central nervous system

CV:

Cross validation

ETH:

Ecdysis-triggering hormone

HMM:

Hidden Markov model

ICI:

Ion channel inhibitor

MFS:

Major facilitator superfamily

ML:

Machine learning

MS:

Mass spectrometry

nAChR:

Nicotinic acetylcholine receptors

NGF:

Nerve growth factor

NP:

Neuropeptide

NPP:

Neuropeptide precursor

OCLP:

ω-Conotoxin-like protein

PSSM:

Position-specific scoring matrix

SP:

Signal peptide

SVM:

Support vector machine

TIL:

Trypsin inhibitor like

TOLIPs:

Toxin-like proteins

TLS:

Toxin-like stability

References

  1. Loewenstein Y et al (2009) Protein function annotation by homology-based inference. Genome Biol 10:207

    Article  PubMed  PubMed Central  Google Scholar 

  2. Sasson O, Kaplan N, Linial M (2006) Functional annotation prediction: all for one and one for all. Protein Sci 15:1557–1562

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12:85–94

    Article  CAS  PubMed  Google Scholar 

  5. Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins 20:216–226

    Article  CAS  PubMed  Google Scholar 

  6. Shachar O, Linial M (2004) A robust method to detect structural and functional remote homologues. Proteins 57:531–538

    Article  CAS  PubMed  Google Scholar 

  7. Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202

    Article  CAS  PubMed  Google Scholar 

  8. Radivojac P et al (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10:221–227

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Schuldiner S, Shirvan A, Linial M (1995) Vesicular neurotransmitter transporters: from bacteria to humans. Physiol Rev 75:369–392

    CAS  PubMed  Google Scholar 

  10. Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci U S A 106:3770–3775

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform (International Conference on Genome Informatics) 23:205–211

    Google Scholar 

  12. Punta M et al (2012) The Pfam protein families database. Nucleic Acids Res 40:D290–D301

    Article  CAS  PubMed  Google Scholar 

  13. Portugaly E, Linial N, Linial M (2007) EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Res 35:D241–D246

    Article  CAS  PubMed  Google Scholar 

  14. Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA (2009) Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol 5:e1000585

    Article  PubMed  PubMed Central  Google Scholar 

  15. Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 39:W385–W390

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19:1155–1178

    Article  PubMed  Google Scholar 

  17. Breiman L (2001) Random forests. Mach Learn Cybern 45:5–32

    Article  Google Scholar 

  18. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 2002:564–575

    Google Scholar 

  20. Tarca AL, Carey VJ, Chen XW, Romero R, Draghici S (2007) Machine learning and its applications to biology. PLoS Comput Biol 3:e116

    Article  PubMed  PubMed Central  Google Scholar 

  21. Ben-Hur A, Ong CS, Sonnenburg S, Scholkopf B, Ratsch G (2008) Support vector machines and kernels for computational biology. PLoS Comput Biol 4:e1000173

    Article  PubMed  PubMed Central  Google Scholar 

  22. Rappoport N, Linial N, Linial M (2013) ProtoNet: charting the expanding universe of protein sequences. Nat Biotechnol 31:290–292

    Article  CAS  PubMed  Google Scholar 

  23. Frith MC et al (2006) The abundance of short proteins in the mammalian proteome. PLoS Genet 2:e52

    Article  PubMed  PubMed Central  Google Scholar 

  24. Kondo T et al (2010) Small peptides switch the transcriptional activity of Shavenbaby during Drosophila embryogenesis. Science 329:336–339

    Article  CAS  PubMed  Google Scholar 

  25. Ponting CP, Belgard TG (2010) Transcribed dark matter: meaning or myth? Hum Mol Genet 19:R162–R168

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Lubec G, Afjehi-Sadat L (2007) Limitations and pitfalls in protein identification by mass spectrometry. Chem Rev 107:3568–3584

    Article  CAS  PubMed  Google Scholar 

  27. Wu CH (2006) Bioinformatics for proteomics at the Protein Information Resource (PIR). Mol Cell Proteomics 5:S341–S341

    Article  Google Scholar 

  28. Rappoport N, Fromer M, Schweiger R, Linial M (2010) PANDORA: analysis of protein and peptide sets through the hierarchical integration of annotations. Nucleic Acids Res 38:W84–W89

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Ofer D, Linial M (2013) NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes. Bioinformatics. 30(7):931–940.

    Google Scholar 

  30. Fry BG (2005) From genome to “venome”: molecular origin and evolution of the snake venom proteome inferred from phylogenetic analysis of toxin sequences and related body proteins. Genome Res 15:403–420

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Mouhat S, Jouirou B, Mosbah A, De Waard M, Sabatier JM (2004) Diversity of folds in animal toxins acting on ion channels. Biochem J 378:717–726

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Norton RS, Pallaghy PK (1998) The cystine knot structure of ion channel toxins and related polypeptides. Toxicon 36:1573–1583

    Article  CAS  PubMed  Google Scholar 

  33. Terlau H, Olivera BM (2004) Conus venoms: a rich source of novel ion channel-targeted peptides. Physiol Rev 84:41–68

    Article  CAS  PubMed  Google Scholar 

  34. Ibanez-Tallon I et al (2002) Novel modulation of neuronal nicotinic acetylcholine receptors by association with the endogenous prototoxin lynx1. Neuron 33:893–903

    Article  CAS  PubMed  Google Scholar 

  35. Chimienti F et al (2003) Identification of SLURP-1 as an epidermal neuromodulator explains the clinical phenotype of Mal de Meleda. Hum Mol Genet 12:3017–3024

    Article  CAS  PubMed  Google Scholar 

  36. Schoofs L, Beets I (2013) Neuropeptides control life-phase transitions. Proc Natl Acad Sci U S A 110:7973–7974

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Karsenty S, Rappoport N, Ofer D, Zair A, Linial M (2014) NeuroPID: a classifier of neuropeptide precursors. Nucleic Acids Res 42:W182–W186

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Brain SD, Cox HM (2006) Neuropeptides and their receptors: innovative science providing novel therapeutic targets. Br J Pharmacol 147(Suppl 1):S202–S211

    CAS  PubMed  PubMed Central  Google Scholar 

  39. Nassel DR (2002) Neuropeptides in the nervous system of Drosophila and other insects: multiple roles as neuromodulators and neurohormones. Prog Neurobiol 68:1–84

    Article  PubMed  Google Scholar 

  40. Vanden Broeck J (2001) Neuropeptides and their precursors in the fruitfly, Drosophila melanogaster. Peptides 22:241–254

    Article  CAS  PubMed  Google Scholar 

  41. Naamati G, Askenazi M, Linial M (2009) ClanTox: a classifier of short animal toxins. Nucleic Acids Res 37:W363–W368

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Dimmer EC et al (2012) The UniProt-GO Annotation database in 2011. Nucleic Acids Res 40:D565–D570

    Article  CAS  PubMed  Google Scholar 

  43. Kim BY et al (2013) Antimicrobial activity of a honeybee (Apis cerana) venom Kazal-type serine protease inhibitor. Toxicon 76:110–117

    Article  CAS  PubMed  Google Scholar 

  44. Palmiter RD (1998) The elusive function of metallothioneins. Proc Natl Acad Sci U S A 95:8428–8430

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Tian C et al (2008) Gene expression, antiparasitic activity, and functional evolution of the drosomycin family. Mol Immunol 45:3909–3916

    Article  CAS  PubMed  Google Scholar 

  46. Petersen TN, Brunak S, von Heijne G, Nielsen H (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785–786

    Article  CAS  PubMed  Google Scholar 

  47. Lipkind GM, Fozzard HA (1994) A structural model of the tetrodotoxin and saxitoxin binding site of the Na + channel. Biophys J 66:1–13

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Kaplan N, Morpurgo N, Linial M (2007) Novel families of toxin-like peptides in insects and mammals: a computational approach. J Mol Biol 369:553–566

    Article  CAS  PubMed  Google Scholar 

  49. Kloog Y et al (1988) Sarafotoxin, a novel vasoconstrictor peptide: phosphoinositide hydrolysis in rat heart and brain. Science 242:268–270

    Article  CAS  PubMed  Google Scholar 

  50. Sousa SR, Vetter I, Lewis RJ (2013) Venom peptides as a rich source of cav2.2 channel blockers. Toxins 5:286–314

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Su M, Ling Y, Yu J, Wu J, Xiao J (2013) Small proteins: untapped area of potential biological importance. Front Genet 4:286

    Article  PubMed  PubMed Central  Google Scholar 

  52. Nijhout HF, Grunert LW (2010) The cellular and physiological mechanism of wing-body scaling in Manduca sexta. Science 330:1693–1695

    Article  CAS  PubMed  Google Scholar 

  53. Mizoguchi A et al (2013) Prothoracicotropic hormone acts as a neuroendocrine switch between pupal diapause and adult development. PLoS One 8:e60824

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Schofield CJ, Jannin J, Salvatella R (2006) The future of Chagas disease control. Trends Parasitol 22:583–588

    Article  PubMed  Google Scholar 

  55. Lee PY, Wang JX, Parisini E, Dascher CC, Nigrovic PA (2013) Ly6 family proteins in neutrophil biology. J Leukoc Biol 94:585–594

    Article  CAS  PubMed  Google Scholar 

  56. Tirosh Y, Ofer D, Eliyahu T, Linial M (2013) Short toxin-like proteins attack the defense line of innate immunity. Toxins 5:1314–1331

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Naamati G, Askenazi M, Linial M (2010) A predictor for toxin-like proteins exposes cell modulator candidates within viral genomes. Bioinformatics 26:i482–i488

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Cui L, Webb BA (1996) Isolation and characterization of a member of the cysteine-rich gene family from Campoletis sonorensis polydnavirus. J Gen Virol 77(Pt 4):797–809

    Article  CAS  PubMed  Google Scholar 

  59. Jekely G (2013) Global view of the evolution and diversity of metazoan neuropeptide signaling. Proc Natl Acad Sci U S A 110:8702–8707

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Insel TR, Young LJ (2000) Neuropeptides and the evolution of social behavior. Curr Opin Neurobiol 10:784–789

    Article  CAS  PubMed  Google Scholar 

  61. Hummon AB et al (2006) From the genome to the proteome: uncovering peptides in the Apis brain. Science 314:647–649

    Article  CAS  PubMed  Google Scholar 

  62. Kreissl S, Strasser C, Galizia CG (2010) Allatostatin immunoreactivity in the honeybee brain. J Comp Neurol 518:1391–1417

    Article  CAS  PubMed  Google Scholar 

  63. Mirabeau O et al (2007) Identification of novel peptide hormones in the human proteome by hidden Markov model screening. Genome Res 17:320–327

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Mentlein R, Dahms P (1994) Endopeptidases 24.16 and 24.15 are responsible for the degradation of somatostatin, neurotensin, and other neuropeptides by cultivated rat cortical astrocytes. J Neurochem 62:27–36

    Article  CAS  PubMed  Google Scholar 

  65. Artimo P et al (2012) ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res 40:W597–W603

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Southey BR, Sweedler JV, Rodriguez-Zas SL (2008) Prediction of neuropeptide cleavage sites in insects. Bioinformatics 24:815–825

    Article  CAS  PubMed  Google Scholar 

  67. Roller L et al (2010) Ecdysis triggering hormone signaling in arthropods. Peptides 31:429–441

    Article  CAS  PubMed  Google Scholar 

  68. Fox JW, Serrano SM (2007) Approaching the golden age of natural product pharmaceuticals from venom libraries: an overview of toxins and toxin-derivatives currently involved in therapeutic or diagnostic applications. Curr Pharm Des 13:2927–2934

    Article  CAS  PubMed  Google Scholar 

  69. Lai Y, Gallo RL (2009) AMPed up immunity: how antimicrobial peptides have multiple roles in immune defense. Trends Immunol 30:131–141

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  70. Brady RM, Baell JB, Norton RS (2013) Strategies for the development of conotoxins as new therapeutic leads. Mar Drugs 11:2293–2313

    Article  PubMed  PubMed Central  Google Scholar 

  71. Consortium iK (2013) The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J Hered 104:595–600

    Article  Google Scholar 

  72. Ofer, Dan, and Michal Linial. ‘‘ProFET: Feature engineering captures high-level protein functions.’’ Bioinformatics (2015): btv345.

    Google Scholar 

Download references

Acknowledgments

We thank Noam Kaplan for his seminal contribution to the work. Many of the ideas and formulation of the problem resulted from many stimulating discussions. N.K.’s beautiful thesis was used as a starting point for this chapter. The development and maintenance of ClanTox, NeuroPID, and the protein database have been a long-term effort of many people over the last 10 years. We thank Solange Karsenty for her contribution to the development of ClanTox and NeuroPID platforms and the system group in the School of Computer Science and Engineering for long-term support of our web servers.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michal Linial .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Ofer, D., Rappoport, N., Linial, M. (2015). The Little Known Universe of Short Proteins in Insects: A Machine Learning Approach. In: Raman, C., Goldsmith, M., Agunbiade, T. (eds) Short Views on Insect Genomics and Proteomics. Entomology in Focus, vol 3. Springer, Cham. https://doi.org/10.1007/978-3-319-24235-4_8

Download citation

Publish with us

Policies and ethics