Abstract
Modern genomics and proteomics technologies are turning out immense quantities of sequenced proteins. The only feasible way to assign functions to this flood of sequences is by applying state-of-the-art computational methods for automated functional annotation. We illustrate the significance of machine learning tools in identifying and annotating short bioactive proteins and peptides from insect genomes. Over 500,000 full-length proteins from insects are currently archived in databases, of which ~15 % are short proteins. Currently, most short sequences remain uncharacterized. We developed a platform to systematically identify the functional class of short toxin-like peptides in metazoa. We present data from eight representative genomes (140,000 proteins) that cover the main phylogenetic branches of Hexapoda. The platform is a trained machine-predictor that successfully identified ~800 toxin-like candidates, 250 of them predicted with high confidence. These proteins’ functions include ion channel inhibition, protease inhibitors, antimicrobial peptides, and components of the innate immune system. Our systematic approach can be expanded to new genomes and other biological classes of proteins. Using similar methodologies, we illustrate the success of identifying overlooked neuropeptide precursors. The systematic discovery of insect neuropeptides and short toxin-like proteins allows developing new strategies for pest control and manipulating insects’ behavior. The overlooked secreted short peptides are discussed with respect to their evolution and potential applications in biotechnology.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
- ANN:
-
Artificial neural network
- AUC:
-
Area under ROC curve
- CAFA:
-
Critical automatic functional annotation
- ClanTox:
-
Classifier of animal toxins
- CNS:
-
Central nervous system
- CV:
-
Cross validation
- ETH:
-
Ecdysis-triggering hormone
- HMM:
-
Hidden Markov model
- ICI:
-
Ion channel inhibitor
- MFS:
-
Major facilitator superfamily
- ML:
-
Machine learning
- MS:
-
Mass spectrometry
- nAChR:
-
Nicotinic acetylcholine receptors
- NGF:
-
Nerve growth factor
- NP:
-
Neuropeptide
- NPP:
-
Neuropeptide precursor
- OCLP:
-
ω-Conotoxin-like protein
- PSSM:
-
Position-specific scoring matrix
- SP:
-
Signal peptide
- SVM:
-
Support vector machine
- TIL:
-
Trypsin inhibitor like
- TOLIPs:
-
Toxin-like proteins
- TLS:
-
Toxin-like stability
References
Loewenstein Y et al (2009) Protein function annotation by homology-based inference. Genome Biol 10:207
Sasson O, Kaplan N, Linial M (2006) Functional annotation prediction: all for one and one for all. Protein Sci 15:1557–1562
Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12:85–94
Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins 20:216–226
Shachar O, Linial M (2004) A robust method to detect structural and functional remote homologues. Proteins 57:531–538
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202
Radivojac P et al (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10:221–227
Schuldiner S, Shirvan A, Linial M (1995) Vesicular neurotransmitter transporters: from bacteria to humans. Physiol Rev 75:369–392
Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci U S A 106:3770–3775
Eddy SR (2009) A new generation of homology search tools based on probabilistic inference. Genome Inform (International Conference on Genome Informatics) 23:205–211
Punta M et al (2012) The Pfam protein families database. Nucleic Acids Res 40:D290–D301
Portugaly E, Linial N, Linial M (2007) EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Res 35:D241–D246
Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA (2009) Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol 5:e1000585
Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res 39:W385–W390
Chapelle O (2007) Training a support vector machine in the primal. Neural Comput 19:1155–1178
Breiman L (2001) Random forests. Mach Learn Cybern 45:5–32
Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697
Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. Pac Symp Biocomput 2002:564–575
Tarca AL, Carey VJ, Chen XW, Romero R, Draghici S (2007) Machine learning and its applications to biology. PLoS Comput Biol 3:e116
Ben-Hur A, Ong CS, Sonnenburg S, Scholkopf B, Ratsch G (2008) Support vector machines and kernels for computational biology. PLoS Comput Biol 4:e1000173
Rappoport N, Linial N, Linial M (2013) ProtoNet: charting the expanding universe of protein sequences. Nat Biotechnol 31:290–292
Frith MC et al (2006) The abundance of short proteins in the mammalian proteome. PLoS Genet 2:e52
Kondo T et al (2010) Small peptides switch the transcriptional activity of Shavenbaby during Drosophila embryogenesis. Science 329:336–339
Ponting CP, Belgard TG (2010) Transcribed dark matter: meaning or myth? Hum Mol Genet 19:R162–R168
Lubec G, Afjehi-Sadat L (2007) Limitations and pitfalls in protein identification by mass spectrometry. Chem Rev 107:3568–3584
Wu CH (2006) Bioinformatics for proteomics at the Protein Information Resource (PIR). Mol Cell Proteomics 5:S341–S341
Rappoport N, Fromer M, Schweiger R, Linial M (2010) PANDORA: analysis of protein and peptide sets through the hierarchical integration of annotations. Nucleic Acids Res 38:W84–W89
Ofer D, Linial M (2013) NeuroPID: a predictor for identifying neuropeptide precursors from metazoan proteomes. Bioinformatics. 30(7):931–940.
Fry BG (2005) From genome to “venome”: molecular origin and evolution of the snake venom proteome inferred from phylogenetic analysis of toxin sequences and related body proteins. Genome Res 15:403–420
Mouhat S, Jouirou B, Mosbah A, De Waard M, Sabatier JM (2004) Diversity of folds in animal toxins acting on ion channels. Biochem J 378:717–726
Norton RS, Pallaghy PK (1998) The cystine knot structure of ion channel toxins and related polypeptides. Toxicon 36:1573–1583
Terlau H, Olivera BM (2004) Conus venoms: a rich source of novel ion channel-targeted peptides. Physiol Rev 84:41–68
Ibanez-Tallon I et al (2002) Novel modulation of neuronal nicotinic acetylcholine receptors by association with the endogenous prototoxin lynx1. Neuron 33:893–903
Chimienti F et al (2003) Identification of SLURP-1 as an epidermal neuromodulator explains the clinical phenotype of Mal de Meleda. Hum Mol Genet 12:3017–3024
Schoofs L, Beets I (2013) Neuropeptides control life-phase transitions. Proc Natl Acad Sci U S A 110:7973–7974
Karsenty S, Rappoport N, Ofer D, Zair A, Linial M (2014) NeuroPID: a classifier of neuropeptide precursors. Nucleic Acids Res 42:W182–W186
Brain SD, Cox HM (2006) Neuropeptides and their receptors: innovative science providing novel therapeutic targets. Br J Pharmacol 147(Suppl 1):S202–S211
Nassel DR (2002) Neuropeptides in the nervous system of Drosophila and other insects: multiple roles as neuromodulators and neurohormones. Prog Neurobiol 68:1–84
Vanden Broeck J (2001) Neuropeptides and their precursors in the fruitfly, Drosophila melanogaster. Peptides 22:241–254
Naamati G, Askenazi M, Linial M (2009) ClanTox: a classifier of short animal toxins. Nucleic Acids Res 37:W363–W368
Dimmer EC et al (2012) The UniProt-GO Annotation database in 2011. Nucleic Acids Res 40:D565–D570
Kim BY et al (2013) Antimicrobial activity of a honeybee (Apis cerana) venom Kazal-type serine protease inhibitor. Toxicon 76:110–117
Palmiter RD (1998) The elusive function of metallothioneins. Proc Natl Acad Sci U S A 95:8428–8430
Tian C et al (2008) Gene expression, antiparasitic activity, and functional evolution of the drosomycin family. Mol Immunol 45:3909–3916
Petersen TN, Brunak S, von Heijne G, Nielsen H (2011) SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8:785–786
Lipkind GM, Fozzard HA (1994) A structural model of the tetrodotoxin and saxitoxin binding site of the Na + channel. Biophys J 66:1–13
Kaplan N, Morpurgo N, Linial M (2007) Novel families of toxin-like peptides in insects and mammals: a computational approach. J Mol Biol 369:553–566
Kloog Y et al (1988) Sarafotoxin, a novel vasoconstrictor peptide: phosphoinositide hydrolysis in rat heart and brain. Science 242:268–270
Sousa SR, Vetter I, Lewis RJ (2013) Venom peptides as a rich source of cav2.2 channel blockers. Toxins 5:286–314
Su M, Ling Y, Yu J, Wu J, Xiao J (2013) Small proteins: untapped area of potential biological importance. Front Genet 4:286
Nijhout HF, Grunert LW (2010) The cellular and physiological mechanism of wing-body scaling in Manduca sexta. Science 330:1693–1695
Mizoguchi A et al (2013) Prothoracicotropic hormone acts as a neuroendocrine switch between pupal diapause and adult development. PLoS One 8:e60824
Schofield CJ, Jannin J, Salvatella R (2006) The future of Chagas disease control. Trends Parasitol 22:583–588
Lee PY, Wang JX, Parisini E, Dascher CC, Nigrovic PA (2013) Ly6 family proteins in neutrophil biology. J Leukoc Biol 94:585–594
Tirosh Y, Ofer D, Eliyahu T, Linial M (2013) Short toxin-like proteins attack the defense line of innate immunity. Toxins 5:1314–1331
Naamati G, Askenazi M, Linial M (2010) A predictor for toxin-like proteins exposes cell modulator candidates within viral genomes. Bioinformatics 26:i482–i488
Cui L, Webb BA (1996) Isolation and characterization of a member of the cysteine-rich gene family from Campoletis sonorensis polydnavirus. J Gen Virol 77(Pt 4):797–809
Jekely G (2013) Global view of the evolution and diversity of metazoan neuropeptide signaling. Proc Natl Acad Sci U S A 110:8702–8707
Insel TR, Young LJ (2000) Neuropeptides and the evolution of social behavior. Curr Opin Neurobiol 10:784–789
Hummon AB et al (2006) From the genome to the proteome: uncovering peptides in the Apis brain. Science 314:647–649
Kreissl S, Strasser C, Galizia CG (2010) Allatostatin immunoreactivity in the honeybee brain. J Comp Neurol 518:1391–1417
Mirabeau O et al (2007) Identification of novel peptide hormones in the human proteome by hidden Markov model screening. Genome Res 17:320–327
Mentlein R, Dahms P (1994) Endopeptidases 24.16 and 24.15 are responsible for the degradation of somatostatin, neurotensin, and other neuropeptides by cultivated rat cortical astrocytes. J Neurochem 62:27–36
Artimo P et al (2012) ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res 40:W597–W603
Southey BR, Sweedler JV, Rodriguez-Zas SL (2008) Prediction of neuropeptide cleavage sites in insects. Bioinformatics 24:815–825
Roller L et al (2010) Ecdysis triggering hormone signaling in arthropods. Peptides 31:429–441
Fox JW, Serrano SM (2007) Approaching the golden age of natural product pharmaceuticals from venom libraries: an overview of toxins and toxin-derivatives currently involved in therapeutic or diagnostic applications. Curr Pharm Des 13:2927–2934
Lai Y, Gallo RL (2009) AMPed up immunity: how antimicrobial peptides have multiple roles in immune defense. Trends Immunol 30:131–141
Brady RM, Baell JB, Norton RS (2013) Strategies for the development of conotoxins as new therapeutic leads. Mar Drugs 11:2293–2313
Consortium iK (2013) The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J Hered 104:595–600
Ofer, Dan, and Michal Linial. ‘‘ProFET: Feature engineering captures high-level protein functions.’’ Bioinformatics (2015): btv345.
Acknowledgments
We thank Noam Kaplan for his seminal contribution to the work. Many of the ideas and formulation of the problem resulted from many stimulating discussions. N.K.’s beautiful thesis was used as a starting point for this chapter. The development and maintenance of ClanTox, NeuroPID, and the protein database have been a long-term effort of many people over the last 10 years. We thank Solange Karsenty for her contribution to the development of ClanTox and NeuroPID platforms and the system group in the School of Computer Science and Engineering for long-term support of our web servers.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Ofer, D., Rappoport, N., Linial, M. (2015). The Little Known Universe of Short Proteins in Insects: A Machine Learning Approach. In: Raman, C., Goldsmith, M., Agunbiade, T. (eds) Short Views on Insect Genomics and Proteomics. Entomology in Focus, vol 3. Springer, Cham. https://doi.org/10.1007/978-3-319-24235-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-24235-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24233-0
Online ISBN: 978-3-319-24235-4
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)