Abstract
To study unknown proteins on a large scale, a reference system has been set up for the three better studied eukaryotic kingdoms, built with 36 proteomes as taxonomically diverse as possible. Proteins from 362 other eukaryotic proteomes with no known homologue in this set were then analyzed, focusing noteworthy on singletons, that is, on such proteins with no known homologue in their own proteome. Consistently, for a given species, no more than 12% of the singletons thus found are known at the protein level, according to Uniprot. In addition, since they rely on the information found in the alignment of homologous sequences, predictions of AlphaFold2 for their tridimensional structure are poor. In the case of metazoan species, the number of singletons rarely exceeds 1000 for the species the closest to the reference system (divergence times below 75 Myr). Interestingly, in the cases of viridiplantae and fungi, larger amounts of singletons are found for such species, as if the timescale on which singletons are added to proteomes were different in metazoa and in other eukaryotic kingdoms. In order to confirm this phenomenon, further studies of proteomes closer to those of the reference system are, however, needed.
Similar content being viewed by others
Notes
Only proteins with at least 50 amino-acid residues were considered.
On June 23th, 2020.
Most of them are likely to be "housekeeping" proteins.
The second version of AlphaFold.
Standing for predicted local-distance difference test.
The proteome of Eimeria mitis is a low-value outlier, according to the Complete Proteome Detector. Note that it does not belong any more to the list of reference proteomes of Uniprot (on February 2023).
A small percentage of proteins are also classified as being uncertain (fifth degree).
Since this work was performed, the size of the proteomes of Leptonychotes weddellii and Meleagris gallopavo has increased by a factor of two.
Dictyostelium discoideum is an amoeba.
As of April 2022 (version 1).
With only 6727 proteins, the proteome of Saccharomyces cerevisiae was not considered in the present study.
Below 30.
Being 76% identical with an inorganic triphosphatase of a Duganella bacterium, this protein is however likely to be a contaminant.
References
Alam I, Hubbard SJ, Oliver SG et al (2007) A kingdom-specific protein domain HMM library for improved annotation of fungal genomes. BMC Genomics 8(1):1–12
Altenhoff AM, Glover NM, Train CM et al (2018) The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucl Acids Res 46(D1):D477–D485
Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Aravind L, Watanabe H, Lipman DJ et al (2000) Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sci USA 97(21):11319–11324
Barnosky AD, Matzke N, Tomiya S et al (2011) Has the earth’s sixth mass extinction already arrived? Nature 471(7336):51–57
Bernstein FC, Koetzle TF, Williams GJB et al (1977) The protein data bank: a computer-based archival file for macromolecular structures. J Mol Biol 112:535–542
Bershtein S, Goldin K, Tawfik DS (2008) Intense neutral drifts yield robust and evolvable consensus proteins. J Mol Biol 379(5):1029–1044
Blake DP (2015) Eimeria genomics: where are we now and where are we going? Vet Parasitol 212(1–2):68–74
Cai JJ, Petrov DA (2010) Relaxed purifying selection and possibly high rate of adaptation in primate lineage-specific genes. Gen Biol Evol 2:393–409
Cao Y, Li L, Xu M et al (2020) The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res 30(9):717–731
Carelli FN, Hayakawa T, Go Y et al (2016) The life history of retrocopies illuminates the evolution of new mammalian genes. Genome Res 26(3):301–314
Carradec Q, Pelletier E, Da Silva C et al (2018) A global ocean atlas of eukaryotic genes. Nat Commun 9(1):373
Carvunis AR, Rolland T, Wapinski I et al (2012) Proto-genes and de novo gene birth. Nature 487(7407):370–374
Ceballos G, Ehrlich PR, Barnosky AD et al (2015) Accelerated modern human-induced species losses: entering the sixth mass extinction. Sci Adv 1(5):e1400,253
Chain PSG, Grafham DV, Fulton RS et al (2009) Genome project standards in a new era of sequencing. Science 326:236–237
Chen S, Zhang YE, Long M (2010) New genes in Drosophila quickly become essential. Science 330(6011):1682–1685
Crozier RH (1997) Preserving the information content of species: genetic diversity, phylogeny, and conservation worth. Annu Rev Ecol Syst 28:243–268
Daubin V, Ochman H (2004) Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res 14(6):1036–1042
Domazet-Loso T, Tautz D (2003) An evolutionary analysis of orphan genes in Drosophila. Genome Res 13(10):2213–2219
Faith DP, Magallón S, Hendry AP et al (2010) Evosystem services: an evolutionary perspective on the links between biodiversity and human well-being. Curr Opin Environ Sustain 2(1–2):66–74
Grüning B, Dale R, Sjödin A et al (2018) Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15(7):475–476
Gui M, Farley H, Anujan P et al (2021) De novo identification of mammalian ciliary motility proteins using cryo-EM. Cell 184(23):5791–5806
Heinen TJ, Staubach F, Häming D et al (2009) Emergence of a new gene from an intergenic region. Curr Biol 19(18):1527–1531
Hernández-Plaza A, Szklarczyk D, Botas J et al (2022) eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucl Acids Res 51:D389
Hu P, Janga SC, Babu M et al (2009) Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol 7(4):e1000,096
Jones DT, Thornton JM (2022) The impact of AlphaFold2 one year on. Nat Methods 19(1):15–20
Jumper J, Evans R, Pritzel A et al (2021) Applying and improving AlphaFold at CASP14. Proteins 89(12):1711–1721
Junier T, Zdobnov EM (2010) The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics 26(13):1669–1670
Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge
King JL, Jukes TH (1969) Non-darwinian evolution: most evolutionary change in proteins may be due to neutral mutations and genetic drift. Science 164(3881):788–798
Kinghorn AD, De Blanco EJC, Lucas DM et al (2016) Discovery of anticancer agents of diverse natural origin. Anticancer Res 36(11):5623–5637
Knowles DG, McLysaght A (2009) Recent de novo origin of human protein-coding genes. Genome Res 19(10):1752–1759
Kumar S, Suleski M, Craig JM et al (2022) TimeTree 5: an expanded resource for species divergence times. Mol Biol Evol 39(8):msac174
Lange A, Patel PH, Heames B et al (2021) Structural and functional characterization of a putative de novo gene in drosophila. Nat Commun 12(1):1–13
Li S, Fernandez JJ, Fabritius AS et al (2022) Electron cryo-tomography structure of axonemal doublet microtubule from Tetrahymena thermophila. Life Sci Alliance 5(3):e202101225
Lichota A, Gwozdzinski K (2018) Anticancer activity of natural compounds from plant and marine environment. Int J Mol Sci 19(11):3533
Lobley A, Swindells MB, Orengo CA et al (2007) Inferring function using patterns of native disorder in proteins. PLoS Comput Biol 3(8):e162
Long M, Betrán E, Thornton K et al (2003) The origin of new genes: glimpses from the young and old. Nat Rev Genet 4(11):865–875
Lucas SJ, Akpınar BA, Šimková H et al (2014) Next-generation sequencing of flow-sorted wheat chromosome 5D reveals lineage-specific translocations and widespread gene duplications. BMC Genomics 15(1):1–18
Ma M, Stoyanova M, Rademacher G et al (2019) Structure of the decorated ciliary doublet microtubule. Cell 179(4):909–922
Ocaña-Pallarès E, Williams TA, López-Escardó D et al (2022) Divergent genomic trajectories predate the origin of animals and fungi. Nature 609:1–7
Ohno S (1970) Evolution by gene duplication. Springer, Berlin
Ohta T (1989) Role of gene duplication in evolution. Genome 31(1):304–310
Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA et al (2016) Uncovering Earth’s virome. Nature 536(7617):425–430
Palmieri N, Kosiol C, Schlötterer C (2014) The life cycle of drosophila orphan genes. Life 3:e01311
Rates SMK (2001) Plants as source of drugs. Toxicon 39(5):603–613
Ruff KM, Pappu RV (2021) AlphaFold and implications for intrinsically disordered proteins. J Mol Biol 433(20):167,208
Schäffer AA, Aravind L, Madden TL et al (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29(14):2994–3005
Schmitz JF, Bornberg-Bauer E (2017) Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA. F1000 Res 6:57
Siew N, Fischer D (2003) Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins 53(2):241–251
Simão FA, Waterhouse RM, Ioannidis P et al (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212
Siupka P, Hamming OJ, Frétaud M et al (2014) The crystal structure of zebrafish IL-22 reveals an evolutionary, conserved structure highly similar to that of human IL-22. Gen Immunol 15(5):293–302
Small E (2011) The new Noah’s Ark: beautiful and useful species only. Part 1. Biodiversity conservation issues and priorities. Biodiversity 12(4):232–247
Sun Y, Shang L, Zhu QH et al (2021) Twenty years of plant genome sequencing: achievements and challenges. Trends Plant Sci 27:1360–1385
Tautz D, Domazet-Lošo T (2011) The evolutionary origin of orphan genes. Nat Rev Genet 12(10):692–702
Teakle GR, Gilmartin PM (1998) Two forms of type IV zinc-finger motif and their kingdom-specific distribution between the flora, fauna and fungi. Trends Biochem Sci 23(3):100–102
Toll-Riera M, Bosch N, Bellora N et al (2009) Origin of primate orphan genes: a comparative genomics approach. Mol Biol Evol 26(3):603–612
Trinquier G, Sanejouand YH (1999) New protein-like properties of cubic lattice models. Phys Rev E 59(1):942–946
Tunyasuvunakool K, Adler J, Wu Z et al (2021) Highly accurate protein structure prediction for the human proteome. Nature 596:590–596
UniProt Consortium (2007) The universal protein resource (UniProt). Nucleic Acids Res 36:D190–D195
UniProt Consortium (2017) Uniprot: the universal protein knowledgebase. Nucl Acids Res 45(D1):D158–D169
UniProt Consortium (2021) Uniprot: the universal protein knowledgebase in 2021. Nucl Acids Res 49(D1):D480–D489
Vakirlis N, Carvunis AR, McLysaght A (2020) Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. Elife 9(e53):500
Varadi M, Anyango S, Deshpande M et al (2022) AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50(D1):D439–D444
Wang W, Yu H, Long M (2004) Duplication-degeneration as a mechanism of gene fission and the origin of new genes in Drosophila species. Nat Genet 36(5):523–527
Waterhouse RM, Zdobnov EM, Tegenfeldt F et al (2011) OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucl Acids Res 39(suppl–1):D283–D288
Weisman CM, Murray AW, Eddy SR (2020) Many, but not all, lineage-specific genes can be explained by homology detection failure. PLoS Biol 18(11):e3000,862
White SH, Jacobs RE (1993) The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences. J Mol Evol 36(1):79–95
Xia S, VanKuren NW, Chen C et al (2021) Genomic analyses of new genes and their phenotypic effects reveal rapid evolution of essential functions in Drosophila development. PLoS Genet 17(7):e1009,654
Acknowledgements
I thank Joseph Parello, for inspiring discussions, and the refereees, for their careful and constructive reading of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Handling editor: Cara Weisman.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sanejouand, YH. On the Unknown Proteins of Eukaryotic Proteomes. J Mol Evol 91, 492–501 (2023). https://doi.org/10.1007/s00239-023-10116-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-023-10116-1