Skip to main content
Log in

On the Unknown Proteins of Eukaryotic Proteomes

  • Original Article
  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

To study unknown proteins on a large scale, a reference system has been set up for the three better studied eukaryotic kingdoms, built with 36 proteomes as taxonomically diverse as possible. Proteins from 362 other eukaryotic proteomes with no known homologue in this set were then analyzed, focusing noteworthy on singletons, that is, on such proteins with no known homologue in their own proteome. Consistently, for a given species, no more than 12% of the singletons thus found are known at the protein level, according to Uniprot. In addition, since they rely on the information found in the alignment of homologous sequences, predictions of AlphaFold2 for their tridimensional structure are poor. In the case of metazoan species, the number of singletons rarely exceeds 1000 for the species the closest to the reference system (divergence times below 75 Myr). Interestingly, in the cases of viridiplantae and fungi, larger amounts of singletons are found for such species, as if the timescale on which singletons are added to proteomes were different in metazoa and in other eukaryotic kingdoms. In order to confirm this phenomenon, further studies of proteomes closer to those of the reference system are, however, needed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Only proteins with at least 50 amino-acid residues were considered.

  2. On June 23th, 2020.

  3. Most of them are likely to be "housekeeping" proteins.

  4. The second version of AlphaFold.

  5. Standing for predicted local-distance difference test.

  6. The proteome of Eimeria mitis is a low-value outlier, according to the Complete Proteome Detector. Note that it does not belong any more to the list of reference proteomes of Uniprot (on February 2023).

  7. A small percentage of proteins are also classified as being uncertain (fifth degree).

  8. Since this work was performed, the size of the proteomes of Leptonychotes weddellii and Meleagris gallopavo has increased by a factor of two.

  9. Dictyostelium discoideum is an amoeba.

  10. As of April 2022 (version 1).

  11. With only 6727 proteins, the proteome of Saccharomyces cerevisiae was not considered in the present study.

  12. Below 30.

  13. Being 76% identical with an inorganic triphosphatase of a Duganella bacterium, this protein is however likely to be a contaminant.

References

  • Alam I, Hubbard SJ, Oliver SG et al (2007) A kingdom-specific protein domain HMM library for improved annotation of fungal genomes. BMC Genomics 8(1):1–12

    Article  Google Scholar 

  • Altenhoff AM, Glover NM, Train CM et al (2018) The OMA orthology database in 2018: retrieving evolutionary relationships among all domains of life through richer web and programmatic interfaces. Nucl Acids Res 46(D1):D477–D485

    Article  CAS  PubMed  Google Scholar 

  • Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Aravind L, Watanabe H, Lipman DJ et al (2000) Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sci USA 97(21):11319–11324

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Barnosky AD, Matzke N, Tomiya S et al (2011) Has the earth’s sixth mass extinction already arrived? Nature 471(7336):51–57

    Article  CAS  PubMed  Google Scholar 

  • Bernstein FC, Koetzle TF, Williams GJB et al (1977) The protein data bank: a computer-based archival file for macromolecular structures. J Mol Biol 112:535–542

    Article  CAS  PubMed  Google Scholar 

  • Bershtein S, Goldin K, Tawfik DS (2008) Intense neutral drifts yield robust and evolvable consensus proteins. J Mol Biol 379(5):1029–1044

    Article  CAS  PubMed  Google Scholar 

  • Blake DP (2015) Eimeria genomics: where are we now and where are we going? Vet Parasitol 212(1–2):68–74

    Article  CAS  PubMed  Google Scholar 

  • Cai JJ, Petrov DA (2010) Relaxed purifying selection and possibly high rate of adaptation in primate lineage-specific genes. Gen Biol Evol 2:393–409

    Article  Google Scholar 

  • Cao Y, Li L, Xu M et al (2020) The ChinaMAP analytics of deep whole genome sequences in 10,588 individuals. Cell Res 30(9):717–731

    Article  PubMed  PubMed Central  Google Scholar 

  • Carelli FN, Hayakawa T, Go Y et al (2016) The life history of retrocopies illuminates the evolution of new mammalian genes. Genome Res 26(3):301–314

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Carradec Q, Pelletier E, Da Silva C et al (2018) A global ocean atlas of eukaryotic genes. Nat Commun 9(1):373

    Article  PubMed  PubMed Central  Google Scholar 

  • Carvunis AR, Rolland T, Wapinski I et al (2012) Proto-genes and de novo gene birth. Nature 487(7407):370–374

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ceballos G, Ehrlich PR, Barnosky AD et al (2015) Accelerated modern human-induced species losses: entering the sixth mass extinction. Sci Adv 1(5):e1400,253

    Article  Google Scholar 

  • Chain PSG, Grafham DV, Fulton RS et al (2009) Genome project standards in a new era of sequencing. Science 326:236–237

    Article  CAS  PubMed  Google Scholar 

  • Chen S, Zhang YE, Long M (2010) New genes in Drosophila quickly become essential. Science 330(6011):1682–1685

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Crozier RH (1997) Preserving the information content of species: genetic diversity, phylogeny, and conservation worth. Annu Rev Ecol Syst 28:243–268

    Article  Google Scholar 

  • Daubin V, Ochman H (2004) Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli. Genome Res 14(6):1036–1042

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Domazet-Loso T, Tautz D (2003) An evolutionary analysis of orphan genes in Drosophila. Genome Res 13(10):2213–2219

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Faith DP, Magallón S, Hendry AP et al (2010) Evosystem services: an evolutionary perspective on the links between biodiversity and human well-being. Curr Opin Environ Sustain 2(1–2):66–74

    Article  Google Scholar 

  • Grüning B, Dale R, Sjödin A et al (2018) Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15(7):475–476

    Article  PubMed  Google Scholar 

  • Gui M, Farley H, Anujan P et al (2021) De novo identification of mammalian ciliary motility proteins using cryo-EM. Cell 184(23):5791–5806

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Heinen TJ, Staubach F, Häming D et al (2009) Emergence of a new gene from an intergenic region. Curr Biol 19(18):1527–1531

    Article  CAS  PubMed  Google Scholar 

  • Hernández-Plaza A, Szklarczyk D, Botas J et al (2022) eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucl Acids Res 51:D389

    Article  PubMed Central  Google Scholar 

  • Hu P, Janga SC, Babu M et al (2009) Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol 7(4):e1000,096

    Article  Google Scholar 

  • Jones DT, Thornton JM (2022) The impact of AlphaFold2 one year on. Nat Methods 19(1):15–20

    Article  CAS  PubMed  Google Scholar 

  • Jumper J, Evans R, Pritzel A et al (2021) Applying and improving AlphaFold at CASP14. Proteins 89(12):1711–1721

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Junier T, Zdobnov EM (2010) The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell. Bioinformatics 26(13):1669–1670

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kimura M (1983) The neutral theory of molecular evolution. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • King JL, Jukes TH (1969) Non-darwinian evolution: most evolutionary change in proteins may be due to neutral mutations and genetic drift. Science 164(3881):788–798

    Article  CAS  PubMed  Google Scholar 

  • Kinghorn AD, De Blanco EJC, Lucas DM et al (2016) Discovery of anticancer agents of diverse natural origin. Anticancer Res 36(11):5623–5637

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Knowles DG, McLysaght A (2009) Recent de novo origin of human protein-coding genes. Genome Res 19(10):1752–1759

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Kumar S, Suleski M, Craig JM et al (2022) TimeTree 5: an expanded resource for species divergence times. Mol Biol Evol 39(8):msac174

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Lange A, Patel PH, Heames B et al (2021) Structural and functional characterization of a putative de novo gene in drosophila. Nat Commun 12(1):1–13

    Article  Google Scholar 

  • Li S, Fernandez JJ, Fabritius AS et al (2022) Electron cryo-tomography structure of axonemal doublet microtubule from Tetrahymena thermophila. Life Sci Alliance 5(3):e202101225

    Article  CAS  PubMed  Google Scholar 

  • Lichota A, Gwozdzinski K (2018) Anticancer activity of natural compounds from plant and marine environment. Int J Mol Sci 19(11):3533

    Article  PubMed  PubMed Central  Google Scholar 

  • Lobley A, Swindells MB, Orengo CA et al (2007) Inferring function using patterns of native disorder in proteins. PLoS Comput Biol 3(8):e162

    Article  PubMed  PubMed Central  Google Scholar 

  • Long M, Betrán E, Thornton K et al (2003) The origin of new genes: glimpses from the young and old. Nat Rev Genet 4(11):865–875

    Article  CAS  PubMed  Google Scholar 

  • Lucas SJ, Akpınar BA, Šimková H et al (2014) Next-generation sequencing of flow-sorted wheat chromosome 5D reveals lineage-specific translocations and widespread gene duplications. BMC Genomics 15(1):1–18

    Article  Google Scholar 

  • Ma M, Stoyanova M, Rademacher G et al (2019) Structure of the decorated ciliary doublet microtubule. Cell 179(4):909–922

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Ocaña-Pallarès E, Williams TA, López-Escardó D et al (2022) Divergent genomic trajectories predate the origin of animals and fungi. Nature 609:1–7

    Article  Google Scholar 

  • Ohno S (1970) Evolution by gene duplication. Springer, Berlin

    Book  Google Scholar 

  • Ohta T (1989) Role of gene duplication in evolution. Genome 31(1):304–310

    Article  CAS  PubMed  Google Scholar 

  • Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA et al (2016) Uncovering Earth’s virome. Nature 536(7617):425–430

    Article  CAS  PubMed  Google Scholar 

  • Palmieri N, Kosiol C, Schlötterer C (2014) The life cycle of drosophila orphan genes. Life 3:e01311

    Google Scholar 

  • Rates SMK (2001) Plants as source of drugs. Toxicon 39(5):603–613

    Article  CAS  PubMed  Google Scholar 

  • Ruff KM, Pappu RV (2021) AlphaFold and implications for intrinsically disordered proteins. J Mol Biol 433(20):167,208

    Article  CAS  Google Scholar 

  • Schäffer AA, Aravind L, Madden TL et al (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29(14):2994–3005

    Article  PubMed  PubMed Central  Google Scholar 

  • Schmitz JF, Bornberg-Bauer E (2017) Fact or fiction: updates on how protein-coding genes might emerge de novo from previously non-coding DNA. F1000 Res 6:57

    Article  Google Scholar 

  • Siew N, Fischer D (2003) Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins 53(2):241–251

    Article  CAS  PubMed  Google Scholar 

  • Simão FA, Waterhouse RM, Ioannidis P et al (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212

    Article  PubMed  Google Scholar 

  • Siupka P, Hamming OJ, Frétaud M et al (2014) The crystal structure of zebrafish IL-22 reveals an evolutionary, conserved structure highly similar to that of human IL-22. Gen Immunol 15(5):293–302

    Article  CAS  Google Scholar 

  • Small E (2011) The new Noah’s Ark: beautiful and useful species only. Part 1. Biodiversity conservation issues and priorities. Biodiversity 12(4):232–247

    Article  Google Scholar 

  • Sun Y, Shang L, Zhu QH et al (2021) Twenty years of plant genome sequencing: achievements and challenges. Trends Plant Sci 27:1360–1385

    Google Scholar 

  • Tautz D, Domazet-Lošo T (2011) The evolutionary origin of orphan genes. Nat Rev Genet 12(10):692–702

    Article  CAS  PubMed  Google Scholar 

  • Teakle GR, Gilmartin PM (1998) Two forms of type IV zinc-finger motif and their kingdom-specific distribution between the flora, fauna and fungi. Trends Biochem Sci 23(3):100–102

    Article  CAS  PubMed  Google Scholar 

  • Toll-Riera M, Bosch N, Bellora N et al (2009) Origin of primate orphan genes: a comparative genomics approach. Mol Biol Evol 26(3):603–612

    Article  CAS  PubMed  Google Scholar 

  • Trinquier G, Sanejouand YH (1999) New protein-like properties of cubic lattice models. Phys Rev E 59(1):942–946

    Article  CAS  Google Scholar 

  • Tunyasuvunakool K, Adler J, Wu Z et al (2021) Highly accurate protein structure prediction for the human proteome. Nature 596:590–596

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • UniProt Consortium (2007) The universal protein resource (UniProt). Nucleic Acids Res 36:D190–D195

    Article  Google Scholar 

  • UniProt Consortium (2017) Uniprot: the universal protein knowledgebase. Nucl Acids Res 45(D1):D158–D169

    Article  Google Scholar 

  • UniProt Consortium (2021) Uniprot: the universal protein knowledgebase in 2021. Nucl Acids Res 49(D1):D480–D489

    Article  Google Scholar 

  • Vakirlis N, Carvunis AR, McLysaght A (2020) Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. Elife 9(e53):500

    Google Scholar 

  • Varadi M, Anyango S, Deshpande M et al (2022) AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50(D1):D439–D444

    Article  CAS  PubMed  Google Scholar 

  • Wang W, Yu H, Long M (2004) Duplication-degeneration as a mechanism of gene fission and the origin of new genes in Drosophila species. Nat Genet 36(5):523–527

    Article  CAS  PubMed  Google Scholar 

  • Waterhouse RM, Zdobnov EM, Tegenfeldt F et al (2011) OrthoDB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucl Acids Res 39(suppl–1):D283–D288

    Article  CAS  PubMed  Google Scholar 

  • Weisman CM, Murray AW, Eddy SR (2020) Many, but not all, lineage-specific genes can be explained by homology detection failure. PLoS Biol 18(11):e3000,862

    Article  CAS  Google Scholar 

  • White SH, Jacobs RE (1993) The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences. J Mol Evol 36(1):79–95

    Article  CAS  PubMed  Google Scholar 

  • Xia S, VanKuren NW, Chen C et al (2021) Genomic analyses of new genes and their phenotypic effects reveal rapid evolution of essential functions in Drosophila development. PLoS Genet 17(7):e1009,654

    Article  CAS  Google Scholar 

Download references

Acknowledgements

I thank Joseph Parello, for inspiring discussions, and the refereees, for their careful and constructive reading of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yves-Henri Sanejouand.

Additional information

Handling editor: Cara Weisman.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sanejouand, YH. On the Unknown Proteins of Eukaryotic Proteomes. J Mol Evol 91, 492–501 (2023). https://doi.org/10.1007/s00239-023-10116-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-023-10116-1

Keywords

Navigation