Computational Prediction of De Novo Emerged Protein-Coding Genes

  • Nikolaos Vakirlis
  • Aoife McLysaght
Part of the Methods in Molecular Biology book series (MIMB, volume 1851)


De novo genes, that is, protein-coding genes originating from previously noncoding sequence, have gone from being considered impossibly unlikely to being recognized as an important source of genetic novelty in eukaryotic genomes. It is clear that de novo gene evolution is a rare but consistent feature of eukaryotic genomes, being detected in every genome studied. However, different studies often use different computational methods, and the numbers and identities of the detected genes vary greatly. Here we present a coherent protocol for the computational identification of de novo genes by comparative genomics. The method described uses homology searches, identification of syntenic regions, and ancestral sequence reconstruction to produce high-confidence candidates with robust evidence of de novo emergence. It is designed to be easily applicable given the basic knowledge of bioinformatic tools and scalable so that it can be applied on large and small datasets.

Key words

De novo genes Gene birth New gene evolution Novel genes ORF formation Protein-coding genes Genome-wide detection Genome evolution 


  1. 1.
    Long M, Betrán E, Thornton K et al (2003) The origin of new genes: glimpses from the young and old. Nat Rev Genet 4:865–875CrossRefGoogle Scholar
  2. 2.
    Andersson DI, Jerlström-Hultqvist J, Näsvall J (2015) Evolution of new functions de novo and from preexisting genes. Cold Spring Harb Perspect Biol 7:a017996CrossRefGoogle Scholar
  3. 3.
    McLysaght A, Hurst LD (2016) Open questions in the study of de novo genes: what, how and why. Nat Rev Genet 17:567–578CrossRefGoogle Scholar
  4. 4.
    Schlötterer C (2015) Genes from scratch—the evolutionary fate of de novo genes. Trends Genet 31:215–219CrossRefGoogle Scholar
  5. 5.
    McLysaght A, Guerzoni D (2015) New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation. Philos Trans R Soc Lond B Biol Sci 370:20140332CrossRefGoogle Scholar
  6. 6.
    Li D, Dong Y, Jiang Y et al (2010) A de novo originated gene depresses budding yeast mating pathway and is repressed by the protein encoded by its antisense strand. Cell Res 20:408–420CrossRefGoogle Scholar
  7. 7.
    Vakirlis N, Sarilar V, Drillon G et al (2016) Reconstruction of ancestral chromosome architecture and gene repertoire reveals principles of genome evolution in a model yeast genus. Genome Res 26:918–932CrossRefGoogle Scholar
  8. 8.
    Tautz D, Domazet-Lošo T (2011) The evolutionary origin of orphan genes. Nat Rev Genet 12:692–702CrossRefGoogle Scholar
  9. 9.
    Cai J, Zhao R, Jiang H et al (2008) De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics 179:487–496CrossRefGoogle Scholar
  10. 10.
    Heinen TJAJ, Staubach F, Häming D et al (2009) Emergence of a new gene from an intergenic region. Curr Biol 19:1527–1531CrossRefGoogle Scholar
  11. 11.
    Knowles DG, McLysaght A (2009) Recent de novo origin of human protein-coding genes. Genome Res 9:1752–1759CrossRefGoogle Scholar
  12. 12.
    Levine MT, Jones CD, Kern AD et al (2006) Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci 103:9935–9939CrossRefGoogle Scholar
  13. 13.
    Carvunis A-R, Rolland T, Wapinski I et al (2012) Proto-genes and de novo gene birth. Nature 487:370–374CrossRefGoogle Scholar
  14. 14.
    Domazet-Lošo T, Carvunis A-R, Albà MM et al (2017) No evidence for phylostratigraphic bias impacting inferences on patterns of gene emergence and evolution. Mol Biol Evol 34:843–856PubMedPubMedCentralGoogle Scholar
  15. 15.
    Moyers BA, Zhang J (2014) Phylostratigraphic bias creates spurious patterns of genome evolution. Mol Biol Evol 32:258–267CrossRefGoogle Scholar
  16. 16.
    Moyers BA, Zhang J (2016) Evaluating phylostratigraphic evidence for widespread de novo gene birth in genome evolution. Mol Biol Evol 33:1245–1256CrossRefGoogle Scholar
  17. 17.
    Vakirlis N, Hebert AS, Opulente DA et al (2018) A molecular portrait of de novo genes in yeast. Mol Biol Evol 35:631–645CrossRefGoogle Scholar
  18. 18.
    Altschul SF, Madden TL, Schäffer AA et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402CrossRefGoogle Scholar
  19. 19.
    Pearson WR, Wood T, Zhang Z et al (1997) Comparison of DNA sequences with protein sequences. Genomics 46:24–36CrossRefGoogle Scholar
  20. 20.
    Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30:772–780CrossRefGoogle Scholar
  21. 21.
    Löytynoja A, Goldman N (2008) Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 320:1632–1635CrossRefGoogle Scholar
  22. 22.
    She R, Chu JS-C, Wang K et al (2009) GenBlastA: enabling BLAST to identify homologous gene sequences. Genome Res 19:143–149CrossRefGoogle Scholar
  23. 23.
    Guindon S, Delsuc F, Dufayard J-F et al (2009) Estimating maximum likelihood phylogenies with PhyML. Methods Mol Biol 537:113–137CrossRefGoogle Scholar
  24. 24.
    Frith MC (2011) A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res 39:e23–e23CrossRefGoogle Scholar
  25. 25.
    Clark MB, Amaral PP, Schlesinger FJ et al (2011) The reality of pervasive transcription. PLoS Biol 9:e1000625CrossRefGoogle Scholar
  26. 26.
    Ingolia NT, Lareau LF, Weissman JS (2011) Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 147:789–802CrossRefGoogle Scholar
  27. 27.
    Chen T, Zhao J, Ma J et al (2015) Web resources for mass spectrometry-based proteomics. Genomics Proteomics Bioinformatics 13:36–39CrossRefGoogle Scholar
  28. 28.
    Wang H, Wang Y, Xie Z (2017) Computational resources for ribosome profiling: from database to Web server and software. Brief Bioinform.
  29. 29.
    Ruiz-Orera J, Messeguer X, Subirana JA et al (2014) Long non-coding RNAs as a source of new peptides. Elife 3:e03523CrossRefGoogle Scholar
  30. 30.
    Scannell DR, Zill OA, Rokas A et al (2011) The awesome power of yeast evolutionary genetics: new genome sequences and strain resources for the Saccharomyces sensu stricto genus. G3 (Bethesda) 1:11–25CrossRefGoogle Scholar
  31. 31.
    Wang L, Park HJ, Dasari S et al (2013) CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res 41:e74CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Nikolaos Vakirlis
    • 1
  • Aoife McLysaght
    • 1
  1. 1.Department of Genetics, Trinity College Dublin, Smurfit Institute of GeneticsUniversity of DublinDublinIreland

Personalised recommendations