Choosing the Best Gene Predictions with GeneValidator

Part of the Methods in Molecular Biology book series (MIMB, volume 1962)


GeneValidator is a tool for determining whether the characteristics of newly predicted protein-coding genes are consistent with those of similar sequences in public databases. For this, it runs up to seven comparisons per gene. Results are shown in an HTML report containing summary statistics and graphical visualizations that aim to be useful for curators. Results are also presented in CSV and JSON formats for automated follow-up analysis.

Here, we describe common usage scenarios of GeneValidator that use the JSON output results together with standard UNIX tools. We demonstrate how GeneValidator’s textual output can be used to filter and subset large gene sets effectively. First, we explain how low-scoring gene models can be identified and extracted for manual curation—for example, as input for genome browsers or gene annotation tools. Second, we show how GeneValidator’s HTML report can be regenerated from a filtered subset of GeneValidator’s JSON output. Subsequently, we demonstrate how GeneValidator’s GUI can be used to complement manual curation efforts. Additionally, we explain how GeneValidator can be used to merge information from multiple annotations by automatically selecting the higher-scoring gene model at each common gene locus. Finally, we show how GeneValidator analyses can be optimized when using large BLAST databases.

Key words

Genome annotation Gene prediction Gene validation GeneValidator 



This work was supported by the Natural Environment Research Council [grant NE/L00626X/1] and the Biotechnology and Biological Sciences Research Council [grant BB/K004204/1 and BB/M009513/1]. This research used Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT (


  1. 1.
    Yandell M, Ence D (2012) A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet 13:329–342CrossRefGoogle Scholar
  2. 2.
    Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Ostell J, Pruitt KD et al (2018) GenBank. Nucleic Acids Res 46:D41–D47CrossRefGoogle Scholar
  3. 3.
    Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12:491CrossRefGoogle Scholar
  4. 4.
    Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32:767–769CrossRefGoogle Scholar
  5. 5.
    Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics 19:189CrossRefGoogle Scholar
  6. 6.
    Schnoes AM, Brown SD, Dodevski I, Babbitt PC (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5:e1000605CrossRefGoogle Scholar
  7. 7.
    Steijger T, Abril JF, Engström PG, Kokocinski F, RGASP Consortium, Hubbard TJ et al (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10:1177–1184CrossRefGoogle Scholar
  8. 8.
    Drăgan M-A, Moghul I, Priyam A, Bustos C, Wurm Y (2016) GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics 32(10):1559–1561CrossRefGoogle Scholar
  9. 9.
    The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169CrossRefGoogle Scholar
  10. 10.
    Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, The UniProt Consortium (2015) UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31:926–932CrossRefGoogle Scholar
  11. 11.
    Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G et al (2016) JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol 17:66CrossRefGoogle Scholar
  12. 12.
    Lee E, Helt GA, Reese JT, Munoz-Torres MC, Childers CP, Buels RM et al (2013) Web Apollo: a web-based genomic annotation editing platform. Genome Biol 14:R93CrossRefGoogle Scholar
  13. 13.
    Priyam A, Woodcroft BJ, Rai V, Munagala A, Moghul I, Ter F et al (2015) Sequenceserver: a modern graphical user interface for custom BLAST databases. bioRxiv.
  14. 14.
    Minoche AE, Dohm JC, Schneider J, Holtgräwe D, Viehöver P, Montfort M et al (2015) Exploiting single-molecule transcript sequencing for eukaryotic gene prediction. Genome Biol 16:549CrossRefGoogle Scholar
  15. 15.
    Bethesda (MD): National Center for Biotechnology Information (2008) BLAST® Command Line Applications User Manual [Internet] - Limiting a Search with a List of Identifiers. Accessed 13 Sept 2018
  16. 16.
    Wurm Y, Wang J, Riba-Grognuz O, Corona M, Nygaard S, Hunt BG et al (2011) The genome of the fire ant Solenopsis invicta. Proc Natl Acad Sci U S A 108(14):5679–5684CrossRefGoogle Scholar
  17. 17.
    Shen W, Xiong J (2019) TaxonKit: a cross-platform and efficient NCBI taxonomy toolkit. bioRxiv.
  18. 18.
    Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.UCL Cancer InstituteUniversity College LondonLondonUK
  2. 2.School of Biological and Chemical SciencesQueen Mary University of LondonLondonUK

Personalised recommendations