Advertisement

Whole-Genome Annotation with BRAKER

  • Katharina J. HoffEmail author
  • Alexandre Lomsadze
  • Mark BorodovskyEmail author
  • Mario Stanke
Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 1962)

Abstract

BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence.

Key words

Protein-coding genes Gene prediction AUGUSTUS GeneMark-ES/ET RNA-Seq reads Protein mapping to genome Genome annotation pipeline BRAKER 

Notes

Acknowledgements

This work is supported in part by the US National Institutes of Health grant HG000783 to MB, by the German Research Foundation grant 1009/12-1 to MS and by the US National Institutes of Health grant GM128145 to MB and MS.

References

  1. 1.
    Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2015) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769CrossRefGoogle Scholar
  2. 2.
    Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33(20):6494–6506CrossRefGoogle Scholar
  3. 3.
    Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 18:1979–1990. https://doi.org/10.1101/gr.081612.108 CrossRefGoogle Scholar
  4. 4.
    Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42(15):e119CrossRefGoogle Scholar
  5. 5.
    Stanke M, Schöffmann O, Dahms St, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinf 7:62CrossRefGoogle Scholar
  6. 6.
    Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B (2006) AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 3(34):W435–W439CrossRefGoogle Scholar
  7. 7.
    Stanke M, Steinkamp R, Waack S, Morgenstern B (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32:W309–W312CrossRefGoogle Scholar
  8. 8.
    Hoff KJ, Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res 41(W1):W123–W128CrossRefGoogle Scholar
  9. 9.
    König S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32(22):3388–3395PubMedPubMedCentralGoogle Scholar
  10. 10.
    Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637–644CrossRefGoogle Scholar
  11. 11.
    Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18(1):188–196CrossRefGoogle Scholar
  12. 12.
    Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf 12(1):491CrossRefGoogle Scholar
  13. 13.
    Abbott A (2005) Competition boosts bid to find human genes. Nature 435:134CrossRefGoogle Scholar
  14. 14.
    Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(1):S2CrossRefGoogle Scholar
  15. 15.
    Stanke M, Tzvetkova A, Morgenstern B (2006) AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol 7(1):S11CrossRefGoogle Scholar
  16. 16.
    Coghlan A, Fiedler T, McKay S, Flicek P, Harris T, Blasiar D, the nGASP Consortium, Stein L (2008) nGASP - the nematode genome annotation assessment project. BMC Bioinf 9(1):549CrossRefGoogle Scholar
  17. 17.
    Steijger T, Abril JF, Engstrom PG, Kokocinski F, Akerman M, Alioto T, Ambrosini G, Antonarakis SE, Behr J, Bohnert R, Bucher P, Cloonan N, Derrien T, Djebali S, Du J, Dudoit S, Gerstein M, Gingeras TR, Gonzalez D, Grimmond SM, Habegger L, Iseli C, Jean G, Kahles A, Lagarde J, Leng J, Lefebvre G, Lewis S, Mortazavi A, Niermann P, Rätsch G, Reymond A, Ribeca P, Richard H, Rougemont J, Rozowsky J, Sammeth M, Sboner A, Schulz MH, Searle SMJ, Solorzano ND, Solovyev V, Stanke M, Steijger T, Stevenson BJ, Stockinger H, Valsesia A, Weese D, White S, Wold BJ, Wu J, Wu TD, Zeller G, Zerbino D, Zhang MQ, Hubbard TJ, Guigo R, Harrow J, Bertone P (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10(12):1177–1184CrossRefGoogle Scholar
  18. 18.
    Keller O, Odronitz F, Stanke M, Kollmar M, Waack S (2008) Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinf 9(1):278CrossRefGoogle Scholar
  19. 19.
    Gremme G (2013) Computational gene structure prediction. PhD thesis, Universität HamburgGoogle Scholar
  20. 20.
    Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31(19):5654–5666CrossRefGoogle Scholar
  21. 21.
    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) A basic local alignment search tool. J Mol Biol 215(3):403–410CrossRefGoogle Scholar
  22. 22.
    Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinf 10(1):421CrossRefGoogle Scholar
  23. 23.
    Barnett DW, Garrison EK, Quinlan AR, Strömberg MP, Marth GT (2011) BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27(12):1691–1692CrossRefGoogle Scholar
  24. 24.
    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079CrossRefGoogle Scholar
  25. 25.
    Chen N (2004) Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinf 5(1):4.10. 1–4.10. 14Google Scholar
  26. 26.
    Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1):i351–i358CrossRefGoogle Scholar
  27. 27.
    Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21CrossRefGoogle Scholar
  28. 28.
    Daehwan K, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4):R36CrossRefGoogle Scholar
  29. 29.
    Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(6):873–881CrossRefGoogle Scholar
  30. 30.
    Kapustin Y, Souvorov A, Tatusova T, Lipman D (2008) Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 3(1):20CrossRefGoogle Scholar
  31. 31.
    Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, et al (2011) eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res 40(D1):D284–D289CrossRefGoogle Scholar
  32. 32.
    Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (2012) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41(D1):D358–D365CrossRefGoogle Scholar
  33. 33.
    Slater GSC, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinf 6(1):31CrossRefGoogle Scholar
  34. 34.
    Gotoh O (2008) Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics 24(21):2438–2444CrossRefGoogle Scholar
  35. 35.
    Gotoh O (2008) A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res 36(8):2630–2638CrossRefGoogle Scholar
  36. 36.
    Iwata H, Gotoh O (2012) Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 40(20):e161CrossRefGoogle Scholar
  37. 37.
    Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Res 44(9):e89CrossRefGoogle Scholar
  38. 38.
    Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinf 19(1):189CrossRefGoogle Scholar
  39. 39.
    Casper J, Zweig AS, Villarreal C, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, Lee CM, Lee BT, Karolchik D et al (2017) The UCSC genome browser database: 2018 update. Nucleic Acids Res 46(D1):D762–D769PubMedCentralGoogle Scholar
  40. 40.
    Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH (2009) JBrowse: a next-generation genome browser. Genome Res 19(9):1630–1638. https://doi.org/10.1101/gr.094607.109 CrossRefGoogle Scholar
  41. 41.
    Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA (2011) Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 28(4):464–469CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.University of Greifswald, Institute of Mathematics and Computer ScienceGreifswaldGermany
  2. 2.Joint Georgia Tech and Emory University Wallace H Coulter Department of Biomedical EngineeringAtlantaUSA
  3. 3.School of Computational Science and EngineeringAtlantaUSA
  4. 4.Moscow Institute of Physics and TechnologyDolgoprudnyRussia

Personalised recommendations