Whole-Genome Annotation with BRAKER

Hoff, Katharina J.; Lomsadze, Alexandre; Borodovsky, Mark; Stanke, Mario

doi:10.1007/978-1-4939-9173-0_5

Katharina J. Hoff³,
Alexandre Lomsadze⁴,
Mark Borodovsky^4,5,6 &
…
Mario Stanke³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1962))

5753 Accesses
284 Citations
10 Altmetric

Abstract

BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence.

The authors “Alexandre Lomsadze”, “Mark Borodovsky”, and “Mario Stanke” contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2015) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769
Article PubMed PubMed Central Google Scholar
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33(20):6494–6506
Article CAS PubMed PubMed Central Google Scholar
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 18:1979–1990. https://doi.org/10.1101/gr.081612.108
Article CAS PubMed PubMed Central Google Scholar
Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42(15):e119
Article PubMed PubMed Central Google Scholar
Stanke M, Schöffmann O, Dahms St, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinf 7:62
Article Google Scholar
Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B (2006) AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 3(34):W435–W439
Article Google Scholar
Stanke M, Steinkamp R, Waack S, Morgenstern B (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32:W309–W312
Article CAS PubMed PubMed Central Google Scholar
Hoff KJ, Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res 41(W1):W123–W128
Article PubMed PubMed Central Google Scholar
König S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32(22):3388–3395
PubMed PubMed Central Google Scholar
Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637–644
Article CAS PubMed Google Scholar
Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18(1):188–196
Article CAS PubMed PubMed Central Google Scholar
Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf 12(1):491
Article Google Scholar
Abbott A (2005) Competition boosts bid to find human genes. Nature 435:134
Article CAS PubMed Google Scholar
Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(1):S2
Article PubMed PubMed Central Google Scholar
Stanke M, Tzvetkova A, Morgenstern B (2006) AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol 7(1):S11
Article PubMed PubMed Central Google Scholar
Coghlan A, Fiedler T, McKay S, Flicek P, Harris T, Blasiar D, the nGASP Consortium, Stein L (2008) nGASP - the nematode genome annotation assessment project. BMC Bioinf 9(1):549
Article Google Scholar
Steijger T, Abril JF, Engstrom PG, Kokocinski F, Akerman M, Alioto T, Ambrosini G, Antonarakis SE, Behr J, Bohnert R, Bucher P, Cloonan N, Derrien T, Djebali S, Du J, Dudoit S, Gerstein M, Gingeras TR, Gonzalez D, Grimmond SM, Habegger L, Iseli C, Jean G, Kahles A, Lagarde J, Leng J, Lefebvre G, Lewis S, Mortazavi A, Niermann P, Rätsch G, Reymond A, Ribeca P, Richard H, Rougemont J, Rozowsky J, Sammeth M, Sboner A, Schulz MH, Searle SMJ, Solorzano ND, Solovyev V, Stanke M, Steijger T, Stevenson BJ, Stockinger H, Valsesia A, Weese D, White S, Wold BJ, Wu J, Wu TD, Zeller G, Zerbino D, Zhang MQ, Hubbard TJ, Guigo R, Harrow J, Bertone P (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10(12):1177–1184
Article CAS PubMed PubMed Central Google Scholar
Keller O, Odronitz F, Stanke M, Kollmar M, Waack S (2008) Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinf 9(1):278
Article Google Scholar
Gremme G (2013) Computational gene structure prediction. PhD thesis, Universität Hamburg
Google Scholar
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31(19):5654–5666
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) A basic local alignment search tool. J Mol Biol 215(3):403–410
Article CAS PubMed Google Scholar
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinf 10(1):421
Article Google Scholar
Barnett DW, Garrison EK, Quinlan AR, Strömberg MP, Marth GT (2011) BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27(12):1691–1692
Article CAS PubMed PubMed Central Google Scholar
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16):2078–2079
Article PubMed PubMed Central Google Scholar
Chen N (2004) Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinf 5(1):4.10. 1–4.10. 14
Google Scholar
Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1):i351–i358
Article CAS PubMed Google Scholar
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21
Article CAS PubMed Google Scholar
Daehwan K, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4):R36
Article Google Scholar
Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(6):873–881
Article CAS PubMed PubMed Central Google Scholar
Kapustin Y, Souvorov A, Tatusova T, Lipman D (2008) Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 3(1):20
Article PubMed PubMed Central Google Scholar
Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, et al (2011) eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res 40(D1):D284–D289
Article PubMed PubMed Central Google Scholar
Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (2012) OrthoDB: a hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41(D1):D358–D365
Article PubMed PubMed Central Google Scholar
Slater GSC, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinf 6(1):31
Article Google Scholar
Gotoh O (2008) Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics 24(21):2438–2444
Article CAS PubMed Google Scholar
Gotoh O (2008) A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res 36(8):2630–2638
Article CAS PubMed PubMed Central Google Scholar
Iwata H, Gotoh O (2012) Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 40(20):e161
Article CAS PubMed PubMed Central Google Scholar
Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Res 44(9):e89
Article PubMed PubMed Central Google Scholar
Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinf 19(1):189
Article Google Scholar
Casper J, Zweig AS, Villarreal C, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, Lee CM, Lee BT, Karolchik D et al (2017) The UCSC genome browser database: 2018 update. Nucleic Acids Res 46(D1):D762–D769
PubMed Central Google Scholar
Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH (2009) JBrowse: a next-generation genome browser. Genome Res 19(9):1630–1638. https://doi.org/10.1101/gr.094607.109
Article CAS PubMed PubMed Central Google Scholar
Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA (2011) Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data. Bioinformatics 28(4):464–469
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work is supported in part by the US National Institutes of Health grant HG000783 to MB, by the German Research Foundation grant 1009/12-1 to MS and by the US National Institutes of Health grant GM128145 to MB and MS.

Author information

Authors and Affiliations

University of Greifswald, Institute of Mathematics and Computer Science, Greifswald, Germany
Katharina J. Hoff & Mario Stanke
Joint Georgia Tech and Emory University Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA, USA
Alexandre Lomsadze & Mark Borodovsky
School of Computational Science and Engineering, Atlanta, GA, 30332, USA
Mark Borodovsky
Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia
Mark Borodovsky

Authors

Katharina J. Hoff
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Lomsadze
View author publications
You can also search for this author in PubMed Google Scholar
Mark Borodovsky
View author publications
You can also search for this author in PubMed Google Scholar
Mario Stanke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Katharina J. Hoff or Mark Borodovsky .

Editor information

Editors and Affiliations

Group Systems Biology of Motor Proteins, Department NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Goettingen, Germany
Martin Kollmar

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Hoff, K.J., Lomsadze, A., Borodovsky, M., Stanke, M. (2019). Whole-Genome Annotation with BRAKER. In: Kollmar, M. (eds) Gene Prediction. Methods in Molecular Biology, vol 1962. Humana, New York, NY. https://doi.org/10.1007/978-1-4939-9173-0_5

Download citation

DOI: https://doi.org/10.1007/978-1-4939-9173-0_5
Published: 25 April 2019
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-4939-9172-3
Online ISBN: 978-1-4939-9173-0
eBook Packages: Springer Protocols

Publish with us

Policies and ethics