Crop Genome Annotation: A Case Study for the Brassica rapa Genome

  • Erli Pang
  • Huifeng Cao
  • Bowen Zhang
  • Kui Lin
Part of the Compendium of Plant Genomes book series (CPG)


Genome annotation is crucial for the bridging the gap between sequence and biology. Nonetheless, it is also a dynamic and continuous improvement process for better understanding of the molecular biology of the genome. With the deep RNA-sequencing of eight Brassica rapa tissues, it should be able to predict protein-coding genes with more accuracy when incorporating this type of RNA information into analysis. In doing so, we used our built annotation pipeline to re-annotate the B. rapa genome on the levels of repetitive elements, protein-coding genes and non-coding RNA genes, respectively. In total, we identified 139.9 MB repetitive elements, 6,088 non-coding RNA genes and 45,149 protein-coding genes, respectively. These results, together with those published previously, would provide a valuable resource for further understanding of B. rapa.


Gene Ontology Long Terminal Repeat Genome Annotation Repetitive Element Gene Predictor 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported by the National Natural Science Foundation of China (Grant: 31171235). We thank to the people who have contributed to the building and maintaining of the genome annotation pipeline in the Laboratory of Computational Molecular Biology of the Beijing Normal University.


  1. Allen JE, Salzberg SL (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 21:3596–3603CrossRefPubMedGoogle Scholar
  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410CrossRefPubMedGoogle Scholar
  3. Apweiler R, Martin MJ, O’Donovan C, Magrane M, Alam-Faruque Y et al (2013) Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res 41:D43–D47CrossRefGoogle Scholar
  4. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H et al (2000) Gene Ontology: tool for the unification of biology. Nat Genet 25:25–29PubMedCentralCrossRefPubMedGoogle Scholar
  5. Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B et al (2005) The universal protein resource (UniProt). Nucleic Acids Res 33:D154–D159PubMedCentralCrossRefPubMedGoogle Scholar
  6. Birney E, Clamp M, Durbin R (2004) GeneWise and genomewise. Genome Res 14:988–995PubMedCentralCrossRefPubMedGoogle Scholar
  7. Brent MR (2008) Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62–73CrossRefPubMedGoogle Scholar
  8. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78–94CrossRefPubMedGoogle Scholar
  9. Campbell MS, Law M, Holt C, Stein JC, Moghe GD et al (2014) MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol 164:513–524PubMedCentralCrossRefPubMedGoogle Scholar
  10. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E et al (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18:188–196PubMedCentralCrossRefPubMedGoogle Scholar
  11. Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F et al (2007) The TIGR plant transcript assemblies database. Nucleic Acids Res 35:D846–D851PubMedCentralCrossRefPubMedGoogle Scholar
  12. Conesa A, Götz S, García-Gómez JM, Terol J, Talón M et al (2005) Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21:3674–3676CrossRefPubMedGoogle Scholar
  13. Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E et al (2004) The Ensembl automatic gene annotation system. Genome Res 14:942–950PubMedCentralCrossRefPubMedGoogle Scholar
  14. DeCaprio D, Vinson JP, Pearson MD, Montgomery P, Doherty M et al (2007) Conrad: gene prediction using conditional random fields. Genome Res 17:1389–1398PubMedCentralCrossRefPubMedGoogle Scholar
  15. Denoeud F, Aury J-M, Da Silva C, Noel B, Rogier O et al (2008) Annotating genomes with massive-scale RNA sequencing. Genome Biol 9:R175PubMedCentralCrossRefPubMedGoogle Scholar
  16. Edgar RC, Myers EW (2005) PILER: identification and classification of genomic repeats. Bioinformatics 21:I152–I158CrossRefPubMedGoogle Scholar
  17. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS et al (2007) Creating a honey bee consensus gene set. Genome Biol 8:R13PubMedCentralCrossRefPubMedGoogle Scholar
  18. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512CrossRefPubMedGoogle Scholar
  19. Flicek P, Amode MR, Barrell D, Beal K, Billis K et al (2014) Ensembl 2014. Nucleic Acids Res 42:D749–D755PubMedCentralCrossRefPubMedGoogle Scholar
  20. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH et al (2011) Rfam: wikipedia, clans and the “decimal” release. Nucleic Acids Res 39:D141–D145PubMedCentralCrossRefPubMedGoogle Scholar
  21. Gotoh O (2008) Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics 24:2438–2444CrossRefPubMedGoogle Scholar
  22. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA et al (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29:644–652PubMedCentralCrossRefPubMedGoogle Scholar
  23. Gross SS, Do CB, Sirota M, Batzoglou S (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8:R269PubMedCentralCrossRefPubMedGoogle Scholar
  24. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK et al (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31:5654–5666PubMedCentralCrossRefPubMedGoogle Scholar
  25. Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE et al (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol 9:R7PubMedCentralCrossRefPubMedGoogle Scholar
  26. Huang X, Adams MD, Zhou H, Kerlavage AR (1997) A tool for analyzing and annotating genomic sequences. Genomics 46:37–45CrossRefPubMedGoogle Scholar
  27. Jones P, Binns D, Chang H-Y, Fraser M, Li W, et al (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics 30:1236–1240Google Scholar
  28. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O et al (2005) Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110:462–467CrossRefPubMedGoogle Scholar
  29. Keller O, Odronitz F, Stanke M, Kollmar M, Waack S (2008) Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinformatics 9:278PubMedCentralCrossRefPubMedGoogle Scholar
  30. Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12:656–664PubMedCentralCrossRefPubMedGoogle Scholar
  31. Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59PubMedCentralCrossRefPubMedGoogle Scholar
  32. Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17:S140–S148CrossRefPubMedGoogle Scholar
  33. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25PubMedCentralCrossRefPubMedGoogle Scholar
  34. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760PubMedCentralCrossRefPubMedGoogle Scholar
  35. Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595PubMedCentralCrossRefPubMedGoogle Scholar
  36. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659CrossRefPubMedGoogle Scholar
  37. Li Z, Zhang Z, Yan P, Huang S, Fei Z et al (2011) RNA-Seq improves annotation of protein-coding genes in the cucumber genome. BMC Genom 12:540CrossRefGoogle Scholar
  38. Liang CZ, Mao L, Ware D, Stein L (2009) Evidence-based gene predictions in plant genomes. Genome Res 19:1912–1923PubMedCentralCrossRefPubMedGoogle Scholar
  39. Lowe TM, Eddy SR (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25:0955–0964CrossRefGoogle Scholar
  40. Lowe TM, Eddy SR (1999) A computational screen for methylation guide snoRNAs in yeast. Science 283:1168–1171CrossRefPubMedGoogle Scholar
  41. Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26:1107–1115PubMedCentralCrossRefPubMedGoogle Scholar
  42. Majoros WH, Pertea M, Salzberg SL (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 20:2878–2879CrossRefPubMedGoogle Scholar
  43. Mott R (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 13:477–478PubMedGoogle Scholar
  44. Nawrocki EP, Kolbe DL, Eddy SR (2009) Infernal 1.0: inference of RNA alignments. Bioinformatics 25:1335–1337PubMedCentralCrossRefPubMedGoogle Scholar
  45. Ouyang S, Buell CR (2004) The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res 32:D360–D363PubMedCentralCrossRefPubMedGoogle Scholar
  46. Parra G, Blanco E, Guigo R (2000) GeneID in Drosophila. Genome Res 10:511–515PubMedCentralCrossRefPubMedGoogle Scholar
  47. Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Bioinformatics 21:I351–I358CrossRefPubMedGoogle Scholar
  48. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J et al (2012) The Pfam protein families database. Nucleic Acids Res 40:D290–D301PubMedCentralCrossRefPubMedGoogle Scholar
  49. Salamov AA, Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10:516–522PubMedCentralCrossRefPubMedGoogle Scholar
  50. Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH (2009) JBrowse: a next-generation genome browser. Genome Res 19:1630–1638PubMedCentralCrossRefPubMedGoogle Scholar
  51. Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31PubMedCentralCrossRefPubMedGoogle Scholar
  52. Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19:II215–II225CrossRefPubMedGoogle Scholar
  53. Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24:637–644CrossRefPubMedGoogle Scholar
  54. Stein L (2001) Genome annotation: from sequence to biology. Nat Rev Genet 2:493–503CrossRefPubMedGoogle Scholar
  55. Stein LD, Mungall C, Shu S, Caudy M, Mangone M et al (2002) The generic genome browser: a building block for a model organism system database. Genome Res 12:1599–1610PubMedCentralCrossRefPubMedGoogle Scholar
  56. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 18:1979–1990PubMedCentralCrossRefPubMedGoogle Scholar
  57. Trapnell C, Pachter L, Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–1111PubMedCentralCrossRefPubMedGoogle Scholar
  58. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G et al (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511–515PubMedCentralCrossRefPubMedGoogle Scholar
  59. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63PubMedCentralCrossRefPubMedGoogle Scholar
  60. Wang X, Wang H, Wang J, Sun R, Wu J et al (2011) The genome of the mesopolyploid crop species Brassica rapa. Nat Genet 43:1035–1039CrossRefPubMedGoogle Scholar
  61. Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859–1875CrossRefPubMedGoogle Scholar
  62. Xu Z, Wang H (2007) LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35:W265–W268PubMedCentralCrossRefPubMedGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.College of Life SciencesBeijing Normal UniversityBeijingChina

Personalised recommendations