Keywords

Chromosome by Chromosome Sequencing

Completed rice genome sequence in 2004 by International Rice Genome Sequencing Project was the first cereal genome sequence (IRGSP 2005), and then sorghum and maize genome sequences were followed (Paterson et al. 2009; Schnable et al. 2009). For the rice genome sequencing, because BAC by BAC sequencing method using Sanger sequencing was adopted, the sequencing accuracy was less than one error in 10 kb. This accuracy was validated by genome resequencing by Next Generation Sequence (NGS) data (Kawahara et al. 2013), and it showed that the rice genome sequence is the most accurate one in the cereal genomes sequenced so far. Sorghum genome was determined by whole-genome shotgun sequencing method and maize genome sequence was achieved by the combination of the minimum tiling path (MTP) method and BAC by BAC sequencing. However, their genome sequences were less accurate than rice genome and were still fragmented into many scaffolds.

In Pooideae, Brachypodium distachyon Bd21 genome was sequenced in 2010 by whole-genome shotgun sequencing method, because of its small genome size (272 Mb) (The International Brachypodium Initiative 2010). However, compared with the B. distachyon genome sequencing, the sequencing of other Pooideae genomes, such as wheat (Triticum aestivum L.) and barley (Hordeum vulgare L.) has fallen behind due to the complexity of their genome structures. First, the wheat and barley genome size was 17 and 5.1 Gb, respectively. They are more than 40-times and 13-times larger than rice genome. Second, repeat regions occupied more than 80 % of the genome hamper their genome assembly. Third, in particular, since wheat is a hexaploidy, it is quite hard to distinguish homoeologous sequences from A, B and D genomes.

To overcome these problems, the various new methods were applied. NGS technology enables us to assemble large sized genome with the low cost. Even if the NGS read length is several hundred bp, millions of NGS reads can be used in one analysis (the total read length is up to several Gb) so that assembly of large genomes can be conducted. For the barley genome sequencing, BAC by BAC sequencing and NGS sequencing methods were combined, and then 1.9 Gbp of the sequences were released in 2012 (The International Barley Genome Sequencing Consortium 2012). In 2013, the genome sequences of Aegilops tauschii and T. urartu were determined (Jia et al. 2013; Ling et al. 2013; Luo et al. 2013). Wheat genome was also sequenced by whole-genome shotgun technology with NGS data (Brenchley et al. 2012). However, because of the hexaploidy, whole genome assembly was not achieved as same as other diploid genomes of Triticeae.

To solve the genome complexity, chromosome sorting by flow cytometry was developed in cereal genomics (Doležel et al. 2007). This method can reduce sample complexity, such as the hexaploid status of the wheat genome, therefore International Wheat Genome Sequencing Consortium (IWGSC) decided to apply this technologies to their activity. Single chromosomes or chromosome arms were sorted by the flow cytometric analysis and chromosome (arm)-specific BAC libraries were constructed. Progress of physical map construction and genome sequencing of each chromosome and chromosome arms can be seen on the IWGSC website (http://www.wheatgenome.org/) and URGI wheat portal site (http://wheat-urgi.versailles.inra.fr/).

Survey Sequencing and Annotation of Chromosome 6B

Under the framework of IWGSC, sequencing project of chromosome 6B was started in Japan in 2011 and the first survey sequences of 6B was released in 2013 (Tanaka et al. 2014). In this analysis, the DNA libraries of sorted 6B chromosome arms were constructed and sequenced independently using the 454 GS-FLX Titanium (Roche, CT, USA). The sequence reads (454 reads) from each arm were assembled by GS assembler 2.7 (Roche). From more than 12 million reads for each arm, 234 and 273 Mbp were assembled comprising 262,375 and 173,655 contigs for 6BS and 6BL, respectively. They correspond to 56.6 % and 54.9 % of the estimated lengths of both arms (415 Mbp for 6BS and 498 Mbp for 6BL).

As described before, the wheat genome is composed of abundant repetitive elements. Known classes of repeat elements were detected using the repeat libraries, such as TREP and MIPS repeat libraries. In addition, to detect novel repeat elements, we constructed a new repeat library by RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html). Using a repeat masking program, censor (http://www.girinst.org/censor/index.php) with TREP and the new repeat library (Jurka et al. 1996), 76.6 % and 85.5 % of 6BS and 6BL assembly were masked, respectively. Since 63.6 % and 72.2 % of 6BS and 6BL assemblies were masked by TREP library, around 13 % of repetitive regions may be novel repeat elements detected only by the new library.

After repeat detection, we identified transcribed regions by mapping many transcripts in public domains. In addition to mRNA and millions of ESTs in DDBJ/EMBL/GenBank, wheat full-length cDNAs (FLcDNAs) were available from TriFLDB (http://trifldb.psc.riken.jp/index.pl) (Mochida et al. 2009). In combination with transcriptome mapping and an ab initio gene prediction program, 4,798 transcribed regions were determined. We found several genes that were known to locate on chromosome 6B, such as α-gliadin gene, the stripe rust resistance gene Yr36, the grain protein content gene Gpc-B1, α-amylase gene, the genes for three low-temperature-responsive dehydrins, Wcs120, Wcs66 and Wcor410, the flowering time gene TaHd1-2 and the gene involved in vernalization TmVIL2.

Our assemblies also showed the conservation of syntenic genes between monocots. First, 2,399 of 2,573 high-confidence barley genes on chromosome 6H could be mapped on our assemblies (E value <10−5). Second, 3,772 syntenic loci were detected from homology search of syntenic genes from chromosome 2 of O. sativa, chromosome 3 of B. distachyon and chromosome 4 of S. bicolor. Since 57.4 % of the syntenic regions had wheat transcriptome evidence, which was significant higher than that of non-syntenic regions (32.7 %), we concluded that wheat 6B has a conserved synteny with the chromosomes of other grass species.

Our annotation pipeline included detection of RNA genes, rRNAs, tRNAs, and miRNAs. It is known that chromosome 6B has a locus for ribosomal DNA (rDNA) containing approximately 5,500 rRNA genes. Moreover, non-protein coding RNAs, such as microRNAs (miRNAs) are currently recognized as biologically important genetic components. We found that some RNA genes were associated to a particular repetitive element. For example, 83 of 131 tRNALys were located in an LTR retrotransposon, Gypsy, and de novo repeats. Almost predicted miRNAs were also located in repeat-masking regions, especially DNA transposons, Mariner and CACTA. In case of rRNA genes, the quite small number of contigs with rRNA genes could be explained by high read depth of contigs. Because of the high sequence similarity, rRNA regions were degenerated during the assembly so that a few contigs with high depth reads existed in our data. This result is quite similar to that of repetitive regions. These results suggested that RNA genes were distributed in the wheat genome with the diffusion of transposons and repetitive elements

Application of Chromosome 6B Sequences to Wheat Genomics

Decipher of genome sequences enables us not only to know representative gene set containing many novel genes, but also to prepare resources for genomics and breeding, such as maker information. In case of wheat, chromosome information is quite useful to distinguish homoeologous genes. For example, there are three homoeologs of flowering time genes, TaHd1-1, TaHd1-2 and TaHd1-3. Our 6B assembly can distinguish TaHd1-2 transcribed from 6B and other two homoeologs from 6A and 6D in the sequence similarity level. In addition, since exon-intron structures are determined on wheat genomes, constructions of transcript-based markers, such as PLUG markers, are easier and more accurate than the previous situation using rice genome data.

Insertion site-based polymorphism (ISBP) marker can be constructed using genome sequences (Paux et al. 2010). Genome wide survey of simple sequence repeat (SSR) is applied to construct SSR markers on non-genic regions that have not been focused by the transcript-based marker constructions. As same as the genome zipper analysis (Mayer et al. 2009, 2011), virtual order of the markers would be speculated by sequence homology of the flanking regions of the markers to closely related species, such as barley and Brachypodium. In fact, we found 16,728 SSRs on non-repetitive regions of 6B and at least 1,354 SSRs of them were positioned on barley chromosome 6H. Since more than 80 % of the SSRs were located in intergenic regions of 6H, the new SSR markers can be efficiently used for the gap filling between known markers.

Survey sequences of wheat chromosome 6B provided the various types of novel information, e.g. repeat information, genome annotation including genes and RNA genes, and marker information. However the current genomic sequences of the chromosome 6B are fragmented and not completely covered so that improvement of genome assembling should be needed. Sequencing of chromosome 6B is ongoing with MTP method and BAC by BAC sequencing using Roche 454, and more accurate and physical positioned sequences will be available in near future.