Keywords

14.1 Introduction

In the early 2000s, technological advances in DNA sequencing allowed the sequencing and the comparison of the genomes from several individuals of the same species (Medini et al. 2005). This helped fuel the notion that an individual genome is insufficient to serve as an appropriate genomic reference, since it does not capture the diversity that represents the species. The idea emerged of a “pan-genome” that encompasses the genomic information of several representative individuals. Pan-genomics was initially applied to many smaller and simple genomes of microbial species, particularly to understand presence/absence variation (PAV) in genes (Medini et al. 2005). The idea of the pan-genome has since been applied to diverse species across all taxonomic kingdoms and has evolved to consider all possible variation present between genomes, including non-genic, PAV, copy number, and structural variation (Jayakodi et al. 2021). Pan-genomics has also been applied more broadly to groups of related species or genera, for “super pan-genomes.” While still in its infancy, pan-genomics of crop species can be particularly valuable for harnessing genomic variants and increasing rates of crop improvement. The application of pan-genomes in crop breeding is gaining increased interest due to the importance of food security and the need for more efficient and effective breeding methods. To date, pan-genomes have been applied to the improvement of various crops, including barley, maize, rice, tomato, and soybean (Gao et al. 2019; Gui et al. 2022; Jayakodi et al. 2020; Liu et al. 2020; Shang et al. 2022; Zhao et al. 2018). Applications of pan-genomics for wheat improvement have also become possible since the completion and the public release of multiple high-quality reference genomes (Walkowiak et al. 2020).

Wheat is a crucial crop globally, with widespread cultivation and significant economic importance, supplying a fifth of global calories and protein (Dixon 2007; Shiferaw et al. 2013). To maintain food security in the context of exponential growth of the human population while facing new challenges (e.g., global warming and climate change) in production, it is essential to create new wheat varieties with increased yield, better quality, and resistance or tolerance to abiotic and biotic stress (Abberton et al. 2016; Batley and Edwards 2016). Early wheat improvement relied on traditional breeding methods, where wheat lines were phenotypically selected in field trials, which is both costly and labor intensive. As our understanding of wheat genetics improved, it became possible to identify major effect genes underlying qualitative traits and to select for these genes through marker-assisted selection (MAS, see also Chap. 9). Marker-assisted selection has been successfully applied to certain traits, particularly disease resistance (Miedaner and Korzun 2012). Unfortunately, many key traits, including yield, have a complex and polygenic determinism. Selection of quantitative traits that are more complex and are influenced by non-genic features, several genes, or gene interactions, require more advanced tools for making DNA-based selections. With the recent availability of high-quality genome assembly and gene annotations for wheat, it has been possible to apply high-throughput genotyping arrays or genotype-by-sequencing methods to gather genome-wide variation information and select for these complex traits at the whole-genome level, through genomic selection (GS) (Haile et al. 2021). Nevertheless, identifying key major effect genes as well as the mechanisms underpinning more complex traits requires a deeper understanding of the diversity of wheat and the impact of genomic variation on phenotypic traits. It is critical to understand the diversity within wheat that is available to breeders in order to make breeding more efficient, identify suitable parents to use in targeted crosses, and select for the best possible combination of genes for rapid trait enhancement.

Despite its importance for food security, the application of genomics and pan-genomics for wheat improvement has been challenged by the large size and the complexity of its genome. The genome is composed of three separated diploid subgenomes, resulting in allohexaploidy (genome AABBDD), where the ‘A’ subgenome was derived from T. urartu, the ‘B’ subgenome from a species related to T. speltoides, and the ‘D’ genome from Ae. tauchii. The genome of modern bread wheat is estimated to be 17 gigabase-pairs (Gb) in length and is composed of ~ 90% repetitive elements. Recent achievements in genome sequencing and assembly technologies have enabled the release of multiple wheat genomes and tools to create a pan-genome, which is inspiring a new age of wheat breeding. In this review, we explore the concept of pan-genomes and a pan-genome of wheat, the history and evolution of the wheat genome and pan-genome, and the future outlook of wheat pan-genomics for research and applied breeding.

14.2 Motivations for Studying Pan-Genomes in Crop Breeding

During the last decade, there have been significant advancements in next-generation sequencing (NGS) technologies, which offer a direct view into DNA variation. These advancements have created numerous possibilities to investigate the connection between genotype and phenotype with greater precision than ever before. NGS has been used for various projects, including gene expression analysis, polymorphism detection, and the development of molecular markers (Barabaschi et al. 2012; Delseny et al. 2010). With the advent of affordable genome sequencing, breeders have started using NGS to sequence extensive groups of plants, which has enhanced the precision of identifying quantitative trait loci (QTL) and simplified the process of discovering genes. This has, in turn, formed the foundation for creating models to comprehend complex genotype–phenotype relationships at the whole-genome level. Over the past two decades, advancements in sequencing technologies, assembly techniques, and computational algorithms have enabled the release of genome sequences for over 700 plant species (Sun et al. 2022).

In parallel, advancements in using DNA-based tools for plant breeding, such as MAS and GS, have progressed significantly. Genomics approaches identified genomic markers associated with traits and were termed as QTL (Geldermann 1975). A single QTL can harbor many genes within the same locus (Beckmann and Soller 1983; Westman et al. 1997). MAS has been in use since the early 1990s and involves identifying genomic markers in silico, which are within causal genes for traits or are closely linked, which are then used to select individuals (Tanksley and Nelson 1996).

The development of reference genome assemblies has expedited the process of identifying candidate genes for in-demand traits. These assemblies serve as a basis for pinpointing single-nucleotide polymorphisms (SNPs), copy number variations (CNVs), and insertion–deletions (InDels) within an individual’s DNA sequence. The markers were used as the basis for conducting genome-wide association studies (GWAS) and genomic selection (GS), which involve comparing diversity panels with reference genomes to identify statistical associations between markers and traits (Crossa et al. 2017; Hayes and Goddard 2010; Varshney et al. 2009). Despite providing a greater insight into the diversity of plant species, particularly at the SNP level (Gore et al. 2009; McNally et al. 2009), reference genomes cover only a limited portion of the overall genomic space of a species and are inadequate in capturing variation across every individual within a given crop species (Bayer et al. 2020). A paradigm shift is occurring due to new advancements in genomics, which now take into account the significance and amount of structural variations present in the pan-genome of crop species. This includes capturing all types of SVs such as PAVs, CNVs, and repetitive elements or TEs, present throughout the entire genome of all individuals belonging to a plant species (Danilevicz et al. 2020; Golicz et al. 2016; Tao et al. 2019). By cataloging this variation and linking it to phenotypic/trait information, it is then possible select parents and candidate wheat lines in breeding programs with more advanced knowledge and decision support tools, allowing for more efficient and targeted crop improvement.

14.3 Historical Challenges and Progress in Wheat Genome Sequencing and Assembly

Prior to the availability of NGS, whole-genome sequencing was performed using the Sanger sequencing technology. Due to a combination of several factors, including the cost and low throughput of Sanger sequencing, and the size and complexity of some large genomes, many genomes were first cloned into bacterial artificial chromosomes (BACs) that included a few hundred thousand base-pairs per clone. This allowed for each BAC to be sequenced and assembled in parallel and then stitched together to assemble larger more complex genomes. After the release of the first human genome sequencing in 2000, which was achieved through the use of bacterial artificial chromosome (BAC) (Lander 2001; Venter et al. 2001), the Arabidopsis genome was the first plant genome to be sequenced using this approach. This was followed by the completion of multiple versions of the rice genome two years later (Goff et al. 2002; Yu et al. 2002). The wheat genome’s larger size, almost 40 times that of rice, and its complexity, which included a high proportion of repetitive sequences and homoeologous DNA copies from three subgenomes, made it economically unfeasible to employ a standard sequencing method. To tackle this challenge, the International Wheat Genome Sequencing Consortium (IWGSC) was established in 2005. The consortium divided the immense task among 20 countries based on chromosomes and chromosome arms. The approach employed genetic stocks that could be differentiated by flow cytometry on an individual chromosome basis (Consortium et al. 2014). Physical maps and minimum tiling paths were produced by fingerprinting BAC libraries, which were subsequently sequenced and assembled (Safár et al. 2010). Although the chromosome-by-chromosome approach was adopted, it took nearly ten years to implement and was only partially accomplished for few chromosomes, including chromosome 3B (Paux et al. 2008). Due to the large size of the hexaploid wheat genome, certain researchers have opted to pursue a different approach by focusing on the genomes of related diploid species, such as Ae. tauschii. This species has a much smaller genome size, approximately one-third of that of hexaploid wheat (~ 4.792 Gb) and does not have any interference from homoeologous DNA copies during physical mapping and eventual sequence assembly. Despite implementing this method, the initial use of regular agarose gels made the task seem overwhelming. However, to anchor contigs, higher throughput technologies such as SNaPshot BAC fingerprinting and Illumina Infinium SNP array were utilized. It took a decade to produce the first version of the Ae. tauschii physical map, which involved fingerprinting 461,706 BAC clones and assembling them into 2263 contigs. Afterward, 7185 molecular markers were utilized to anchor these contigs onto a genetic map (Luo et al. 2013). Despite some success with Ae. tauschii, the BAC approach had limited achievement in hexaploid wheat and the approach was slowly abandoned for wheat once more advanced DNA sequencing, sequencing library preparation, and genome assembly technologies became available.

In the 2000s, wheat genome sequencing was boosted by Illumina sequencing technologies, which were able to perform short read paired-end sequencing at high depth and low cost. The sequencing was first done on the diploid ancestors of common wheat due to their smaller genome size and early challenges of applying short read data to large polyploidy genomes. The draft genome assembly for Ae. tauschii, the D genome donor for bread wheat, was completed using short read sequencing methods to about 90 × coverage (Jia et al. 2013). Approximately, 83.4% of the genome was covered by the assembled scaffolds, and out of these, 65.9% were identified as transposable elements (TEs). Using RNA-seq data from different tissues, a total of 43,150 protein-encoding genes were identified. A comparable approach was employed to construct the genome sequence of the A genome contributor, T. urartu. The assembly that was obtained had a total length of 3.92 Gb, which corresponds to 79.35% of the estimated size of the A genome (4.94 Gb). However, due to subgenome interactions and evolutionary processes spanning around 10,000 years, the genomes of the progenitors are not able to fully depict their counterparts in the common wheat genome. Therefore, the sequencing of the common wheat genome was yet to be achieved.

The first sequencing of the common wheat genome for the landrace CHINESE SPRING was accomplished using Roche 454 pyrosequencing, specifically the GS FLX Titanium and GS FLX1 platforms, which were used to sequence the wheat genome to about 5 × coverage. Sequencing of related progenitors was also performed using various platforms, such as Illumina methods for sequencing of T. monococcum, the A genome donor of bread wheat. Likewise, Ae. tauschii was sequenced using the Roche 454 sequencing platform. While whole-genome data was not yet available, cDNA sequences were sequenced from Ae. speltoides, which has a genome similar to the B genome. Using the SOLiD sequencing platform, additional short reads of CHINESE SPRING were generated. These yielded 95,000 predicted gene models, with most of them designated to either the A, B, or D subgenome. Despite its high degree of fragmentation, the draft genome was still considered valuable, as it was the first wheat genome available for community use (Brenchley et al. 2012).

As the IWGSC adopted the chromosome-based BAC sequencing approach, progress was consistently made. As NGS became available, it was possible to sequence the BACs using more high-throughput methods. The approach involved developing sequencing libraries from the DNA of individual chromosomes or their arms and subsequently sequencing pair-end reads on the Illumina HiSeq 2000 platform. The assembly obtained, which resembled the 454 assembly, comprised approximately 500,000 contigs with N50 values ranging from 1.7 to 8.9 kb. Its total size was 10.2 Gb. These contigs, taken together, make up 61% of the estimated hexaploid wheat genome. Predictions were made for a total of 133,090 high confidence genes, as well as 890,576 low confidence genes. Using a genetic map, just over half of the high confidence genes were assigned genetic positions (Mascher et al. 2013), allowing them to be considered within the context of the telosome-based assembly resources for each chromosome arm. This led to the completion of a draft genome assembly of wheat, known as the IWGSC chromosome survey sequence (CSS) assembly (Consortium et al. 2014).

The IWGSC also accomplished a noteworthy feat when they generated a reference-level sequence of chromosome 3B (Choulet et al. 2014). This high-quality sequence was created using a minimum tiling path consisting of 8452 BACs, spanning 774 Mb, and containing 5326 protein-coding genes as well as 85% of TEs. Additionally, a molecular-genetic map (CHINESE SPRING x RENAN) was used for long-range orientation of DNA sequences. The assembly of chromosome 3B demonstrated the success of the chromosome-based BAC sequencing strategy, although the assembly remained approximately 7% incomplete.

14.4 The Completion of a Chromosome-Scale Assembly of Hexaploid Wheat

While evidence suggested the BAC sequencing approach could work for achieving a chromosome-based wheat genome assembly, the complexity of the genome, high repeat content, high transposon activity, large genome size, and allopolyploidy were continuing to hamper assembly efforts. Meanwhile, third-generation sequencing technologies, which were created by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), surfaced and progressed quickly. These techniques produce reads with substantially longer lengths and have been extensively employed, in combination with established assembly algorithms, to construct intricate and sizable plant genomes with unparalleled precision (Cheng et al. 2021; Koren et al. 2017; Niu et al. 2022). This led to a paradigm shift away from BAC sequencing and toward the direct shotgun sequencing of the genome using more advanced sequencing technologies and assembly algorithms.

A new assembly method called MaSuRCA was used to assemble wheat using a hybrid approach that combined the strengths of both PacBio long reads, which have high error rates, and Illumina short reads, which are more accurate. This method was initially used to create a genome assembly of Ae. tauschii (Zimin et al. 2017a). To obtain a comprehensive sequence coverage of the genome, a combination of sequencing methods was employed, including over 19 million PacBio reads providing approximately 38 × coverage of the D genome, 177 × coverage from Illumina HiSeq 2500 reads consisting of 200-base paired-end reads, and MiSeq reads consisting of 250-base paired-end reads. The sequencing libraries with a range of insert sizes yielded a total coverage of 200 × of the genome. The genome’s quality was validated through a comparison with optical maps and BAC assemblies that were produced independently. Subsequently, the pipeline was utilized to produce the initial near-complete hexaploid wheat genome for CHINESE SPRING (Zimin et al. 2017b). Triticum 1.0 was a genome assembly consisting of 829,839 contigs with a total size of 17.05 Gb, with a contig and scaffold N50 of 76.3 kb and 101.2 kb, respectively. Another method involved assembling long reads directly with the FALCON assembler, which produced FALCON Trit1.0 with a size of 12.94 Gb. Although this version was shorter than the MaSuRCA-assembled version, it had a longer contig N50 of 215.3 kb. Using the genome alignment tool MUMmer (Kurtz et al. 2004), the combination of Triticum 1.0 and Trit1.0 resulted in a final assembly that spans almost the entire wheat genome, with a size of 15.3 Gb and a contig N50 of 232.6 kb.

At the same time, an alternative approach was also taken to create the CHINESE SPRING genome assembly using short reads (Clavijo et al. 2017). The approach involved 1.1 billion 250-bp paired-end reads (33 × genome coverage) from CHINESE SPRING short insert libraries, and 68 × coverage of long insert libraries, yielding the TGACv1 version of the wheat genome assembly. This version spanned 13.43 Gb and accounted for over 78% of the wheat genome. In addition to the improved assembly, strand-specific Illumina RNA-seq and PacBio full-length cDNAs were combined to achieve better annotation. Although chromosome-level assembly was not attained, this new wheat genome assembly was now available for the broader scientific community to utilize, bringing the prospect of a high-quality reference genome into focus.

Shortly thereafter, a breakthrough was made with the release of new short read assemblers. NRGene’s DeNovoMagic (NRGene, Ness Ziona, Israel) algorithm and the TRITEX pipeline (Monat et al. 2019) for short read assemblies demonstrated that a shotgun whole-genome sequencing approach could be achieved when combining different Illumina library sizes and preparation methods. The AABB genome of wild emmer wheat (WEW), which represents the reference-level genome of polyploid wheat, was produced through the utilization of the DeNovoMagic algorithm (Avni et al. 2017). By sequencing on Illumina HiSeq 2500 machines, a total of 2.1 Terabase-pairs were generated, comprising 176 × genome coverage reads from five libraries. The insert sizes in the libraries ranged from 450 bp to 10 kb. The scaffolds were then consolidated using a high-density molecular-genetic linkage map and additional reads from a three-dimensional (3D) conformation capture Hi-C library. Ultimately, the final assembly was 10.5 Gb, accounting for 87.5% of the predicted tetraploid wheat genome. The annotation of 110,544 gene models provided strong evidence for the high quality of this genome assembly. Among these models, 58.8% (65,012) were identified as high confidence gene models, while the remaining 41.2% were of low confidence. This assembly successfully captured 98.4% of the total expected gene sets of WEW, as verified by BUSCO (Simão et al. 2015). Additionally, it was utilized for identifying the genes that played a role in the early domestication of wheat, as reported by Avni et al. (2017). After the completion of the WEW genome, bread wheat genome sequencing efforts quickly pivoted toward the same shotgun genomics approach. The successful completion of the bread wheat genome IWGSC RefSeq v1.0 was achieved using a combination of similar techniques and software. According to Consortium et al. (2018), DeNovoMAGIC2 utilized the complete genome as the primary framework and incorporated various sources of data such as physical maps, genotyping-by-sequencing data, and Hi-C data. The common wheat genome was assembled into 21 pseudomolecules at the chromosome scale, which were assigned to the subgenomes A, B, and D. This resulted in a genome assembly with a super-scaffold N50 of 22.8 Mb, and total length of 14.5 Gb. Using a similar assembly approach, the genome sequencing of durum wheat (DW) was completed shortly after (Maccaferri et al. 2019).

14.5 Progress Toward a Wheat Pan-Genome

In 2018, the IWGSC released the first reference-quality genome sequence for the wheat landrace CHINESE SPRING, which marked a significant change in the use of genomics as a research tool for wheat. The publication enabled the wider research community to have easy access to this tool (Consortium et al. 2018). The CHINESE SPRING genome assembly was a major milestone in wheat genomics research, and within a few years, it has already laid the foundation for countless studies dissecting the genome to understand wheat biology. However, CHINESE SPRING shares only a distant ancestral connection with the majority of current wheat varieties. Additionally, due to the considerable diversity present within the species, a single genome sequence is insufficient for fully representing its genetic makeup. Additional pan-genome information is required to identify new genetic diversity that can enhance traits and understand the mechanism behind the traits present in elite wheat cultivars. Fortunately, with new short-read assembly algorithms capable of shotgun sequencing, the path forward to additional genomes would no longer be a technical limitation.

Choosing crop genotypes for pan-genome analysis is a challenging task as the objective is to encompass a wide range of genetic variations using a limited number of representative genotypes for the particular species. This selection procedure necessitates the acquisition of genome-wide genotypic data from either entire genebank collections or representative subgroups that cover all significant germplasm groups within the species. Recent reports have described several genebank genomics studies on rice (Wang et al. 2018), barley (Milner et al. 2019), and wheat (Juliana et al. 2019). Soleimani et al. (2020) have described different methods that can be used to choose core sets for pan-genome analysis. One tool that aims to maximize diversity, representativeness, and allelic richness of core sets is Core Hunter (De Beukelaer et al. 2018). It achieves this by using various algorithms that operate on genetic distance matrices. To further customize the selection process, clustering of the diversity space through principal component analysis (Patterson et al. 2006) or model-based ancestry estimation (Alexander et al. 2009) can be used. Pan-genome panels offer the possibility of incorporating not only cultivated plant varieties but also wild progenitors or ancestors of polyploid species. For example, teosinte as a wild progenitor of maize and wild emmer or Aegilops tauschii as progenitors of wheat. These wild relatives are valuable out-groups and represent diversity available in the secondary and tertiary gene pools. These relatives could be used to determine the ancestral states for SVs or because of their significance in introgression breeding (Harlan and de Wet 1971). Besides emphasizing on incorporating diverse global varieties in a crop, a pan-genome initiative might also choose specific genotypes that have a significant role in breeding and genetics. These could comprise founder genotypes of breeding programs, experimental population parents (Yu et al. 2008), or genotypes that can be genetically modified (Jain et al. 2019; Schreiber et al. 2020) to optimize the advantages for both research and breeding communities. These chosen accessions will serve as reference genotypes for future functional and genetic studies in pan-genomic research.

The International 10 + Wheat Genomes Project (www.10wheatgenomes.com) was established in 2019 with the goal of creating reference-quality genome assemblies for at least ten diverse bread wheat cultivars. Using genomic diversity analysis of 3800 wheat samples, ten wheat lines were chosen and sequenced utilizing Illumina short read sequencing technologies, and then assembled using NRGene’s DeNovoMagic algorithm (NRGene, Ness Ziona, Israel). Subsequently, all these assemblies were organized into subgenome-aware pseudomolecules with the aid of Hi-C technology (van Berkum et al. 2010). Additionally, five other wheat varieties were also sequenced and assembled to the scaffold level using separate short-read assembly algorithms established at the Earlham Institute (Norwich, UK).

A gene projection strategy was implemented and applied to all assemblies to evaluate and compare the gene content of the newly sequenced lines in a fair and consistent manner, given the lack of genome-specific transcriptome data available at that time. This strategy involved using the CHINESE SPRING reference gene models and transferring them to all assemblies. Differences in gene content among the 10 + wheat reference genomes were observed, likely due to the complex breeding histories of the selected lines. These variations in gene content were found to be linked with adaptation to different environments and with efforts to enhance grain yield, quality, and resistance to abiotic and biotic stresses. Significant structural rearrangements and introgressions from wild relatives were observed upon comparing the pseudomolecule structures of the reference sequences. This underscores the importance of having multiple reference genomes of quality (at pseudomolecule level) instead of relying on resequencing approaches, as only chromosome-level assemblies can provide information on large- and small-scale structural rearrangements with a high degree of resolution and accuracy. The study conducted by Walkowiak et al. (2020) illustrates how the wheat pan-genomes can be utilized to study causal genes for traits, as the genomes were used to uncover the gene Sm1, known for conferring resistance against midge. With the availability of recently sequenced and compiled wheat reference genomes, there is an unprecedented opportunity to identify functional genes and enhance wheat breeding. The subsequent phase of the project will involve generating de novo gene predictions for all chromosome-scale assemblies using extensive transcriptome data. These data will offer a comprehensive understanding of the functional and regulatory arrangement of the wheat pan-genome.

While the 10 + Wheat Genomes Project provided the first insights into the wheat pan-genome, sequencing and assembly methods continued to evolve. Throughput increased for both PacBio and ONT sequencing platforms, leading to additional genome assemblies (Aury et al. 2022). Further, PacBio released its HiFi sequencing method based on circular consensus sequencing, which significantly improved sequencing and assembly accuracy. These long and accurate sequencing reads have led to the highest-quality genome assemblies of wheat achieved thus far. With the upcoming release of new long read sequencing technologies with high accuracy and output, such as the Revio platform from PacBio, it is expected that additional genomes for wheat will be released in the coming years. While no longer constrained by technological limitations in genome sequencing and assembly, the next chapter begins for integrating these data into a functional pan-genome that will drive future research and breeding.

14.6 A Functional Pan-Genome for Wheat Research and Applied Breeding

Pan-genome construction is the process of creating a comprehensive set of genetic information from a collection of related genomes. It is a complex task, requiring the use of multiple approaches and techniques. It involves assembling and annotating all genomic information and variants, can be used to understand genome and gene evolution, discover new genes and alleles, and investigate gene–gene interaction networks.

To construct a pan-genome, two primary methods can be utilized, whole-genome assembly and comparative genomics. Whole-genome assembly involves assembling all of the reads from a collection of genomes into a single, contiguous genome. The steps for whole-genome assembly are well-documented (Jung et al. 2020). The approach is most appropriate for genomes that are closely related and possess significant sequence similarity. It offers the benefit of an all-encompassing perspective on the species’ genetic variation, but it is often restricted by the number of genomes that can be sequenced. Comparative genomics (Pop et al. 2004), on the other hand, involves comparing and contrasting multiple genomes to identify shared and unique components. This method is most suitable for more distantly related genomes with lower sequence similarity.

The ability to assemble high-quality reference genomes for numerous plants simultaneously has been made possible by recent advancements in sequencing technologies and bioinformatic tools. Despite this progress, it is still challenging to perform combined analysis of multiple genomes or a subset of genomes and provide readily accessible genetic information to end-users, such as researchers and breeders (Li et al. 2020b). The comparison, analysis, and visualization of multiple reference genomes and their diversity necessitate powerful and specialized computational strategies and tools. De novo assembly, iterative assembly, and graph-based assembly methods have been employed to construct pan-genomes (Li et al. 2014; Liu and Tian 2020).

14.6.1 De Novo Assembly

Constructing a pan-genome can be achieved through the de novo assembly of genomes from multiple individuals, followed by comparative analysis to identify variant types and classify them as core or flexible genome components. This approach has been discussed by Mahmoud et al. (2019). Technological advancements in sequencing and assembly methods have enabled the generation of high-quality, chromosome-level plant genomes, including telomere-to-telomere genome assemblies (Miga et al. 2020). However, generating accurate genome assemblies can be costly, especially for large plant genomes, and may not be practical when dealing with hundreds of reference genomes for a single species (Hurgobin and Edwards 2017). Nevertheless, the 10 + Wheat Genomes Project was successful at the construction of several chromosome-scale assemblies. Along with these genomes were tools to visualize haplotype blocks representing shared or unique regions between the assemblies (http://www.crop-haplotypes.com/) (Brinton et al. 2020). Likewise, many of the wheat genomes had major introgressions or large structural variants, which could be visualized using synteny viewers (https://kiranbandi.github.io/10wheatgenomes/, http://10wheatgenomes.plantinformatics.io/).

14.6.2 Iterative Assembly

The iterative assembly approach differs from de novo assembly in that it commences with the creation of a single-reference genome, which is then used as a framework for the sequential alignment of reads from other samples. Any unmapped reads are subsequently assembled and incorporated into the reference genome to form a non-redundant pan-genome (Golicz et al. 2016). This technique is less expensive than de novo assembly since low sequencing depths can be used for each sample, allowing for the pooling of numerous samples. Nevertheless, the iterative assembly method may struggle to handle genomes that contain many repeat regions and is not capable of detecting large structural variations that cannot be covered by individual short reads (Jiao and Schneeberger 2017). Resequencing and iterative assembly methods have been applied to wheat (Montenegro et al. 2017; Watson-Haigh et al. 2018). However, evidence suggests that wheat has a very plastic genome due to its allopolyploidy and has abundant PAV, CNV, and SV that are important for trait variation www.10wheatgenomes.com, (Nilsen et al. 2020). Therefore, iterative assembly approaches, particularly low-coverage reference-based analyses, are highly limiting when exploring wheat pan-genomics.

14.6.3 Graph-Based Assembly

Pan-genomes can also be constructed using graphs. The most commonly used graph for this purpose is the compacted de Bruijn graph, which integrates genetic information from different accessions of a species (Chikhi et al. 2016; Li et al. 2020a). In contrast, the bi-directed variation graphs capture genetic variations throughout a population and identify their potential positions on a reference genome. Compared to traditional linear genomes, graph-based pan-genomes have been shown to significantly mitigate reference bias (Garrison et al. 2018). However, graph-based pan-genomes are challenging to construct and apply due to several factors, including the intricate nature of plant genomes with their high repeat content and polyploidy. Additionally, there is a shortage of common downstream analysis tools and visualization techniques for the graph, which further adds to the limitations. Despite these challenges, graph-based genomes have strengths compared to other methods and may have more widespread applications for wheat research and breeding in the future, particularly as tools for graph-based assembly of more complex genomes improve.

14.6.4 Pan-Genome Annotation and Other Pan-Omics

Once the pan-genome has been assembled, there are several techniques that can be used to annotate it. One technique is to use gene prediction software to identify genes in the pan-genome. This can be done using homology-based or de novo gene prediction algorithms. There is a plethora of ab initio gene prediction software (Scalzitti et al. 2020), including Augustus (Stanke and Morgenstern 2005), Genscan (Burge and Karlin 1997), GeneID (Parra et al. 2000), GlimmerHMM (Majoros et al. 2004), and Snap (Korf 2004). Another technique to annotate the pan-genome is to use comparative genomics to identify conserved or novel gene families. This involves comparing the genomes of different species to identify shared and unique components. By comparing gene sequences between two species, it is possible to identify regions of similarity that may indicate similar functions. In wheat, comparative genomics has been used for identifying resistance genes (Marchal et al. 2020) and uncovering the molecular basis of nitrogen-use efficiency (Shi et al. 2022). In addition to annotating the gene space, there is increasing interest in expanding the annotation of the pan-genome to include the dynamics of gene expression (pan-transcriptomics), epigenomic modifications (epipan-genomics), as well as interaction networks between variants as well as genes, and associating these directly with biological traits. Such a complete atlas of biological information will equip researchers and breeders with unprecedented tools for wheat research and improvement.

14.6.5 Applying the Pan-Genome to Breeding

After constructing and annotating the pan-genome, the subsequent step involves utilizing it for crop enhancement. The effectiveness of next-generation breeding technologies, such as transgenics and CRISPR-Cas9 gene editing, has been proven for wheat (Nilsen et al. 2020). However, regulatory challenges exist that may limit the widespread adoption of these methods for delivering new wheat cultivars. As a result, wheat breeding will likely involve generating biparental populations and screening for progeny for some time to come. Gene discovery has certainly benefitted from the availability of pan-genomics resources for wheat, facilitating marker discovery that can be applied to MAS and making screening of parental lines and progeny more efficient (www.10wheatgenomes.com). With the availability of more genome assemblies that are representative of the genes and genomic variants that can be used in breeding, the need to generate additional high-quality genomes will likely lessen as genomes can be imputed based on lower coverage haplotype information; for example, from genotype-by-sequencing or high-throughput SNP arrays (Alipour et al. 2019). Having genomic information available for the parental materials being used in crosses, even if imputed, will allow for breeders to make stronger associations between traits of interests and variants within the genome, allowing for more efficient and targeted genomic-based selections to be made in their resulting progeny through GS.

14.7 Conclusion and Future Directions

Owing to its ability to identify novel genetic variations that can enhance crucial traits, the pan-genome serves as a valuable asset for crop breeding, specifically in wheat. Through consistent pan-genome research in crops, more robust and productive varieties are expected to be developed, resulting in benefits for farmers and consumers worldwide. While it is difficult to predict all possible future applications of pan-genomics to wheat breeding, the resources are now available to innovate. With recent advances in GS, artificial intelligence, and deep learning, one can only imagine the possibilities when applying these tools to pan-genomics, particularly if the pan-genomes are well annotated and have associated phenotypic data generated through applied breeding. This may not only be able to predict the performance of parents or offspring but could potentially help optimize designer genomes for specific purposes, environments, or stresses.