Genomic resources of Colletotrichum fungi: development and application

Anthracnose caused by Colletotrichum spp. is an economically important disease of many plants, including grain, vegetable, and fruit crops. Next-generation sequencing technologies have led to a dramatic growth in the size and availability of genomic data in public repositories. Beginning with genome sequencing projects of C. higginsianum and C. graminicola, many Colletotrichum spp. genomes have been sequenced due to their scientific and agricultural importance. Today, we can access more than a hundred genome assemblies of Colletotrichum spp. Utilizing those abundant genomic datasets would enable a better understanding of adaptation mechanisms of Colletotrichum spp. at the genomic level, which could help to control this important group of pathogens. In this review, we outline the development and application of genomic resources of Colletotrichum spp. with a focus on the benefits of genomic data-driven studies, including reverse-genetics, a range of comparative genomic analyses, species identification, taxonomy, and diagnosis, while describing the potential pitfalls of genome analysis. Further, we discuss future research directions that could allow a more comprehensive understanding of genomic diversity within the genus Colletotrichum.


Introduction
The genus Colletotrichum, in the phylum Ascomycota, comprises over 200 species that have been subclustered into fifteen species complexes (Talhinhas and Baroncelli 2021). Anthracnose caused by Colletotrichum spp. is an economically important disease on many ornamental plants as well as important grain, vegetable, and fruit crops (Cannon et al. 2012). In maize, anthracnose stalk rot caused by C. graminicola is responsible for 3.63% of estimated yield loss in the United States and Canada (Savary et al. 2019). With the total maize production in the United States and Ontario, Canada, al. 2004). In contrast, members of the C. gloeosporioides species complex tend to be post-harvest pathogens on a wide range of fruits, including avocado, banana, mango, coffee, and strawberry (Hyde et al. 2009). Whatever their host range, the majority of Colletotrichum spp. employ a hemibiotrophic lifestyle, characterized by the sequential development of a series of specialized cell types (Münch et al. 2008). The initial biotrophic phase is characterized by the development of bulbous primary hyphae in living host cells following penetration from melanized structures called appressoria. Host cell death is later induced in a subsequent necrotrophic phase, which is characterized by the production of filamentous secondary hyphae. Previous studies showed that hemibiotrophic infection in Colletotrichum spp. is orchestrated by various molecules: small secreted proteins called effectors manipulating the plant immune system (Irieda et al. 2019;Kleemann et al. 2012), carbohydrate-active enzymes (CAZymes) degrading host cell wall (Ben-Daniel et al. 2012;Yakoby et al. 2000), and secondary metabolites enhancing rigidity of appressoria or inhibiting plant hormone signaling (Dallery et al. 2020;Ludwig et al. 2014). Unlike obligate biotrophic pathogens (Spanu and Panstruga 2017), Colletotrichum spp. can be cultured axenically and are amenable to genetic manipulation, such as transformation and targeted knock-out mutagenesis (De Groot et al. 1998;Rikkerink et al. 1994;Rodriguez and Yoder 1987). Moreover, recently established CRISPR/Cas9 and marker recycling systems have made experimental designs more flexible and accessible Nakamura et al. 2019;Yamada et al. 2021).
The rapid drop in the cost of next-generation genome sequencing has led to a rapid expansion of publicly available genomic information, with the number of GenBank accessions increasing at an annual rate of about 40% (Sayers et al. 2019). Colletotrichum spp. are no exception to this trend. Due to agricultural and general scientific interest, more than a hundred Colletotrichum spp. genome assemblies have been generated over the last decade. In this review, we examine the literature covering research that has contributed to the development and utilization of those genomic resources, and then we propose future perspective uses of genomic data.

The dawn of the Colletotrichum genomics
The genomics of Colletotrichum spp. began with the release of C. higginsianum IMI349063 and C. graminicola M1.001 genome assemblies generated by a whole-genome shotgun (WGS) approach (O'Connell et al. 2012). Subsequently, the genome assemblies of C. orbiculare 104-T, C. fructicola Nara-gc5 (considered as C. gloeosporioides when the genome was sequenced), and C. gloeosporioides Cg-14 were also published (Alkan et al. 2013;Gan et al. 2013). Comparative genomics studies using these genomic resources revealed pathogenicity-related gene repertoires, including effector candidate genes, CAZyme encoding genes, and secondary metabolite gene clusters. Furthermore, comprehensive gene annotations made RNA-seq analyses for understanding transcriptome dynamics possible. Those studies demonstrated the sequential transcriptome changes that occur during the transition from one infection stage to another (Gan et al. 2013;O'Connell et al. 2012), as well as the regulation of gene expression by the pH-responsive transcription factor pacC (Alkan et al. 2013). The increasing number of publicly accessible host plant genome assemblies enabled performing dual RNA-seq, in which transcriptomic changes in both host and pathogen are simultaneously analyzed. Using this approach, Alkan et al. revealed concurrent tomato and C. gloeosporioides alteration in gene expression during infection (Alkan et al. 2015).

Acceleration of gene characterization
The availability of genomic information facilitated forward genetic screening techniques using Colletotrichum spp. Rhizobium radiobacter (Agrobacterium tumefaciens)-mediated T-DNA insertional mutagenesis has been widely used to generate Colletotrichum spp. mutants to identify genes involved in pathogenicity (Tsuji et al. 2003;Takahara et al. 2004). Before the availability of genomic information, researchers employed a method called "genomic walking", where T-DNA flanking sequences are determined by thermal asymmetric interlaced polymerase chain reaction (PCR) or inverse PCR and then used to physically isolate a cosmid-type genomic clone that contains the sequences (Fujihara et al. 2010;Huser et al. 2009). In the post-genomic era, researchers can computationally search T-DNA flanking sequences against genome assemblies, which makes the identification of mutated genes relatively easy (Harata and Kubo 2014;Korn et al. 2015).
The combination of effective genetic manipulation tools and genomic information made it feasible to conduct reverse genetics experiments with Colletotrichum spp., which yielded new insights into infection mechanisms. Reverse genetics studies using C. orbiculare have identified many of the signaling components required for infection, such as CoBFA1, CoTEM1, CoWHI2, and CoPSR1 (Fukada and Kubo 2015;Harata et al. 2016), homologs of the budding yeast Saccharomyces cerevisiae genes that are involved in cell cycle regulation and septum formation. Interestingly, the functions of those genes differ between C. orbiculare and other fungi, including S. cerevisiae, emphasizing the utility of the reverse genetics approach for illuminating the specific functional adaptations of genes in Colletotrichum spp. Genomic information from Colletotrichum spp. and other fungi was also used to explore the biological roles of proteins without known functional domains. Although the functions of hypothetical proteins including many putative effectors are hard to predict, comparative genomics analyses can provide valuable clues to narrow down the possible roles of genes of interest. Based on the hypothesis that Colletotrichum spp. use a set of common effectors during infection to support their hemibiotrophic lifestyle, Tsushima et al. explored conserved effector candidates in the genus Colletotrichum by comparing 24 ascomycete genomes. This study identified that a conserved effector candidate with no known functional domains, CEC3, induces nuclear expansion and host cell death (Tsushima et al. 2021). Many comparative genomics studies have provided lists of effector candidates categorized by their conservation patterns, for example, having no homolog in other Colletotrichum spp.
(species-specific) or other genera (genus-specific) ( Baroncelli et al. 2016;Boufleur et al. 2021;Gan et al. 2016). Yet, the majority of them remain functionally uncharacterized. Experimental validation of these effector candidates is likely to be a focus of future studies.
Colletotrichum spp. genome assemblies also act as reference genomes for mapping next-generation sequencing (NGS) reads, they are used to assist in the identification of single nucleotide polymorphisms (SNPs), which can be used as high-density genetic markers. Bhadauria et al. generated a genome assembly of C. lentis and performed quantitative trait locus (QTL) mapping, then successfully identified a virulence-governing minichromosome (Bhadauria et al. 2019). QTL mapping requires crossing parental lines with different phenotypes to generate progeny, which is only possible for a limited number of Colletotrichum spp. where a crossing methodology has been established (Armstrong-Cho and Banniza 2006). Even if mating between different isolates is difficult under laboratory conditions, it is possible to conduct a genome-wide association study (GWAS) utilizing natural variations among unrelated individuals. A GWAS using 30 C. kahawae isolates identifed four candidate genes that are potentially involved in signaling, detoxification, and gene expression, which may contribute to virulence on coffee berries (Vieira et al. 2019). However, for supposedly asexual Colletotrichum spp. (Wilson et al. 2021), GWAS settings may need to be adjusted because near-clonal genetic backgrounds can limit the power of the study due to reduced markers that are distributed by recombination (Plissonneau et al. 2017).

Detection of adaptation signals on genomes
Colletotrichum spp. thrive in a variety of niches. Hence, they should have evolved distinct gene sets as a consequence of environmental adaptation. In this section, we discuss how genome data have been used to detect genes associated with survival in specific niches.
Although most copy number variations (CNVs) of genes are assumed to be detrimental, they can increase fitness by altering expression via dosage effects, or by compensating for deleterious effects of loss-of-function mutations, particularly under stressful conditions or in perturbed environments (Katju and Bergthorsson 2013). O'Connell et al. found that the dicot-infecting C. higginsianum encodes more CAZymes involved in pectin degradation than the monocot-adapted pathogen C. graminicola, suggesting that there has been adaptation to the differences in their host cell composition; eudicot cells accumulate more pectin than monocot cells (O'Connell et al. 2012). There are more genes encoding pectin and hemicellulose-degrading CAZymes in members of the C. gloeosporioides and C. acutatum species complexes compared to other members of the genus (Baroncelli et al. 2016;Gan et al. 2016). Despite belonging to phylogenetically separate branches within the genus, these two species complexes include important postharvest pathogens, suggesting a convergent gene expansion associated with their common infection strategy. Genes encoding secondary metabolite biosynthesis-related proteins are significantly enriched in plant growth-promoting C. tofieldiae than in pathogenic Colletotrichum spp., suggesting that C. tofieldiae-specific secondary metabolites are responsible for the beneficial endophytic interactions with host plants ).
An evolutionary arms race between hosts and pathogens generates strong positive selection on both parties, a process that leaves imprints on the genes involved (Möller and Stukenbrock 2017). A classical test to detect these signatures is based on the estimated ratio of non-synonymous to synonymous mutations (dN/dS also known as ω) (Goldman and Yang 1994; Nei and Gojobori 1986). Using this calculation, Rech et al. explored genomic variations among eight different C. graminicola isolates and found that CDSs encoding secondary metabolites and putative effectors tend to have higher dN/dS ratios, indicating positive selection (Rech et al. 2014). Similarly, comparative genomics analysis using six different Colletotrichum spp. showed that genes encoding predicted secreted proteins, which include effector candidates, are often enriched for positively selected genes compared to other genes (Gan et al. 2016). The genes under positive selection identified in these two studies include previously characterized effector genes, CgEP1 in C. graminicola and homologs of ChELP1 in C. higginsianum (Vargas genomics studies could only access a handful of fragmented genome assemblies (Alkan et al. 2013;Gan et al. 2013;O'Connell et al. 2012), it would be interesting to readdress the same questions with the abundant, high-quality genomic resources now available.

Long-read sequencing and structural genomic variations
The advent of long-read sequencing technologies such as PacBio and Oxford Nanopore has revolutionized genomic studies. During de novo genome assembly using a few hundred-bp NGS reads, repetitive DNA sequences containing, for example, transposable elements (TEs) that are longer than the read length, can lead to gaps in the assembly, or misassembled rearrangements (Treangen and Salzberg 2011). This resulted in repeat sequence contents that were often underrepresented in genome assemblies (Alkan et al. 2011). Long-read sequencing technologies solved these problems by producing more than several tens of kb reads that span repeat-rich regions and generate highly contiguous genome assemblies. Using a combination of PacBio long-read sequencing and optical mapping, a chromosomelevel genome assembly of C. higginsianum (Zampounis et al. 2016) produced some remarkable findings, including identification of the association between TEs and effector candidate genes, or secondary metabolite gene clusters, a dispensable TE-rich small chromosome required for virulence, and large-scale genomic rearrangements mediated by TEs (Dallery et al. 2017;Plaumann et al. 2018;Tsushima et al. 2019). Recently, Gan et al. also reported that large-scale genomic rearrangements and multi-copy effector candidate gene clusters are frequently associated with repeat sequences such as telomeres and TEs within the C. gloeosporioides species complex . Historically, many studies have suggested the importance of repeat sequences to generate genomic variations in plant pathogenic fungi (Chuma et al. 2003;Crouch et al. 2008;Ikeda et al. 2002), yet these findings are often restricted to specific repeat sequence families, due to the lack of the comprehensive genomic information. A comparison of chromosomelevel assemblies illustrates a generalized view of the role of repeat sequences in genomic evolution at the single-nucleotide resolution. Detection of structural genomic changes has raised the question of how repeat sequences contribute to pathogen fitness. When existing genomes are compared, identification of a responsible genomic variation(s) for a given phenotype is often difficult because of the high background noise within a natural population. Further analyses using chromosome modification techniques, such as deletion or transmission of chromosomes (He et al. 1998;Takahara et al. 2016), establishing the utility of dN/dS analysis for detecting effector genes that facilitate pathogenicity.
CNV and dN/dS ratio analyses usually target gene groups or homologous pairs. However, recently emerged genes may also play significant roles in the occupation of novel niches, as has been demonstrated in effector genes, which are often restricted in certain lineages (Fouché et al. 2018). Genus-or species-specific genes have been well-documented by comparing multiple Colletotrichum and other fungal lineages (Buiate et al. 2017;Rao and Nandineni 2017). Moreover, recent studies identified strain-specific genes (Gan et al. It is obvious that the quality and quantity of genomic information are crucial to conducting comparative genomics studies. Dallery et al. identified that a previous C. higginsianum genome assembly had 2,699 split gene models and 2,289 missing gene models, which were recovered in the latest chromosome-level genome assembly (Dallery et al. 2017). An important caveat, then, is that incomplete or disrupted chromosomal regions can diminish the utility or reliability of comparative genomics analyses. In addition, successful genome analysis relies on the accuracy of gene annotations. Most fungal genome sequencing projects employ gene annotation pipelines to generate gene models using transcriptome data and/or known homologous protein sequences as guides for prediction. However, the identification of accurate gene models using automated methods is still challenging, especially for less conserved genes, such as orphan genes whose presence is recognized in a single species (Li et al. 2022). Indeed, some studies reported transcript variants from Colletotrichum spp. that differ from their original models (Kumakura et al. 2021;Schliebner et al. 2014;Tsushima et al. 2021). We should notice that predicted gene models for non-model fungi, including Colletotrichum spp., may lack experimental support, and that different annotation methods could produce variations among gene models. Since the first few Colletotrichum spp. agricultural industry has a great need for inspecting other aspects of pathogens, including lineage, aggressiveness, or fungicide sensitivity. Population genomics can aid in monitoring them all together by analyzing high-resolution genotypic data generated by mapping NGS reads to reference genomes. For example, the field pathogenomics approach, which utilizes RNA-seq data of infected host tissues from fields, provides a way to identify phylogenetic relationships between samples (Hubbard et al. 2015;Islam et al. 2016), to estimate a race based on genetic proximity to known isolates (Lewis et al. 2018;Tsushima et al. 2022), or to evaluate mutations in fungicide target genes (Cook et al. 2021). A wealth of sequence data should make the diagnosis of Colletotrichum spp. more accurate, flexible, and tailored in the future.

Toward Colletotrichum pangenomics
Genome assemblies had at one time been generated for a representative isolate of each species. However, it is now easier to obtain and compare several genome assemblies from a single species due to the low cost of sequencing. This technological advancement, combined with computational power, has generated the concept of pangenome, which characterizes the entire set of genomic sequences within a phylogenetic clade of interest (e.g. species) (Vernikos et al. 2015) (Fig. 1). Pangenomic studies determine the genomic diversity among all available datasets, spanning from highly conserved core sequences to sporadically arisen accessory sequences. Identification of such rapidly evolving accessory sequences is particularly important for the study of plant pathogens because they could contribute to selective advantages in the arms race with host plants (Badet and Croll 2020). Although reference-based genome alignments are practical, scalable, and demand less computational power, they can only detect genomic variations that are present in a particular reference genome that may be chosen arbitrarily (Eizenga et al. 2020). This is an issue for analyzing fungal phytopathogen genomes because previous studies show that these genomes are highly plastic at the chromosomal level (Faino et al. 2016;Li et al. 2019;Tsushima et al. 2019). In the near future, comparing many genome assemblies will be possible to investigate specific selective pressures against structural variations.
Pangenome construction requires huge genomic datasets. How can these massive datasets be produced, archived, and examined? The most straightforward way is to generate genome assemblies by individual research groups. However, there is a limit to how much a single team can handle. A more realistic solution would be to harness open data resources with a common architecture and coordinating Plaumann et al. 2018) should help to artificially reproduce structural genomic changes and to examine their effects on Colletotrichum spp. pathogenicity.

Harnessing sequence data for species identification and diagnosis
Accumulation of genomic information has made phylogenetic analysis and species identification much more robust. A Colletotrichum isolate infecting shiso, or beefsteak plant (Perilla frutescens), was previously classified as C. destructivum based on morphology and the internal transcribed spacer (ITS) sequence (Kawaradani et al. 2008). However, phylogenetic analysis using concatenated multi-locus sequence data represented by ITS and four housekeeping genes from the newly-obtained genome assembly identified this isolate as a novel species, C. shisoi (Gan et al. 2019). Because ITS sequence frequently does not provide enough resolution for species identification, multi-locus phylogenetic analysis is commonly used for taxonomic placement of Colletotrichum spp. as a supplement to morphological examination (Talhinhas and Baroncelli 2021). However, in general, there are no standard protein-coding genes for fungal species identification (Houbraken et al. 2021). The sequence sets used for multi-locus phylogenetic analyses using Colletotrichum spp. differ depending on the efficacy of each locus to resolve species delimitation in individual species complexes (Jayawardena et al. 2016b). Phylogenetic analysis using universal single-copy orthologs across genomes (Shen et al. 2020), or average nucleotide identity (ANI) analysis, which has been used extensively in bacterial taxonomic assignments (Ciufo et al. 2018), may be used with great effect to assist with Colletotrichum species characterization and identification.
Publicly-available genomic resources of Colletotrichum spp. can also be used to generate diagnostic markers. Diagnosis of a causal pathogen is important to understand the economic impact caused by that pathogen and to take appropriate disease control measures. Yet, classification based on morphology and host species is less definitive, especially within a species complex or showing a similar host range. To solve this problem, Gan et al. developed diagnostic PCR makers to distinguish the four different members of the C. gloeosporioides species complex by comparing the genomes (Gan et al. 2017). Analysis using these PCR markers showed that C. fructicola has been the predominant species causing strawberry anthracnose in Chiba, Japan (Gan et al. 2017) and that C. fructicola was detected in 5.7% of tested weed leaves around strawberry nurseries in Nara, Japan, which could be an inoculum source of the disease (Hirayama et al. 2018). Apart from species diagnosis, the of genomic diversity of Colletotrichum spp., we advocate for collecting and analyzing a multitude of genomes, even within a single species.
Acknowledgements This work was supported in part by KAKENHI (22H00364 to KS).

Conflict of interest The authors have no conflicts of interest to declare.
Human and animal rights This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/. authority. Sequence data deposited in public databases provide a great opportunity to investigate the genomic atlas of fungi, including Colletotrichum spp. Despite the benefits of analyzing public data, as users, we should be aware that those data were obtained in different ways, and that filtering inappropriate datasets is often required to reduce project-dependent biases (Sielemann et al. 2020). As sequence depositors, providing detailed, accurate, and timeless metadata along with the sequence is essential to assist integrative analyses by other researchers, and to eventually maximize the value of the data (Wilkinson et al. 2016). A possible bottleneck for expanding the available Colletotrichum spp. genomic data would be to find Colletotrichum strains that asymptomatically infect plants, or that grow in organic matter without hosts (Silva et al. 2017). We can easily overlook non-pathogenic Colletotrichum strains, although they may harbor intriguing traits, like promotion of plant growth Ye et al. 2020). The wholemetagenome shotgun sequencing approach holds promises for obtaining novel genomic information of Colletotrichum spp. in nature. Assembling whole-metagenome sequence reads is still challenging due to low sequencing depth, especially for eukaryotes (Bandla et al. 2020;Regalado et al. 2020). However, some studies successfully recovered fungal metagenome-assembled genomes (MAGs) (West et al. 2018;Peng et al. 2021). In the future, application of longread sequencing technologies could improve the quality and quantity of eukaryotic MAGs. To capture the global image