Harnessing Whole Genome Sequencing in Medical Mycology
Purpose of Review
Comparative genome sequencing studies of human fungal pathogens enable identification of genes and variants associated with virulence and drug resistance. This review describes current approaches, resources, and advances in applying whole genome sequencing to study clinically important fungal pathogens.
Genomes for some important fungal pathogens were only recently assembled, revealing gene family expansions in many species and extreme gene loss in one obligate species. The scale and scope of species sequenced is rapidly expanding, leveraging technological advances to assemble and annotate genomes with higher precision. By using iteratively improved reference assemblies or those generated de novo for new species, recent studies have compared the sequence of isolates representing populations or clinical cohorts. Whole genome approaches provide the resolution necessary for comparison of closely related isolates, for example, in the analysis of outbreaks or sampled across time within a single host.
Genomic analysis of fungal pathogens has enabled both basic research and diagnostic studies. The increased scale of sequencing can be applied across populations, and new metagenomic methods allow direct analysis of complex samples.
KeywordsWhole genome sequencing Human fungal pathogens Genome sequencing Medical mycology Fungal infections Review
Whole genome sequencing provides a powerful lens for the investigation of fungal pathogens. In providing a comprehensive snapshot of the gene content and thereby the functional potential, the genomic studies of human pathogenic fungal species have revealed the repertoire of proteins that contribute to host interactions [1•, 2, 3•, 4], predicted metabolic capabilities and requirements [1•, 5•], uncovered the potential for sexual reproduction in some species [6, 7], and have been used as the platform to study specific genes as well as for systematic functional genomic approaches [8, 9, 10]. For diagnostic purposes, the complete sequence of clinical isolates can type the infecting species and the subgroup or lineage with high accuracy. In addition, the identification of specific mutations that are clinically actionable, including those that confer or promote resistance to antifungal drugs, has the potential to guide treatment decisions for individual patients.
The ease and falling cost of generating whole genome sequence has dramatically expanded the use of this data. The ability to sequence genomes on demand has dramatically expanded the scope of sequenced species, and importantly allows rapid response to new or emerging pathogens. Further, comparing the genome-wide variation of clinical isolates from an outbreak with isolates from environmental or other associated samples can precisely determine how clonal an outbreak is across patients, the identity to isolates from other sources, and establish transmission chains. For recurrent infections or in the context of prolonged outbreaks, genome sequence can trace how pathogens evolve over time, tracking the emergence of new genotypes and the spread of particularly virulent or drug-resistant groups.
Whole Genome Sequencing Approaches
Generation of a new genome assembly for a species, also known as de novo sequencing and assembly (Fig. 1), has been applied to generate reference genome assemblies for all of the major human fungal pathogens (Table 1). Many of these species have now achieved chromosome-scale assemblies, although there may still be gaps at subtelomeres or within each chromosomal sequence often in repetitive or low complexity regions. The choice of a genome sequencing strategy depends both on the properties of the genome and on the goal of the analysis. While the assembly of a short read from a single library can provide a good overview of gene content, repetitive sequences sharing high identity may not be resolved, resulting in gaps in the genome assembly . This issue can be overcome either by the generation of larger insert mate pair libraries for short read sequencing or by the incorporation of longer reads to provide linkage information; highly repetitive genomes may benefit from a strategy of exclusively long read sequencing and assembly. Similar to the prior use of physical or genetic maps in validating and anchoring draft assemblies to chromosomes, technologies such as Hi-C that map the three-dimensional space of a genome also can be used for higher order scaffolding of assemblies [13, 14]. In diploid genomes, heterozygosity may also impact genome assembly; most methods seek to generate a haploid version of the reference, and the generation of a consensus sequence from heterozygous regions can incorrectly merge haplotypes. A phased diploid assembly for Candida albicans was constructed using a panel of strains homozygous for specific chromosomes [15•]; a more general approach using long reads was recently reported . De novo assemblies of either haploid or diploid genomes can then be annotated by predicting the structure and function of protein coding genes, using de novo, homology-based, and evidence-based prediction algorithms . The ability to generate deep coverage of RNA-Seq from multiple conditions enables a higher level of accuracy and validation of gene structures, and has been applied to systematically improve gene structures and predict alternatively spliced transcripts.
With the decreased cost of whole genome sequencing, de novo assemblies have been generated on demand to examine new species. This includes representing both rarely observed pathogens and many nonpathogenic species related to common pathogens, with the goal of using comparative genomic approaches to identify differences that could contribute to pathogenesis. The ability to rapidly generate genomes for new species is also important for the response to emerging pathogens and in the context of recent fungal outbreaks. Recent studies have also expanded our view within a single species by examining the genome of more than one “reference” isolate for some species, which can characterize differences in gene content between isolates of the same species. From a larger perspective, the increasing number and diversity of sequenced genomes enable a wide range of studies focused on comparisons of specific genes, as well as a set of references for alignment-based approaches including both metagenomics and sequencing of single isolates.
For species for which a high-quality reference assembly is available, re-sequencing is an alternative approach to identify genome-wide variants. Typically short read sequence is generated from one or more isolates of the same species, reads are aligned to a reference assembly, and high-quality variants identified from the alignments (Fig. 1). These methods have been applied to both haploid and diploid fungal genomes. The full-genome resolution and scalability of this approach make it ideal for examining transmission links and in the context of an outbreak and pathogen evolution during the course of an infection, during which few variants may be expected. Variants can also be mapped to genes, on the reference genome, to infer changes in important genes involved in drug resistance (see below).
In addition to these approaches that rely on sequencing of individual isolates, the increase of metagenomic sequencing has driven the development of methods to look directly at populations within a single sample. Also using a shotgun sequencing approach, the sequence of a pool of samples can be used to determine the species within a single sample, to categorize the gene content to suggest functional capacity, and recently to examine species level variation.
Recent Genome Sequencing Findings
With the advent of highly multi-parallel sequencing, the increased ease and low cost of generating whole genome sequence led to a dramatic expansion of the number of fungal genomes available. Notably, the 1000 fungal genomes project at the US Department of Energy Joint Genome Institute (http://1000.fungalgenomes.org) aims to provide a comprehensive representation of the fungal kingdom, where each family level division would be represented by at least two genomes. The pace of sequencing is already eclipsing the scale of this project, with over 2100 fungal genome assemblies available in NCBI (https://www.ncbi.nlm.nih.gov/genome/browse/#). Of these, only 812 have gene annotations deposited in NCBI, highlighting the more limited scope of easily available gene sets for comparative analysis.
Genome assemblies for human fungal pathogens
Genome size (Mb)
Candida albicans, Candida glabrata
Cryptococcus neoformans, Cryptococcus gattii
Rhizopus oryzae, Rhizopus delemar
In addition to studies focusing on a single genome as representative of a species, multiple studies have used re-sequencing to examine variation across multiple isolates of a single species. Large studies of C. gattii have characterized the relationship of global isolates [34, 35] and identified a loss of function mutation in the mismatch repair gene MSH2 in one sublineage, VGIIa . One of the largest studies to date in C. neoformans var. grubii compared the sequence of 387 isolates from clinical and environmental origin; genome-wide association study (GWAS) variants associated with the isolation source identified virulence factors and stress response genes [36•]. A parallel GWAS of melanization in these isolates identified loss of function mutation in clinical isolates in the BZP4 transcription factor required for melanin production [36•]; while melanin is a virulence factor in Cryptococcus, the presence of multiple loss of function mutations in clinical isolates suggests that loss of melanin production is observed clinically. A study of a panel of 20 clinical isolates of C. albicans characterized frequent loss of heterozygosity and pinpointed a loss of function of EFG1, a gene required for filamentous growth; this isolate further showed a competitive advantage during gastrointestinal growth over isolates that were isogenic except for the addition of a wild type copy of this gene, suggesting that this change could have provided an advantage during commensal growth . Additional studies in dimorphic fungi have defined new population subdivisions and the level of genetic exchange between these groups [2, 38].
Refining Phylogenetic Relationships
Genome sequence is also utilized to assess phylogenetic relationships between species and has helped resolve inconsistencies in species naming. The Phylogenetic Species Concept requires consistency across multiple gene trees , as single genes could be subject to recombination or introgression and not reflect the true species relationships. This can highlight conflicts in the naming of genera or species grouped by morphological and phenotypic information and suggest how to refine species boundaries. However, where to set species boundaries and the decision of what evidence justifies changes in species naming is debated [40, 41]. More straightforward cases are those where phylogenetic analysis highlighted inconsistencies in the current genus naming; a re-assessment of the Emmonsia genus including many newly reported clinical cases  led to a proposed re-organization of the taxonomy of this group including a new genus name . Such assessments incorporate analysis of the support for phylogenetic subdivisions and the genetic distance between groups.
While phylogenies based on whole genome data may capture the same major phylogenetic relationships and subdivisions as those observed in phylogenies based on small numbers of loci, analysis of whole genome data provides a more comprehensive view of genetic exchange between subdivisions. For example, in C. neoformans, four well-separated lineages (VNI, VNII, VNB-I, and VNB-II) in whole genome phylogenies appear similar to those identified in multi-locus phylogenies, and at a finer scale, subdivision of VNI into three subgroups is also strongly supported from phylogenetic analysis of whole genome data. However, while recombination is limited between the four lineages, the level of recombination appears similar across VNI as within each subgroup, suggesting that the phylogenetic subdivisions within VNI do not reflect genetic isolation [36•]. Such analyses of level of genetic exchange and separation can help support or question subdivisions made based only on multi-gene phylogenies. In addition, these studies can highlight unusual phylogeographic patterns, which can motivate further population sampling to evaluate rare or unexpected subgroups.
Outbreaks and Emerging Species
Multiple species of fungi have been responsible for major outbreaks of infections in the USA within the last 10 years. In contrast to the predominant species that cause of human fungal infections, many outbreaks have resulted from organisms that are not a common cause of infection, and consequently some of these species are not well studied or previously sequenced. For such cases, a primary goal for whole genome sequencing has been to generate a reference genome that could be used for identification of genome-wide variants across outbreak samples as well as for further genomic and transcriptomic studies of pathogenesis. Comparing the genomes of patient and environmental isolates from populations of these pathogens can help trace the origin and transmission patterns in an outbreak; if isolates from an outbreak and potential source show very few genome-wide differences, this supports a clonal outbreak mechanism with strong link to the potential source. In addition, the gene set predicted from the genome of outbreak isolates can help develop biomarkers and new diagnostics, and can potentially guide our understanding of what enabled a strain to cause a suddenly high rate of severe infections.
One major treatment-acquired fungal outbreak in the USA resulted from injection of methylprednisolone, as a treatment for pain management, contaminated with the phaeoid fungus Exserohilum rostratum. As of January 2013, E. rostratum had caused more than 750 cases of phaeohyphomycotic meningitis and at least 61 deaths in 19 US states . A very similar but smaller fungal outbreak occurred 10 years previously, caused by a steroid contamination with Wangiella (Exophiala) dermatitidis . Both species of phaeoid fungi (black or dark brown pigmented) are infrequently the cause of superficial infections, however in rare cases they result in systemic neurotropic infections. Whole genome sequencing of E. rostratum purified from clinical samples from patients injected with contaminated steroids and from steroids lots established that a clonal fungus was present in both patients and steroid lots . This analysis incorporated both de novo and re-sequencing approaches (Fig. 1): a reference assembly was generated from for one of the outbreak strains, and SNPs were identified by aligning reads from all other samples to this assembly. Analysis of variants revealed that genomes from outbreak isolates were nearly identical, both from 19 patients and 6 from steroid lots; only two SNPs were found between any isolate from a patient and compounding vial. By contrast, over 136,000 SNPs differentiated the outbreak isolates from other environmental isolates, though these were collected in years prior to the outbreak in different geographic regions. This genomic analysis provided strong evidence that the fungal strains found in all patients and in the suspected steroid vials were identical.
One recent report demonstrated how whole genome sequencing can pinpoint isolate relationships and suggest new transmission patterns. To determine whether clinical cases of Coccidioidomycosis in Washington State were the first reports of local exposure or resulted from transmission during travel to the southwestern USA, patient isolates were compared with environmental isolates from the local area in Washington and from the southwestern USA. Remarkably, whole genome sequencing revealed that Coccidioides immitis isolates from these patient cases were nearly identical to local soil isolates, differing by only three SNPs across the entire 28 Mb genome , suggestive of local transmission and a potentially an expanded endemic area for this pathogen.
More recently, genomic analysis of drug resistance Candida auris established that patient isolates from specific geographic regions are highly identical [48•, 49]. In one study, a de novo genome assembly was generated for one isolate of C. auris, and SNPs identified in other isolates using the re-sequencing approach [48•]. While isolates from a given geographic area appear closely related, there is more variation between regions; in addition, drug-resistant isolates show candidate-resistant mutations in ERG11, based on mapping such sites from C. albicans.
Outbreaks may also occur when there is a change in a pathogen that increases resistance to stress conditions or enables survival in a new environment. Mutations in the mismatch repair gene MSH2 initially identified in the outbreak lineage of C. gattii may enable more rapid adaption to stressful conditions or new environments by allowing a higher rate of mutation . Loss of MSH2 has also been detected in Candida glabrata, where this appears to accelerate the acquisition of drug resistance ; however another recent study failed to find a correlation between MSH mutation and azole resistance [Dellière S et al. Fluconazole and Echinocandin Resistance of Candida glabrata Correlates Better with Antifungal Drug Exposure Rather than with MSH2 Mutator Genotype in a French Cohort of Patients Harboring Low Rates of Resistance. Front Microbiol. 2016 Dec 23;7:2038.], suggesting that other factors in the genetic background could play a role in drug resistance. These studies suggest that loss of DNA repair genes in fungi could result in a higher mutation rate may provide an advantage under adverse conditions.
Evolution Within Patients
Where fungal infections persist in patients, studies of how the genome changes during chronic infection can highlight mechanisms of adaptation. Recent studies of C. albicans and Cryptococcus have used whole genome re-sequencing to identify how isolates of these species change during infection. One recent study examined serial isolates of C. albicans from 11 patients with oral candidiasis; this revealed that during clinical passage, isolates acquired new mutations, including some linked to host adaptation [51•]. In addition, genome regions showing loss of heterozygosity during passage include genes implicated in drug resistance. Another study of serial isolates of C. neoformans and gattii compared isolates during initial presentation of disease and after 120 days or more during a relapse . These cases were also highly clonal, demonstrating that a second independently infecting strain was not the origin of the relapse. The lower rate is consistent with a prior report; however, higher rates of change in some isolates in a separate study were suggested to result from changes in mismatch repair proteins . Analysis of wider sets of isolates is needed to validate whether such mutations are common in Cryptococcus.
Genome sequencing can type known drug resistance mutations, in some cases suggesting whether particular drugs will fail to control an infection. Whole genome variants could be screened for point mutations in specific drug targets that are highly correlated with resistance. For example, specific mutations in the target of azole drugs  or in the transcription factors that control the expression of drug efflux transporters [55, 56] can be identified from whole genome sequence data only in isolates that display drug resistance . In addition, copy number variation of both drug targets and transporters can also lead to drug resistance; genomic regions showing higher sequencing read depth for such genes are also associated with drug resistance . Where this method could be applied to metagenomic population sequencing, it may be possible to detect early arising drug-resistant mutants that have not yet swept through the population, and precede treatment failure.
Direct genome sequencing of microbial samples can provide precise diagnostic information. With approaches that meet the need of rapid turnaround for clinical samples, similar methods can be applied to complex samples that contain multiple microbes in addition to human DNA. While analyzing the sequence of a set of organisms requires different approaches than single isolate sequencing, specialized methods can identify the potential pathogen sequence in metagenomic data from a mixed population [58, 59]. Metagenomic data can also be used to examine how much the sequence of a specific species varies within a single sample, separating out the contribution of different strains involved in mixed infections [60, 61]. With sufficient coverage or target enrichment, metagenomic sequence can also provide precise typing of specific mutations of clinical impact, such as those that promote or provide drug resistance. In this way, such approaches can tailor treatments to patients, particularly those at high risk of invasive infections, by using genomic data to predict how a patient will respond to a specific treatment.
Genome sequencing is becoming an increasingly common approach to study human fungal pathogens. Current studies seek to compare the genomes of hundreds of isolates of a given species and to examine pathogen populations at unprecedented scale. Sequencing is no longer a bottleneck; however, analysis approaches also need to continue to scale. Genome data offers the resolution needed to examine microevolution of isolates during the course of infection and to pinpoint the source and transmission networks involved in outbreaks. The use of genomic approaches in diagnosis may become more routine and offers the potential to provide additional clinically actionable information such predictions of the level or potential for drug resistance. Even prior to some treatments, metagenomic information may identify fungi and other microbes that are a cause for concern. While real challenges exist to the wide implementation of microbial sequencing methods in the clinic, including turnaround time, ease of use and interpretation, and cost, these approaches enable a high-resolution view of the specific fungal isolates causing disease.
Compliance with Ethical Standards
Conflict of Interest
Christina A. Cuomo was supported by Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services,under Grant Number U19AI110818 to the Broad Institute.
Human and Animal Rights and Informed Consent
This article does not contain any studies with human or animal subjects performed by the author.
Papers of particular interest, published recently, have been highlighted as: • Of importance
- 1.• Ma L, Chen Z, Huang DW, Kutty G, Ishihara M, Wang H, et al. Genome analysis of three Pneumocystis species reveals adaptation mechanisms to life exclusively in mammalian hosts. Nat Commun. 2016;7:10740. Chromosomal assemblies of Pneumocystis jirovecii , P. murina , and P. carinii . Comparative analysis revealed loss of enzymes responsible for chitin and higher order mannose synthesis, suggesting a more flexible cell wall structure that may also impact detection by the host immune system CrossRefPubMedPubMedCentralGoogle Scholar
- 2.Muñoz JF, Farrer RA, Desjardins CA, Gallo JE, Sykes S, Sakthikumar S, et al. Genome diversity, recombination, and virulence across the major lineages of Paracoccidioides. mSphere. 2016;1.Google Scholar
- 5.• Cissé OH, Pagni M, Hauser PM. De novo assembly of the Pneumocystis jirovecii genome from a single bronchoalveolar lavage fluid specimen from a patient. MBio. 2012;4:e00428–12. First genome assembly for Pneumocystis jirovecii , utilizing purification and host sequence filtering to assemble this obligate pathogen CrossRefPubMedPubMedCentralGoogle Scholar
- 11.Cuomo CA, Birren BW. The fungal genome initiative and lessons learned from genome sequencing. Methods Enzymol. 2010;470:Chapter 34.Google Scholar
- 14.Lieberman-Aiden, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293.Google Scholar
- 15.• Muzzey D, Schwartz K, Weissman JS, Sherlock G. Assembly of a phased diploid Candida albicans genome facilitates allele-specific measurements and provides a simple model for repeat and indel structure. Genome Biol. 2013;14:R97. Production of a phased diploid assembly for Candida albicans , enabling allele-specific analysis CrossRefPubMedPubMedCentralGoogle Scholar
- 24.Janbon G, Ormerod KL, Paulet D, Byrnes EJ, Yadav V, Chatterjee G, et al. Analysis of the genome and transcriptome of Cryptococcus neoformans var. grubii reveals complex RNA expression and microevolution leading to virulence attenuation. PLoS Genet. 2014;10:e1004261.CrossRefPubMedPubMedCentralGoogle Scholar
- 25.D’Souza CA, Kronstad JW, Taylor G, Warren R, Yuen M, Hu G, et al. Genome variation in Cryptococcus gattii, an emerging pathogen of immunocompetent hosts. mBio. 2011;2.Google Scholar
- 28.Histoplasma Genome Project | Broad Institute [Internet]. [cited 2017 Mar 21]. Available from: https://www.broadinstitute.org/fungal-genome-initiative/histoplasma-genome-project
- 30.Nierman WC, Fedorova-Abrams ND, Andrianopoulos A. Genome sequence of the AIDS-associated pathogen Penicillium marneffei (ATCC18224) and its near taxonomic relative Talaromyces stipitatus (ATCC10500). Genome Announc. 2015;3.Google Scholar
- 36.• Desjardins C, Giamberardino C, Sykes S, Yu C-H, Tenor J, Chen Y, et al. Population genomics and the evolution of virulence in the fungal pathogen Cryptococcus neoformans. bioRxiv. 2017;118323. Large population genomic study in Cryptococcus establishing the major subdivisions in the populations and associated phenotypes and application of GWAS to identify variants associated with clinical origin and phenotypes.Google Scholar
- 40.Hagen F, Khayhan K, Theelen B, Kolecka A, Polacheck I, Sionov E, et al. Recognition of seven species in the Cryptococcus gattii/Cryptococcus neoformans species complex. Fungal Genet Biol. 2015.Google Scholar
- 41.Kwon-Chung KJ, Bennett JE, Wickes BL, Meyer W, Cuomo CA, Wollenburg KR, et al. The case for adopting the “species complex” nomenclature for the etiologic agents of Cryptococcosis. mSphere. 2017;2.Google Scholar
- 43.Dukik K, Muñoz JF, Jiang Y, Feng P, Sigler L, Stielow JB, et al. Novel taxa of thermally dimorphic systemic pathogens in the Ajellomycetaceae (Onygenales). Mycoses. 2017.Google Scholar
- 45.Centers for Disease Control and Prevention. Exophiala infection from contaminated injectable steroids prepared by a compounding pharmacy—United States, July--November 2002. Morb Mortal Week Rep. 2002;51:1109–12.Google Scholar
- 48.• Lockhart SR, Etienne KA, Vallabhaneni S, Farooqi J, Chowdhary A, Govender NP, et al. Simultaneous emergence of multidrug-resistant Candida auris on 3 continents confirmed by whole-genome sequencing and epidemiological analyses. Clin. Infect. Dis. Off. Publ. Infect. Dis. Soc. Am. 2017;64:134–40. Population genomic study of Candida auris outbreak isolates and characterization of geographic substructure and the basis of drug resistance; example of application of a de novo assembly and re-sequencing approach CrossRefGoogle Scholar
- 51.• Ford CB, Funt JM, Abbey D, Issi L, Guiducci C, Martinez DA, et al. The evolution of drug resistance in clinical isolates of Candida albicans. elife. 2015;4:e00662. Initial study of microevolution and the acquisition of new phenotypes in serial isolates of Candida albicans CrossRefPubMedPubMedCentralGoogle Scholar
- 53.Rhodes J, Beale MA, Vanhove M, Jarvis JN, Kannambath S, Simpson JA, et al. A population genomics approach to assessing the genetic basis of within-host microevolution underlying recurrent cryptococcal meningitis infection. G3 Bethesda Md. 2017.Google Scholar
- 55.Dunkel N, Blass J, Rogers PD, Morschhäuser J. Mutations in the multi-drug resistance regulator MRR1, followed by loss of heterozygosity, are the main cause of MDR1 overexpression in fluconazole-resistant Candida albicans strains. Mol Microbiol. 2008;69:827–40.CrossRefPubMedPubMedCentralGoogle Scholar
- 56.Coste A, Turner V, Ischer F, Morschhäuser J, Forche A, Selmecki A, et al. A mutation in Tac1p, a transcription factor regulating CDR1 and CDR2, is coupled with loss of heterozygosity at chromosome 5 to mediate antifungal resistance in Candida albicans. Genetics. 2006;172:2139–56.CrossRefPubMedPubMedCentralGoogle Scholar
- 61.Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. [Internet]. 2017 [cited 2017 Mar 14]; Available from: http://genome.cshlp.org/content/early/2017/03/10/gr.216242.116