Functional & Integrative Genomics

, Volume 4, Issue 4, pp 207–218

A comparative genomic analysis of ESTs from Ustilago maydis

  • Ryan Austin
  • Nicholas J. Provart
  • Nuno T. Sacadura
  • Kimberly G. Nugent
  • Mohan Babu
  • Barry J. Saville
Original Paper

DOI: 10.1007/s10142-004-0118-x

Cite this article as:
Austin, R., Provart, N.J., Sacadura, N.T. et al. Funct Integr Genomics (2004) 4: 207. doi:10.1007/s10142-004-0118-x


A large-scale comparative genomic analysis of unisequence sets obtained from an Ustilago maydis EST collection was performed against publicly available EST and genomic sequence datasets from 21 species. We annotated 70% of the collection based on similarity to known sequences and recognized protein signatures. Distinct grouping of the ESTs, defined by the presence or absence of similar sequences in the species examined, allowed the identification of U. maydis sequences present only (1) in fungal species, (2) in plants but not animals, (3) in animals but not plants, or (4) in all three eukaryotic lineages assessed. We also identified 215 U. maydis genes that are found in the ascomycete but not in the basidiomycete genome sequences searched. Candidate genes were identified for further functional characterization. These include 167 basidiomycete-specific sequences, 58 fungal pathogen-specific sequences (including 37 basidiomycete pathogen-specific sequences), and 18 plant pathogen-specific sequences, as well as two sequences present only in other plant pathogen and plant species.


Fungal comparative genomics EST analysis Ustilago maydis 


The evolution of plant pathogenesis as a means by which fungi obtain nutrients requires adaptation to the in planta environment. To investigate how this adaptation may have influenced the evolution of the fungal plant pathogen genome, we are comparing genes present in Ustilago maydis to those present in other eukaryotes. U. maydis is a basidiomycete plant pathogen and the causal agent of common smut of corn, Zea mays. It has emerged as a model for fungal plant pathogenesis by the smut and rust fungi (Saville and Leong 1992; Martinez-Espinoza et al. 2002). These two fungal lineages branch basally from the rest of the basidiomycetes (Swann and Taylor 1993) and are among the most devastating plant pathogens worldwide (Agrios 1997; Martinez-Espinoza et al. 2002). Both groups of fungi require growth in a plant to complete their life cycles; in fact, the rust fungi cannot be cultured outside the plant. In contrast, U. maydis is readily cultured on defined media, amenable to molecular manipulation—including homologous gene replacement—and has established protocols for genetic analysis and characterization of gene function. We created EST libraries from U. maydis diploid filaments and germinating teliospores as a rapid means of identifying genes (Nugent et al. 2004; Sacadura and Saville 2003). EST collections have also been catalogued from several other fungal and oomycete plant pathogens and plant–pathogen interactions (Trail et al. 2003; Karlsson et al. 2003; Kruger et al. 2002; Kim et al. 2001; Thomas et al. 2001; Qutob et al. 2000; Keon et al. 2000; Kamoun et al. 1999). The majority of these studies involve EST comparisons to the National Center for Biotechnology Information (NCBI) databases, using BLAST. These analyses provide an identification of genes based on sequence similarity but do not provide for a ready link between fungal pathogenesis and the presence or absence of genes. Identifying gain and loss of genes from various eukaryotic lineages through species-by-species comparison allows a correlation of gene presence with an organism’s life cycle. We investigated this approach by performing comparative analysis of U. maydis ESTs to species-based genomic and EST sequence databases with a goal of identifying genes that are present in all eukaryotes, in fungi, in pathogenic fungi, in basidiomycete fungi, and in basidiomycete pathogenic fungi.

The U. maydis genome has been sequenced by the Whitehead Institute/MIT Genome Sequencing center (Ma et al. 2003; Whitehead Institute/MIT Center for Genome Research Ustilago maydis database, The value of this sequence data will be fully realized with thorough annotation. EST libraries created from different U. maydis cell types and developmental stages (Nugent et al. 2004; Sacadura and Saville 2003) will aid in this annotation. The unisequence set used here has the potential to represent 64% of the estimated 6,522-gene-coding capacity of the U. maydis genome ( With this degree of coverage, the unisequence set was used to assess the gene complement of this model pathogenic fungus and identify the conservation of sequences present in U. maydis across different eukaryotic lineages. We carried out pair-wise comparisons with 21 genome and EST databases to determine the U. maydis unisequence set of ESTs (uniESTs) with sequences similar to genes in these other organisms. We also carried out a twofold annotation of the U. maydis unigene set based on similarity with known proteins and protein-signature scanning. This allowed the annotation of 70% of uniESTs with respect to protein characteristics and probable function. This annotation and identification of conserved gene groups furthers our understanding of the evolution of pathogenic fungi and fungal genomes and forms a basis for the functional analysis of identified genes.

Materials and methods

The EST collection

RNA was isolated from two distinct cellular states of U. maydis, a forced diploid, FBD12 (Banuett and Herskowitz 1989), growing as filaments on charcoal media plates and the germinating diploid teliospore. The procedures followed for the construction of these libraries are described in Nugent et al. (2004) and Sacadura and Saville (2003), respectively. Briefly, total RNA was isolated from the filamentous growth form, using TRIzol reagent (GIBCO BRL, Gaithersburg, Md.); poly(A)+ RNA was purified using Oligotex mRNA Spin-Column protocol (Qiagen, Valencia, Calif.), and cDNA libraries were constructed using a Superscript Plasmid System for cDNA Synthesis and Cloning (GIBCO–BRL). Germinating teliospores were lysed in a French pressure cell. Total RNA was isolated using an RNeasy Mini Kit (Qiagen), and a cDNA library was created using a Creator SMART cDNA Library Construction Kit (Clontech, Palo Alto, Calif.). Single-pass sequencing was performed from the 3′ end of cDNA [corresponding to the 5′, non-poly(A)+ tail end of the transcript], using plasmid DNA as a template and ABI PRISM Big Dye Terminator Chemistry, version 3 (Applied Biosystems, Foster City, Calif.). The resulting extension products were separated and analyzed on an ABI PRISM 3100 Genetic Analyzer. The clustering software of SeqManII was used to assemble ESTs into overlapping contigs, using default settings. This produced 4,221 single ESTs or overlapping contigs. The uniESTs therefore represent a set of unique gene sequences, each consisting of either a single EST or a contig sequence assembled from a group of ESTs (see Electronic Supplementary Material, Table 3).


All uniESTs in the collection were queried against the NCBI non-redundant (nr) protein reference library, using the NCBI standalone blastall program and the BLASTX algorithm at default settings (Altschul et al. 1997). Within each uniEST, the existence of a top high-scoring pair (HSP) with an E-value below 10−5 was taken as indicative of significant similarity (Rubin et al. 2000), and annotation information from the homologue was used to annotate the uniEST. The SEALS program blast2gi (Walker and Koonin 1997) was used to create a complete annotation file based on BLAST results, and a Shell script was written to retrieve the full annotation information directly from the nr database, which was otherwise truncated during BLAST analysis.

Protein fingerprint and motif analysis of each uniEST was performed using an InterProScan wrapper program. InterProScan (Zdobnov and Apweiler 2001) was downloaded from the European Bioinformatics Institute (InterPro,, and release 6.0 of the InterPro database was used (Mulder et al. 2003). An interface to the InterProScan application (pbsInterProScan) was written in Perl in order to distribute jobs across the Botany Beowulf Cluster at the University of Toronto, using the OpenPBS load-balancing system. Results were summarized using Perl, and Shell scripts and all significant InterProScan results were appended to the annotation of the uniESTs.

BLAST analysis

All uniESTs were queried against all datasets obtained for each species (see Table 1 for dataset sources), using the NCBI standalone blastall program and the tBLASTX algorithm with default settings on the Botany Beowulf Cluster at the University of Toronto. Individual BLAST jobs for each uniEST were distributed across the Beowulf cluster using a wrapper program (bBlastall) written in Perl and interfaced with the OpenPBS load-balancing system. Results were delivered to a respective species directory, one file for each uniEST blast result.
Table 1

Datasets used in study

Species, disease caused


Data type

Data origin

Ustilago maydis



Saville lab, Department of Botany, University of Toronto

 Common smut of corn

Phanaerochyte chrysosporium



The Joint Genome Institute, Data as of 02/16/02

 Whiterot pathogen

Cryptococcus neoformans



The Institute for Genome Research (TIGR), Dataset: JEC21_9X.fa

 Cryptococcal meningitis

Schizosaccharomyces pombe



The Sanger Institute, Data as of 11/28/02


Saccharomyces cerevisiae



The National Center for Biotechnology Information (NCBI), Data as of 11/06/02


Neurospora crassa



Fungal Genome Initiative, Data as of 11/24/02


Blumeria graminis



Consortium for the Functional Genomics of Microbial Eukaryotes (COGEME), Data as of 02/21/03

 Barley powdery mildew

Botryotinia fuckeliana



COGEME, Data as of 02/21/03

 Grey mould

Cladosporium fulvum



COGEME, Data as of 02/21/03

 Leaf mould of tomato

Colletotrichum trifolli



COGEME, Data as of 02/21/03

 Alfalfa anthracnose disease

Gibberella zeae



COGEME, Data as of 02/21/03

 Fusarium head blight of wheat

Fusarium sporotrichiodes



COGEME, Data as of 02/21/03

 Fusarium head blight of wheat

Leptosphaeria maculans



COGEME, Data as of 02/21/03

 Blackleg of oilseed rape

Mycosphaerella graminicola



COGEME, Data as of 02/21/03

 Wheat leaf spot

Magnaportha grisea



COGEME, Data as of 02/21/03

 Rice blast disease

Verticillium dahliae



COGEME, Data as of 02/21/03

 Vascular wilt

Phytophthora infestans



COGEME, Data as of 02/21/03

 Late blight in potato

Phytophthora sojae



COGEME, Data as of 02/21/03

 Stem and root rot on soybean

Caenorhabditis elegans



The Sanger Institute, Data as of 02/21/03


Drosophile melanogaster



NCBI, Data as of 02/21/03


Arabidopsis thaliana



The Arabidopsis Information Resource (TAIR), Data as of 11/06/02


Zea mays



TIGR, Dataset: ZMGI.TCs.092602.Z


Deduced protein-coding sequence datasets used for InterProScan study

A. thaliana



TAIR, Data as of 08/30/03

S. cerevisiae



Saccharomyces Genome Database, Stanford University, Dataset: orf_trans.fasta.gz, as of 08/30/03

N. crassa



The Broad Institute, Dataset: neuorospora_3_protein.tgz, as of 08/30/03

The top HSP and associated E-value were parsed from all uniEST output files returned for each species and read to a species-specific summary file, using the SEALS blast2gi program (Walker and Koonin 1997). Normalizing all E-values based on the cumulative size of all databases and selecting those HSPs with an eEvalue less than 10-5 generated a second summary file.

Adjustment of the E-value for comparative analysis was based upon the formula for calculating E-value,
$$E = m \times n \times 2^{{ - {S}'}} $$
where m is the species database size, n is sequence length, and S′ is normalized bit score. The normalized bit score is independent of database size (Altschul et al. 1990) and can thus be ignored in the normalization equation. Normalization of E to create En was performed with the formula
$$ E_{{\text{n}}} = \frac{{{m}' \times m \times n \times 2^{{ - {S}'}} }} {{m \times n}} = \frac{{{m}' \times E}} {m} $$
where En is the normalized E-value and m′ is the sum total of all species dataset sizes. This allowed for the treatment of each dataset as a subset of one large database and the comparison of E-values for each uniEST across species was consequently standardized and unaffected by variability in size between species datasets.

An overall summary file listing the uniEST IDs and associated E-values (E-value <10−5) for all species was created using Perl. Annotation information was then incorporated and all results processed using a Venn selection program (Botany Beowulf Cluster Webtools at University of Toronto Botany Department, to display the data in an easily interpretable format, which could be imported into MS Excel and sorted.

Gene family identification

The UNIX pattern matching tool grep allowed for the identification of names of identified expanded gene families (and substrings thereof) within the nr annotation, while the InterProScan annotation file was searched for respective InterPro motif numbers found within each family. InterProScan was also run on the genomic sequence of Cryptococcus neoformans, and on deduced protein sequence datasets for Arabidopsis, Neurospora crassa, and Saccharomyces cerevisiae, as obtained from the sources listed in Table 1.


The analysis strategy used is presented in Fig. 1. Sequence datasets from a total of 21 species including two basidiomycetes, 13 ascomycetes, two oomycetes, two plants, and two animals were selected for comparative analysis. The basidiomycetes, in addition to U. maydis, include Phanerochaete chrysosporium and C. neoformans. The ascomycetes include Schizosaccharomyces pombe, S. cerevisiae, N. crassa, and ten fungal pathogens from the Consortium for the Functional Genomics of Microbial Eukaryotes (COGEME) (Soanes et al. 2002; Table 1; The oomycete pathogens were also from COGEME (Table 1). Sequences of the animals and plants studied were from Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, and Z. mays (for dataset information, see Table 1).
Fig. 1

Project methodology. The three-pronged approach involved, from left to right, carrying out BLAST searches against various EST and genomic sequence databases from fungi and other eukaryotes, [IPR] Scanning for the presence of conserved motifs, and BLAST searches against the NR database

Annotation by comparison to nr

Sequence comparison against the NCBI nr database allowed for the annotation of 2,372 uniESTs. Among these are 1,685 annotations derived from 120 fungal species, 338 from 39 animal species, 183 from 80 bacterial species, and 131 annotations from 18 plant species. Of the 4,221 uniESTs, 1,834 did not exhibit significant similarity (E-value > 10−5) to genes in the nr database.

InterProScan annotation

Additional information on uniESTs was obtained by protein-signature scanning (Zdobnov and Apweiler 2001). Domain presence in U. maydis was compared with that in the fungi S. cerevisiae, C. neoformans, and N. crassa as well as the plant A. thaliana. InterProScan was used for sequence comparison to the InterPro database (Zdobnov and Apweiler 2001). Submission of each uniEST to an InterProScan analysis facilitated the annotation of 2,173 uniESTs with significant protein signatures and associated InterPro numbers. The InterPro database amalgamates signature information from Hidden Markov models, regular expressions, fingerprints and profiles for protein families, and domains from public domain database projects including Pfam (, PROSITE, ProDom, PRINTS, TIGRFAMS, and SMART (Apweiler et al. 2001). Of the 2,173 annotated sequences, 557 were not annotated during the sequence comparison against the NCBI nr database. A total of 8,905 protein signatures were identified within the 2,173 uniESTs, and 16% of the uniESTs possessed only one identified domain. The 134 most frequently occurring domain signatures with the [IPR] designation present in the uniESTs, preceded by the number of times they were identified, are listed in Table 2, along with the corresponding numbers found in the other fungi and A. thaliana. While the U. maydis database is not as comprehensive as the genomic databases of the other organisms, the relative levels of various domains are assessed. There are a higher number of GTPase, GTP-binding protein-, and zinc finger- Tim10/DDP-type signatures in the U. maydis sequences relative to the other basidiomycete, C. neoformans, whereas the number of ABC transporters, serine/threonine protein kinases, major facilitator superfamily DEAD/DEAH box helicases, RING, and CCHC zinc finger protein domains are under represented in the U. maydis sequences. The level of heat shock protein Hsp70, mitochondrial carriers, calcium-binding EF-hands, and forkhead-associated (FHA) domains are similar between these basidiomycetes. The relative number of Zn-finger, RING and FHA domains in the basidiomycetes are less than that in the ascomycetes investigated.
Table 2

The 134 most frequent [IPR] protein signatures identified using the InterProScan application and the InterPro database in 2,173 U. maydis uniESTs possessing a signature. As a comparison, the number of occurrences of each in A. thaliana, S. cerevisiae, and N. crassa deduced coding sequences, and in translated C. neoformans genomic contigs is provided. These are denoted A, S, N, and C, respectively. U. maydis is denoted U. The background color indicates number identified, ranging from 0 (white) to 100 or more (dark yellow)

BLAST analysis to species sequence datasets

We used the tBLASTX algorithm (Altschul et al. 1997) to identify similar sequences in the sequence datasets of the 21 species that were examined. Rubin et al. (2000) indicated an E-value of 10−6 or smaller as a cutoff for sequence similarity. We performed a verification of the 10−5 value we have used as indicative of similarity significance by plotting the number of significant HSPs against E-values ranging from 10−25 to 1 (data not shown). Increasing the E-value cutoff resulted in a gradual increase in the number of significant HSPs. Above an E-value cutoff of ~10−4, however, the number of significant HSPs begins to increase dramatically until a value of 1, at which point all U. maydis sequences had “matches” identified. As the E-value of 10−5 fell within the range of gradual increase of HSP numbers, it was taken as an adequate indicator of significant similarity.

The default amino acid substitution matrix employed by the NCBI tBLASTX algorithm for scoring amino acid alignments is BLOSUM62, developed by Henikoff and Henikoff (1992). Since we were drawing comparisons between species separated by varying evolutionary distances, the application of the BLOSUM matrices based on an expected percent identity between proteins across distantly related species was investigated. tBLASTX analysis was performed on representatives from each of three taxonomic groups, with BLOSUM matrices reflecting their expected percent similarity with U. maydis sequence. Therefore, for the species of A. thaliana, D. melanogaster, and P. chrysosporium, we applied the BLOSUM45, BLOSUM62, and BLOSUM80 matrices, respectively. The results of significant HSP number versus E-value were plotted across the E-value range 10−25 to 1 and compared at the 10−5 cutoff level to assess the effect each matrix had on the number of significant HSPs. Results revealed a slight rise in the number of significant HSPs for P. chrysosporium under the BLOSUM80 matrix and a slight drop in the number of significant HSPs for A. thaliana under BLOSUM45 (data not shown). However, variation in the number of uniESTs possessing a significant match was minimal in both species comparisons, and the BLOSUM62 matrix was deemed appropriate for the tBLASTX analysis of all 21 species. The consistent use of one matrix offered a means of standardizing the analysis across all species.

In these comparisons, it is important to note that the U. maydis sequences do not represent the complete genome. The U. maydis genome has been estimated to contain 6,700 coding sequences (Kämper et al. 2001). With this estimate, the 4,221 uniESTs may represent as much as 63% of the genome. The Broad Institute’s annotation of the genome revealed 6,522 genes (, suggesting that the uniESTs could represent 65% of the genes; however, 325 uniESTs did not have a match in the genome sequence (data not shown). Hybridizations carried out with 70 of these (Babu et al., unpublished) have confirmed that they are U. maydis genes, indicating that the posted annotated genome sequence lacks data for some regions.

All uniEST sequences from U. maydis were queried against the sequence datasets of all species of comparison, using the tBLASTX algorithm. The resultant BLAST output for each uniEST from each species was then parsed for the top HSP and its associated E-value. E-values were normalized against the sum total size of all datasets, and a table listing the number of uniESTs possessing a top HSP below the E-values of 10−5, 10−10, 10−20, 10−50, and 10−100 was produced (Table 3). Of the 4,221 uniESTs from U. maydis, the proportion of similar sequences at the 10−5 level are P. chrysosporium, 51.4%; N. crassa, 47.2%; C. neoformans, 46.8%; S. pombe, 40.6%: D. melanogaster, 31.7%; A. thaliana, 35.3%; and S. cerevisiae 19.7% (Table 3). The numbers of uniESTs with similarity to sequences of other organisms are portrayed in Fig. 2. There are a high number of similar sequences (E-value <10−5) to basidiomycete datasets and a range in the number of similar sequences in the ascomycete datasets. The basidiomycete datasets are genomic, and the ascomycete datasets are a mixture of genomic and EST sequences. However, there is variation in the number of similar sequences even among the genomic datasets with S. pombe having 1,713 similar sequences and S. cerevisiae 830.
Table 3

All uniEST sequences underwent tBLASTX against all other species datasets (size and number of sequences for each species shown in two right-most columns). The E-value from the top high-scoring pair (HSP) from all results was normalized by dividing by the size of the species dataset and multiplying by the sum total of all species datasets (represented by Total Nts in the table). The numbers of top HSPs possessing an E-value below each cutoff are displayed as numbers and percentages for each species. Species marked with an asterisk indicate genomic datasets. Percentages were calculated as the number of uniESTs possessing a homologous sequence over the total number of uniESTs from U. maydis (4221)


E-value <10−5

E-value <10−10

E-value <10−20

E-value <10−50

E-value <10−100

N seqs

Total nts











P. chrysosporium*













C. neoformans*













S. cerevisiae*













S. pombe*













B. graminis













C. fulvum













C. trifolii













F. sporotrichioides













G. zeae













L. maculans













M. graminicola













M. grisea













N. crassa*













V. dahliae













B. fuckeliana













P. infestans













P. sojae













D. melanogaster*













C. elegans EST













A. thaliana













Z. mays













Total Nts



Fig. 2

Graphical representation of the number of high-scoring pairs (HSPs) possessing significant similarity with a sequence within each respective species dataset below an E-value of 10−5. The smaller EST datasets show a deficit in the number of HSPs identified due to their incomplete nature. Organisms are color-coded based on taxa: pink basidiomycetes, yellow ascomycetes, purple oomycetes, blue and green animals, and orange plants

In an effort to determine sequences that are conserved amongst various species lineages and thus provide an indication of conservation and potential homologous relationships, we present a list of 2,560 uniESTs with E-values ≤10−5 in at least one species (Fig. 3; Electronic Supplementary Material, Table 1). All uniESTs were documented with the E-values for the best HSP from each species. These results were then sorted and grouped based on the existence of sequence matches relative to the uniESTs of U. maydis. The species compared are listed vertically as indicated in Fig. 3a. The arrangement of the uniEST data presented in Fig. 3b is based upon the number of species in which a given uniEST had a similar sequence. That is, those uniESTs in the upper left of Fig. 3b have the highest number of species matches, and the uniESTs in the bottom right of Fig. 3b a single match or no matches. Striking in this representation is a group of uniESTs with sequence similarity in a large number of species—the predominantly yellow sections—and stretches of uniESTs similar only to a single other species—the yellow-fading-to-dark lines in the second last row of Fig. 3b.
Fig. 3

tBLASTX was used on the 4,221 U. maydis uniESTs to obtain the datasets indicated. BLASTs were performed for each database separately, and E-values were then recalculated based on a merged database size of 572,524,341 nucleotides. A cutoff of 10−5 or smaller was used as an indicator of similarity. The E-value for the best HSP for each organism was then taken if it met the cutoff criterion and was log2-transformed and plotted according to the color-scale shown on the bottom left. Blue areas are indicative of an E-value larger than 10−5 or of no hit. The organisms are ordered in the image on the right within in each stripe from top to bottom as listed in the table in a, and the U. maydis ESTs are ordered left to right according to the number of organisms having a homologue (i.e., first 19 have homologues in 20 of 21 organisms listed) and then by the E-value of the Phanaerochyte chrysosporium homologue, unless not present. Results are truncated after 2,750 uniESTs

Further analysis of U. maydis uniESTs grouped the sequences based upon the eukaryotic taxa within which similarity was found. This information can be obtained by sorting the data in Electronic Supplementary Material, Table 1, and is summarized here and in Electronic Supplementary Material, Table 2. The groups include uniESTs with similar sequences present in all the eukaryotic taxa searched (1,333), those sequences with similarity only in the fungi (520), and those not present in any other database (orphans, 1,555). In calculating the numbers for these categories, those uniESTs possessing nr annotation that referenced species outside of the categorization were removed.

The classification of uniEST sequences conserved across all eukaryotes was defined by at least one similar sequence in each of the basidiomycetes, ascomycetes, plants, and animals. The oomycetes were excluded from this selection, as both of the oomycete EST collections contained less than 1,500 sequences and would have biased the grouping. Under these criteria, 1,233 uniESTs (29%) were present in all the eukaryotic taxa. We were able to annotate all uniESTs in the eukaryotic group with the nr database annotation results.

U. maydis sequences found to be present only within fungal species totaled 520 (see Electronic Supplementary Material, Table 2). Among these, 58 sequences are present only in U. maydis and another pathogenic fungus, either ascomycete or basidiomycete.

The number of orphan uniESTs—those showing no similarity with any other organism or the nr database—was 1,555 (37%). However, the number of U. maydis uniESTs lacking a similarity to sequences in the nr database was 1,834 (43%). Thus, after our cross-species identification of homologous sequences, we were able to reduce the number of orphan genes by another 279 uniESTs (6%), the majority of which shared similarity strictly with sequences in the basidiomycete datasets (152 uniESTs) or with sequences in a basidiomycete and/or an ascomycete (111 uniESTs).

Comparing the U. maydis ESTs to the genomic datasets of P. chrysosporium, C. neoformans, N. crassa, S. cerevisiae, S. pombe, and D. melanogaster, it was possible to determine presence and absence of U. maydis sequences. Among the fungal specific sequences there are 167 U. maydis uniESTs that are present in another basidiomycete but absent from the ascomycetes and 122 U. maydis uniESTs that are present in the ascomycetes but absent from the other basidiomycete species.


We are investigating the genome of the model fungal plant pathogen U. maydis to gain insight into the evolution of a fungal-pathogen genome and to identify genes that are unique to fungal pathogens. We have compared a set of uniESTs representing as much as 64% of the coding capacity of the U. maydis genome to the non-redundant database of GenBank as well as to species-specific sequence databases for ten plant pathogenic fungi, a human pathogenic fungus, four non-pathogenic fungi, two oomycetes, two plants, and two animals. This combination of general searches and species-by-species comparisons allows the annotation of sequences and, within the limitations of currently available sequence data, the identification of sequences specific to different eukaryotic lineages.

BLAST searches indicated that 43% of the uniESTs lack similarity to an entry in the nr protein database. This is in accordance with prior studies that found the percentage of orphan genes in U. maydis to be 40% (Kämper et al. 2001), 38% (Sacadura and Saville 2003), and 41.2% (Nugent et al. 2004). The degree of annotation was extended through the identification of protein motifs, using InterProScan searches of the InterPro databases. This extension resulted in annotation of 69.4% of the uniESTs. The InterProScan searches identified motif families, allowing comparison of the U. maydis uniESTs to the genomic databases of A. thaliana, S. cerevisiae, N. crassa, and C. neoformans (Table 2). Previous studies showed that fungi, represented only by S. cerevisiae, had reduced sizes of motif families relative to the animals D. melanogaster and C. elegans (Rubin et al. 2000). In the analysis presented here, the S. cerevisiae searches also show reduced sizes of families compared to the plant, A. thaliana, but comparable family size to the other fungi. While the use of ESTs means the analysis is incomplete with regards to U. maydis, some trends emerge. The two most abundant U. maydis protein domains identified are antifreeze protein type 1 and proline-rich regions. These are also abundant in the other organisms analysed here (Table 2). Proline-rich regions are frequently found in transcription activators and have been implicated as a critical component of the protein recognition domain in numerous protein–protein interactions (Kay et al. 2000). The number of protein kinases, serine/threonine protein kinases, DEAD/DEAH box helicases, and sugar transporters are lower in U. maydis than all other fungi investigated. While it will be necessary to rule out that these numbers result from using EST data from only two growth stages, these differences may reflect differences in the growth or life cycle in U. maydis. It is not immediately obvious that U. maydis would require less signal-transduction capacity, RNA unwinding/transport, or sugar transport than the other fungi investigated; however, it may be that the obligate in planta growth requirement of U. maydis has led to or is the result of gene reduction in some families of proteins.

Loss of family members may also apply to the U. maydis RING, CCHC, and C-x8-C-x5-C-x3-H-type zinc finger proteins; however, it appears that the Tin10/DDP-type zinc finger proteins are at an equivalent level to that of all organisms investigated. These protein families are involved in protein–DNA, protein–protein, and protein–lipid interactions (Matthews and Sunde 2002), and all but the Tim10/DDP-type have large numbers of members in A. thaliana compared to any of the fungi. Interestingly, U. maydis may have C-x8-C-x5-C-x3-H-type and Tin10/DDP-type zinc finger protein family sizes more similar to the ascomycete S. cerevisiae than to the basidiomycete C. neoformans. This relationship is also reflected in Rab-type, Ras small GTPase domain numbers. If these relative numbers hold up with further U. maydis sequence analysis, it may indicate retention of genes by U. maydis that have been lost in the other basidiomycetes. In contrast, the number of bipartite nuclear localization signal domains is higher in basidiomycetes U. maydis and C. neoformans relative to the ascomycetes, and the FHA domain numbers are lower in the basidiomycetes versus the ascomycetes. These changes may reflect evolutionary gain and loss of gene family members that, in turn, may represent adaptations to life-cycle differences between these two fungal lineages. In general, these motif-family analyses support the concept of a mosaic U. maydis genome structure.

Since we compare the U. maydis uniESTs to other species-specific databases, we assessed the effect of different E-value cutoff thresholds and selected an E-value of 10−5 as the cutoff to determine presence of a given U. maydis sequence in another organism. While this may have been too conservative and eliminated sequences that were similar from further analysis, its use and that of BLOSUM62 provide a compromise in comparison of sequences within species of differing degrees of relatedness.

The BLAST algorithm used had an impact on the results of our analyses, possibly because BLASTX inserts gaps and tBLASTX does not. During the functional annotation of U. maydis uniESTs, the BLASTX algorithm yielded 880 similar sequences with A. thaliana and 1,107 with S. cerevisiae at an E-value cutoff of 10−20. At the same E-value cutoff, the tBLASTX searches against genomic datasets revealed 864 and 476 similar sequences in A. thaliana and S. cerevisiae, respectively. While different databases are being searched in each instance, this drop in similar sequences seems disproportionate, but a comparison of U. maydis uniESTs to N. crassa and S. cerevisiae sequences revealed a similar drop with regards to the S. cerevisiae sequences. BLASTX searches at an E-value cutoff of 10−5 against N. crassa and S. cerevisiae sequences revealed 2,318 versus 1,927 matches, respectively, as compared to 2,192 and 973 matches, respectively, when using a tBLASTX search. (Note these numbers for the tBLASTX searches are not corrected as are those listed in Table 3.) Such a dramatic drop in number for one ascomycete but not the other is not what might be expected. However, S. cerevisiae is thought to have undergone extensive genomic shuffling and remodeling (Wolfe and Shields 1997; Keogh et al. 1998; Seoighe et al. 2000), and it is possible that the low number of similar sequences in the tBLASTX analyses reflect this genomic change. If this is the case, the tBLASTX algorithm may considerably under-report the incidence of similar sequences in comparisons with organisms whose genomes have undergone such rearrangements. While this may be detrimental in some analyses, BLASTX versus tBLASTX comparisons may turn out to be predictive of genome rearrangement.

In the species-by-species comparison, 1,243 uniESTS (29%) were present in all eukaryotic taxa; all of these also had similar sequences in the NCBI nr database. This might be expected if one considers that most highly conserved genes would tend to belong to functional groups involved in basal cellular processes across all organisms. Such genes would generally exhibit higher transcript levels (Mata and Bahler 2003), facilitating their identification and study, as well as possessing a greater chance of being isolated, cloned, and sequenced in at least one of the numerous organisms frequently studied. The discovery that 29% of our uniESTs belonged to the eukaryotic classification is slightly low but not unexpected. Several factors may explain this, including an under-reported amount of similar sequences due to the level of stringency we employed in our analysis, a lack of publicly available genetic data with which to compare, a bias in the percentage of uniESTs in our collection that represent housekeeping and eukaryotic conserved genes, or a high degree of species-specific or fungal-specific genes within the U. maydis genome. In apparent contrast to this interpretation are the two uniESTs with similar sequences among the fungi, plants, and animals and yet without InterProScan or nr sequence similarity.

There were 235 uniESTs with sequence similarity to at least one plant species but neither animal. There are also 184 uniESTs with similarity among animal sequences but not plant sequences. One might expect more genes similar in the animals due to their closer evolutionary relationship with fungi. While the number of sequences in each group may be skewed as the result of sampling, it is conceivable that a portion of the genes conserved in plants, while lost from the animal and many fungal lineages, were conserved in U. maydis due to a life cycle that exists partly within a plant host. The ability of a plant pathogen to express proteins orthologous to endogenous host proteins could have advantages with respect to defense avoidance, tissue targeting, or the manipulation of cellular processes within the host.

It has been suggested that horizontal gene transfer may be more prevalent in fungi than in other eukaryotic lineages (Rosewich and Kistler 2000). We observed such a gene present in Z. mays, U. maydis, and Gibberella zeae. As both U. maydis and G. zeae are pathogens of Z. mays, the pathogens may have traded pathogenicity genes, as has been observed among prokaryotic pathogens (Doolittle 1998). Horizontal transfer has been suggested between prokaryotic species and N. crassa (Braun et al. 2000), and the occurrence of selfish genetic elements laterally transferring between fungi, as well as between fungi and plants, has been documented (Vaughn et al. 1995; Holst-Jensen et al. 1999). Hoffmann et al. (1994) presented experimental evidence of plasmid transfer between a pathogenic fungus and its host plant. Further investigation of the identified gene in both U. maydis and G. zeae may provide further insight into horizontal gene transfer in fungi.

In the species-by-species comparisons, the basidiomycete fungus P. chrysosporium possessed the largest number of similar sequences to the U. maydis uniESTs, followed by the ascomycete N. crassa and the other basidiomycete, C. neoformans (Fig. 2). The greater number of similar sequences in an ascomycete versus another basidomycete could be a result of the lack of gap insertion in the tBLASTX algorithm. It could also reflect limitations in the sequence databases compared or that there is a great deal of similarity between U. maydis, which branched early from the rest of the basidiomycetes, and the ascomycetes. Among the smaller EST datasets, the number of uniESTs with similarity was relatively low. This may be a result of the incomplete nature of the EST datasets. The observation that N. crassa possesses the second-highest number of significant matches, while S. cerevisiae possesses the lowest may reflect the non-gapped tBLASTX search used or gene loss in S. cerevisiae. If the latter is correct, then our analysis may further support the hypothesis that the unicellular fungi arose from multicellular ancestors (Braun et al. 2000).

Comparisons on a species level also revealed 520 uniESTs—12.3% of the total uniESTs investigated—with similarity only among the fungi. There may be genes among these that have retained phylogenetic signatures dating to the separation of fungi and animals or genes with signatures representing further changes leading to the current state of U. maydis. Within this 12.3%, there are 167 uniESTs similar only among basidiomycetes and 36 uniESTs similar only between U. maydis and the other basidiomycete pathogen, C. neoformans. In the context of a greater number of similar sequences being present in the non-pathogenic basidiomycete P. chrysosporium or even the non-pathogenic ascomycete N. crassa, these 36 genes are of particular interest for further analysis. There are also 122 uniESTS with similar sequences in ascomycetes but not in the basidiomycetes. A portion of these sequences may represent genes that were present in an ancestor to the basidiomycetes and the ascomycetes and that have been retained in U. maydis and the ascomycetes but lost in the other basidiomycetes. As a group, these fungal specific uniESTs may provide information on chromosomal change in fungi over times of divergence and, possibly, during the development of pathogenesis.

This work utilized available databases, including the fungal-pathogen EST databases of COGEME, to map potential evolutionary relationships of U. maydis uniESTs and provide a base for investigating genes that may be linked to the pathogenic life style. As more sequence data become accessible to researchers via public domain sequencing initiatives, the genomic characteristics and evolutionary significance of fungal pathogens will become increasingly defined. This study provides a base for future investigation into the nature of genomic change associated with the evolution of the U. maydis genome. The established protocols for the analysis of gene function available for U. maydis make it a valuable model in which to investigate fungal pathogenesis. This work provides directions for this pursuit.


We would like to acknowledge the assistance of Kristen Choffe in the creation of cDNA libraries and sequencing. Funding provided by NSERC Canada to B. J. Saville.

Copyright information

© Springer-Verlag 2004

Authors and Affiliations

  • Ryan Austin
    • 1
  • Nicholas J. Provart
    • 1
  • Nuno T. Sacadura
    • 2
  • Kimberly G. Nugent
    • 2
  • Mohan Babu
    • 2
  • Barry J. Saville
    • 2
  1. 1.Department of BotanyUniversity of TorontoTorontoCanada
  2. 2.Department of BotanyUniversity of Toronto at MississaugaMississaugaCanada

Personalised recommendations