Background

Secondary metabolites produced by fungi are a rich source of medically useful compounds because of their pharmaceutical and toxicological properties [1]. While secondary metabolites are not required for an organism’s growth or primary metabolism, they may provide important benefits in its environmental niche. For example, A. nidulans laeA mutants defective in the production of secondary metabolites are ingested more readily by the fungivorous arthropod, Folsomia candida, suggesting that secondary metabolite production can protect fungi from predation [2].

The Aspergilli are producers of a wide variety of secondary metabolites of considerable medical, industrial, agricultural and economic importance. For example, the antibiotic penicillin is produced by A. nidulans and the genes involved in the penicillin biosynthetic pathway have been extensively studied [35]. Sterigmatocystin (ST), an aflatoxin (AF) precursor, and many of the genes that are involved in its biosynthesis have also been extensively studied in A. nidulans[610]. AF is a secondary metabolite produced mainly by Aspergillus species growing in foodstuffs [11], and it is of both medical and economic importance as contaminated food sources are toxic to humans and animals when ingested. Gliotoxin is an extremely toxic secondary metabolite produced by several Aspergillus species during infection [12, 13]. The ability of this toxin to modulate the host immune system and induce apoptosis in a variety of cell-types has been most studied in the ubiquitous fungal pathogen, A. fumigatus[14, 15].

The availability of Aspergillus genomic sequences has greatly facilitated the identification of numerous genes involved in the production of other secondary metabolites. Based on the number of predicted secondary metabolite biosynthesis genes and the fact that the expression of many secondary metabolite gene clusters is cryptic [16], meaning that expression is not evident under standard experimental conditions, there appears to be the potential for production of many more secondary metabolites than currently known [17]. Secondary metabolite biosynthetic genes often occur in clusters that tend to be sub-telomerically located and are coordinately regulated under certain laboratory conditions [1820]. Typically, a secondary metabolite biosynthetic gene cluster contains a gene encoding one of several key “backbone” enzymes of the secondary metabolite biosynthetic process: a polyketide synthase (PKS), a non-ribosomal peptide synthetase (NRPS), a polyketide synthase/non-ribosomal peptide synthetase hybrid (PKS-NRPS), a prenyltransferase known as dimethylallyl tryptophan synthase (DMATS) and/or a diterpene synthase (DTS).

Comparative sequence analysis based on known backbone enzymes has been used to identify potential secondary metabolite biosynthetic gene clusters for subsequent experimental verification. One approach for experimental verification is the deletion of genes with suspected roles in secondary metabolite biosynthesis followed by identification of the specific secondary metabolite profiles of the mutants by thin layer chromatography, NMR or other methods [7, 8]. For example, the deletion of A. fumigatus encA, which encodes an ortholog of the A. nidulans non-reducing PKS (NR-PKS) mdpG, followed by analysis of culture extracts using high-performance liquid chromatography (HPLC) enabled the recent identification of endocrocin and its biosynthetic pathway intermediates [21]. Similarly, the deletion of the gene encoding the PKS, easB, enabled the identification of the emericellamide biosynthetic pathway of A. nidulans[22]. Another approach is the overexpression of predicted transcriptional regulators of secondary metabolism gene clusters with subsequent analysis of the gene expression and secondary metabolite profiles of the resulting strains, which has facilitated the identification of numerous secondary metabolites and the genes responsible for their synthesis [23, 24]. For example, overexpression of laeA in A. nidulans, a global transcriptional regulator of secondary metabolism production, coupled with microarray analysis, facilitated the delineation of the cluster responsible for production of the anti-tumor compound, terrequinone A [18]. Thus, genome sequence analysis, coupled with targeted experimentation, has been a highly effective strategy for identifying novel secondary metabolites and the genes involved in their synthesis.

The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a web-based resource that provides centralized access to gene and protein sequences, analysis tools and manually curated information derived from the published scientific literature for A. nidulans, A. fumigatus, A. niger and A. oryzae[25, 26]. AspGD curators read the published experimental literature to record information including gene names and synonyms, write free-text descriptions of each gene, record phenotypes and assign terms that describe functional information about genes and proteins using the Gene Ontology (GO; http://www.geneontology.org). These annotations are an important resource for the scientific research community, used both for reference on individual genes of interest as well as for analysis of results from microarray, proteomic experiments, or other screens that produce large lists of genes.

The GO is a structured vocabulary for describing the functions associated with genes products [27]. GO terms describe the activity of a gene product (Molecular Function; MF) within the cell, the biological process (Biological Process; BP) in which a gene product is involved and the location within the cell (Cellular Component; CC) where the gene product is observed [28]. Evidence codes are assigned to GO annotations based on the type of available experimental evidence.

At the start of this project most of the terms needed to describe secondary metabolite biosynthetic genes or regulators of secondary metabolism did not yet exist in the GO. Thus, in order to provide an improved annotation of secondary metabolite biosynthetic genes and their regulatory proteins, we developed new GO terms for secondary metabolite production in collaboration with the GO Consortium, and reannotated the entire set of genes associated with secondary metabolism in AspGD. We then performed a comprehensive analysis of the secondary metabolism biosynthetic genes and their orthologs across the genomes of A. nidulans, A. fumigatus, A. niger and A. oryzae and now provide a set of manually annotated secondary metabolite gene clusters. We anticipate that these new, more precise annotations will encourage the rapid and efficient experimental verification of novel secondary metabolite biosynthetic gene clusters in Aspergillus and the identification of the corresponding secondary metabolites.

Results

Identifying genes for reannotation

Many branches of the GO, such as apoptosis and cardiac development [29], have recently been expanded and revised to include new terms that are highly specific to these processes. The secondary metabolism literature has expanded over the last several years, allowing AspGD curators to make annotations to an increasing number of genes with roles in secondary metabolism. During routine curation, it became apparent that hundreds of Aspergillus genes that were candidates for annotation to the GO term ‘secondary metabolic process’ had the potential for more granular annotations, since, in many cases, the specific secondary metabolite produced by a gene product is known. At the inception of this project, only terms for ‘aflatoxin biosynthetic process, ’ ‘penicillin biosynthetic process’ and ‘sterigmatocystin biosynthetic process, ’ the 3 most well-studied secondary metabolites to date, were present in the GO (Additional file 1).

Candidate genes for reannotation were identified as those that had pre-existing GO annotations to ‘secondary metabolic process’ or curated mutant phenotypes that impact secondary metabolite production. For example, numerous genes in AspGD are annotated with mutant phenotypes affecting the production of secondary metabolites such as asperthecin [30], austinol and dehydroaustinol [31], emericellin [32], fumiquinazolines [33], orsellinic acid [34], pseurotin A [35], shamixanthones [32, 36] and violaceol [37] among others. These genes were then analyzed and a list of new GO terms was generated to annotate these genes more specifically (Table 1 and Additional file 1).

Table 1 Number of Aspergillus genes with manual and computational GO annotations to ‘secondary metabolic process

We also used published SMURF (Secondary Metabolite Unknown Regions Finder) predictions [38] to annotate additional candidate gene cluster backbone enzymes (i.e., PKS, NRPS, DMATS). SMURF is highly accurate at predicting most of these cluster backbone enzymes; across the four species of Aspergillus analyzed it identified a total of 105 genes as encoding PKS or PKS-like enzymes, 65 genes encoding NRPS or NRPS-like enzymes, 8 genes encoding putative hybrid PKS-NRPS enzymes and 15 DMATS. Note that DTS genes are not predicted by the SMURF algorithm. The AspGD Locus Summary pages now indicate these annotations based on the cluster backbone predictions generated by SMURF and by direct experimental characterization from the secondary metabolism literature.

Expansion of the secondary metabolism branch of the GO

To improve the accuracy of the AspGD GO annotation in the area of secondary metabolite production, a branch of the GO in which terms were sparse, we worked in collaboration with the GO Consortium to add new, more specific terms to the BP aspect of the ontology, and then used many of these new GO terms to annotate the Aspergillus genes that had experimentally determined mutant phenotype data associated with one or more secondary metabolite. We focused on the BP annotations because the relevant processes are well-represented in the experimental literature, whereas experimental data to support CC annotations are relatively sparse in the secondary metabolism literature. Adequate MF terms exist for the PKS and NRPS enzymes, but annotations to them in AspGD are mostly based on computationally determined domain matches and Interpro2GO annotations, or by annotations with Reviewed Computational Analysis (RCA) as the evidence code, meaning that these functions are predicted, rather than directly characterized through experiments.

The new GO annotations that we have added now precisely specify the secondary metabolite produced. For example, mdpG is known to influence the production of arugosin, emodin, monodictyphenone, orsinellic acid, shamixanthones and sterigmatocystin in A. nidulans. The gene was formerly annotated to the fairly nonspecific parental term ‘secondary metabolic process’ (GO:0019748), but because the secondary metabolites produced by this protein are known and published, it is now annotated to the new and more informative child terms ‘arugosin biosynthetic process’ (GO:1900587), ‘emodin biosynthetic process’ (GO:1900575), ‘monodictyphenone biosynthetic process’ (GO:1900815), ‘o-orsellinic acid biosynthetic process’ (GO:1900584), ‘shamixanthone biosynthetic process’ (GO:1900793) and ‘sterigmatocystin biosynthetic process’ (GO:0045461).

In total, we added 290 new BP terms to the GO for 48 secondary metabolites produced by one or more Aspergillus species. There are over 400 Aspergillus genes in AspGD that have been manually or computationally annotated to more specific secondary metabolism BP terms, based on over 260 publications (Table 2). A complete list of the GO terms for secondary metabolic process annotations is available in Additional file 1. The addition of new terms is ongoing as new secondary metabolites and their biosynthetic genes are identified and described in the scientific literature. The process of adding new GO terms depends on the elucidation of the structure of the secondary metabolite as the structure is required for new ChEBI (Chemical Entities of Biological Interest; http://www.ebi.ac.uk/chebi/) terms to be assigned, and these chemical compound terms are a prerequisite for GO term assignments involving chemical compounds. These new and improved GO terms provide researchers with valuable clues to aid in the identification of proteins involved in the production of specific classes of Aspergillus secondary metabolites.

Table 2 GO terms used for secondary metabolism annotations at AspGD

Predictive annotation using orthology relationships in conjunction with experimentally-based GO term assignments

Manual curation of the genes of one species can be used to computationally annotate the uncharacterized genes in another species based on orthology relationships. The use of GO to describe gene products facilitates comparative analysis of functions of orthologous genes throughout the tree of life, including orthologous genes within the filamentous fungi. To augment the manual GO curation in AspGD, we leveraged orthology relationships to assign GO annotations to genes that lacked manual annotations of their own but which had an experimentally characterized ortholog in AspGD, the Saccharomyces Genome Database (SGD) (http://www.yeastgenome.org) or PomBase (http://www.pombase.org). A total of 492 GO annotations were made to secondary metabolism-related genes in A. nidulans, A. fumigatus, A. niger and A. oryzae based on their orthology relationships (Table 3). Files listing these orthology relationships are available for download at http://www.aspergillusgenome.org/download/homology/orthologs/ and the files describing all GO term annotations for each gene product in AspGD are available at http://www.aspergillusgenome.org/download/go/. A list of all genes annotated to the secondary metabolic process branch of the GO and their associated annotations can be obtained through the AspGD Advanced Search Tool (http://www.aspergillusgenome.org/cgi-bin/search/featureSearch).

Table 3 Number of GO annotations for secondary metabolism that were transferred to and between Aspergillus species under curation at AspGD

Manual annotation of computationally predicted gene clusters

Algorithms such as SMURF [38] and antiSMASH (antibiotics and Secondary Metabolite Analysis SHell) [39] can be used to predict fungal secondary metabolite gene clusters. Both of these algorithms are based on the identification of backbone enzymes, usually one or more polyketide synthase (PKS), non-ribosomal peptide synthetase (NRPS), hybrid PKS-NRPS, NRPS-like enzyme or dimethylallyl tryptophan synthase (DMATS), and the use of a training set of experimentally characterized clusters. Adjacent genes are then scanned for the presence of common secondary metabolite gene domains and boundaries are predicted for each cluster. We used the pre-computed gene clusters for A. nidulans, A. fumigatus, A. niger and A. oryzae that were identified at the J. Craig Venter Institute (JCVI) with the SMURF algorithm [38]. We also used the antiSMASH algorithm [39] on these genomes to make gene cluster predictions and added 5 additional clusters for A. nidulans based on the presence of DTS/ent-kaurene synthase backbone enzymes.

Altogether, a total of 261 non-redundant clusters were predicted by SMURF and antiSMASH: 71 for A. nidulans, 39 for A. fumigatus, 81 for A. niger and 75 for A. oryzae (Tables 4, 5, 6, 7). Neither SMURF nor antiSMASH predict DTS-based clusters, so these clusters were manually identified based on their annotations. Because clusters with other types of non-PKS and non-NRPS backbone enzymes were included in the antiSMASH predictions and SMURF only analyzes PKS, NRPKS or DMATS-based clusters, antiSMASH identified more clusters than SMURF in every species except for A. niger (Table 8). For clusters identified by both algorithms, there were no cases where both the left and right boundary predictions were the same, although a small number of single boundary predictions did coincide with each other (Tables 4, 5, 6, 7). Both the experimentally and manually (see below) predicted clusters tend to be smaller than the SMURF and antiSMASH algorithms predict, as the algorithms are designed to err on the side of inclusivity while the manual boundaries are designed to provide increased precision of the cluster boundaries through the examination of inter- and intra-cluster genome synteny alignments across multiple Aspergillus species. SMURF was previously reported to overpredict boundaries by about 4 genes [38] and we found that antiSMASH performed similarly. Figure 1 shows an example of the disparity between these two prediction programs in cluster boundary determination and how intra- and inter-species cluster synteny data used in our analysis aids in the manual predictions of secondary metabolite gene cluster boundaries (see below).

Figure 1
figure 1

Genomic context of the predicted An03g05680 cluster of A. niger viewed with the Sybil multiple genome browser. Boundary predictions for A. niger CBS 513.88 species identifies predicted clusters in A. niger ATCC 1015, A. acidus and A. brasiliensis by matching orthologous protein clusters in Sybil. The red bar delineates the manually predicted cluster boundary based on cluster synteny between 2 A. niger strains and 2 additional Aspergillus species. The blue bar indicates the extent of the SMURF cluster prediction and the green bar indicates the antiSMASH-predicted boundaries.

Table 4 A. nidulans secondary metabolite biosynthetic gene clusters determined by SMURF, antiSMASH and by manual annotation or experimental characterization
Table 5 A. fumigatus secondary metabolite biosynthetic gene clusters determined by SMURF, antiSMASH and by manual annotation or experimental characterization
Table 6 A. niger secondary metabolite biosynthetic gene clusters determined by SMURF, antiSMASH and by manual annotation or experimental characterization
Table 7 A. oryzae secondary metabolite biosynthetic gene clusters determined by SMURF, antiSMASH and by manual annotation or experimental characterization
Table 8 Number of gene clusters predicted by SMURF, antiSMASH, or manual and experimental methods

Andersen et al.[16] recently reported another strategy of identifying the extent of secondary metabolite gene cluster boundaries. Their method uses genome-wide microarray expression studies from A. nidulans to identify coregulated genes surrounding secondary metabolite gene cluster backbone enzymes. Since secondary metabolite gene clusters often show cryptic expression under many laboratory growth conditions, this study generated expression data from cultures grown on a wide variety of media (to maximize the possibility of expression), and combined these data with previously generated expression data to analyze a superset of 44 expression conditions [16]. Their analysis produced a list of 53 predicted secondary metabolite gene clusters of A. nidulans, some of which show clear patterns of coregulated expression while some of the expressed backbone enzymes showed no correlation with adjacent genes. Five of these were DTS-based gene clusters not identified by the SMURF or antiSMASH algorithms. These data have been curated at AspGD and were used as a criterion for our manual cluster boundary predictions (see below). An example of the inpA- and inpB-containing gene cluster determined by this criterion is shown in Figure 2. The gene clusters of A. nidulans with all of the boundary predictions made with ‘expression pattern’ as the primary evidence are listed in Table 4. The total number of boundaries predicted using this criterion is summarized in Table 9.

Figure 2
figure 2

A. nidulans AN3497 gene cluster predicted based of gene expression analysis of Andersen et al. 2013. Red bar indicates manually predicted cluster boundary (AN3490-AN3497) based on expression pattern and aligned with orthologous clusters of A. versicolor and A. sydowii. Blue bar indicates SMURF boundary prediction (AN3491-AN3506) and green bar indicates the antiSMASH-predicted boundary (AN3485-AN3503).

Table 9 Summary of primary criteria used for making manual secondary metabolite gene cluster boundary predictions

To generate a high-quality set of candidate secondary metabolite biosynthetic gene clusters, we used SMURF and antiSMASH as the source of cluster predictions, along with manually predicted DTS clusters and then manually refined the gene cluster boundaries. Manual cluster boundary annotations (Tables 4, 5, 6, 7 and Additional files 2, 3, 4, 5) were made based on several criteria: published experimental data (including gene expression studies), synteny between clustered genes among different species indicated by the presence of conserved gene cluster boundaries (Figure 1), functional annotation of predicted genes within and adjacent to clusters and increases in intergenic distance between boundary genes and adjacent genes, which we frequently observed (Figure 3). We determined that gene clusters tend to be conserved between species and that breaks in cluster synteny frequently indicate a cluster boundary. To the best of our knowledge, no gene cluster prediction algorithm or research group has used genomic comparisons between species for large-scale cluster predictions. We used the Sybil viewer [51], which displays alignments of orthologous genes across multiple species in their genomic context, to manually examine potential boundaries and to compare synteny between clusters of different species and/or strains (Figure 1) and the adjacent syntenic regions outside each predicted cluster. The genome sequence is available for two strains each of A. fumigatus (Af293 and A1163) and A. niger (CBS513.88 and ATCC 1015), which allowed us to consider cluster synteny, which approached 100%, between these strains in addition to the orthology between Aspergillus species.

Figure 3
figure 3

Conserved cluster synteny between the gliotoxin cluster of A. fumigatus and the orthologous cluster of Neosartorya fischeri . The predicted gene cluster is indicated with a red bar. The left border of the Afu6g09650 cluster shows a small increase in intergenic distance while the right border shows a large change in intergenic distance. Both borders are examples of interspecies cluster synteny. Red bar indicates experimentally determined cluster boundary (Afu6g09630 - Afu6g09740). Blue bar indicates SMURF boundary prediction (Afu6g09580 - Afu6g09740) and green bar indicates the antiSMASH-predicted boundary (Afu6g09520 - Afu6g09745).

AspGD displays and provides sequence resources for 15 Aspergillus genomes and related species. A given genome is typically particularly closely related to that of one or two of the other species; the A. fumigatus genome best matches that of Neosartorya fischeri (see Sybil syntenic genomic context in Additional file 3), A. niger best matches A. acidus and A. brasiliensis (Additional file 4) and A. oryzae best matches A. flavus (Additional file 5). Unlike A. fumigatus, A. niger and A. oryzae, A. nidulans lacks such a closely related species in AspGD with sufficient synteny to enable routine use of cluster orthology in boundary determination. Therefore, we used other criteria such as published gene expression patterns [16], increases in intergenic distance and changes from secondary metabolism-related gene annotations to non-secondary metabolism-related gene annotations (described below) for making these predictions in A. nidulans (Figure 1). The numbers of manually predicted gene clusters in each of these additional species, determined by observing breaks in gene cluster synteny (see Methods), are summarized in Table 9.

In some cases, the functional annotation of the putative gene cluster members was informative in predicting cluster boundaries, especially for A. nidulans, which often lacked cluster synteny with other species present in AspGD. In addition to genes encoding the core backbone enzymes, clusters typically include one or more acyl transferase, oxidoreductase, hydrolase, cytochrome P450, transmembrane transporter and a transcription factor. We manually inspected each cluster and the genomic region surrounding it; changes in functional annotations from typical secondary metabolism annotations to annotations atypical of secondary metabolic processes were frequently observed upon traversing a cluster boundary (Additional files 2, 3, 4, 5) and this was used as an additional criterion for boundary prediction, especially in cases where inter- or intra-species clustering or published gene expression data were not available. In some instances, genes with functional annotations unrelated to secondary metabolism are embedded within a cluster. For example, A. nidulans bglD (AN7915) encodes a glucosidase present in the F9775 biosynthetic gene cluster (Additional file 2). In a cclAΔ strain background in which histone 3 lysine 4 methylation is impaired, the expression of cryptic secondary metabolite clusters, such as F9775, is activated [52]. The activation of bglD expression was observed along with other genes in the F9775 cluster and based on this pattern of coregulation, bglD is included as a member of this cluster [52]. It is unclear, however, whether bglD actually plays a role in F9775 biosynthesis. The gene encoding translation elongation factor 1 gamma, stcT, is a member of the ST gene cluster (stc) of A. nidulans. Its inclusion in the stc cluster was based on its pattern of coregulation with 24 other genes, some of which have experimentally determined roles in A. nidulans ST biosynthesis, or are orthologous to A. parasiticus proteins involved in AF production, for which ST is a precursor [46]. We also observed a gene, AN2546, that is expressed, and is predicted to encode a glycosylphosphatidylinositol (GPI)-anchored protein [53], located in the emericellamide cluster (Additional file 2); however, an AN2546 deletion strain still produces emericellamide, thus its inclusion in the cluster is based on its genomic location and expression pattern rather than function. These examples indicate that some genes are located within clusters and yet may not contribute to secondary metabolite production. The frequency and significance of unrelated genes that have become incorporated into a secondary metabolism gene cluster remains unclear; experimental verification is needed to further assess these. In cases where the cluster synteny data were compelling, cluster synteny was given higher precedence than functional annotation in the delineation of the cluster boundaries.

Increases in the distance between predicted boundary genes and the gene directly adjacent to a boundary (which we refer to as intergenic distance) were frequently observed. An example with a large intergenic distance at the right boundary is shown in the A. fumigatus gliotoxin (gli) cluster (Figure 3). However, we found that more subtle increases in intergenic distance were only somewhat reliable when compared to boundaries with experimental evidence. We therefore only based a cluster boundary prediction on an increase in intergenic distance in a small number of cases where no other data were available (Table 9).

Discussion

AspGD provides high-quality manual and computational gene structure and function annotations for A. nidulans, A. fumigatus, A. niger and A. oryzae, along with sequence analysis and visualization resources for these and additional Aspergilli and related species. Among fungal databases, AspGD is the only resource performing comprehensive manual literature curation for Aspergillus species. AspGD contains curated data covering the entire corpus of experimental literature for A. nidulans, A. fumigatus, A. niger and A. oryzae, with phenotype and GO annotations for every gene described in the literature for these species, including those related to secondary metabolism. The direct, manual curation of genes from the literature forms the basis for the computational annotations at AspGD. This information, collected in a centralized, freely accessible resource, provides an indispensible resource for scientific information for researchers.

During the course of curation, we identified gaps in the set of GO terms that were available in the Biological Process branch of the ontology. To improve the GO annotations for secondary metabolite biosynthetic genes, we added new, more specific BP terms to the GO and used these new terms for direct annotation of Aspergillus genes. These terms include the specific secondary metabolite in each GO term name. Because ‘secondary metabolic process’ (GO:0019748) and ‘regulation of secondary metabolite biosynthetic process’ (GO:0043455) map to different branches in the GO hierarchy, complete annotation of transcriptional regulators of secondary metabolite biosynthetic gene clusters, such as laeA, requires an additional annotation to the regulatory term that we also added for each secondary metabolite.

GO annotations facilitate predictions of gene function across multiple species and, as part of this project, we used orthology relationships between experimentally characterized A. nidulans, A. fumigatus, A. niger and A. oryzae genes to provide orthology-based GO predictions for the unannotated secondary metabolism-related genes in AspGD. The prediction and complete cataloging of these candidate secondary metabolism-related genes will facilitate future experimental studies and, ultimately, the identification of all secondary metabolites and the corresponding secondary metabolism genes in Aspergillus and other species.

The SMURF and antiSMASH algorithms are efficient at predicting gene clusters on the basis of the presence of certain canonical backbone enzymes; however, disparities between boundaries predicted by these methods became obvious when the clusters predicted by each method were aligned. While there was an extensive overlap between the two sets of identified clusters, in most cases the cluster boundaries predicted by SMURF and antiSMASH were different, requiring manual refinement.

The data analysis of Andersen et al.[16] used a clustering matrix to identify superclusters, defined as clusters with similar expression, independent of chromosomal location, that are predicted to participate in cross-chemistry between clusters to synthesize a single secondary metabolite. They identified seven superclusters of A. nidulans. Two known meroterpenoid clusters that exhibit cross-chemistry, and are located on separate chromosomes, are the austinol (aus) clusters involved in the synthesis of austinol and dehydroaustinol [31, 37]. The biosynthesis of prenyl xanthones in A. nidulans is dependent on three separate gene clusters [36]. This was apparent because the mdpG gene cluster was shown to be required for the synthesis of the anthraquinone emodin, monodictyphenone, and related compounds. Emodin and monodictyphenone are precursors of prenyl xanthones and the mdpG cluster lacked a prenyltransferase, required for prenyl xanthone synthesis [36]. A search of the A. nidulans genome for prenyltransferases that may participate in prenyl xanthone synthesis predicts seven prenyltransferases. Two strains (ΔxptA and ΔxptB) with mutated prenyltransferase genes at chromosomal locations distant from the mdpG cluster, have been described as being defective in prenyl xanthone synthesis. Therefore, while a total of 266 unique clusters were identified in our analysis, published data indicate that some of these clusters may function as superclusters that display cross-chemistry synthesis of a single secondary metabolite or group of related secondary metabolites [16, 31, 36].

Our manual annotation of secondary metabolite gene clusters in four Aspergillus species complements the computational prediction methods for identifying fungal secondary metabolites and the genes responsible for their biosynthesis. Implicit in our interspecies cluster synteny analysis is the prediction of secondary metabolite gene clusters orthologous to those in our curated species. For example, A. nidulans gene clusters most closely matched those in A. versicolor, thus identifying several new predicted A. versicolor gene clusters by orthology and interspecies cluster synteny with the predicted A. nidulans clusters (Additional file 2).

Conclusions

These new curated data, based on both computational analysis and manual evaluation of the Aspergillus genomes, provide researchers with a comprehensive set of annotated secondary metabolite gene clusters and a comprehensive functional annotation of the secondary metabolite gene products within AspGD. We anticipate that these new data will promote research in this important and complex area of Aspergillus biology.

Methods

Generation of new GO terms

The Gene Ontology Consortium requires that any compounds within BP term names in the GO be cataloged in the Chemical Entities of Biological Interest (ChEBI) database (http://www.ebi.ac.uk/chebi/). To enable the creation of the new GO terms, we first requested and were assigned ChEBI identifiers for all secondary metabolites recorded in AspGD. Once ChEBI term identifiers were assigned, the relevant GO terms were requested from the GO Consortium through TermGenie (http://go.termgenie.org/) for biosynthetic process, metabolic process and catabolic process terms for each new secondary metabolic process term and regulation of secondary metabolic process term (Additional file 1).

Orthologous protein predictions

Jaccard-clustering, which groups together highly similar proteins within a genome of interest, was used to make ortholog predictions between the Aspergillus species and is described in detail at http://sybil.sourceforge.net/documentation.html#jaccard. Briefly, the first step of this algorithm identifies highly similar proteins within each genome of interest. The resulting groups (“clusters”) from multiple genomes are themselves grouped in the second step to form orthologous groups (“Jaccard Orthologous Clusters”). The corresponding genes can be subsequently analyzed in their genomic context to visually identify conserved synteny blocks that are displayed in the Sybil genome viewer (aspgd.broadinstitute.org). The ortholog predictions for all AspGD species are available for download at http://www.aspergillusgenome.org/download/homology/orthologs/. Orthologous protein predictions between Saccharomyces cerevisiae, Schizosaccharomyces pombe and the Aspergillus protein sets were made by pair-wise comparisons using the InParanoid software [54]. InParanoid was chosen based on compatibility with the existing ortholog analysis pipeline at AspGD, and comparable accuracy when compared with alternative methods [55]. Stringent cutoffs were used: BLOSUM80 and an InParanoid score of 100% (parameters: -F \“m S\” -M BLOSUM80). The data from this comparison are available for download at (http://www.aspergillusgenome.org/download/homology/).

Orthology- and domain-based GO transfer

To augment the annotations for all genes, including secondary metabolism related genes, we used manual and domain-based GO annotations to annotate the predicted orthologs that lacked direct experimental characterization. Ortholog predictions for A. nidulans, A. fumigatus, A. niger and A. oryzae were made based on the characterized proteins of S. cerevisiae, S. pombe and the other Aspergillus species in AspGD. Candidate GO annotations to be used as the basis for these inferences are limited to those with experimental evidence, that is, with evidence codes of IDA (Inferred from Direct Assay), IPI (Inferred from Physical Interaction), IGI (Inferred from Genetic Interaction) or IMP (Inferred from Mutant Phenotype). Annotations that are themselves predicted in S. cerevisiae, S. pombe or in Aspergillus, either based on sequence similarity or by some other methods, are excluded from this group to avoid transitive propagation of predictions. Also excluded from the predicted annotation set are annotations that are redundant with existing, manually curated annotations or those that assign a related but less specific GO term. The orthology-based GO assignments are given the evidence code IEA (Inferred from Electronic Annotation) and displayed with the source species and name of the gene from which they were derived, along with a hyperlink to the appropriate gene page at AspGD, SGD or PomBase. The new annotations that have been manually assigned or electronically transferred from S. cerevisiae and S. pombe to A. nidulans, A. fumigatus, A. niger and A. oryzae, and between the Aspergillus species are summarized in Table 3.

Domain-based GO transfers were assigned to a lower precedence than orthology-based transfers. IprScan predicts InterPro domains based on protein sequences [56]. The Interpro2go mapping file (http://www.ebi.ac.uk/interpro) was used to map GO annotations to genes with the corresponding domain predictions. A domain-based GO prediction was made only if it was not redundant with an existing manually-curated or orthology-based GO term, or one of its parental terms, that was already assigned to an orthologous protein.

Finally, descriptions for genes lacking manual or GO-based annotations were constructed from the manual GO terms assigned to characterized orthologs. GO annotations were included with the following precedence: BP, followed by MF, and then CC. For genes that lacked experimental characterization and characterized orthologs, but had functionally characterized InterPro domains, descriptions were generated from the domain-based GO annotations. The same precedence rules applied as to the descriptions generated using orthology-based GO information. For genes that lacked experimental characterization and characterized orthologs, and without functionally characterized InterPro domains, but had uncharacterized orthologs, the descriptions simply list the orthology relationship because no inferred GO information was available.

Secondary metabolic gene cluster analysis and annotation

The pre-computed results file (smurf_output_precomputed_08.13.08.zip) was downloaded from the SMURF website (http://jcvi.org/smurf/index.php). Version 1.2.1 of the antiSMASH program [39] was downloaded from (http://antismash.secondarymetabolites.org/) and run locally on the chromosome and/or contig sequences of A. nidulans FGSC A4, A. fumigatus Af293, A. niger CBS 513.88 and A. oryzae RIB40. Details of the parameters the antiSMASH program uses to predict boundaries are in described in Medema et al. 1998 [39] and those for SMURF are described in Khaldi et al. 2010 [38]. The secondary metabolic gene clusters predicted by these programs were manually analyzed and annotated using functional data available for each gene in AspGD. Cluster membership was determined based on physical proximity of candidate genes to cluster backbone genes. Adjacent genes were added to the cluster if they had functional annotations common to known secondary metabolism genes. In cases where backbone genes had Jaccard orthologs in other species (see above), we required orthology between all other cluster members. Confirmation of orthology between clusters was facilitated by use of the Sybil multiple genome browser which can be used to evaluate synteny between species. We visually evaluated synteny by examining whether a gene that was putatively in a cluster had orthologs in the other species – where a gene in the species in which the cluster was identified no longer had orthologs in the other species that were adjacent, we inferred a break in synteny. Cluster boundaries were also determined by changes in common functional annotation, or by an increase in intergenic distances. tRNAs and other non-coding RNAs were excluded in cluster boundary analysis. Annotated images of the orthologous gene clusters are included in Additional files 2, 3, 4, 5.