Introduction

Accidental spills of oil have risen as a global important problem due to the serious environmental damages caused by soil and water contamination [1]. Whereas oil is a complex mixture of aromatic and aliphatic hydrocarbons of different molecular weights, its removal from the environment is difficult and its permanence is prolonged [2]. These compounds have gained considerable attention because of their harmful features like resistance to degradation, bioaccumulation, and carcinogenic activity. Their persistence in the environment increases with their molecular weight and there is a need to develop technologies or processes able to degrade or to transform these compounds into less toxic molecules [3].

The ability of several organisms, primarily microorganisms (bacteria, fungi and microalgae), to degrade these toxic substances has been extensively studied in recent decades [1, 4,5,6,7,8,9]. The main goal is to improve the decontamination of the environment via bioremediation, which encompasses technologies that allow the transformation of compounds to less harmful or not harmful forms, with less use of chemicals, energy, and time [10, 11]. Microbial bioremediation is very effective due to the catabolic activity of microorganisms; among these, many species of bacteria, fungi, and microalgae have demonstrated the ability of hydrocarbon degradation. This process involves the breakdown of organic molecules through biotransformation in less complex metabolites, or mineralization to water, carbon dioxide, or methane [3].

Several strategies have been employed to study these microorganisms and to understand the processes carried out by them. Within these, genomics have allowed the recognition of promoters, genes, and degradation pathways that influence the construction of more efficient degradative strains relevant in bioremediation processes [12, 13]. Genome sequencing of hydrocarbon-degrader organisms has allowed the identification of several genes involved in metabolism and catabolism of aliphatic, aromatic alcohols, and other similar compounds, as well as some metals resistance genes [14]. However, the number of sequenced genomes of fungal species is lower than in bacteria. To date, there are 103,076 prokaryotic genomes sequenced whereas there are only 4503 genomes from eukaryotes in GenBank database (July 2017).

Scedosporium apiospermum (teleomorph: Pseudallescheria apiosperma [15]) is a fungus belonging to the phylum Ascomycota, which has been isolated from various environments, usually in those influenced by human activity [16]. This fungus was reported as a hydrocarbon-degrading microorganism since 1998 due to its ability to degrade polluting compounds, such as phenol and p-cresol [17]. One year later, its ability to degrade phenylbenzoate and its derivatives was elucidated [18]. In recent years, studies regarding degradation of complex compounds, such as toluene [19], polycyclic aromatic hydrocarbons (PAHs) [20], long-chain aliphatic hydrocarbons, and mixtures of these contaminants (unpublished results from our group) [21] have risen. Additionally, the fungus’ ability to regenerate granular activated carbon once it has been saturated with phenol was shown in our laboratory (unpublished results).

Therefore, Scedosporium apiospermum presents a wide range of opportunities in bioremediation and its genome sequencing can allow the identification of promoters, genes, and degradation pathways of hydrocarbons. Indeed, the genomic analysis of this fungus can improve the understanding of functional dynamics of contaminants microbial degradation and enhance conditions for effective decontamination processes in different environments [2]. On the other hand, this fungus has been recognized as a potent etiologic agent of severe infections in immunocompromised and occasionally in immunocompetent patients [22]. For this reason, in 2014, the genome of an isolate from a cystic fibrosis patient (clinical strain) was sequenced with the aim of gaining knowledge of its pathogenic mechanisms [23].

Thus, our objective was the complete characterization of the genome of the S. apiospermum environmental strain HDO1. In order to analyze the genes and pathways involved in the degradation process and to assess the unique components of its genome compared to the clinical strain and other sister species, we sequenced, assembled, annotated, and fully characterized the environmental strain’s genome.

Organism information

Classification and features

Scedosporium apiospermum environmental strain HDO1 was isolated as a contaminant from assays on bacterial strains able to grow in crude oil (API gravity 33) as the unique carbon source. It was selected for sequencing due to its capability to grow in cultures containing aliphatic hydrocarbons of crude oil, naphthalene, phenanthrene, phenol, and mixtures of these compounds in the laboratory. The fungal isolate was grown on potato dextrose agar (OXOID LTD, Hampshire, UK) plates for a period of 7 days at 30 °C. The optimal growth temperature was 30–37 °C. Identification was based on the following morphological characteristics: obverse and reverse colony color (according to the color chart Küppers, H. [24]), colony texture, size, and presence of diffusible pigments, hyphae characteristics, and conidia arrangement. The morphological characteristics were: colonies with a diameter of 7 cm on PDA at 25 °C after 7 days, cottony textured, greyish-white (N00, C00-A00) with yellowish-white reverse. No diffusible pigment was observed. The mycelium was hyaline, septate, and thin. Unbranched conidiphores with long neck-bottle shaped phialides were observed. Conidia were hyaline, approximately 5 μm in diameter, occurring in basipetal chains leaving long hyaline annelids (Fig. 1). For the molecular characterization, the fungus was grown in Sabouraud broth at 25 °C, 150 rpm for 7 days, and the biomass obtained was lyophilized for at least 12 h. Fungal genomic DNA was extracted from 100 mg of lyophilized and pulverized mycelia conducting the CTAB and Phenol/Chloroform/Isoamylic alcohol method [25]. The universal primers used for amplification of the ITS region, were ITS4 (5′-TCCTCCGCTTATTGATATGC-3′) and ITS5 (5′-GGAAGTAAAAGTCGTAACAAGG-3′) [26]. Sanger sequencing was performed by Macrogen (South Korea). Nucleotide sequences obtained were compared with the non-redundant database of the National Center for Biotechnology Information (NCBI) using the tBlastx program (parameters by default), and the ITS region sequences were assigned to the fungus Scedosporium apiospermum with an E-value equal to 0.0, 100% query coverage and 100% identity. The obtained sequence is deposited at the NCBI Genbank nr database with the accession number JQ003882.1.

Fig. 1
figure 1

Micrograph of Scedosporium apiospermum. a Optical microscopy of hyphae and conidia from a PDA culture, at 100× total magnification. Lactophenol cotton blue wet mount preparation. b Scanning electron microscopy of hyphae and conidia from a liquid culture grown in minimal salt medium plus crude oil as the sole carbon and energy source

A phylogenetic analysis was performed using the long subunit rRNA gene, the internal transcribed spacer and the elongation factor 1-α sequences obtained from GenBank. Species from the Microascaceae family were included [27] [28] and are described in the Additional file 1: Table S3. Individual gene regions (LSU, ITS and TEF) were aligned using MAFFT v. 7.187 [29]. Maximum Likelihood analyses were performed using RAxML v.7.6.3 [30] as implemented on the CIPRES portal [31]. The sequence alignment was partitioned into three subsets, each one under a specified model of nucleotide substitution, chosen with PartitionFinder [32]. Estimation of different shapes, GTR rates, and base frequencies for each partition were allowed. The majority rule criterion implemented in RAxML [33] (−autoMRE) was used to assess clade support by bootstrap. The resulting trees were plotted using FigTree v. 1.4.2 [34]. Microascus longirostris and Scopulariopsis brevicaulis were used as outgroups. Environmental strain HDO1 used in this study clustered with the clinical strain IHEM14462 with good support, and they are the sister group of Trichurus spiralis CBS635.78 (Fig. 2). The whole group is contained within the wardamycopsis lineage described by Sandoval-Denis, M. et al. [28]. Summary of the classification and general features of S. apiospermum is given in Table 1.

Fig. 2
figure 2

Phylogenetic Analysis of S. apiospermum HDO1. Estimated relationships of S. apiospermum HDO1 with S. apiospermum IHEM 14462 and other species from the Microascaceae family. The tree shows the concatenated analysis of the Internal Transcribed Spacer, the Large Subunit and the Elongation factor gene regions. Sequences from reference strains were used (Additional file 1:Table S3). Support values represent Bootstrap support values (Maximum Likelihood)

Table 1 Classification and general features of Scedosporium apiospermum strain HDO1

Genome sequencing information

Genome project history

The Genome of the isolate HDO1 was sequenced by NovoGene Technology Bioinformatics Co., Ltd. (Hong Kong). The whole genome shotgun project of S. apiospermum has been deposited in NCBI database under the accession number MVOQ00000000, belonging to the bioproject PRJNA357602. A summary of the project and information about genome sequence are shown in Table 2.

Table 2 Project information

Growth conditions and genomic DNA preparation

Fungus growth was carried out in liquid culture (YPG: 1% yeast extract, peptone 1% and 2% glucose) at 30 °C for 7 days, followed by vacuum filtration, lyophilization, and maceration to have a homogeneous sample. DNA was extracted by the CTAB and Phenol/Chloroform/Isoamyl alcohol method [25]. DNA quality was analyzed by Nanodrop2000 (Thermo Fisher Scientific, MA, USA) and agarose gel electrophoresis (0.8%). DNA quantity was determined by Qubit2.0 (Invitrogen, CA, USA).

Genome sequencing and assembly

Genome sequencing of the strain was performed using high-throughput Illumina technology on a Hiseq2500 and employing two libraries: a 250 bp paired-end library and a 5kpb mate-pair library. Quality trimming of reads was performed using Trimmomactic 0.23 [35] and quality control was performed using FastQC 0.11.2 [36]. Coverage and depth of sequencing was analyzed by mapping the reads using Bowtie2–2.2.4 [37], the sam files were converted to bam files for visualization using samtools1.1 [38], and the visualization was made using tablet. 1.15.09.1 [39]. The genome was the novo assembled using Abyss 1.0.9.20 [40] with a kmer size of 64, scaffolds were generated with SSPACE BASIC 2.0 [41], and gaps were reduced using GapFiller 1.1 [42]. Assembly statistics were obtained using Quast 2.3 (Additional file 1: Table S1) [43]. Repetitive elements were identified with RepeatMasker 4.0.5 [44]. The draft genome of S. apiospermum strain HDO1 was assembled from a total of 97,208,043 reads using Abyss [40] assembler. The assembly yielded 178 scaffolds (larger than 500 bp) with a genome size of 44.2 Mbp and a G + C content of 49.91% with a mean depth of 541X. The genome assembly statistics are shown in Table 3. The total number of non-coding repetitions was found using RepeatMasker [44] and was of 1.93%. The majority of repetitions were found to be simple repeats (0.89%) and low complexity regions (0.25%). The complete report of the annotation results for the non-coding repeats sequences can be seen in the Additional file 1: Table S2. The assembly features obtained for the draft sequence were similar to other fungal genome sequence projects [23, 45, 46].

Table 3 Genomic statistics

Genome annotation

Gene prediction and structure annotation was conducted using Augustus 3.0.3 [47]. Functional annotation was performed using Blast2GO 3.1 [48]. Briefly, a BLASTx against the National Center for Biotechnology Information “nr” database [49] was conducted. Then, results were classified among Gene Ontology categories [50]. Protein classification was made using the COG [51], KOG (Eukaryotic Orthologous Groups) [52] and EggNOG [53] databases using Blast2GO v4.0 platform [48]. Annotated genes were mapped against Kyoto encyclopedia of genes and genomes [54] to its functional analysis and assigned the Enzyme Codes. A total of 11,195 protein-encoding genes were predicted using Augustus [47]. Functional annotation showed a total of 8595 (76.0% of predicted genes) sequences with predicted function using Blastx [49]. Then, InterProScan [55] and Gene Ontology [56] permitted the annotation of 7934 (70.3%) sequences with GO terms, whilst the remaining genes were annotated as hypothetical (17.1%) and unknown function proteins (5.0%). A total of 7978 (70.8%) genes contained pfam [57] domains and 1333 had signal peptide domains. The transmembrane helices in the proteins were predicted with TMHMM sever v.2.0 in the online portal [58]. The ribosomal RNA genes were predicted in the RNAmmer 1.2 Server [59] and making an alignment with the predicted genes for Neurospora crassa from the database FungiDB [60], same database was used for pseudogenes prediction comparing with pseudogenes predicted for Neurospora crassa . The statistics of the genome annotation are shown in Table 3. A total of 4789 (42.5%) genes were assigned to the KOG [61] categories, most of them (60%) were assigned to one or more functional groups and the rest of genes were assigned to the function unknown group (Table 4). KEGG pathway analysis assigned an enzyme code to 2645 (23.5%) genes and revealed specific genes involved in the pathways of hydrocarbon degradation. These hydrocarbons are chloroalkane/alkene, chlorocyclohexane and chlorobenzene, benzoate, aminobenzoate, fluorobenzoate, toluene, caprolactam, geraniol, naphthalene, styrene, atrazine, dioxin, xylene, ethylbenzene, and polycyclic aromatic hydrocarbons. Also, the analysis revealed the presence of genes involved in metabolism of xenobiotics by cytochrome P450 and in synthesis and degradation of ketone bodies. These results are shown in Fig. 3.

Table 4 Number of genes associated with general COG functional categories
Fig. 3
figure 3

Distribution of the hydrocarbon degradation genes in KEEG pathways. The bars represent the number of genes mapped in KEEG pathways related to hydrocarbon degradation. Most of the genes were mapped to the benzoate and its derivate compounds as aminobenzoate and fluorobenzoate

Genome properties

The assembled genome of the strain HDO1 has a size of 44,188,879 pb (distributed in 178 scaffolds) with a G-C content of 49.91%; the genome size and the G-C content was similar to the draft genome reported for the strain IHEM 14462 [22] (Table 5). A total of 11,278 genes were predicted; among these, 11,184 were identified as coding protein genes (representing the 99.16% of the total genes); 92 as RNA genes (0.81%); and 2 as pseudogenes (0.02%) (Table 3). Some other features of the predicted genes are shown in Table 4. The number of chromosomes could not be elucidated.

Table 5 Genomic features comparison between HDO1 strain and IHEM 14462 strain [22]

Insights from the genomic sequence

Comparative genomics

Reads were mapped versus the clinical strain IHEM 14462 using Bowtie2–2.2.4 [26]. The sam files were converted to bam files for the visualization using samtools1.1 [27] and the visualization was made using tablet. 1.15.09.1 [28] resulting in an overall alignment of 92.75%. Genomes’ comparison between the environmental strain HDO1 and the clinical strain IHEM 14462 was performed using MAUVE 20150226 [62]. The genome sequence of HDO1 strain aligned with the sequence of IHEM 14462 strain in 88,1% of its length. The MAUVE [62] alignment showed a high level of similarity between the clinical and the environmental strains (Fig. 4). A total of 508 local collinear blocks (LCBs) that correspond to the homologous regions that are shared by the two sequences were found and a few of them were in reverse orientation after eight reordered cycles. From ordered output fasta file obtained with MAUVE [48] a new alignment was made with Nucmer at nucleotide level (maximum gap between two adjacent matches in a cluster of 90 bp and a minimum length of a maximal exact match of 20 bp) and Promer at amino acid level (maximum gap between two adjacent matches in a cluster of 30 amino acids and a minimum length of a maximal exact match of 6 amino acids). Nucmer and Promer alignments were plotted using Mumerplot, the last three mentioned tools from mummer 3.0 suite [63] (Fig. 5). This analysis revealed that a high number of forward matches are in the greatest scaffolds of HDO1 genome sequence and reverse matches are more common in the smallest scaffolds. These differences and similarities seen for the nucleotides showed the same trends when these were translated to amino acids. These analyses and their corresponding plots also permitted to determine rearrangements, insertions, and deletions between both genomes.

Fig. 4
figure 4

MAUVE [62] alignment of draft genome sequence of HDO1 strain and draft genome sequence of IHEM 14462 strain. The figure represents the locally contiguous blocks (LCBs) that both sequences share, connected by lines to show their positions in the genomes. At the top the sequence of HDO1 strain is visualized and at the bottom the re-ordered sequence of the IHEM 14462 strain appears [23]. Blocks that are shown below indicate regions that have the reverse sequence related to the HDO1 sequence

Fig. 5
figure 5

Dot plot analysis comparing the HDO1 and IHEM14462 strains’ genomes. a Comparison at the nucleotide level. b Comparison at the protein level. It shows the alignment of the genome sequence of IHEM 14462 strains (y axis) against HDO1 genome sequence (x axis). The red color lines and dots represent the forward matches between the both genome sequences while the blue color ones represent reverse matches

A thorough comparative analysis showed some important differences between the genome draft sequences of the clinical and the environmental strain sequenced here. These differences were evident in the genome size of the assemblies and the number of predicted genes (Table 5). Indeed, our assembly had a total of 783.135 bp (1.77% of genome size) and 276 coding sequences more than the clinical strain. The remarkable difference in the number of annotated genes involved in hydrocarbons degradation pathways could be attributed to the pipeline followed to annotate genes. For the clinical strain the CDSs found were annotated against TrEmbl database [64] that only comprises UniProtKB/Swiss-Prot, while in this study, we used the nr (non-redundant protein sequences) database of NCBI which has a wider coverage because it comprises sequences obtained from another databases like GenPept, TPA, PIR, PRF, PDB, NCBI RefSeq, and UniProtKB/Swiss-Prot [65]. Since the repetitive elements of the genome were estimated as only 1.93%, it is highly probable that the difference in size can be attributed to some of the elements involved in functional categories.

Genes involved in hydrocarbon biodegradation pathways

Several genes involved in hydrocarbon biodegradation pathways were annotated in the genome of the environmental strain. In Table 6 the genes previously reported in the clinical strain [23] are shown. Results revealed that some genes are involved in several degradation pathways, principally corresponding to aromatic hydrocarbon metabolism (polycyclic aromatic hydrocarbons and phenolic compounds) and cytochrome P450 system. The number of these genes annotated for each strain can also be seen in the table and these values showed a higher number of genes in the environmental strain HDO1. The genes solely found in the draft sequence of the strain HDO1 are reported in Table 7. These genes comprised some genes belonging to the aromatic hydrocarbons degradation pathways completing the pathways in which genes found in both strains are also involved. Genes involved in the degradation of other organic compounds like toluene, lignin, and xylenol were found (Table 6).

Table 6 Annotated genes involved in hydrocarbons degradation pathways
Table 7 Annotated genes found only in the HDO1 strain

The complete annotation of the genome and, particularly, of the genes belonging to a major class of protein families involved in fungal catabolism of organic pollutants was made. We could identify genes coding for proteins that have the ability to oxidize aromatic compounds like dioxygenases or monooxygenases. Among these, we could predict dioxygenases such as 2-nitropropane dioxygenase, extracellular dioxygenase (EC:1.13.11), gentisate 1,2-dioxygenase, intradiol ring-cleavage dioxygenase (EC:1.13.11), lignostilbene dioxygenase (EC:1.13.11.43), catechol 1,2-dioxygenase (EC:1.13.11.1), biphenyl-2,3-diol 1,2-dioxygenase, aromatic ring-opening dioxygenase, and 4-hydroxyphenylpyruvate dioxygenase (EC:1.13.11.27). These enzymes have great importance because, along with NADH-dependent flavin reductase and [2Fe-2S] redox centers, they catalyze the transformation of several aromatic compounds to dihydrodiols [66], allowing the complete mineralization of these compounds to CO2 and H2O (with the participation of other specific enzymes). Another enzyme family identified among the annotated genes was cytochrome P450. These enzymes have an interesting catabolic potential because they do not have substrate specificity and can catalyze epoxidation and hydroxylation of several organic pollutants like dioxins, nonylphenol, and PAHs [67]. Genes coding for extracellular proteins like laccases and tyrosinase (known as phenoloxidase enzymes), which have the ability to degrade several groups of organic compounds due to their non-specificity action, were annotated in the genome. These enzymes produce organic radicals beyond one electron abstraction; those free radicals can be transformed by several reactions that include the ether cleavage in dioxins, quinone formations from PAHs and chlorophenols [68]. These extracellular enzymes are extremely important because of their potential in biotechnological applications [69, 70]. Moreover, several oxidoreductases, hydrolases, dehydroxylases, isomerases, and transferases were also predicted in the studied strain. However, extracellular enzymes such as lignin and manganese peroxidases could not be identified yet.

Catabolic proteins of S. apiospermum involved in phenol, p-cresol and phenylbenzoate degradation pathway previously reported by (Clauβen and Schmidt) [17, 18] like phenol 2-monooxygenase and cathecol 1,2 dioxygenase were identified. However, hydroquinone hydroxylase, 4-hydroxybenzoate 3-hydroxylase, hydroxiquinone 1,2 dioxygenase, protocatechuate 3,4 dioxygenase, and maleylacetate reductase could not be found, suggesting that these proteins could be classified among the proteins annotated as hypothetical or with an unknown function or that they can be in the gap regions of the genome assembly.

Conclusions

The draft genome sequence of environmental strain S. apiospermum HDO1 isolated from bacterial bioremediation assays in crude oil was described here. The structural and functional information of the genome sequence of S. apiospermum has allowed advancing in the understanding of the ability of this fungus to degrade several kinds of xenobiotic compounds mainly several hydrocarbons families and offers an opportunity to propose its use or its enzymes in controlled bioremediation or bioaugmentation processes.